Metadata
The metadata (from the Greek μετα, meta, 'after, beyond' and Latin datum, 'what is given', "data"), literally "about data", are data that describe other data. In general, a metadata group refers to a group of data describing the informational content of an object called a resource. The concept of metadata is analogous to using indexes to locate objects in instead of data. For example, in a library, cards are used that specify authors, titles, publishers, and places to search for books. Thus, metadata helps to locate data.
For various fields of computing, such as information retrieval or the semantic web, metadata in tags is an important approach for bridging the semantic gap, since any resource has, when stored together with others, the need to be described to facilitate searches that might try to find it based on its distinctive characteristics. This is true for any kind of resource, whether it's a video or a book in a library or a bone in a paleontologist's closet.
The concept of metadata predates the Internet and the web, although it is true that new information search needs have sparked an interest in metadata standards and practices hitherto unknown.
Definitions
The term "metadata" does not have a single definition. According to the most widespread definition of metadata, it is that it is “data about data”. There are also many statements such as "information about data", "data about information" and "information about information".
Another class of definitions tries to specify the term as "structured and optional descriptions that are publicly available to help locate objects" or "structured and encoded data that describe characteristics of instances containing information to help identify, discover, valuing, and managing the described instances." This class arose from the criticism that the simplest statements are so fuzzy and general that they will make it difficult to agree on standards, but such definitions are not very common.
We can also consider metadata, in the areas of telecommunications and computing, as information that is not relevant to the end user but is of great importance to the system that handles the data. The metadata is sent together with the information when a request or update is made.
In the biological field, metadata has become a fundamental tool for the discovery of data and information. In this context, metadata can be defined as "a standardized description of the characteristics of a data set" with this includes the description of the context in which the data was collected and also refers to the use of standards to describe them.
Distinction between data and metadata
Most of the time it is not possible to differentiate between data and metadata. For example, a poem is a data set, but it can also be a metadata set if it is attached to a song that uses it as text.
Many times, data is both "data" as "metadata". For example, the title of a text is part of the text as well as data referring to the text (data as metadata).
Metadata about metadata
Because metadata is data itself, it is possible to create metadata on top of metadata. Although, at first glance, it seems absurd, metadata about metadata can be very useful. For example, by merging two images and their different metadata it can be very important to deduce the origin of each group of metadata, recording this in metadata about metadata.
Types of metadata
Descriptive: To find or understand a source of information.
Administrative: - Technical metadata: To decode and represent files. - Preservation metadata: Long-term management of archives. - Rights metadata: Intellectual property rights attached to the content.
Structural: Relationships of parts of resources to each other.
Markup languages: Integrates metadata and markup for other structural or semantic features within the content
Objectives
The most frequently mentioned use of metadata is refining search engine queries. Using additional information results are more accurate, and the user saves additional manual filtering.
The semantic interval raises the problem that the user and the computer do not understand each other because the latter does not understand the meaning of the data. Metadata may enable communication by stating how the data is related. That is why knowledge representation uses metadata to categorize information. The same idea facilitates artificial intelligence by deducing conclusions automatically.
Metadata facilitates your workflow by automatically converting data from one format to another. For this, it is necessary that the metadata describe the content and structure of the data.
Some metadata enables more efficient data compression. For example, if in a video the software knows how to distinguish the foreground from the background, it can use different compression algorithms and thus improve the compression rate.
Another application idea is variable data presentation. If there is metadata pointing out the most important details, a program can select the most appropriate form of presentation. For example, if a mobile phone knows where a person is located in an image, it has the possibility to reduce it to the dimensions of its screen. In the same way, a browser can decide to present a diagram to its blind user in a tactile or read form.
Classification
Metadata is classified using three criteria:
- Contents
- Subdividing metadata by its content is the most common thing. Metadata describing can be separated remedy describing the content of the resource. It is possible to subdivid these two groups more times, for example to separate the metadata that describes the sense of content describing the structure of content or those describing the remedy describing the life cycle of resource.
- Variability
- According to variability, metadata can be distinguished mutable e immutable. The immutables do not change, no matter what part of the resource is seen, for example the name of a file. Mutables differ from part to side, for example the content of a video.
- Function
- Data can be part of one of the three layers of functions: subsymbolic, symbolic or logical. Subsimbolic data do not contain information about its meaning. Symbolics describe subsymbolic data, that is, add meaning. The logical data describes how symbolic data can be used to deduce logical conclusions, that is, they add understanding.
Life Cycle
The metadata life cycle comprises creation, manipulation, and destruction phases. Careful analysis of each stage brings significant issues to light.
Creation
You can create metadata manually, semi-automatically, or automatically. The manual process can be very laborious, depending on the format used and the desired volume, to a degree that humans cannot overcome. Therefore, the development of semi-automatic or automatic tools is more than desirable.
In automatic production, the software acquires the information it needs without external help. Although the development of such advanced algorithms is currently under investigation, it is unlikely that the computer will be able to extract all the metadata automatically. Instead, it is considered the most realistic semi-automatic production; here a human server supports autonomous algorithms with the clarification of insecurities or the proposal of information that the software cannot extract without help.
There are many experts who are responsible for the design of tools for the creation of metadata but who ignore questioning this process. According to those who do not avoid the issue, generation should not start after the completion of a resource but should be done during manufacturing: metadata has to be archived as soon as it originates, with the special knowledge of the producer, to avoid laborious later reconstruction. For this reason, the production of metadata has to be integrated into the procedure for manufacturing the resource.
Manipulation
If the data changes, the metadata has to change too. Here the question is asked: who is going to adapt the metadata? There are modifications that can be handled easily and automatically, but there are others where the intervention of a human server is essential.
Metaproduction, the recycling of parts of resources to create other resources, demands particular attention. The merger of affiliated metadata is not trivial, especially if it is about information with legal relevance, such as digital rights management.
Destruction
In addition, metadata destruction should be investigated. In some cases it is convenient to delete the metadata along with its resources, in others it is reasonable to keep the metadata, for example to monitor changes to a text document.
Metadata in computing
Metadata have gained great relevance in the world of the Internet, due to the need to use metadata to classify the enormous amount of data. In addition to classification, metadata can help in searches. For example, if we search for an article about vehicles, this data will have its corresponding key metadata attached to it, such as 4 wheels, engine, etc.
Other examples of metadata uses in computing:
- Metatags in HTML: tags with information about the web document itself: author, editor, coding, etc.
- Information in the file system itself: HFS or ReiserFS, to name two. They are completed by smart seekers (Beagle or Spotlight) who know how to recognize these metadata.
- Photo ratings: F-Spot, Picasa or iPhoto, for example.
- Song Sorters: contain metadata on songs, either on MP3 or on audio CD, in a format called ID3. For example: iTunes and Rhythmbox.
Storage
There are two ways to store metadata: store it internally, in the same document as the data, or store it externally, in the same resource. Initially, metadata was stored internally for ease of administration.
Today, external localization is generally considered a better option because it allows the concentration of metadata to optimize search operations. Instead, there is the problem of how a resource is linked to its metadata. Most of the standards use URIs, the technique of locating documents on the World Wide Web, but this method poses other questions, for example what to do with documents that do not have URIs.
Encoding
The earliest and simplest metadata formats used cleartext or binary encoding to store metadata in files.
Today, it is common to encode metadata using XML. Thus, they are readable by both humans and computers. In addition, this language has many characteristics in its favor, for example it is very easy to integrate it into the World Wide Web. But there are also drawbacks: the data needs more memory space than in binary format, and it is not clear how to convert the tree structure into a data stream.
For this reason, many standards include utilities to convert XML to binary encoding and vice versa, thus combining the advantages of the two.
Controlled vocabularies and ontologies
To ensure uniformity and compatibility of metadata, many suggest the use of a controlled vocabulary fixing the terms of a field. For example, in the case of synonyms or interlanguage, it is necessary to remember what words are used to prevent the search engine from locating "Spanish" but not "Spanish".
An ontology further defines the relationships of vocabulary terms so that the computer can evaluate them automatically. Thus it is possible to present a web page about "Vincent Van Gogh" even though the user typed "Dutch painters"; Using a suitable ontology, the searcher understands that Van Gogh was a Dutch painter.
A concept very similar to ontologies are folksonomies. Ontologies are defined by experts in the field who order the terms, but folksonomies are defined by the users themselves.
Metadata and learning objects
In e-learning, metadata is used to describe learning objects and resources in order to facilitate searches in repositories. Often the metadata information from learning object repositories complies with the IEEE LOM standard, which defines a set of nine categories of information, which allow describing the resources from both a didactic and technical point of view, allowing searches much more adjusted that will allow users (teachers who compose a new course with existing materials, or students interested in learning about a certain topic) to obtain results that are more adjusted to the search criteria entered.
Metadata is an essential part of the learning object paradigm, since:
- The reuse of learning objects is based on the creation and use of metadata, often external descriptions to the resources themselves.
- Metadata, if provided in the appropriate languages, allows to develop new technological tools that facilitate the search and manipulation of learning objects.
- They facilitate the recovery of information by describing content and its relationships with other resources.
- They facilitate interoperability, as they make it easier to share and exchange information.
- They simplify management and storage, as they allow you to store information about the life cycle of resources.
- They help to properly manage and protect intellectual property rights.
Metadata is therefore an element of fundamental value. A digital resource with an excellent pedagogical design is not per se a good learning object, but it will be so to the extent that the metadata that describes it is also of good quality.
Fordata
The set of data generated during the interaction between a user and a resource or service in an educational setting (a virtual learning environment, repository, social network, etc.) is known as paradata. Depending on the resource or service accessed and the operations carried out with it, the paradata generated will contain more or less information. This information can be stored to be later analyzed, with the aim of better understanding how users interact in said educational scenario, detecting possible problems, as well as opportunities for improvement of both the educational scenario itself and the tools used, as well as the understanding of the process that users follow in the same. From the analysis carried out, it is then possible to build, among others, recommendation systems, reputation schemes, interaction visualizations, etc.
For example, if a user downloads a document from a learning object repository, it will be possible to know that user U has downloaded document D at the moment T. This information can be used to detect the most downloaded resources or, conversely, those that are never accessed by users, but also to find out at what times of the academic semester the greatest use of the repository occurs. Another possibility is if a user U evaluates a resource R with a rating X at time T, for example, a comment in a Facebook group. This allows knowing the comments that are best or worst valued by users, as well as the most active users.
In general, the goal is to store user interaction with the system in the form of (U, T, S, R, X) tuples: a user U at time T uses a service S on a resource R with a result X. This is the minimum information that must be stored for later analysis.
Paradata storage
Since the interaction in a virtual learning environment is usually carried out through the use of a web browser, a user who visits the spaces offered by it leaves a trace in the form of accesses to web pages that are collected in the files of registration (in English, log files) of the web servers that support the system. Therefore, it seems feasible to analyze the log files to extract the information related to the interaction. The problem, however, is that these files contain many more entries related to the layout of the elements that make up a web page, than actually related to the result of user interaction itself, which must be calculated from the jump sequence captured in log files. The computational cost of analyzing log files is very high (they contain millions of log lines) and it is not easy when dealing with complex systems with multiple servers where the trace left by a user may be fragmented into different files.
Therefore, if it is necessary to collect paradata for further analysis, it is better for the system to be specifically designed with a collection service focused on storing only the information required for analysis, reducing the size needed as well as the time required for analysis. process. Depending on the purpose of the analysis, paradata can be stored within a resource's metadata, within the user's profile, or, most commonly, in a separate table (or database, depending on its complexity).
Criticism and problems associated with the use of metadata
Some experts strongly criticize the use of metadata. His most substantial arguments are:
- Metadata is expensive and they need too much time. Companies will not produce metadata because there is no demand and private users will not invest so much time.
- Metadata is too complicated. People do not accept the standards because they do not understand them and do not want to learn them.
- Metadata depends on the point of view and context. There are no two people who add the same metadata. In addition, the same data can be interpreted entirely differently, depending on the context.
- Metadata is unlimited. It is possible to adhere more and more useful metadata and there is no end.
- Metadata is superfluous. There are already powerful search engines for texts, and in the future the technique query by example (“search based on an example”) will be improved, both for locating images and for music and video.
Some metadata standards are available but not enforced: critics see this as proof of the flaws in the metadata concept. It should be noted that this effect can also be caused by insufficient compatibility of the formats or by the enormous diversity that intimidates companies. Outside of that there are very popular metadata formats.
Although the inclusion of metadata is necessary, to facilitate and enhance a number of important tasks, there are also problems associated with its use. Some of the most cited problems in the scientific literature are summarized in the following points:
- Lack of completion: introducing metadata is often an ungrateful task and requires considerable effort. This leads to a certain tendency not to complete (or to poorly complete) metadata records, as organizations do not properly perceive the need to offer complete metadata records or cannot cope with the high costs, in terms of effort, of completing their collections. This problem is more evident when the number of learning objects to score is important (collections of thousands of learning objects) or when the number of metadata elements to be filled is high (over 20).
- Difficulties of interoperability: some metadata information rests on the use of vocabulary, collections of closed terms whose elements are to be used obligatoryly to record the information regarding the metadata element in question. The standards allow to use different vocabulary for a metadata element, so they can vary from one institution to another, causing difficulties to exchange resources, so that external systems operate on metadata with different origin, etc. For example, information for the element 5.8. the IEEE LOM standard must be chosen (according to IEEE LOM) from the following list of terms: Very easy. / Easy. / medium/ difficult. / Very difficult.. However, IEEE LOM itself allows to use another vocabulary if it is deemed appropriate, so a certain institution could simplify the number of categories to 3, extend it to 10 to offer a more detailed scale, or choose another scale more appropriate to its context such as: core / basic-need-to-support / essential-support-external/ complex.
- Semantic inconsistency and other interesting problems derived from the established standards: it is not rare to observe how two different institutions offer different information for the same element of metadata, often due to lack of clarity of the standard of metadata used. Taking as an example the IEEE LOM standard, some elements like the element 5.8. depend to a large extent on the subjective opinion of the person who creates the metadata registry and, consequently, are destined to be inconsistent with records created by other persons as one might consider difficult. What for another is Very difficult.. Another similar problem is semantic incompleteness, that is, the introduction of incomplete information for a certain element of metadata, not providing all the information that would be possible — and desirable — to offer.
- They are "oriented to humans": metadata are textual descriptions that people easily interpret. However, IT systems and applications are not easy to process this information to provide added services, as metadata has not been written to understand "machines". Thus, it is difficult to schedule a search engine that prioritizes the results of a search for learning objects in a repository, depending, for example, on access rights information (LOM element 6.2.or its geographical or temporal coverage (element) 1.8. Coverage), simply because such information is in a text written in a human language from which it is difficult to extract the information sought: it is necessary to process the text through complex techniques of linguistic analysis, to divide it into its essential parts, to analyze each part and extract the information, etc., always bearing in mind that in texts such as the description of the coverage there may be obvious localisms, omissions for people but incomprehensible for a machine, or any other complexities.
Formats and standards
There are two groups driving the development of metadata formats: the multimedia technique and the semantic web. The destiny of the multimedia technique is to describe a unique multimedia resource, that of the semantic web the description of resources of each type and also the chaining of knowledge. The most popular and largest formats are:
- ID3 makes it possible to notice very simple metadata, such as title and interpreter, in MP3 audio files. The format is very popular and shows that metadata can be useful.
- MPEG-7
- MPEG-21
- TV-Anytime
- EXIF
- Dublin Core
- LOM used in learning objects
- Resource description framework (RDF)
- RDF Schema
- OWL
- NewsML
- SportsML
- ONIX for Books, a standard used in the publishing industry as a means of transmitting the metadata of the books necessary for trade.
- ISO 19115, which regulates the metadata of geographical information.
Contenido relacionado
Visual Basic for Applications
IBM RS/6000
Befunge