Extraction of information
Information Extraction (IE, Information Extraction) is a type of information retrieval whose objective is to automatically extract structured or semi-structured information from computer-readable documents.
A typical IE application is scanning a series of documents written in a natural language and populating a database with the extracted information. These texts can be in semi-structured or unstructured form. These documents can be very diverse, from newspaper articles to scientific reports that are generally written in human language. Current trends regarding IE use natural language processing techniques that focus on very restricted areas.
The goal is to process these documents with natural language processing (NLP) software to extract useful information from them. This task is very complex since these programs usually operate with very restricted domains. This makes it difficult to extract information from texts with non-formal language or images.
For example, the Message Understanding Conference (MUC), or Message Understanding Conference is a competition that has focused on following aspects in recent years:
- MUC-1 1987, MUC-2 1989: Messages for naval operations.
- MUC-3 1991: Terrorism in Latin American countries.
- MUC-5 1993: Microelectronic.
- MUC-6 1995: New articles near management changes.
- MUC-7 1998: Satellite launch reports.
Typical IE Tasks
- Recognition of entity names (NER, for its English acronyms. Search, locate and classify atomic elements in text on predefined categories such as names of people, organizations, places, expressions of hours, quantities, monetary values, percentages, etc. Using knowledge of the domain or information other sentences. To carry out this location and identification it is necessary to assign a unique identifier to the extracted entity. When nothing is known about entity instances, a technique called entity name detection is used. For example, if we have this text: “Luis enjoy riding a bike”. The detection task would remove the name Luis from the text to refer to a person. That he's probably the subject in the text.
- Corrigendum resolution (CR, for its English acronym).: Its purpose is to detect the correctness of the links between the entities of the text. This task is restricted to finding links between the names entities that have been previously extracted. For example, Sociedad Española de Automóviles de Turismo y SEAT refer to the same entity. The amphora is a type of correferentiality.
- Extraction of terminology. It identifies and extracts candidates in terms of the texts explored. It is to analyze a text to detect the semantic arguments associated with the preached or verbs of a sentence and thus to be able to classify them according to the specific roles. For example: Luis bought a computer from Juan. In this case “Luis” represents the buyer agent and “John” the seller agent, “a computer” represents the object of the sentence and the verb of the sentence is to buy.
- Extraction of relationships. It requires the detection and classification of mentions to semantic relationships (such as a client's office number or a customer's address). To find out if for example the customer Jorge has as telephone number 94220033 and the client Luis has the number 911230001.
Contenido relacionado
Skip list
Precision farming
Evolutionary strategy