Linguistic corpus

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

A linguistic corpus is a broad and structured set of real examples of language use. These examples can be obtained from written texts (the most common), or oral samples (usually transcribed). Corpuses can be textual, when they compile oral or written texts, or reference, when they record concordances extracted from texts. In Spanish, an example of a reference corpus is the Corpus Básico del Español de Chile.

A linguistic corpus is a relatively large set of texts, created independently of its possible forms or uses. In other words, in terms of its structure, variety and complexity, a corpus must reflect a language, or its modality, in the most exact way possible; As for its use, worry that its representation is real. Corpus have similarities with texts because they are composed of them, on the other hand, they are not texts in themselves, because unlike them, it does not make sense to analyze them in their entirety. A text has a beginning and an end, and is cohesive and coherent to a greater or lesser degree, while a corpus lacks such characteristics because it does not have a structure, but only a composition. For this reason, it is convenient to analyze a corpus using your own tools and methodology.

Due to its size, accessibility, linguistic and encyclopedic information, very high reliability and other particularities, the compilation of corpora has become one of the main, if not the main, method and instrument of language research in general linguistics.

The need to work with the collected samples in an efficient and economical way (taking into account its enormous extension), has encouraged the development of one of the branches with the most future of contemporary linguistics: computational linguistics. Currently the corpora are collected and stored electronically.

Application of the corpora

Linguistic corpora are used to perform statistical analyzes and test hypotheses about the area they study. This type of corpus has more and more supporters, and thanks to the study of the language that these corpora offer, some linguistic postulates that had great support within the linguistic community have been questioned.

Corpus linguistics is the subdiscipline of linguistics that studies language through these samples. This type of approach clashes with the Chomskian approach that tends to study language through the linguistic intuition of the speaker. This subdiscipline, given the volume of data it handles, is often associated with computational linguistics, as the latter approaches natural language processing applications.

This discipline began in 1967 when Henry Kucera and Nelson Francis published the now classic Computational Analysis of Present-Day American English.), from the Brown corpus, a compilation of American English of approximately one million words selected from a wide variety of sources.

Classification criteria

Degree of generality

The degree of generality of a corpus depends on the extent to which its texts have been selected with respect to the various varieties of a language. Specialized corpora are oriented to a particular linguistic variety (sublanguage) or to a restricted domain (journalistic, legal, medical language, etc.). For this reason they have the minimum degree of generality. On the other hand, the texts of the general corpora belong to different linguistic varieties and are selected because they compose the descriptive framework of the language as a whole. It is, therefore, multifunctional corpora that are often used as reference resources when studying a language, for example as a data source for the preparation of a dictionary. Frequently, general corpora are made up or can be divided into several subcorpuses, that is, subsets of texts that belong to a particular variety.

Size

The size of the corpus is determined by the number of words that the same corpus contains in the case of written language corpora. On the other hand, in the spoken language corpora, the hours of recording are taken into account. We can distinguish closed and open corpora. The first is the version of the traditional standard corpus, in which the number of texts and words is already fixed in the initial phase of the project. Closed corpora are like a kind of photograph of a language through the selected texts, but they are not adapted to follow the changes and evolution that occur in a language as with its intrinsically dynamic nature. In order to overcome this limit, John Sinclair proposed extending the traditional notion of corpus to an instrument of linguistic observation. The corpus whose function is mainly to observe the language (monitor corpus) is an open set of texts that changes over time, since it introduces new texts selected according to the same criteria used to select previous texts. This type of corpus allows, for example, to observe the changing nature of the lexicon of the language in question and therefore can be used in lexicographical contexts as a data source for updated dictionaries.

Representative Corpus

A corpus is representative when it has features of the entire range of variability and properties of a language. This means that a corpus must provide a model of the linguistic properties of the analyzed language in the most plausible way possible. In this way, it should be possible to apply the texts of the corpus to an entire population.

Authenticity

This is another condition required to achieve a representative corpus. Authentic texts are those that have been created under conditions of natural communication. This is especially true for spoken language. For example, dramas, poetry, movie subtitles, etc. They are considered not very authentic and too specific corpus sources. Due to the inevitable influence on the part of the language of the original version, the translations are not included in the general corpora (they are in the parallel corpora). There are more problems related to authenticity, such as prescription. The collected samples, as a rule, are not corrected, shortened or changed. Even standard errors in English-language newspapers and magazines are considered valuable in some way. They make it possible to notice regularities, how spelling rules or other formal norms are violated, and to check the direction of linguistic development trends.

Balance

Since the emergence of corpora, efforts have been made to create them in a balanced way, composed of various sources and following clear criteria. Only later did corpora appear that used all the texts collected (opportunistic corpus). The balance is achieved by establishing the proportions of different sources according to certain criteria. The possible criteria are the following:

Elitism (the most valid sources)
Legislation (best sellers, journalism)
Demographic indicators (best variety of authors)
Accessibility

Regarding representativeness, the essential question is what should a corpus reflect? It is not enough just to say that it should reflect the language or its variability, because this answer is not informative. Therefore, it is convenient to break down the composition into four spheres of use: speech, writing, listening and reading, and also take into account the number of users in each sphere. Opportunistic and representative corpora are often related as distinct stages of composition of a single corpus: first a representative corpus is created and then, according to certain proportions, corpora are selected for a balanced corpus.

Corpus types

The typology of corpora

Linguistic corpora can be divided depending on the type of information they collect. Typologically, corpora can be subdivided according to

the channel communicative (written and oral)
his shape presentation and storage (textuals and multimodals, the latter consist of sound recordings, image and corresponding transcripts);
the Number of languages represented (mono-/bi- and multilingual, parallel corpus);
the annotation of the tongue (not morphologically, phonetically and syntactically noted);
the scope (general and specialized bodies);
its character Temporary approach on the language (synchronic and diachronic);
- The degree of completion (defined (static) and continuous (dynamic)).

It should be mentioned that the different types of corpus are not exclusive, it may be the case that a single corpus combines the characteristics of two of the types of corpus mentioned below. Here are some types explained:

General Corpus

Contains a wide variety of oral and written examples of the language that have been produced by people of various ages, regions, and social classes. A well-known example of this type of corpus is the British National Corpus.

Specialized Corpus

Although there is controversy about whether it is necessary to know in advance who is going to use the corpus and how, the tendency to create corpora of a general type and for various purposes, rather than specialized corpora, is becoming more and more noticeable. The latter are small in size, represent a specific area of the language, and are codified (annotated) directly addressing the needs of individual researchers. Corpus compilers follow the “clean text policy”: the original version of the corpus is not encoded, it is not contaminated with markup of any kind so that the needs of some researchers do not obstruct the work of others. Therefore specialized and annotated corpora are usually presented as separate versions of global corpora.

Synchronous corpus and diachronic corpus

The synchronous corpus contains linguistic examples collected at a single moment, that is, at a certain time. An example could be the linguistic corpus of Spanish from the beginning of the 19th century. The diachronic corpus collects texts from different periods, such as the different centuries. It is used to see how words disappear, are introduced or change their meaning.

Finite and continuous corpora

Finite corpora teach the state of the language at a given moment. They are useful when compared with other similar corpora created at another point in time or for another language or dialect. Filters can be applied to continuous corpora that collect new linguistic facts. Normally they are composed of entire texts and not their fragments and therefore not balanced. However, their extension compensates for the imbalance.

Mono-/bi- and multilingual corpus

Monolingual corpora allow research on one language, while bi- or multilingual corpora collect examples from more than one. Three different subtypes of multilingual corpora can be distinguished:

Comparative corpora

They are corpora in which the texts of different languages are comparable in size and content, but where not all languages are necessarily targeted with the same precision.

Parallel Corpus

They are corpora with the same texts in all the annotated languages. A famous parallel corpus is the Bible in all the languages into which it is translated.

Corpus aligned

They are parallel corpora in which not all languages have the same texts, but where it is noted which fragment of the text corresponds to which fragment of the text in the other language. These annotations are either at the paragraph level or at the sentence level.

Oral Corpus

In addition to the written corpus, there are also corpora that collect samples of oral language (dialogues, interviews, conferences, etc.). In most cases, the spoken fragments are accompanied by orthographic or phonetic transcriptions.

The most popular orthographic transcription is movie subtitling, while phonetic transcription uses the phonetic alphabet.

The quality of the corpora that collect samples of the oral language will depend on the situation in which the communication takes place: background noise, errors when speaking, hesitations and other phenomena typical of orality will be reflected in the transcription, as well as volume and intonation.

Oral corpora are used to analyze the peculiarities of oral discourse (in this case, we usually work with transcribed corpora) and to study the phonic component (with recordings).

Corpus of Apprentices

Are data sets produced by foreign language learners, such as written essays or recordings.

Treebanks

(From English: 'tree bench') Corpus with syntactic annotations. They are used for research and installation of decay programs.

Contenido relacionado

Más resultados...