Parser

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

See also: Syntactic analysis (linguistic)

A syntactic analyzer (parser) or simply parser, is a computer program that parses a string of symbols according to the rules of a grammar formal. The term comes from the Latin pars, which means part (of speech). It usually makes use of a compiler, in which case it transforms an input into a derivation syntax tree.

Parsing converts the input text into other structures (commonly trees), which are more useful for further parsing and capture the implicit hierarchy of the input. A lexical analyzer creates tokens from a sequence of input characters and it is these tokens that are processed by the parser to build the data structure, for example a parse tree or abstract syntax trees.

Natural language. It is used to generate diagrams of languages that use grammatical inflection, such as Romance languages or Latin. The languages commonly recognized by parsers are context-free languages. Note that there is a formal justification that context-free languages are those recognizable by a stack automaton, such that any parser that recognizes a context-free language is computationally equivalent to a stack automaton.

Syntactic analyzers were extensively studied during the 1970s, detecting numerous patterns of operation in them, which allowed the creation of programs that generate syntactic analyzers from a specification of the language syntax in Backus-Naur form for example, such as yacc, GNU bison and javaCC.

Languages

Some natural language processing or translation systems are lexically parsed by computer programs. Sentences are not easily parsable due to the load of ambiguity that exists in the structure of human language. In order to process human language, researchers must first agree on the grammar to be used and this decision is influenced by linguistic and computational criteria, for example, some analysis systems use lexical-functional grammars. But in general the analysis of grammars of this type is NP-complete.

The "Head-driven phrase structure grammar" is another formalism that has been popular in the community, but research efforts have focused on less complex algorithms like Penn Treebank's. The light parsing "Shallow parsing" is only responsible for finding the main components of the sentence such as nouns or verbs. Another popular strategy to avoid linguistic controversy is the dependency grammar.

Most modern analyzers are at least partly statistical, meaning they are based on training data that has been analyzed by hand. This approach allows the system to gather information about the frequency with which certain constructions occur in a specific context. Some of these approaches have included probabilistic context-free grammars, maximum entropy systems, and neural networks.

The most successful systems use lexical statistics, that is, they obtain the grammatical category of the words, these systems are vulnerable because they end up having an excessive amount of parameters and finally require simplifications.

Natural language parsing algorithms cannot be based on grammars that have good features as designed grammars are, for example, for programming languages. Some grammatical formalisms are very difficult to parse computationally, so a context-free approximation is generally used even if the structure itself is not context-free to obtain a first simplification.

Algorithms that use context-free grammars are often based on some variant of the Cocke-Younger-Kasami (CYK) algorithm and heuristics for pruning unsuccessful parses. In any case, some approaches sacrifice speed for accuracy using, for example, linear versions of the "shift-reduce" algorithm. Recently developed approaches use an algorithm that generates from multiple analyzes and another that chooses the best option.

Programming languages

The most common use of parsers is as part of the parsing phase of compilers. So they have to analyze the source code of the language. Programming languages tend to be based on context-free grammars, because fast and efficient parsers can be written for them.

Context-free grammars have limited expressiveness and can only express a limited set of languages. Informally the reason for this is that the memory of such a language is limited, the grammar cannot remember the presence of a construct in an arbitrarily long input and this is necessary in a language in which for example a variable must be declared. before it can be referenced. More complex grammars cannot be parsed efficiently. For these reasons it is common to create a permissive parser for a context-free grammar that accepts a superset of the language (accepts some invalid constructs), after initial parsing incorrect constructs can be filtered out.

It is usually easy to define a context-free grammar that accepts all the constructs of a language, but conversely it is practically impossible to build a context-free grammar that admits only the desired constructs. In any case most analyzers are not built by hand but using automatic generators.

Overview of the process

The following case demonstrates a common case of parsing a programming language with two levels of grammar, lexical and syntactic.

The first stage is token generation or lexical analysis, in this process the input string is broken into meaningful symbols defined by a regular expression grammar, for example a calculator program with the following input: " 12*(3+4)^2", would divide it into the following tokens 12, *, (, 3, +, 4,), ^ and 2, each of these symbols has a meaning in the context of the arithmetic expression. The parser will contain rules to indicate that the symbols *, +, ^, (and) indicate the start of a new token, so other meaningless tokens like 12 or 13 will not be generated.

The next state is parsing which means checking that the tokens form a valid expression, this is usually done using a context free grammar that recursively defines components that can appear in an expression and the order in which they should appear. The rules that define a programming language cannot always be expressed using only a context-free grammar, for example type validation and the correct declaration of identifiers. These rules can be formally expressed using attribute grammars.

The final phase is the semantic analysis, which works on the implications of the already validated expression and performs the pertinent actions. In the case of the calculator, the action is to evaluate the expression. A compiler on the other hand will generate code. Attribute grammars can also be used to define these actions.

Dependency Analysis

Difference between a dependency tree and a constituent tree

Another method to perform syntactic analysis or parsing is using dependency grammars, which arise as an alternative to sentence structures. In general terms, these grammars define a dependency relationship between each element of a sentence. a construction (usually sentences but can also be just phrases) and its "head" (or head). These elements can be words, tokens, taglines, or even punctuation marks. Additionally, an element 0 or "root" (root) at the head of the main constituent, usually the main verb of the sentence. It is important not to confuse dependencies with constituents, since dependency relationships generate unique and ordered pairs.

The criteria for determining the head Y of a dependent D in a construction C are as follows:

H determines the syntactic category of C and H can replace C.
H determines the semantic category of C; D specifies H.
H is mandatory; D can be optional.
H selects D and determines whether D is mandatory.
The form of D depends on H (agreement or government).
The D linear position is specified with respect to H.

However, these criteria can generate contradictions or inconsistencies with morphological or semantic criteria and it is not always clear if the dependents are optional or not.

Example of an analysis of dependencies in English

The task of dependency parsers is, given a sentence, to determine the heads and dependency type of each of the elements. The advantage of using this type of parsing is that certain problems can be avoided in languages with loose word order. There are many different ways to classify dependency types but CoNLL (Conference on Computational Natural Language Learning) has created a format for universal use in dependency parsing: CoNLL-U

The results of the latest systems in different parsing tests have been compiled on the shared task site, in 2017 the task was to create a multilingual parser, that is, capable of parsing different languages.

Classification

The essential task of a parser is to determine whether a given input can be derived from the initial symbol, using the rules of a formal grammar, and how to do this, there are essentially two ways:

descending sintactic analyzer (Top-Down-Parser):..a analyzer can start with the initial symbol and try to transform it into the entry, intuitively this would be to gradually divide the entry into smaller and smaller parts, so the LL analyzers work, an example is the javaCC. An improvement in these pairs can be achieved using GLR (Generalized Left-to-right Rightmost derivation).
ascending synthetic analyzer (Bottom-Up-Parser): an analyzer can start with the input and try to reach the initial symbol, intuitively the analyzer tries to find the smaller symbols and progressively build the hierarchy of symbols until the initial, the LR analyzers work so and an example is the Yacc. There are also SLR (Simple LR) or LALR (Look-ahead LR) as well as GLL (Generalized Left-to-right Leftmost derivation).

Other types of parsers are:

Recursive sintactic analyzer
Chart parser
Left corner parser
LR Sintactic Analyzer
LALR Syntactic Analyzer

Contenido relacionado

Más resultados...