Compiler

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

Diagram to blocks of operation of a typical multilanguage compiler.

In computing, a compiler is a program that translates code written in a programming language (called a source) into another language (known as an object). In this type of translator, the source language is usually a high-level language and the object a low-level language, such as assembly or machine code. This translation process is known as compilation.

Building a compiler involves dividing the process into a series of phases that will vary with their complexity. Generally, these phases are grouped into two tasks: the analysis of the source program and the synthesis of the object program.

Analysis: this is the checking of the correction of the source program, according to the definition of language in terms of formal language theory. It includes the phases corresponding to lexicon analysis (which consists of the decomposition of the source program in lexicon components), syntactic analysis (grouping of lexicon components in grammatical phrases) and semantic analysis (comproving the semantic validity of the sentences accepted in the syntactic analysis phase).
Synthesis: its objective is the generation of the output expressed in the target language and is usually formed by one or several combinations of code generation phases (usually it is an intermediate code or object code) and code optimization (in which it is sought to obtain an objective program as efficiently as possible, according to its computer complexity or complexity of Kolmogórov: time of execution, space during execution, space to be stored out of execution, etc.).

Alternatively, the phases described for the analysis and synthesis tasks can be grouped into:

Analyzer or front-end: is the part that analyzes the source code, checks its validity, generates the derivation tree and fills the values of the symbol table. This part is usually independent of the platform or system for which it will be compiled, and it is composed of the phases between lexicon analysis and the generation of intermediate code.
Generator or back-end: is the part that generates the machine code, specific of a platform, from the results of the analysis phase.

This split allows the same generator to be used to create machine code for several different programming languages, and the same parser that is used to examine the source code of a particular programming language to be used to produce machine code on multiple platforms.

History

Main article: History of the construction of the compilers

In 1938, Konrad Zuse developed the first electromechanical digital computer, called Z1 in Germany, and later, in 1946, the first fully electronic computer ENIAC was developed, followed mainly by the EDVAC (1951), the first digital electronic computer. At first, these machines executed instructions consisting of numerical codes that signaled to the machine's circuits the states corresponding to each operation, which was called machine language.

Soon the first users of these computers discovered the advantage of writing their programs using keys that were easier to remember than those codes; in the end, all those keys together were manually translated into machine language. These keys constitute the so-called assembly languages.

In spite of everything, the assembly language was still that of a machine, but easier to handle (machine instructions are replaced by mnemonics. The research work was oriented towards the creation of a language that expressed the different actions to in a way that is as simple as possible for one person. The first compiler was written by Grace Hopper, in 1952 for the A-0 programming language. In 1950 John Backus led research at IBM on an algebraic language. In 1954 it began to develop a language that allowed mathematical formulas to be written in a way translatable by a computer, they called it FORTRAN (FORmulae TRANslator).It was the first high-level language and was introduced in 1957 for the use of the IBM computer model 704.

Thus arose for the first time the concept of a translator as a program that translated a language into another language. In the particular case that the language to be translated is a high-level language and the translated language is low-level, the term compiler is used.

The job of making a compiler was difficult to do. The first FORTRAN compiler took 18 person-years to build and was very simple. This development of FORTRAN was highly influenced by the target machine on which it was to be implemented. As an example of this, we have the fact that the blank spaces were ignored, because the peripheral that was used as program input (a punched card reader) did not correctly count the blank spaces.

The first self-contained compiler, that is, capable of compiling its own source code, was created for Lisp by Hart and Levin at MIT in 1962. Since the 1970s it has become common practice to write the compiler in the same language as it compiles, although PASCAL and C have been widely used alternatives.

Creating a self-contained compiler causes a problem called bootstrapping, ie the first compiler created for a language has to either be compiled by a compiler written in another language or compiled by running the compiler in an interpreter.

Types of compilers

This taxonomy of compiler types is not exclusive, so there may be compilers that fall into several categories:

Cross compilers: they generate code for a platform other than that in which they are working.
Optimizing compilers: they make changes in the code to improve their efficiency, but maintaining the functionality of the original program.
Single-pass compilers: generate the machine code from a single reading of the source code.
Several-pass compilers: they need to read the source code several times before they can produce the machine code.
JIT compilers (JIT)just in time): They are part of an interpreter and compile parts of the code as needed.

In the early days of computing, compilers were considered some of the most complex software available.^{[citation needed]}

The first compilers were made by programming them directly in machine language or in assembler. Once you have a compiler, you can write new versions of the compiler (or different compilers) in the language that compiler compiles.

There are tools that make it easier to write compilers or computer interpreters. These tools allow the parser skeleton to be generated from a formal definition of the source language, normally specified by a cheap formal grammar, leaving only the compiler programmer the task of programming the associated semantic actions..

Compilation process

It is the process by which instructions written in a particular programming language are translated into machine language. In addition to a translator, other programs may be needed to create an executable object program. A source program can be divided into modules stored in different files. The task of assembling the source program is often entrusted to a separate program, called a preprocessor. The preprocessor can also expand abbreviations, macro calls, to statements in the source language.

Normally creating an executable program (a typical .exe file for Windows or DOS) involves two steps. The first step is called compilation (proper) and translates source code written in a programming language stored in a file into low-level code (usually object code, not directly into machine language). The second step is called linking in which the low-level code generated from all the files and subprograms that have been sent to be compiled is linked and the code of the functions that exist in the libraries of the compiler so that the executable can communicate directly with the operating system, thus ultimately translating the object code into machine code, and generating an executable module.

These two steps can be done separately, by storing the output of the compile phase in object files (a typical.obj for Microsoft Windows, DOS or for Unix); to link them in later phases, or directly create the executable; so the build phase is stored only temporarily. A program might have parts written in various languages (eg C, C++, and Asm), which could be independently compiled and then linked together to form a single executable module.

Stages of the process

The translation process is internally composed of several stages or phases, which perform different logical operations. It is useful to think of these phases as separate pieces within the translator, and they can actually be written as separately coded operations even though in practice they are often integrated together.

Analysis phase

Lexical analysis

Main article: Lexicon analysis

The lexical analysis constitutes the first phase, here the source program is read from left to right and it is grouped into lexical components (tokens), which are sequences of characters that have a meaning. Also, all white space, blank lines, comments, and other unnecessary information is removed from the source program. It also checks that the language symbols (key words, operators, etc.) have been written correctly.

Since the task performed by the lexical analyzer is a special case of pattern matching, the pattern specification and pattern recognition methods are needed, mainly finite automata that accept regular expressions are used. However, a lexical analyzer is also the part of the translator that handles source code input, and since this input often involves significant time expenditure, the lexical analyzer must function as efficiently as possible.

Syntactic analysis

Main article: Syntactic Analyzer

In this phase the characters or tokens are hierarchically grouped into grammatical phrases that the compiler uses to synthesize the output. It checks if what is obtained from the previous phase is syntactically correct (obeys the grammar of the language). Usually, grammar phrases in the source program are represented by a parse tree.

The hierarchical structure of a program is usually expressed using recursive rules. For example, the following rules can be given as part of the definition of expressions:

An identifier can be an expression.
A number can be an expression.
Yes.₁ and expression₂ are expressions, then they are also:
- expression₁ + expression₂
- expression₁ * expression₂
- (expression)₁)

Rules 1 and 2 are basic (non-recursive) rules, while rule 3 defines expressions based on operators applied to other expressions.

The division between lexical analysis and syntactic analysis is somewhat arbitrary. One factor in determining splitting is whether or not a source language construct is inherently recursive. Lexical constructions do not require recursion, while syntactic constructions usually do. No recursion is required to recognize identifiers, which are typically strings of letters and digits beginning with a letter. Typically, identifiers are recognized by simply examining the input stream, waiting until it finds a character that is neither a letter nor a digit, and then grouping all the letters and digits found up to that point into a token called an identifier. On the other hand, this kind of parsing is not powerful enough to parse expressions or propositions. For example, we cannot properly match parentheses in expressions, or the words begin and end in statements without imposing some kind of nesting or hierarchical structure on the input..

Semantic analysis

The semantic analysis phase reviews the source program to try to find semantic errors and gathers information about types for the later code generation phase. In it, the hierarchical structure determined by the parsing phase is used to identify the operators and operands of expressions and propositions.

An important component of semantic analysis is type checking. Here, the compiler checks whether each operator has operands allowed by the source language specification. For example, the definitions of many programming languages require the compiler to report an error whenever a real number is used as the index of an array. However, the language specification may place restrictions on the operands, for example, when a binary arithmetic operator is applied to an integer and a real number. Check that the arrays have the correct size defined.

Synthesis phase

It consists of generating the object code equivalent to the source program. Object code will only be generated when the source program is free of parse errors.

The result can be machine language or assembly code. Memory locations are selected for each of the variables used by the program. Then each of the instructions in between is translated into a sequence of machine instructions that performs the same task. A decisive aspect is the assignment of variables to registers.

Intermediate code generation

After parsing and semantics, some compilers generate an explicit intermediate representation of the source program. This intermediate representation must have two important properties: it must be easy to produce and easy to translate into the object program.

The intermediate representation can take various forms. There is an intermediate form called "three-address code," similar to assembly language, in which each instruction performs a single operation. The three-address code consists of a sequence of instructions, each of which has a maximum of three operands. This intermediate representation has several properties:

First: each instruction in three directions has at most an operator, in addition to the assignment.
Second: the translator must generate a temporary name to save the values calculated by each instruction.
Third: some “three-way” instructions have less than three operators.

Code optimization

Main article: Optimizing compiler

The code optimization phase consists of improving the intermediate code, so that it results in a machine code that is faster to execute. This phase of the synthesis stage is possible especially if the translator is a compiler (an interpreter can hardly optimize the object code). There is a lot of variation in the amount of code optimization that different compilers perform. In those that do a lot of optimization, called "optimizing compilers", a significant part of the compiler's time is spent in this phase. However, there are simple optimizations that significantly improve the execution time of the object program without slowing down the compilation too much.

Main data structure

The interaction between the algorithms used by the compiler phases and the data structures that support these phases is naturally very strong. The compiler writer strives to implement these algorithms as efficiently as possible, without increasing the complexity too much. Ideally, a compiler should be able to compile a program in a time proportional to the size of the program.

Tokens or Tokens

When a lexical analyzer assembles the characters into a token, it generally represents the token symbolically, that is, as a value of an enumerated data type that represents the set of tokens in the source language. Sometimes it is also necessary to keep the string itself or other information derived from it, such as the name associated with an identifier token or the value of a number token.

In most languages the lexical analyzer only needs to generate one token at a time. In this case a simple global variable can be used to hold the token information. In other cases (the most notable example of which is FORTRAN), an array (or vector) of tokens may be necessary.

Syntactic tree

If the parser generates a syntax tree, it is usually constructed as a standard pointer-based structure that is dynamically allocated as parsing occurs. The entire tree can then be kept as a single variable pointing to the root node. Each node in the structure is a record whose fields represent the information collected both by the parser and, later, by the semantic analyzer. For example, the data type of an expression can be held as a field in the syntax tree node for the expression.

Sometimes, to save space, these fields are allocated dynamically, or stored in other data structures, such as the symbol table, which allow for selective allocation and deallocation. In reality, each syntax tree node by itself may require different attributes to be stored, depending on the kind of language structure it represents. In this case, each node in the syntax tree may be represented by a variable record, with each node class containing only the information necessary for that case.

Symbol table

This data structure holds the information associated with identifiers: functions, variables, constants, and data types. The symbol table interacts with almost every stage of the compiler: the lexer, parser, or semantic analyzer can introduce identifiers into the table; the semantic analyzer will add data types and other information; and the optimization and code generation phases will use the information provided by the symbol table to make appropriate object code selections.

Since the symbol table will have access requests so frequently, insert, delete, and access operations need to be efficient, preferably constant-time operations. A standard data structure for this purpose is the hash or address calculation table, although various tree structures may also be used. Sometimes multiple tables are used and kept in a list or stack.

Table of literals

Fast search and insert are also essential for the table of literals, which stores constants and strings used in the program. However, a table of literals needs to prevent deletes because its data applies globally to the program and a constant or string will appear only once in this table. The table of literals is important in reducing the size of a program in memory by allowing the reuse of constants and strings. It is also needed for the code generator to construct symbolic addresses for literals and to insert data definitions into the object code file.

Intermediate code

Depending on the kind of intermediate code (for example, three-address code or P-code) and the kinds of optimizations performed, this code can be kept as an array of text strings, a temporary text file, or a list of linked structures. In compilers that perform complex optimizations, particular attention should be paid to the selection of representations that allow for easy reorganization.

Code optimization

The code optimization phase tries to improve intermediate code so that it results in faster-executing machine code. Some optimizations are trivial. For example, a natural algorithm generates the intermediate code (2) using one statement for each operator in the tree representation after semantic analysis, although there is a better way to perform the same calculations using the two statements.

temp1:= id3 * 60.0 ==
id1:= id2 + temp1

There is nothing wrong with this simple algorithm, since the problem can be fixed in the code optimization phase. That is, the compiler can infer that the conversion of 60 from integer to real can be done once and for all at compile time, so the "entreal()" can be deleted. Also, temp3 is used only once, to pass its value to id1. Then it is safe to replace id1 with temp3, whereupon the last statement of (2) is not needed and the code of (3) is obtained.

There are many variations in the amount of code optimization that different compilers perform. In what do a lot of optimization called "optimizing compilers," a significant portion of the compiler's time is spent in this phase. However, there are simple optimizations that significantly improve the execution time of the object program without slowing down the compilation too much.

Temporary files

In the beginning, computers did not have enough memory to store an entire program during compilation. This problem was solved by using temporary files to hold the products of intermediate steps during translation or by compiling "on the fly", i.e. keeping only enough information from previous parts of the source program to allow proceeding with the translation. translation.

Memory limitations are now much less of an issue, and it is possible to require an entire compilation unit to be kept in memory, especially if separate compilation is available in the language. However, compilers occasionally find it useful to generate intermediate files during some stage of processing. Typical of these is the need for backward correction directions during code generation.