What is Compiler in Programming?

“I am not able to understand what this programmer is saying, where is my compiler, call compiler, I want to do what this programmer is saying” - Computer.

A compiler is a computer program that converts source code into target code. Source code is the computer program written in one of the programming languages, and target code is the machine language program generated by the compiler.

The compiler is responsible for translating the source code written in a high-level language into target code. The target code is in a low-level language. The target code is also known as assembly language, object code, or machine code. The target code is used to create an executable program.

There exist many different types of Compilers such as Ada Compiler, ALGOL 60 compilers, ALGOL 68 compilers, Assemblers (Intel *86), BASIC compilers, C compilers, C++ compilers, COBOL compilers, this each language has its own associated set of compilers.

Each compiler produces target code according to the CPU or operating system on which they are running. In addition, a cross-compiler is used to generate target code, having instruction formats according to different CPUs and operating systems.

A bootstrap compiler comes with the language in which programming is to be done. Thus it converts the source code of the programming language into machine code. 

It is also possible to convert the target code into source code using a decompiler. A decompiler is a program that transforms the low-level language to machine level language.

There also exists compilers that translated high-level language programs into another high-level language. The compilers that convert the source code into another source code are called source-to-source compilers.

A compiler is responsible for performing different operations. These operations have different phases. The compiler phase includes – preprocessing, lexical, analyzing, parsing, semantic analysis, intermediate language, code optimization, and the required code generation.

The compiler preprocesses the source code to output that is used as the input to another program. In this phase, the compiler is responsible for performing textual substitutions and macro expansions. In addition, there exists lexical preprocessors, syntactic preprocessors, and a few general-purpose preprocessors. 

Lexical preprocessors do lexical analysis and substitute required tokens as per the defined rules. In addition, lexical preprocessors perform different types of substitution, such as macro substitution, textual inclusion, and conditional compilation.

Next is, Syntactic preprocessors—the syntactic preprocessor belong to the LISP language. The Syntactic preprocessor is responsible for converting the syntax tree as per the user-defined guidelines.

Other preprocessors include general-purpose preprocessors. A general-purpose preprocessor is a program that does not target any specific language and can perform different types of text processing tasks.     

The next phase in the process of compilation is lexical analysis. The lexical analysis is the technique using which different sets of characters are converted into token sequences. The lexical analyzer is responsible for performing syntax analysis of the source code.

A lexical analyzer performs lexical analysis in one pass. In lexical analysis, source code is scanned to perform segmentation of the input string into lexemes. Then, these lexemes are categorized into their respective token class.

Lexical analysis is also done in natural language processing to segments text into words and other units. Lexeme is matching a particular set of characters in the source program into its respective token pattern. A token is a string with its name and its respective value. Some of the commonly used tokens are identifier, keyword, separator, operator, literal and comment.

Each programming language has its lexical grammar. This lexical grammar is responsible for defining lexical syntax. Lexical syntax is generated using regular expression. The lexical syntax consists of a sequence of characters termed token. The lexical analyser performs source code analysis, picks up the strings from the source code, and picks and takes appropriate actions to produce a token. 

White spaces and comments are used by most of the programming languages, the treatment of white spaces and comments is defined in the compiler of the language, most of the compilers discard and treat white spaces as non-significant. 

The source code contains a set of input characters, these sets of input characters are classified and fed as input to some other programs for semantic analysis. In semantic analysis, the parser uses the identified tokens to build an abstract syntax tree. 

The abstract syntax tree is built using pre-defined rules to treat identified tokens in the source code. Thus to build a correct abstract syntax tree, correct identification of tokens is required. To correctly identify tokens, the source code is searched for a specific sequence of characters called flag. Next, the source code is searched for delimiters. Finally, the source code is matched with a dictionary definition. Lexers also search source code for special characters and punctuation characters to correct;y build abstract syntax trees. 

The source code consists of identifiers, operators, grouping symbols, or other data types. These must be identified to build the correct syntax tree. Then, all the identified tokens in the source code are given a number and fed into the parser. 

The finite-state machine identifies tokens in the source code, the sequence of characters to identify tokens. To produce tokens, an evaluator is used, the evaluator is used to giving value to lexeme. The lexeme is a set of characters with its associated value builds token. There are different evaluators for identifiers, literals, or floating-point numbers. 

Tokenization is done based on words used in the source code—the sequences of characters used in the source code build token.

In lexical analysis, sequences of characters are characterised into the stream of characters termed tokens. The lexical analyzer either omits tokens or add tokens. The whitespaces and comments are omitted in the source code, tokens are added to the programming instructions.

The source code contains different types of tokens in the source code, such as line continuation, semicolon, and Off-side rule.

Parsing is the process of analyzing a string of symbols. The parsing is done to find the word’s meaning or the sentence used in the source code. To find the meaning of the word or sentence, sentence diagrams are used. The sentence diagram grammatically divides the sentence into subject and predicate.

In parsing, sentence analysis is done to build a parse tree. Then, the parse tree builds a syntactical relationship with each other. The parse tree also contains semantic information as well as other types of information, such as p-values. If the source code is syntactically ambiguous, then a parse forest is generated. 

Thus parsing refers to the analysis of the source code and categorizing its constituents into its respective parts to be used for compilers and interpreters.

A parser is used to build a correct syntax tree. A parser takes input source code and generates a data structure. This data structure is called a parse tree or abstract syntax tree. Although parsing is a process and contains multiple steps, these steps can be combined into a single step. The input to the parser is the output of the lexical analyzer. 

Each programming language has a parser associated with it, termed as a parser generator. The parser generator takes input in the form of text. Only a certain part of the text is extracted to generate a parse tree. Parsing is done using a regular expression. A set of regular expressions forms regular language. This common language is used to conduct pattern matching and text extraction.

A parser is a compiler component used to parse the source code into an internal representation to generate the program’s output. The source code of the programming language is written in context-free grammar. The parser of context-free grammar is fast and effective, and efficient.

There can be one-pass compilers or multi-pass compilers. Unfortunately, context-free grammar is not able to express all the requirements of the language. This is because of memory limitation. Since the memory is limited, the parser is not able to remember the long input. 

Relaxed parsers are used in the case of context-free grammars. In the parsing process, the input character stream is split into symbols having specified meanings. Then, regular expressions give meaning to the symbol.

The next process in parsing is syntactic analysis. In syntactical analysis, the components that make up the expression is defined. The next step in parsing is analysis. In this step, expression is validated, and actions are taken accordingly.

There exist different types of parsers, namely, top-down parser and bottom-up parser. In top-down parsing, an attempt is made to do the input character stream’s left-most derivations. This is done using top-down expansion rules. In top-down expansion rules, the token is processed in a sequence that begins from the left and ends at the right.

The next is Bottom-up parsing. In bottom-up parsing, the parser takes input and rewrite the lexical code to the start symbol. But, first, the bottom-up parser finds the basic elements. An example of the bottom-up parser is the LR parser.

The next step in the compilation process is semantic analysis. This is the step that comes after parsing. In semantic analysis, information is collected from the source code. Different types of information are collected from the source code, such as type checking, and it is also checked whether the variable is declared before it is used. These checks are done during parsing.

After the semantic analysis is done, the next step is intermediate representation. In intermediate representation, data structures are designed. Then, the compiler uses this data structure. Finally, an intermediate representation is used in optimization and translation. An Intermediate representation must represent the source code and should not get influenced by other source code or target language. An intermediate representation exhibits different forms such as in-memory data structure, a program written in tuples, or a program written using the stack data structure.

An intermediate language is used in the analysis of the program. After this machine code is generated, the source code is translated into a form that will improve the code. An intermediate language differs from a machine language as in an intermediate language. Each instruction is converted into one basic operation. The information about the control flow is not tracked in the instruction set, and intermediate language uses many processor registers.

The most common format used by intermediate language is the three-address code.

A language that targets virtual machines or p-code machines is treated as an intermediate language. A few examples of intermediate language includes – Java bytecode, Common Intermediate Language of Microsoft, MATLAB and Pascal p-code.

To achieve platform independence, different Intermediate languages are used. Among these are – Register Transfer Language, GENERIC, and SSA.

The next step that the compiler conducts after Intermediate Generation is code optimization. In code optimization, the main objective is to generate code that is more efficient and effective. The code optimization is done to achieve fast execution of the program and optimise the program’s memory utilisation.

There may exist a different number of levels of code optimization. A higher level of code optimization highly influences source code, and a large amount of code may be required to be changed. In the code optimization process, it may be possible that a high level of code Optimization may be achieved with little changes in the source code, and it is also possible that a low amount of code optimization is achieved with more changes in the code.

The overall performance of the code can be achieved with a low-level/small portion of the program. The code optimization is not achieved in a single step, but it is a cyclic process that continues until it becomes too costly to optimize the code. 

Code optimization is a process done from the start of the program so that the developed software is delivered within time and optimized. To achieve code optimization, the prototypes are developed with the objective that each prototype is optimized. When the prototypes are optimized, then the resultant system is automatically optimized. 

The overall design of the software influence the performance of the system. The system’s goal decides the system’s design. For example, when the objective is to achieve fast compilation, a one-pass compiler is used rather than a multi-pass compiler. 

The code optimization is also influenced by choice of the programming language and the hardware required to execute the program. Further code optimization also depends on the framework on which the software is being developed. Finally, the code optimizations are achieved if modular programming is done.

Next element that influence code optimization is an efficient use of the algorithm and its associated data structures. If the algorithm and data structures are implemented efficiently, then it will automatically optimize the code. Selection of appropriate data structure is a must—an efficient and appropriate data structure influences the code’s overall performance.

The selection of an algorithm is also critical in achieving code optimization. For example, an algorithm whose time complexity is constant is prefered as compared to an algorithm whose time complexity is quadratic.

Memorization is also a technique that can be used to achieve the overall performance of the code. Memorization is a technique that avoids duplicate computations. Memorization is mostly used in an iterative process. Memorization is done to reduce the computation process and reduce the time complexity of the program.

Code optimization can also be achieved by optimizing the compiler. To optimize the compiler, understanding of compiler is a must. Compiler level optimization depends on loop-invariant code motion and returns value optimization. The use of an optimized compiler makes sure that the program that is being executed is optimized.

The most efficient code optimization is achieved when the code is written in assembly language. The assembly language code is machine-dependent. Most of the programs are written in a high-level language and thus requires code optimization. But as technology is evolving and it is becoming harder to write code in assembly language.                

These phases of compilation are implemented as a modular component. The compiler is responsible for transforming source input to target output using an efficient design and achieving correctness in transformation.

Compilers also target the platform on which they are used. This also influences the compilation process. If the compiler’s output is targeted to run on the same machine on which it is hosted, it is called a native compiler. If the compiler’s output is executed on a different machine, then a cross compiler is used.