15. Compiler Design
15. Compiler Design
15. Compiler Design
specifying each of its seven parts. However, going to that level of detail can be cumbersome for all but
the tiniest Turing machines. Accordingly, we won‘t spend much time giving such descriptions. Mostly
wewill give only higher-level descriptions because they are precise enough for our purposes and are much
easier to understand. Nevertheless, it is important to remember that every higher-level description is
actually just shorthand for its formal counterpart. With patience and care we could describe any of the
Turing machines in this module in complete formal detail.
Activity 4.5
1. What is a compiler?
2. List and Explain steps of Compiler?
3. Why we design compiler?
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
For example, a typical program using C compiler, is executed on a host machine in the following
manner.
Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. Whereas a compiler
reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
Pass: A pass refers to the traversal of a compiler through the entire program.
Phase: A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can have
more than one phase. The compilation process is a sequence of various phases. Each phase takes
input from its previous stage, has its own representation of source program, and feeds its output
to the next phase of the compiler. Let us understand the phases of a compiler.
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program. And groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of
the form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis.
In the token, the first component token-name is an abstract symbol that is used during syntax
analysis, and the second component attribute-value points to an entry in the symbol table for this
token. Information from the symbol-table entry is needed for semantic analysis and code
generation.
poison=initial + rate*60
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. Position is a lexeme that would be mapped into a token (id, I), where id is an abstract
symbol standing for identifier and 1 point to the symbol table entry for position. The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token
needs no attribute-value, we have omitted the second component. We could have used
any abstract symbol such as assign for the token-name, but for notational convenience we
have chosen to use the lexeme itself as the name of the abstract symbol.
3. Initial is a lexeme that is mapped into the token (id,2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+).
5. Rate is a lexeme that is mapped into the token (id,3), where 3 points to the symbol-table
entry for rate.
6. * Is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).'
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
poison=initial + rate*60
Are to be performed. The tree has an interior node labeled * with (id,3) as its left child and the
integer 60 as its right child. The node (id,3) represents the identifier rate. The node labeled *
makes it explicit that we must first multiply the value of r a t e by 60. The node labeled +
indicates that we must add the result of this multiplication to the value of initial. The root of the
tree, labeled =, indicates that we must store the result of this addition into the location for the
identifier position. This ordering of operations is consistent with the usual conventions of
arithmetic which tell us that multiplication has higher precedence than addition, and hence that
the multiplication is to be performed before the addition. The subsequent phases of the compiler
use the grammatical structure to help analyze the source program and generate the target
program.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation.
An important part of semantic analysis is typing checking, where the compiler checks that each
operator has matching operands. For example, many programming language definitions require
an array index to be an integer; the compiler must report an error if a floating-point number is
used to index an array.
The language specification may permit some type conversions called coercions. For example, a
binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point
numbers. If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
After syntax and semantic analysis of the source program, many compilers generate an explicit
low-level or machine-like intermediate representation, which we can think of as a program for an
abstract machine. This intermediate representation should have two important properties:
Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so
that better target code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power. For example, a
straightforward algorithm generates the intermediate code (1.3), using an instruction for each
t1 = id3 * 60.0
id1 = id2 + t1
There is a great variation in the amount of code optimization different compilers perform. In
those that do the most, the so-called "optimizing compilers," a significant amount of time is
spent on this phase. There are simple optimizations that significantly improve the running time
of the target program without slowing down compilation too much Code Generation
Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to hold variables. For example, using
registers R 1and R2,the intermediate code in (1.4) might get translated into the machine code
LDF R2 , id3
MULF R2 , R2 , #60.0
LDF Rl , id2
ADDF Rl , Rl , R2
STF idl , Rl
The first operand of each instruction specifies a destination. The F in each instruction tells us that
it deals with floating-point numbers. The code in above loads the contents of address id3 into
register R2, then multiplies it with floating-point constant 60.0. The # signifies that 60.0 is to be
treated as an immediate constant. The third instruction moves id2 into register R1 and the fourth
adds to it the value previously computed in register R2. Finally, the value in register R1 is stored
into the address of idl, so the code correctly implements the assignment statement
These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include
SELF-TEST QUESTIONS
1. Assume B is a subset of S and S is universal set, Based on this assumption, which one of
the following is not equal to B
A. B- B. B C. D.B S
2. Which of the following alternative is true about Demorgan‘s Law
A. = ⋂ C. S1 S2 = ⋂
B. . = ⋃ D. = ⋃
3. If we have a relation R in the set C= {0,1,2,. ,. ,. ,20} if our relation is a divides b, so
which statement can be true about R
A. R is symmetric C. R is transitive
B. . R is equivalence D. R is not a Relation
4. If S1 and S2 are two sets. Assume both sets contain same elements, then which of the
following statement can‘t true?
A. S1 S2 B. S2 S1 C. S1 S1 D. S1 S2
5. If a set A={ 0,1,2} then, identify the false statement from the following list
A. 2A B. {0,1,2} 2A C. 2|A|=3 D. {{0,1,2}} 2A
6. When do we say a function is injective?
A. If different element in domain have different images
B. If all element in the range are image of some element in domain
C. If all element in the domain are image of some element in range
D. If different element in range have different domain
7. One of the following cannot true statement about a tree
A. A tree is a connected graph with no circuits or loops.
B. In a tree there is one and only one path between every pair of vertices.
C. A tree with n vertices has n-1 edges.
D. If a disconnected graph with n vertices has n - 1 edge, then it is a tree
8. From the following list of automata which one is the most powerful
A. Finite Automata B. Turing Machine C. Pushdown Automata D. None
9. Sequence of finite symbols, which contain variables as well as terminals, are called ____
A. Strings B. Sentence C. Sentential Form D. Word
10. If we have a grammar in the form of A → Bx, or A → x. where A, B V, x T*
A. Left Linear Grammar C. Right Linear Grammar
B. Simple Grammar D. Context-Free Grammar
a b
b b b
q 2
a
A. All sets of strings from ={a,b} that ends with substring abb
B. All sets of strings from ={a,b} that starts with substring bab
C. All sets of strings from ={a,b} that ends with substring bab
D. All sets of strings from ={a,b} that starts with substring ab
12. Which of the following strings accepted by the finite automata in question 1
A. aaababaab C. bbbbabab
B. aaabbbaaabba D. aaabbaabbabb
13. From the following lists of automata, which one is with memory?
A. FA B. DFA C. NFA D. PDA
14. Considering the transition graph below, the language accepted by it is
q 0
a q b q b q
1 3 4
b a a
a, b
q 2
a, b
A. The set of all strings (a, b)* starting with ab
B. Set of all string (a, b)* starting with abb
C. The set of strings contains {abb}
D. set of strings contains (ab, abb}
15. What type of finite automata is the graph in question 4
A. Deterministic C. Nondeterministic
B. Context free D. Derivation tree
0,1 0,
A. 0110 B. 01001 C. 100100000 D. 000000
Considering the automaton given below and choose the correct answers for Questions 7-9:
0
0,1
a, b 1 a
1 1
q 2
0
17. M is a _____________
A. nondeterministic finite automaton
B. deterministic finite automaton accepting {0,1}*
C. DFA accepting all strings over {0,1} having 3m 0's and 3n 1's, m, n >=1
D. Deterministic finite automaton
18. M accepts
A.01110 B. 10001 C.01010 D. 11111
19. L (M) is equal to
A. {03m 13n| m,n>=0}
B. {03m 13n| m,n>=1}
C. {w| w has 111 as substring}
D. {w| w has 3n 1‘s, n>=1}
20. Which of the following statements are acceptable?
A. Automata is a complex model of digital computer
B. If a Language is accepted by any DFA there is equivalent NFA that accepts it.
C. Many languages may accept by single machine
D. Some regular languages may not have finite automata
21. The regular expression r= b*ab*ab* represents
A. All sets of strings from ={a,b} that have exactly two a‘s
B. All sets of strings from ={a,b} that have at least two a‘s
C. All sets of strings from ={a,b} that have at most two a‘s
D. All are possible answers