15. Compiler Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

As we did for finite and pushdown automata, we can formally describe a particular Turing machine by

specifying each of its seven parts. However, going to that level of detail can be cumbersome for all but
the tiniest Turing machines. Accordingly, we won‘t spend much time giving such descriptions. Mostly
wewill give only higher-level descriptions because they are precise enough for our purposes and are much
easier to understand. Nevertheless, it is important to remember that every higher-level description is
actually just shorthand for its formal counterpart. With patience and care we could describe any of the
Turing machines in this module in complete formal detail.

Representation of Turing Machines

We can describe a Turing machine by employing


(i) Instantaneous descriptions using move-relations (├), (ID)
(ii) Transition table. And
(iii) Transition diagram (transition graph).

4.2. Compiler Design

Activity 4.5
1. What is a compiler?
2. List and Explain steps of Compiler?
3. Why we design compiler?

Implement a small compiler using modern compiler writing tools.


Provide the student with skills and knowledge (such as lexical analysis and parsing)
which are applicable to a broad range of computer science application areas (such as
text editors, information retrieval, etc...).
A compiler translates the code written in one language to some other language without changing
the meaning of the program. It is also expected that a compiler should make the target code
efficient and optimized in terms of time and space. Compiler design principles provide an in-
depth view of translation and optimization process. Compiler design covers basic translation
mechanism and error detection & recovery. It includes lexical, syntax, and semantic analysis as
front end, and code generation and optimization as back-end.

Page 229 of 248


Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical
device and its functions are being controlled by a compatible software. Hardware understands
instructions in the form of electronic charge, which is the counterpart of binary language in
software programming. Binary language has only two alphabets, 0 and 1. To instruct, the
hardware codes must be written in binary format, which is simply a series of 1s and 0s. It would
be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes.

4.2.1. Language Processing System


We have learnt that any computer system is made of hardware and software. The hardware
understands a language, which humans cannot understand. So, we write programs in high-level
language, which is easier for us to understand and remember. These programs are then fed into a
series of tools and Operating system components to get the desired code that can be used by the
machine. This is known as Language Processing System.

Page 230 of 248


Figure: A language-processing system

The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.

For example, a typical program using C compiler, is executed on a host machine in the following
manner.

 User writes a program in C language (high-level language).


 The C compiler, compiles the program and translates it to assembly program (low-level
language).

Page 231 of 248


 An assembler then translates the assembly program into machine code (object).
 A linker tool is used to link all the parts of the program together for execution (executable
machine code).
 A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.

Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.

Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. Whereas a compiler
reads the whole program even if it encounters several errors.

Assembler
An assembler translates assembly language programs into machine code. The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.

Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.

Page 232 of 248


Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.

Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.

Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.

4.2.2. The Structure of a Compiler


Up to this point we have treated a compiler as a single box that maps a source program into a
semantically equivalent target program. If we open up this box a little, we see that there are two
parts to this mapping: analysis and synthesis.
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts and then checks for lexical, grammar and syntax errors.The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.

Page 233 of 248


Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.

A compiler can have many phases and passes.

Pass: A pass refers to the traversal of a compiler through the entire program.

Phase: A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can have
more than one phase. The compilation process is a sequence of various phases. Each phase takes
input from its previous stage, has its own representation of source program, and feeds its output
to the next phase of the compiler. Let us understand the phases of a compiler.

Page 234 of 248


Lexical Analysis

The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program. And groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of
the form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis.
In the token, the first component token-name is an abstract symbol that is used during syntax
analysis, and the second component attribute-value points to an entry in the symbol table for this
token. Information from the symbol-table entry is needed for semantic analysis and code
generation.

For example, suppose a source program contains the assignment statemen

poison=initial + rate*60

The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:

1. Position is a lexeme that would be mapped into a token (id, I), where id is an abstract
symbol standing for identifier and 1 point to the symbol table entry for position. The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token
needs no attribute-value, we have omitted the second component. We could have used
any abstract symbol such as assign for the token-name, but for notational convenience we
have chosen to use the lexeme itself as the name of the abstract symbol.
3. Initial is a lexeme that is mapped into the token (id,2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+).
5. Rate is a lexeme that is mapped into the token (id,3), where 3 points to the symbol-table
entry for rate.
6. * Is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).'

Page 235 of 248


Blanks separating the lexemes would be discarded by the lexical analyzer. Figure below shows
the representation of the assignment statement. After lexical analysis as the sequence of tokens.

Syntax Analysis
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the

Page 236 of 248


children of the node represent the arguments of the operation. A syntax tree for the token stream
is shown as the output of the syntactic analyzer in Fig. above. This tree shows the order in which
the operations in the assignment

poison=initial + rate*60

Are to be performed. The tree has an interior node labeled * with (id,3) as its left child and the
integer 60 as its right child. The node (id,3) represents the identifier rate. The node labeled *
makes it explicit that we must first multiply the value of r a t e by 60. The node labeled +
indicates that we must add the result of this multiplication to the value of initial. The root of the
tree, labeled =, indicates that we must store the result of this addition into the location for the
identifier position. This ordering of operations is consistent with the usual conventions of
arithmetic which tell us that multiplication has higher precedence than addition, and hence that
the multiplication is to be performed before the addition. The subsequent phases of the compiler
use the grammatical structure to help analyze the source program and generate the target
program.

Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation.

An important part of semantic analysis is typing checking, where the compiler checks that each
operator has matching operands. For example, many programming language definitions require
an array index to be an integer; the compiler must report an error if a floating-point number is
used to index an array.

The language specification may permit some type conversions called coercions. For example, a
binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point
numbers. If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number

Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.

Page 237 of 248


Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.

4.2.3. Intermediate Code Generation


In the process of translating a source program into target code, a compiler may construct one or
more intermediate representations, which can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during syntax and semantic analysis.

After syntax and semantic analysis of the source program, many compilers generate an explicit
low-level or machine-like intermediate representation, which we can think of as a program for an
abstract machine. This intermediate representation should have two important properties:

 it should be easy to produce and


 it should be easy to translate into the target machine.

We consider an intermediate form called three-address code, which consists of a sequence of


assembly-like instructions with three operands per instruction. Each operand can act like a
register. The output of the intermediate code generator in above diagram consists of the three-
address code sequence.
tl = inttofloat(60)
t2 = id3 * tl
t3 = id2 + t2
id1 = t3
There are several points worth noting about three-address instructions. First, each three-address
assignment instruction has at most one operator on the right side. Thus, these instructions fix the
order in which operations are to be done; the multiplication precedes the addition in the source
program.

Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so
that better target code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power. For example, a
straightforward algorithm generates the intermediate code (1.3), using an instruction for each

Page 238 of 248


operator in the tree representation that comes from the semantic analyzer. A simple intermediate
code generation algorithm followed by code optimization is a reasonable way to generate good
target code. The optimizer can deduce that the conversion of 60 from integer to floating point
can be done once and for all at compile time, so the int to float operation can be eliminated by
replacing the integer 60 by the floating-point number 60.0. Moreover, t3 is used only once to
transmit its value to id1 so the optimizer can transform (1.3) into the shorter sequence

t1 = id3 * 60.0
id1 = id2 + t1
There is a great variation in the amount of code optimization different compilers perform. In
those that do the most, the so-called "optimizing compilers," a significant amount of time is
spent on this phase. There are simple optimizations that significantly improve the running time
of the target program without slowing down compilation too much Code Generation

Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language. If the target language is machine code, registers or memory locations
are selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to hold variables. For example, using
registers R 1and R2,the intermediate code in (1.4) might get translated into the machine code

LDF R2 , id3
MULF R2 , R2 , #60.0
LDF Rl , id2
ADDF Rl , Rl , R2
STF idl , Rl
The first operand of each instruction specifies a destination. The F in each instruction tells us that
it deals with floating-point numbers. The code in above loads the contents of address id3 into
register R2, then multiplies it with floating-point constant 60.0. The # signifies that 60.0 is to be
treated as an immediate constant. The third instruction moves id2 into register R1 and the fourth
adds to it the value previously computed in register R2. Finally, the value in register R1 is stored
into the address of idl, so the code correctly implements the assignment statement

Page 239 of 248


4.2.4. Compiler-Construction Tools
The compiler writer, like any software developer, can profitably use modern software
development environments containing tools such as language editors, debuggers, version
managers, profilers, test harnesses, and so on. In addition to these general software-development
tools, other more specialized tools have been created to help implement various phases of a
compiler.

These tools use specialized languages for specifying and implementing specific components, and
many use quite sophisticated algorithms. The most successful tools are those that hide the details
of the generation algorithm and produce components that can be easily integrated into the
remainder of the compiler. Some commonly used compiler-construction tools include

1. Parser generators that automatically produce syntax analyzers from a grammatical


description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description
of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a
parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values
are transmitted from one part of a program to each other part. Data-flow analysis is a key
part of code optimization.
6. Compiler-construction toolk2ts that provide an integrated set of routines for constructing
various phases of a compiler.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's
names along with their types are stored here. The symbol table makes it easier for the compiler to
quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.

Page 240 of 248


Chapter Review Questions

 SELF-TEST QUESTIONS
1. Assume B is a subset of S and S is universal set, Based on this assumption, which one of
the following is not equal to B
A. B- B. B C. D.B S
2. Which of the following alternative is true about Demorgan‘s Law
A. = ⋂ C. S1 S2 = ⋂
B. . = ⋃ D. = ⋃
3. If we have a relation R in the set C= {0,1,2,. ,. ,. ,20} if our relation is a divides b, so
which statement can be true about R
A. R is symmetric C. R is transitive
B. . R is equivalence D. R is not a Relation
4. If S1 and S2 are two sets. Assume both sets contain same elements, then which of the
following statement can‘t true?
A. S1 S2 B. S2 S1 C. S1 S1 D. S1 S2
5. If a set A={ 0,1,2} then, identify the false statement from the following list
A. 2A B. {0,1,2} 2A C. 2|A|=3 D. {{0,1,2}} 2A
6. When do we say a function is injective?
A. If different element in domain have different images
B. If all element in the range are image of some element in domain
C. If all element in the domain are image of some element in range
D. If different element in range have different domain
7. One of the following cannot true statement about a tree
A. A tree is a connected graph with no circuits or loops.
B. In a tree there is one and only one path between every pair of vertices.
C. A tree with n vertices has n-1 edges.
D. If a disconnected graph with n vertices has n - 1 edge, then it is a tree
8. From the following list of automata which one is the most powerful
A. Finite Automata B. Turing Machine C. Pushdown Automata D. None
9. Sequence of finite symbols, which contain variables as well as terminals, are called ____
A. Strings B. Sentence C. Sentential Form D. Word
10. If we have a grammar in the form of A → Bx, or A → x. where A, B V, x T*
A. Left Linear Grammar C. Right Linear Grammar
B. Simple Grammar D. Context-Free Grammar

Page 241 of 248


11. Which one of the following languages is represent by the transition graph?
a
a
q0
b q a q q
1 3 4

a b
b b b
q 2

a
A. All sets of strings from ={a,b} that ends with substring abb
B. All sets of strings from ={a,b} that starts with substring bab
C. All sets of strings from ={a,b} that ends with substring bab
D. All sets of strings from ={a,b} that starts with substring ab

12. Which of the following strings accepted by the finite automata in question 1

A. aaababaab C. bbbbabab
B. aaabbbaaabba D. aaabbaabbabb
13. From the following lists of automata, which one is with memory?
A. FA B. DFA C. NFA D. PDA
14. Considering the transition graph below, the language accepted by it is
q 0
a q b q b q
1 3 4

b a a
a, b
q 2

a, b
A. The set of all strings (a, b)* starting with ab
B. Set of all string (a, b)* starting with abb
C. The set of strings contains {abb}
D. set of strings contains (ab, abb}
15. What type of finite automata is the graph in question 4
A. Deterministic C. Nondeterministic
B. Context free D. Derivation tree

Page 242 of 248


16. Which of the listed strings below accepted by the following NFA?
1
0 1
q 0
q 1 q 2

0,1 0, 
A. 0110 B. 01001 C. 100100000 D. 000000
Considering the automaton given below and choose the correct answers for Questions 7-9:
0
0,1

a, b 1 a
1 1

q 2

0
17. M is a _____________
A. nondeterministic finite automaton
B. deterministic finite automaton accepting {0,1}*
C. DFA accepting all strings over {0,1} having 3m 0's and 3n 1's, m, n >=1
D. Deterministic finite automaton
18. M accepts
A.01110 B. 10001 C.01010 D. 11111
19. L (M) is equal to
A. {03m 13n| m,n>=0}
B. {03m 13n| m,n>=1}
C. {w| w has 111 as substring}
D. {w| w has 3n 1‘s, n>=1}
20. Which of the following statements are acceptable?
A. Automata is a complex model of digital computer
B. If a Language is accepted by any DFA there is equivalent NFA that accepts it.
C. Many languages may accept by single machine
D. Some regular languages may not have finite automata
21. The regular expression r= b*ab*ab* represents
A. All sets of strings from ={a,b} that have exactly two a‘s
B. All sets of strings from ={a,b} that have at least two a‘s
C. All sets of strings from ={a,b} that have at most two a‘s
D. All are possible answers

Page 243 of 248


22. The regular expression r=(a+b)*bb(a+b)* represents
A. All sets of strings from ={a,b} that have exactly two b‘s
B. All sets of strings from ={a,b} containing the substring bb
C. All sets of strings from ={a,b} that have at most two b‘s
D. All sets of strings from ={a,b} that have least two b‘s
23. If you have a grammar G=({A,B},{0,1},A,A 0B, B 0B ,B 01), What is the language
generated by the grammar?
A. 0*01 B.00*01 C. 001 D. 00*01*
24. From the following which regular grammar represent the regular expression r= a*b(a +
b)*
A. A aA/bB/b, B bB/aB/a/b C. A aA/b, B bB/aB/a/b
B. A bB/b, B bB/aB/a/b D. A b, B bB/aB/a/b
25. Which one of the following statement is not true about regular language?
A. If we can draw DFA for a language then it is regular language.
B. A set of strings which represent by regular expression is regular language.
C. A set of strings generated by left linear grammar is regular language.
D. A set of strings generated by a linear grammar is regular language.
26. Which of the following is not regular grammar?
A. A aA/aB/b B. A Aa/Ba/b C. A Aa/aB/b D. A Aa/Ba/bb
27. Which of the following language accepted by the grammar S →aaS1, S1 →aabS|b,
A. aaaab(aab)* B. (aaab)*aab C. (aaaab)*aab D. (aaaab)*ab
28. All set of strings over {a,b} having exactly 3b‘s is represented by the regular
expression
A. a*bbb B. a*ba*ba*b C. ba*ba*b D. a*ba*ba*ba*
29. If L is the set of all strings over {a, b} containing at least one a, then it is not represented
by the regular expression
A. (a) b*a(a + b)* C. (a + b)*a(b + a)*
B. (a + b)*ab* D. (a + b)*a
30. (0*1*)* is the same as
A. (0 + 1)* C. (10)*
B. (01)* D. none

31. Grammar of the programming language is checked at ………. phase of a compiler.


A. Syntax analysis C. Semantic analysis
B. Lexical analysis D. Code generation

Page 244 of 248


32. Which component is different in the implementation of all types of LR parser?
A. input buffer C. parsing algorithm
B. parsing table D. stack
33. A sequence of replacement of structure names by choices on the right-hand side of
grammar rules is:
A. Derivation C. precedence
B. Parsing D. association
34. All of the following are primary tasks of code generator except
A. Register allocation C. instruction selection
B. Declaration of variables D. instruction ordering
35. Which one of the following is different from the others?
A. quadruples C. Syntax tree
B. triples D. indirect triples
36. Which of the following represents a dynamic semantic error?
A. type mismatch C. illegal character
B. Missing semi-colon D. division by zero
37. Which of the following represents a target Code?
A. Three address code C. Triplets
B. Assembly language D. Syntax tree
38. The process of converting non-deterministic grammar in to deterministic grammar is
called: -
A. left recursion C. left factoring
B. Association D. Precedence
39. All of the following can represent intermediate code except
A. Syntax tree C. lexemes
B. Three address code D. Quadruples
40. The recognizer to syntax analysis is: -
A. Deterministic finite automata C. non-deterministic pushdown automat
B. Non-deterministic finite automata D. deterministic push down automata

Page 245 of 248


 ANSWER FOR SELF-TEST QUESTIONS (Formal and Compiler)
15. A 29. C
1. B
16. B 30. A
2. D
17. D 31. A
3. C
18. A 32. A
4. D
19. D 33. A
5. C
20. B 34. B
6. A
21. A 35. C
7. D
22. B 36. D
8. B
23. B 37. B
9. C
24. A 38. C
10. A
25. D 39. C
11. C
26. C 40. A
12. C
27. C
13. D
28. D
14. C

Page 246 of 248

You might also like