UNIT 1 COMPILER DESIGN

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

COMPILER DESIGN LECTURE

COMPLETENOTES
SYLLABUS

UNIT - I

Introduction to Compiling:
Compilers, Analysis of the source programe, The phases of a compiler, Cousins of the compiler,
The grouping of phases, Compiler-construction tools
A Simple One-Pass Compiler:
Overview, Syntax definition, Syntax-directed translation, Parsing, A translator for simple
expressions, Lexical analysis, Incorporating a symbol table, Abstract stack machines, Putting the
techniques together
Lexical Analysis:
The role of the lexical analyzer, Input buffering, Specification of tokens, Recognition of tokens,
A language for specifying lexical analyzers, Finite automata, From a regular expression to an
NFA, Design of a lexical analyzer generator, Optimization of DFA-based pattern matchers
Introduction to Compiling:
INTRODUCTION OF LANGUAGE PROCESSINGSYSTEM

Fig 1.1: Language Processing System


Preprocessor

A preprocessor produce input to compilers. They may perform the following functions.

1. Macro processing: A preprocessor may allow a user to define macros that are short handsfor
longerconstructs.
2. File inclusion: A preprocessor may include header files into the programtext.
3. Rational preprocessor: these preprocessors augment older languages with more modernflow-of-
control and data structuringfacilities.
4. LanguageExtensions:Thesepreprocessorattemptstoaddcapabilitiestothelanguagebycertain
amounts to build-inmacro

COMPILER

Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a compiler
is error showing to the programmer.

Fig 1.2: Structure of Compiler


Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program is
loaded into a memory executed.
Fig 1.3: Execution process of source program in Compiler

ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).

INTERPRETER
An interpreter is a program that appears to execute a source program as if it were machine language.

Fig1.4: Execution in Interpreter

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexicalanalysis
2. Synatxanalysis
3. Semanticanalysis
4. DirectExecution

Advantages:
Modification of user programcan be easily made and implemented as execution proceeds.
Type of object that denotes a various may changedynamically.
Debugging a program and finding errors is simplified task for a program used forinterpretation.
The interpreter for the language makes it machineindependent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.

LOADER AND LINK-EDITOR:

Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control to it,

thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would have
to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another component
calledloader
“A loader is a program that places programs into memory and prepares them for execution.” It would
be more efficient if subroutines could be translated into object form the loader could”relocate”
directly behind the user’s program. The task of adjusting programs o they may be placed in arbitrary
core locations is called relocation. Relocation loaders perform four functions.

TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translatorare:
1 Translating the HLL program input into an equivalent mlprogram.
2 Providing diagnostic messages wherever the programmer violates specification of theHLL.

LIST OFCOMPILERS
1. Adacompilers
2 .ALGOLcompilers
3 .BASICcompilers
4 .C#compilers
5 .Ccompilers
6 .C++compilers
7 .COBOLcompilers
8 .Common Lisp compilers
9. ECMAScriptinterpreters
10. Fortrancompilers
11 .Javacompilers
12. Pascalcompilers
13. PL/Icompilers
14. Pythoncompilers
15. Smalltalk compilers

STRUCTURE OF THE COMPILERDESIGN


Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation
that takes source program in one representation and produces output in another representation. The
phases of a compiler are shown inbelow
There are two phases of compilation.
a. Analysis (Machine Independent/LanguageDependent)
b. Synthesis(Machine Dependent/Languageindependent)

Compilation process is partitioned into no-of-sub processes called ‘phases’.


Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the source program into a
sequence of automic units called tokens.
Fig 1.5: Phases of Compiler

Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is
aided by using techniques based on formal grammar of the programming language.

Intermediate Code Generations:-


An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.

Code Optimization :-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.

Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is
the machine language program of the specifiedcomputer.
Table Management (or) Book-keeping:- This is the portion to keep the names used by the
program and records essential information about each. The data structure used to record this
information called a ‘SymbolTable’.

Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structure
of the incoming token stream by identifying which parts of the token stream should begrouped.

Example, (A/B*C has two possible interpretations.)


1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.

Intermediate Code Generation:-


The intermediate code generation uses the structure produced by the syntax analyzer to create a
stream of simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.

Code Optimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. LocalOptimization:-
Therearelocaltransformationsthatcanbeappliedtoaprogramtomakeanimprovement.For
example,
If A >B goto L2
Goto L3
L2 :

This can be replaced by a single statement


If A < B goto L3

Another important local optimization is the elimination of common sub-expressions


A := B + C + D
E := B + C + F
Might be evaluated as

T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.

b. LoopOptimization:-
Another important source of optimization concerns about increasing the speed of loops. A
typical loop improvement is to move a computation that produces the same result eachtime
around the loop to a point, in the program just before the loop isentered.

Code generator :-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.

Table Management OR Book-keeping :-


A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.

Error Handing :-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.

Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:

Fig 1.6: Compilation Process of a source code through phases


2. A simple One Pass Compiler:

INTRODUCTION: In computer programming, a one-pass compiler is a compiler


that passes through the parts of each compilation unit only once, immediately translating
each part into its final machine code. This is in contrast to a multi-pass compiler which
converts the program into one or more intermediate representations in steps between
source code and machine code, and which reprocesses the entire compilation unit in each
sequentialpass.

OVERVIEW

 LanguageDefinition
o Appearance of programming language :
Vocabulary : Regularexpression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
o Semantics : Informal language or someexamples

 Fig 2.1. Structure of our compiler frontend

SYNTAXDEFINITION

 To specify the syntax of a language : CFG andBNF


oExample : if-else statement in C has the form of statement →if ( expression )
statement else statement
 An alphabet of a language is a set ofsymbols.
o Examples : {0,1} for a binary numbersystem(language)={0,1,100,101,...}
{a,b,c} for language={a,b,c, ac,abcc..}
{if,(,),else ...} for a if statements={if(a==1)goto10, if--}
 A string over analphabet
ois a sequence of zero or more symbols from the alphabet.
oExamples : 0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
o Null string is a string which does not have any symbol ofalphabet.
 Language
o Is a subset of all the strings over a givenalphabet.
o AlphabetsAi Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac,abcc..}
A2={all of C tokens} L2= {all sentences of C program }
 Example 2.1. Grammar for expressions consisting of digits and plus andminus
signs.
o Language of expressions L={9-5+2, 3-1,...}
o The productions of grammar for this language Lare:
list → list + digit
list → list - digit
list → digit
digit → 0|1|2|3|4|5|6|7|8|9
olist, digit : Grammar variables, Grammar symbols
o0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols
 Convention specifyinggrammar
o Terminal symbols : bold face string if, num,id
o Nonterminal symbol, grammar symbol : italicized names, list, digit,A,B

 GrammarG=(N,T,P,S)
o N : a set of nonterminalsymbols
o T : a set of terminal symbols,tokens
o P : a set of productionrules
o S : a start symbol,S∈N
o
 Grammar G for a language L={9-5+2, 3-1,...}
o G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P : list -> list + digit
list -> list - digit
list -> digit
digit ->0|1|2|3|4|5|6|7|8|9
S=list

 Some definitions for a language L and its grammarG


 Derivation:
A sequence of replacements S⇒α1⇒α2⇒…⇒αn is a derivation of αn.
Example, A derivation 1+9 from the grammarG

list ⇒ list + digit ⇒ digit + digit ⇒ 1 + digit ⇒ 1 + 9


 left mostderivation

list ⇒ list + digit ⇒ list + 9 ⇒ digit + 9 ⇒ 1 + 9


 right mostderivation

 Language of grammarL(G)

L(G)={x| S ⇒* x} where x ∈ a sequence of terminal symbols


L(G) is a set of sentences that can be generated from the grammar G.

 Example: Consider a grammarG=(N,T,P,S):


N={S}T={a,b}
S=S P ={S → aSb | ε }
 is aabb a sentecne of L(g)? (derivation of string aabb)
S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(or S⇒* aabb)
so,aabbεL(G)
 there is no derivation for aa, soaa∉L(G)
 note L(G)={anbn| n≧0} where anbn meas n a's followed by nb's.

 ParseTree
A derivation can be conveniently represented by a derivation tree( parse tree).
o The root is labeled by the startsymbol.
o Each leaf is labeled by a token orε.
o Each interior none is labeled by a nonterminalsymbol.
o When a production A→x1… xn is derived, nodes labeled by x1… xn are madeas
children
nodes of node labeled by A.
 root : the startsymbol
 internal nodes :nonterminal
 leaf nodes :terminal

o ExampleG:
list -> list + digit | list - digit | digit
digit -> 0|1|2|3|4|5|6|7|8|9
 left most derivation for9-5+2,
list ⇒ list+digit ⇒ list-digit+digit ⇒ digit-digit+digit ⇒ 9-digit+digit
⇒ 9-5+digit ⇒9-5+2
 right most derivation for9-5+2,
list ⇒ list+digit⇒ list+2 ⇒ list-digit+2 ⇒ list-5+2
⇒ digit-5+2 ⇒ 9-5+2

parse tree for 9-5+2

Fig 2.2. Parse tree for 9-5+2 according to the grammar in Example

Ambiguity
 A grammar is said to be ambiguous if the grammar has more than one parse tree fora
given string oftokens.
 Example2.5.SupposeagrammarGthatcannotdistinguishbetweenlistsanddigitsasin
Example2.1.
 G : string → string + string | string - string|0|1|2|3|4|5|6|7|8|9
Fig 2.3. Two Parse tree for 9-5+2
 1-5+2 has 2 parse trees => Grammar G isambiguous.

Associativity of operator
A operator is said to be left associative if an operand with operators on both sides of it is
taken by the operator to its left.
eg) 9+5+2≡(9+5)+2, a=b=c≡a=(b=c)
 Left Associative Grammar:
list → list + digit | list - digit
digit →0|1|…|9
 Right Associative Grammar:
right → letter = right |letter
letter → a|b|…|z

Fig 2.4. Parse tree left- and right-associative operators.

Precedence of operators
We say that a operator(*) has higher precedence than other operator(+) if the operator(*) takes
operands before other operator(+) does.
 ex. 9+5*2≡9+(5*2),9*5+2≡(9*5)+2
 left associative operators : + , - , * ,/
 right associative operators : = ,**
 Syntax of fullexpressions
operator associative precedence

+,- left 1 low


*,/ left 2 heigh

 expr → expr + term | expr - term |term


term → term * factor | term / factor | factor
factor → digit | ( expr )
digit → 0 | 1 | … | 9

 Syntax ofstatements
ostmt → id = expr ;
| if ( expr ) stmt ;
| if ( expr ) stmt else stmt ;
| while ( expr ) stmt ;
expr → expr + term | expr - term | term
term → term * factor | term / factor | factor
factor → digit | ( expr )
digit → 0 | 1 | … | 9
SYNTAX-DIRECTEDTRANSLATION(SDT)
A formalism for specifying translations for programming language constructs.
( attributes of a construct: type, string, location, etc)
 Syntax directed definition(SDD) for the translation ofconstructs
 Syntax directed translation scheme(SDTS) for specifyingtranslation
Postfix notation for an expression E
 If E is a variable or constant, then the postfix nation for E is E itself ( E.t≡E).
 if E is an expression of the form E1 op E2 where op is a binaryoperator
o E1' is the postfix ofE1,
o E2' is the postfix ofE2
othen E1' E2' op is the postfix for E1 op E2
 if E is (E1), and E1' is apostfix
then E1' is the postfix for E

eg) 9 - 5 + 2 ⇒ 9 5 - 2+

9 - (5 + 2) ⇒9 5 2 + -

Syntax-Directed Definition(SDD) for translation


 SDD is a set of semantic rules predefined for each productions respectivelyfor
translation.
 A translation is an input-output mapping procedure for translation of an inputX,
oconstruct a parse tree for X.
osynthesize attributes over the parse tree.
 Suppose a node n in parse tree is labeled by X and X.a denotes thevalue
of attribute a of X at thatnode.
 compute X's attributes X.a using the semantic rules associated withX.

Example 2.6. SDD for infix to postfixtranslation

Fig 2.5. Syntax-directed definition for infix to postfix translation.

An example of synthesized attributes for input X=9-5+2

Fig 2.6. Attribute values at nodes in a parse tree.

Syntax-directed Translation Schemes(SDTS)


 A translation scheme is a context-free grammar in which program fragmentscalled
translation actions are embedded within the right sides of theproduction.
productions(postfix) SDD for postfix to SDTS
infix notation
list → list + term list.t =li st.t || term.t||"+" list →lis t + term
{print("+")}

 {print("+");} : translation(semantic)action.
 SDTS generates an output for each sentence x generated by underlying grammar by
executing actions in the order they appear during depth-first traversal of a parse tree forx.
1. Design translation schemes(SDTS) fortranslation
2. Translate:
a) parse the input string xand
b) emit the action result encountered during the depth-first traversal of parsetree.

Fig 2.7. Example of a depth-first traversal of a tree. Fig 2.8. An extra leaf is constructed for a semantic action.

Example 2.8.
 SDD vs. SDTS for infix to postfixtranslation.

productions SDD SDTS


expr → list + term expr.t = list.t || term.t ||"+" expr → list + term
expr → list + term expr.t = list.t || term.t || "-" printf{"+")}
expr → term expr.t =term.t expr → list + term printf{"-")}
term →0 term.t ="0" expr → term
term →1 term.t ="1" term → 0printf{"0")}
… … term → 1printf{"1")}
term →9 term.t ="9" …
term → 9 printf{"0")}

 Action translating for input9-5+2

Fig 2.9. Actions translating 9-5+2 into 95-2+.


1) Parse.
2) Translate.
Do we have to maintain the whole parse tree ?
No, Semantic actions are performed during parsing, and we don't need the nodes (whose
semantic actions done).
PARSING
if token string x ∈ L(G),then parse tree
else error message
Top-Down parsing
1. At node n labeled with nonterminal A, select one of the productions whose left part is
A and construct children of node n with the symbols on the right side of thatproduction.
2. Find the next node at which a sub-tree is to beconstructed.
ex. G: type →simple
|↑id
|array [ simple ] of type
simple → integer
|char
|num dotdot num

Fig 2.10. Top-down parsing while scanning the input from left to right.
Fig 2.11. Steps in the top-down construction of a parse tree.
 The selection of production for a nonterminal may involve trial-and-error.=>
backtracking

 G : { S->aSb | c | ab}

 S/acb⇒aSb/acb⇒aSb/acb⇒aaSbb/acb ⇒X
According to topdown parsing procedure, acb , aabb∈L(G)?

⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(S→aSb) move (S→aSb) backtracking

(s→c) move move


so, acb∈ L(G)
Is is finished in 7 steps including one backtracking.

 S/aabb⇒aSb/aabb⇒aSb/aabb⇒aaSbb/aabb⇒aaSbb/aabb⇒aaaSbbb/aabb ⇒X

⇒aaSbb/aabb⇒aacbb/aabb ⇒ X
(S→aSb) move (S→aSb) move (S→aSb) backtracking

⇒aaSbb/aabb⇒aaabbb/aabb⇒ X
(S→c) backtracking

⇒aaSbb/aabb⇒ X
(S→ab) backtracking

⇒aSb/aabb⇒acb/aabb
backtracking

⇒aSb/aabb⇒aabb/aabb⇒aabb/aabb⇒aabb/aabb⇒aaba/aabb
(S→c) bactracking

(S→ab) move move move


so, aabb∈L(G)
but process is too difficult. It needs 18 steps including 5 backtrackings.
 procedure of top-downparsing

oif( g ∈ N ) select and expand a production whose left part equals to g next to
let a pointed grammar symbol and pointed input symbol be g, a respectively.

current production.
else if( g = a ) then make g and a be a symbol next to current symbol.
else if( g ≠a ) back tracking
 let the pointed input symbol a be the symbol that moves back tosteps
same with the number of current symbols of underlyingproduction
 eliminate the right side symbols of current production and let thepointed
symbol g be the left side symbol of currentproduction.

Predictive parsing (Recursive Decent Parsing,RDP)


 A strategy for the general top-downparsing


Guess a production, see if it matches, if not, backtrack and try another.


 It may fail to recognize correct string in some grammar G and is tedious inprocessing.

 Predictiveparsing
ois a kind of top-down parsing that predicts a production whose derived terminal
symbol is equal to next input symbol while expanding in top-down paring.
owithout backtracking.
o Procedure decent parser is a kind of predictive parser that is implemented by
disjoint recursive procedures one procedure for each nonterminal, the procedures
are patterned after the productions.
 procedure of predictiveparsing(RDP)

oif( g ∈ N )
let a pointed grammar symbol and pointed input symbol be g, a respectively.

 select next production P whose left symbol equals to g and a set of first
terminal symbols of derivation from the right symbols of the productionP
includes a input symbola.
 expand derivation with that productionP.
oelse if( g = a ) then make g and a be a symbol next to current symbol.
oelse if( g ≠a ) error

 G : { S→aSb | c | ab } => G1 : { S->aS' | c S'->Sb | ab }


According to predictive parsing procedure, acb ,aabb∈L(G)?
o S/acb⇒ confused in { S→aSb, S→ab}
oso, a predictive parser requires some restriction in grammar, that is, there should
be only one production whose left part of productions are A and each first
terminal symbol of those productions have unique terminal symbol.
 Requirements for a grammar to be suitable for RDP: For each nonterminaleither
1. A → Bα,or

1) for 1 ≦ i, j ≦ n and i≠ j, ai ≠aj


2. A → a1α1 | a2α2 | … |anαn

2) A ε may also occur if none of ai can follow A in a derivation and if we haveA→ε


 Ifthegrammarissuitable,wecanparseefficientlywithoutbacktrack.
General top-down parser withbacktracking

Recursive Descent Parser without backtracking

Picture Parsing ( a kind of predictive parsing ) without backtracking

Left Factoring
 If a grammar contains two productions ofform
S→ aα and S →aβ
it is not suitable for top down parsing without backtracking. Troubles of this form can
sometimes be removed from the grammar by a technique called the left factoring.
 In the left factoring, we replace { S→ aα, S→ aβ }by
{ S → aS', S'→ α, S'→ β } cf. S→ a(α|β)
(Hopefully α and β start with different symbols)
 left factoring for G { S→aSb | c | ab}
S→aS'|c cf. S(=aSb | ab | c = a ( Sb | b) | c ) → a S' |c
S'→Sb | b
 A concreteexample:
<stmt>→ IF <boolean> THEN <stmt>|
IF <boolean> THEN <stmt> ELSE <stmt>
is transformed into
<stmt>→ IF <boolean> THEN <stmt>S'
S'→ ELSE <stmt> |ε

 Example,
ofor G1 : { S→aSb | c | ab }
According to predictive parsing procedure, acb , aabb∈L(G)?
 S/aabb⇒ unable to choose { S→aSb, S→ab?}
o According for the feft factored gtrammar G1, acb ,
aabb∈L(G)? G1 : { S→aS'|c S'→Sb|b} <= {S=a(Sb|b) | c}
o S/acb⇒aS'/acb⇒aS'/acb ⇒ aSb/acb ⇒ acb/acb ⇒ acb/acb⇒acb/acb
(S→aS') move (S'→Sb⇒aS'b)(S'→c) move move
so, acb∈ L(G)
It needs only 6 steps whithout any backtracking.
cf. General top-down parsing needs 7 steps and I backtracking.
o S/aabb⇒aS'/aabb⇒aS'/aabb⇒aSb/aabb⇒aaS'b/aabb⇒aaS'b/aabb⇒aabb/aabb⇒⇒
(S→aS') move (S'→Sb⇒aS'b) (S'→aS') move (S'→b) move move
so, aabb∈L(G)
but, process is finished in 8 steps without any backtracking.
cf. General top-down parsing needs 18 steps including 5 backtrackings.

Left Recursion
 A grammar is left recursive iff it contains a nonterminal A, suchthat
A⇒+ Aα, where is anystring.
o Grammar {S→ Sα | c} is left recursive because ofS⇒Sα
o Grammar {S→ Aα, A→ Sb | c} is also left recursive because of S⇒Aα⇒Sbα
 If a grammar is left recursive, you cannot build a predictive top down parser forit.
1) If a parser is trying to match S & S→Sα, it has no idea how many times S mustbe

2) Given a left recursive grammar, it is always possible to find another grammarthat


applied

3) The resulting grammar might or might not be suitable forRDP.


generates the same language and is not leftrecursive.

 After this, if we need left factoring, it is not suitable forRDP.


 Right recursion: Special care/Harder than left recursion/SDT canhandle.

Eliminating Left Recursion


Let G be S→ S A | A
Note that a top-down parser cannot parse the grammar G, regardless of the order the productions
are tried.
⇒ The productions generate strings of form AA…A
⇒ They can be replaced by S→A S' and S'→A S'|ε

Example :
 A →Aα∣β
=>
A → βR
R → αR | ε

Fig 2.12. Left-and right-recursive ways of generating a string.

 In general, the rule isthat


oIf A→ Aα1 | Aα2 | … | Aαn and
A→ β1 | β2 | … | βm (no βi's start with A),
then, replace by
A → β1R | β2R| … | βmR and
Z → α1R | α2R | … | αnR | ε

Exercise: Remove the left recursion in the following grammar:


expr → expr + term | expr - term
expr → term
solution:
expr → term rest
rest → + term rest | - term rest | ε
A TRANSLATOR FOR SIMPLEEXPRESSIONS
 Convert infix into postfix(polish notation) usingSDT.
 Abstract syntax (annotated parse tree) tree vs. Concrete syntaxtree

 Concrete syntax tree : parsetree.


 Abstract syntax tree: syntaxtree
 Concrete syntax : underlyinggrammar

Adapting the Translation Scheme


 Embed the semantic action in theproduction
 Design a translationscheme
 Left recursion elimination and Leftfactoring
 Example
3) Design a translate scheme and eliminate leftrecursion
E→ E + T {'+'} E→ T {} R
E→ E - T {'-'} R→ + T{'+'} R
E→ T {} R→ - T{'-'} R
T→ 0{'0'}|…|9{'9'} R→ ε
T→ 0{'0'}…|9{'9'}
4)Translate of a input string 9-5+2 : parsing andSDT

Result: 9 5 – 2 +
Example of translator design and execution
 A translation scheme and withleft-recursion.
Initial specification for infix-to-postfix with left recursion eliminated
translator
expr → expr + term {printf{"+")} expr → term rest
expr → expr - term {printf{"-")} rest → + term {printf{"+")} rest
expr → term rest → - term {printf{"-")} rest
term → 0 {printf{"0")} rest → ε
term → 1 {printf{"1")} term → 0 {printf{"0")}
… term → 1 {printf{"1")}
term → 9 {printf{"0")} …
term → 9 {printf{"0")}

Fig 2.13. Translation of 9 – 5 +2 into 95-2+.

Procedure for the Nonterminal expr, term, and rest

Fig 2.14. Function for the nonterminals expr, rest, and term.
Optimizer and Translator

LEXICALANALYSIS
 reads and converts the input into a stream of tokens to be analyzed byparser.
 lexeme : a sequence of characters which comprises a singletoken.
 Lexical Analyzer →Lexeme / Token →Parser
Removal of White Space and Comments
 Remove white space(blank, tab, new line etc.) andcomments
Contsants
 Constants: For a while, consider onlyintegers
 eg) for input 31 + 28, output(tokenrepresentation)?
input : 31 +28
output: <num, 31><+, ><num, 28>
num + :token
31 28 : attribute, value(or lexeme) of integer token num
Recognizing
 Identifiers
o Identifiers are names of variables, arrays,functions...
o A grammar treats an identifier as atoken.
o eg) input : count = count + increment;
output : <id,1><=, ><id,1><+, ><id, 2>;
Symbol table
tokens attributes(lexeme)

0
1 id count
2 id increment
3
 Keywords are reserved, i.e., they cannot be used asidentifiers.
Then a character string forms an identifier only if it is no a keyword.
 punctuationsymbols
ooperators : + - * / := <> …

Interface to lexical analyzer

Fig 2.15. Inserting a lexical analyzer between the input and the parser

A Lexical Analyzer

Fig 2.16. Implementing the interactions in Fig. 2.15.

 c=getchcar();ungetc(c,stdin);
 tokenrepresentation
o#define NUM 256
 Functionlexan()
eg) input string 76 + a
input , output(returned value)
76 NUM, tokenval=76(integer)
+ +
A id, tokeval="a"

 A way that parser handles the token NUM returned bylaxan()


oconsider a translation scheme
factor → ( expr )
| num { print(num.value) }
#define NUM 256
...
factor() {
if(lookahead == '(' )
{ match('(');
exor();match(')');
} else if (lookahead == NUM){
printf(" %f ",tokenval); match(NUM);
} else error();
}
 The implementation of functionlexan
1) #include<stdio.h>
2) #include<ctype.h>
3) int lino =1;
4) int tokenval =NONE;
5) int lexan(){
6) intt;
7) while(1){
8) t =getchar();
9) if ( t=='' || t=='\t' );
10) else if ( t=='\n' ) lineno+=1;
11) else if (isdigit(t)){
12) tokenval = t-'0';
13) t =getchar();
14) while ( isdigit(t)){
15) tokenval = tokenval*10 + t -'0';
16) t=getchar();
17) }
18) ungetc(t,stdin);
19) retunrNUM;
20) } else{
21) tokenval =NONE;
22) returnt;
23) }
24) }
25) }

INCORPORATION A SYMBOLTABLE
 The symbol table interface, operation, usually called byparser.
oinsert(s,t): input s: lexeme
t: token
output index of new entry
olookup(s): input s: lexeme
output index of the entry for string s, or 0 if s is not found in the symbol
table.
 Handling reservedkeywords
1. Inserts all keywords in the symbol table inadvance.
ex) insert("div",div)
insert("mod", mod)
2. whileparsing
 whenever an identifier s isencountered.
if (lookup(s)'s token in {keywords} ) s is for a keyword; else s is for a identifier;

 example
opreset
insert("div",div);
insert("mod",mod);
owhile parsing
lookup("count")=>0 insert("count",id);
lookup("i") =>0 insert("i",id);
lookup("i") =>4, id
llokup("div")=>1,div

Fig 2.17. Symbol table and array for storing strings.

ABSTRACT STACKMACHINE
o An abstract machine is for intermediate codegeneration/execution.
o Instruction classes: arithmetic / stack manipulation / controlflow

1) Instruction memory : abstract machine code, intermediatecode(instruction)


 3 components of abstract stackmachine

2) Stack
3) Datamemory
 An example of stack machineoperation.
ofor a input (5+a)*b, intermediate codes : push 5 rvalue 2 ....
L-value and r-value
 l-values a : address of locationa
 r-values a : if a is location, then content of locationa
if a is constant, then valuea

lvalue a⇒2 r value 5 ⇒ 5 r value of b ⇒ 7


 eg) a :=5 +b;

Stack Manipulation
 Some instructions for assignmentoperation
opush v : push v onto the stack.
orvalue a : push the contents of data location a.
olvalue a : push the address of data location a.
opop : throw away the top element of the stack.
o:= : assignment for the top 2 elements of the stack.
ocopy : push a copy of the top element of the stack.

Translation of Expressions
 Infix expression(IE) → SDD/SDTS → Abstact macine codes(ASC) of postfix expressionfor
stack machine evaluation.

oIE: a + b, (⇒PE: a b + ) ⇒ IC: rvalue a


eg)

rvalue b
+
o day := (1461 * y) div 4 + (153 * m + 2) div 5 + d
(⇒ day 1462 y * 4 div 153 m * 2 + 5 div + d

⇒1) lvalue day6) div


+ :=)
11) push 5 16) :=
2)push 1461 7) push 153 12) div
3) rvaluey 8)rvaluem 13) +
4) * 9) push 2 14) rvalue d
5) push 4 10) + 15) +
 Atranslationschemeforassignment-statementintoabstractastackmachinecodeecanbe
expressed formally In the form asfollows:
stmt → id := expr

eg) day :=a+b ⇒ lvalue day rvalue a rvalue b + :=


{ stmt.t := 'lvalue' || id.lexeme || expr.t || ':=' }
Control Flow
 3 types of jump instructions:
o Absolute targetlocation
o Relative target location( distance :Current↔Target)
o Symbolic target location(i.e. the machine supportslabels)
 Control-flowinstructions:
olabel a: the jump's target a
ogoto a: the next instruction is taken from statement labeled a
ogofalse a: pop the top & if it is 0 then jump to a
ogotrue a: pop the top & if it is nonzero then jump to a
ohalt : stop execution

Translation of Statements
 Translation scheme for translation if-statement into abstract machinecode.
stmt → if expr thenstmt1
{ out := newlabel1)
stmt.t := expr.t || 'gofalse' out || stmt1.t || 'label' out }

Fig 2.18. Code layout for conditional and while statements.

 Translation scheme for while-statement?

Emitting a Translation
 Semantic Action(TranaslationScheme):
1. stmt →if
expr { out := newlabel; emit('gofalse', out) }
then
stmt1 { emit('label', out) }
2. stmt → id { emit('lvalue', id.lexeme)}
:=
expr { emit(':=') }
3. stmt →i
expr { out := newlabel; emit('gofalse', out) }
then
stmt1 { emit('label', out) ; out1 := newlabel; emit('goto', out`1); }
else
stmt2 { emit('label', out1) ; }
if(expr==false) goto out
stmt1 goto out1
out : stmt2
out1:

Implementation
 procedurestmt()
 vartest,out:integer;
 begin
oif lookahead = id then begin
 emit('lvalue',tokenval);match(id);
match(':='); expr();emit(':=');
oend
oelse if lookahead = 'if' then begin
 match('if');
 expr();
 out :=newlabel();
 emit('gofalse',out);
 match('then');
 stmt;
 emit('label', out)
oend
oelse error();
 end
Control Flow with Analysis
 if E1 or E2 then S vs if E1 and E2 then S
E1 or E2 = if E1 then true else E2
E1 and E2 = if E1 then E2 elsefalse
 The code for E1 orE2.
oCodes for E1 Evaluation result: e1
ocopy
ogotrue OUT
opop
oCodes for E2 Evaluation result: e2
olabel OUT

 The full code for if E1 or E2 then S;


ocodes for E1
ocopy
ogotrue OUT1
opop
ocodes for E2
olabel OUT1
ogofalse OUT2
ocode for S
olabel OUT2
 Exercise: How about if E1 and E2 thenS;
oif E1 and E2 then S1 else S2;

 infix expression ⇒ postfixexpression


Putting the techniquestogether!

eg) id+(id-id)*num/id ⇒ id id id - num * id /


+

Description of the Translator


 Syntax directed translation scheme
(SDTS) to translate the infixexpressions
into the postfixexpressions,
Fig 2.19. Specification for infix-to-postfix translation

Structure of the translator,

Fig 2.19. Modules of infix to postfix translator.

oglobal header file "header.h"

The Lexical Analysis Module lexer.c


oDescription of tokens
+ - * / DIV MOD ( ) ID NUM DONE
Fig 2.20. Description of tokens.

The Parser Module parser.c

SDTS
|| ← left recursion elimination
New SDTS

Fig 2.20. Specification for infix to postfix translator & syntax directed translation scheme after
eliminating left-recursion.
The Emitter Module emitter.c
emit (t,tval)

The Symbol-Table Modules symbol.c and init.c


Symbol.c
data structure of symbol table Fig 2.29p62
insert(s,t)
lookup(s)

The Error Module error.c


Example of execution
input 12 div 5 + 2
output 12
5
div
2
+
3. LexicalAnalysis:

OVER VIEW OF LEXICALANALYSIS


 To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programminglanguage.
 Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finiteautomata.

ROLE OF LEXICALANALYZER
The LA is the first phase of a compiler. It main task is to read the input character and produce as
output a sequence of tokens that the parser uses for syntax analysis.

Fig. 3.1: Role of Lexical analyzer

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads
the input character until it can identify the next token. The LA return to the parser representation
for the token it has found. The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma orcolon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.

TOKEN, LEXEME,PATTERN:
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Fig. 3.2: Example of Token, Lexeme and Pattern

LEXICALERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognised valid tokens don't
match any of the right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or lexical erroris detected.

Error-recovery actions are:


i. Delete one character from the remaininginput.
ii. Insert a missing character in to the remaininginput.
iii. Replace a character by anothercharacter.
iv. Transpose two adjacentcharacters.

REGULAREXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the characterx
. any character, usually accept a newline
[xy z] any of the characters x, y, z,…..
R? a R or nothing (=optionally asR)
R* zero or moreoccurrences…..
R+ one or more occurrences……
R1R2 an R1 followed by anR2
R1|R1 either an R1 or anR2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we wouldwrite.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
 is a regular expression denoting { € }, that is, the language containing only the empty
string.
 For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only onestring
consisting of the single symbol ‘a’ .
 If R and S are regular expressions,then

(R) | (S) means L(r) U L(s)


R.S means L(r).L(s)
R* denotes L(r*)

REGULARDEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and
finds a prefix that is a lexeme matching one of thepatterns.
Stmt →if expr then stmt
| If expr then else stmt

Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and <> is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | <>

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any WS - -
if if -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

TRANSITIONDIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w thelexeme
Begin and forward pointers we always indicate an accepting state by a doublecircle.
2. In addition, if it is necessary to return the forward pointer one position, then weshall
additionally place a * near that acceptingstate.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have beenused.
Fig. 3.3: Transition diagram of Relational operators

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,


called a transition diagram. Position in a transition diagram, are drawn as circles and are
called as states.

Fig. 3.4: Transition diagram of Identifier

The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.

FINITEAUTOMATON
 A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no”otherwise.
 We call the recognizer of the tokens as a finiteautomaton.
 A finite automaton can be: deterministic (DFA) or non-deterministic(NFA)
 This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
 Both deterministic and non-deterministic finite automaton recognize regularsets.
 Whichone?
– deterministic – faster recognizer, but it may take morespace
– non-deterministic – slower, but it may take lessspace
– Deterministic automatons are widely used lexicalanalyzers.
 First, we define regular expressions for tokens; Then we convert them into a DFA to geta
lexical analyzer for ourtokens.
Non-Deterministic Finite Automaton(NFA)
 A non-deterministic finite automaton (NFA) is a mathematical model that consistsof:
o S - a set ofstates
o Σ - a set of input symbols(alphabet)
omove - a transition function move to map state-symbol pairs to sets of states.
os0 - a start (initial) state
o F- a set of accepting states (finalstates)
 ε- transitions are allowed in NFAs. In other words, we can move from one stateto
another one without consuming anysymbol.
 A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell outx.
Example:

Deterministic Finite Automaton(DFA)

 A Deterministic Finite Automaton (DFA) is a special form of aNFA.


 No state has ε-transition
 Foreachsymbolaandstates,thereisatmostonelabelededgealeavings.i.e.transition function
is from pair of state-symbol to state (not set ofstates)

Example:
Converting RE toNFA
 This is one way to convert a regular expression into aNFA.
 There can be other ways (much efficient) for theconversion.
 Thomson’s Construction is simple and systematicmethod.
 It guarantees that the resulting NFA will have exactly one final state, and one startstate.
 Construction starts from simplest parts (alphabetsymbols).
 To create a NFA for a complex regular expression, NFAs of its sub-expressionsare
combined to create itsNFA.
 To recognize an empty stringε:

 To recognize a symbol a in the alphabetΣ:

 For regular expression r1 |r2:

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
 For regular expression r1r2

Here, final state of N(r1) becomes the final state of N(r1r2).


 For regular expressionr*

Example:
For a RE (a|b) * a, the NFA construction is shown below.

Converting NFA to DFA (SubsetConstruction)


We merge together NFA states by looking at them from the point of view of the input characters:
 From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in theDFA.
 If it is possible to have multiple transitions based on the same symbol, then we canregard
a transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFAstate.
To perform this operation, let us define two functions:
 The -closure function takes a state and returns the set of states reachable from it based on
(oneormore)-transitions.Notethatthiswillalwaysincludethestateitself.Weshouldbe able to
get from a state to any state in its -closure without consuming anyinput.
 Thefunctionmovetakesastateandacharacter,andreturnsthesetofstatesreachableby one
transition on thischaracter.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.

For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')


move(C,`a').
The Subset Construction Algorithm is a follows:

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)


while (there is one unmarked S1 in DS) do
begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end

a state S in DS is an accepting state of DFA if a state in S is an accepting state ofNFA


 the start state of DFA isε-closure({s0})

Lexical AnalyzerGenerator

Lexspecifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
1. The declarations section includes declarations of variables,manifest constants(A manifest
constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14),
and regulardefinitions.
2. The translation rules of a Lex program are statements of the form:

p1 {action1}
p2 {action2}
p3 {action3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whateverauxiliary proceduresare needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexicalanalyzer.

Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.

INPUTBUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:
1. Bufferpairs
2. Sentinels

The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and thecharacter next to be read.
In practice each buffering scheme adopts one convention either apointer is at the symbol last
read or the symbol it is ready toread.

Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travel past the actual token may belarge. For example, in a PL/I program we may see:
DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or
an array name until we see the character that follows the right parenthesis. In either case, the
token itself ends at the second E. If the look ahead pointer travels beyond the buffer half in
which it began, the other half must be loaded with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on how
much look ahead can be used before the next token is discovered. In the above example, ifthe
look ahead traveled to the left half and all the way through the left half to the middle, we could
not reload the right half, because we would lose characters that had not yet been groupedinto
tokens. While we can make the buffer larger if we chose or use another buffering scheme,we
cannot ignore the fact that overhead islimited.

You might also like