Parsing - 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Syntax Analysis

• Check syntax and construct abstract syntax tree

if
== = ;
b 0 a b

• Error reporting and recovery


• Model using context free grammars
• Recognize using Push down automata/Table
Driven Parsers
1
Limitations of regular languages
• How to describe language syntax precisely and
conveniently. Can regular expressions be
used?
• Many languages are not regular, for example,
string of balanced parentheses
– ((((…))))
– { (i)i | i ≥ 0 }
– There is no regular expression for this language
• A finite automata may repeat states, however,
it cannot remember the number of times it
has been to a particular state
• A more powerful language is needed to
describe a valid string of tokens 2
Syntax definition
• Context free grammars <T, N, P, S>
– T: a set of tokens (terminal symbols)
– N: a set of non terminal symbols
– P: a set of productions of the form
nonterminal →String of terminals & non terminals
– S: a start symbol
• A grammar derives strings by beginning with a
start symbol and repeatedly replacing a non
terminal by the right hand side of a production
for that non terminal.
• The strings that can be derived from the start
symbol of a grammar G form the language L(G)
defined by the grammar.
3
Examples
• String of balanced parentheses
S → ( S ) S |Є

• Grammar
list → list + digit
| list – digit
| digit
digit → 0 | 1 | … | 9

Consists of the language which is a list of digit


separated by + or -.
4
Derivation
list  list + digit
 list – digit + digit
 digit – digit + digit
 9 – digit + digit
 9 – 5 + digit
9–5+2
Therefore, the string 9-5+2 belongs to the
language specified by the grammar
The name context free comes from the fact
that use of a production X  … does not
depend on the context of X
5
Examples …
• Simplified Grammar for C block
block  ‘{‘ decls statements ‘}’
statements  stmt-list | Є
stmt–list  stmt-list stmt ‘;’
| stmt ‘;’
decls  decls declaration | Є
declaration  …
6
Syntax analyzers
• Testing for membership whether w belongs
to L(G) is just a “yes” or “no” answer
• However the syntax analyzer
– Must generate the parse tree
– Handle errors gracefully if string is not in the
language
• Form of the grammar is important
– Many grammars generate the same language
– Tools are sensitive to the grammar
7
What syntax analysis cannot do!
• To check whether variables are of types on
which operations are allowed

• To check whether a variable has been


declared before use

• To check whether a variable has been


initialized

• These issues will be handled in semantic


analysis
8
Derivation
• If there is a production A  α then we
say that A derives α and is denoted by A
α
• α A β  α γ β if A  γ is a production
• If α1  α2  …  αn then α1 +αn
• Given a grammar G and a string w of
terminals in L(G) we can write S +w
• If S *α where α is a string of terminals
and non terminals of G then we say
that α is a sentential form of G
9
Derivation …
• If in a sentential form only the leftmost non
terminal is replaced then it becomes leftmost
derivation
• Every leftmost step can be written as
wAγ lm* wδγ
where w is a string of terminals and A  δ is a
production
• Similarly, right most derivation can be defined
• An ambiguous grammar is one that produces
more than one leftmost (rightmost) derivation
of a sentence
10
Parse tree
• shows how the start symbol of a
grammar derives a string in the language
• root is labeled by the start symbol
• leaf nodes are labeled by tokens
• Each internal node is labeled by a non

terminal
• if A is the label of anode and x1, x2, …xn
are labels of the children of that node
then A  x1 x2 … xn is a production in the
grammar 11
Example
Parse tree for 9-5+2
list

list + digit

list - digit 2

digit 5

9
12
Ambiguity
• A Grammar can have more than one
parse tree for a string
• Consider grammar
list  list+ list
| list – list
|0|1|…|9

• String 9-5+2 has two parse trees


13
list list

list + list list - list

list - list 2 9 list + list

9 5 5 2

14
Ambiguity …
• Ambiguity is problematic because meaning
of the programs can be incorrect
• Ambiguity can be handled in several ways
– Enforce associativity and precedence
– Rewrite the grammar (cleanest way)
• There is no algorithm to convert
automatically any ambiguous grammar to
an unambiguous grammar accepting the
same language
• Worse, there are inherently ambiguous
languages! 15
Ambiguity in Programming Lang.
• Dangling else problem
stmt  if expr stmt
| if expr stmt else stmt
• For this grammar, the string
if e1 if e2 then s1 else s2
has two parse trees

16
if e1
stmt
if e2
s1
else s2
if expr stmt else stmt

if e1 e1 if expr stmt s2
if e2
s1
else s2 e2 s1
stmt

if expr stmt

e1 if expr stmt else stmt

e2 s1 s2 17
Resolving dangling else problem
• General rule: match each else with the closest
previous unmatched if. The grammar can be
rewritten as
stmt  matched-stmt
| unmatched-stmt
matched-stmt  if expr matched-stmt
else matched-stmt
| others
unmatched-stmt  if expr stmt
| if expr matched-stmt
else unmatched-stmt 18
Associativity
• If an operand has operator on both the
sides, the side on which operator takes this
operand is the associativity of that
operator
• In a+b+c b is taken by left +
• +, -, *, / are left associative
• ^, = are right associative
• Grammar to generate strings with right
associative operators
right  letter = right | letter
letter  a| b |…| z
19
Precedence
• String a+5*2 has two possible
interpretations because of two
different parse trees corresponding to
(a+5)*2 and a+(5*2)
• Precedence determines the correct
interpretation.
• Next, an example of how precedence
rules are encoded in a grammar
20
Precedence/Associativity in the
Grammar for Arithmetic Expressions
Ambiguous • Unambiguous,
with precedence
E E +E and associativity
| E*E rules honored
| (E) E E+T|T
| num | id T T*F|F
F  ( E ) | num
3+2+5 | id
3+2*5 21
Parsing
• Process of determination whether a string
can be generated by a grammar
• Parsing falls in two categories:
– Top-down parsing:
Construction of the parse tree starts at the root
(from the start symbol) and proceeds towards
leaves (token or terminals)
– Bottom-up parsing:
Construction of the parse tree starts from the
leaf nodes (tokens or terminals of the grammar)
and proceeds towards root (start symbol)
22
Top down Parsing
• Following grammar generates types of
Pascal

type  simple
|  id
| array [ simple] of type

simple  integer
| char
| num dotdot num

1
Example …
• Construction of a parse tree is done by starting
the root labeled by a start symbol

• repeat following two steps

– at a node labeled with non terminal A select one of the


productions of A and construct children nodes
(Which production?)
– find the next node at which subtree is Constructed

(Which node?)

2
• Parse
array [ num dotdot num ] of integer

Start symbol type


Expanded using the
rule type  simple
simple
• Cannot proceed as non terminal “simple” never generates
a string beginning with token “array”. Therefore, requires
back-tracking.

• Back-tracking is not desirable, therefore, take help of a


“look-ahead” token. The current token is treated as look-
ahead token. (restricts the class of grammars)

3
array [ num dotdot num ] of integer

Start symbol
look-ahead Expand using the rule
type type  array [ simple ] of type

array [ simple ] of type


Left most non terminal
Expand using the rule
Simple  num dotdot num num dotdot num simple

all the tokens exhausted Left most non terminal integer


Parsing completed Expand using the rule
type  simple

Left most non terminal


Expand using the rule
simple  integer 4
Recursive descent parsing
First set:

Let there be a production


A

then First() is the set of tokens that appear as


the first token in the strings generated from 

For example :
First(simple) = {integer, char, num}
First(num dotdot num) = {num}

5
Define a procedure for each non terminal
procedure type;
if lookahead in {integer, char, num}
then simple
else if lookahead = 
then begin match(  );
match(id)
end
else if lookahead = array
then begin match(array);
match([);
simple;
match(]);
match(of);
type
end
else error;
6
procedure simple;
if lookahead = integer
then match(integer)
else if lookahead = char
then match(char)
else if lookahead = num
then begin match(num);
match(dotdot);
match(num)
end
else
error;

procedure match(t:token);
if lookahead = t
then lookahead = next token
else error; 7
Left recursion
• A top down parser with production
A  A  may loop forever

• From the grammar A  A  |  left


recursion may be eliminated by
transforming the grammar to

A R
R  R|
8
Parse tree corresponding Parse tree corresponding
to a left recursive grammar to the modified grammar

A A

A R

A R

β α α β α Є

Both the trees generate string βα*


9
Example
• Consider grammar for arithmetic expressions

E E+T|T
T  T * F |F
F  ( E ) | id

• After removal of left recursion the grammar becomes

E  T E’
E’  + T E’ | Є
T  F T’
T’ * F T’ | Є
F  ( E ) | id

10
Removal of left recursion
In general

A  A1 | A2 | ….. |Am


|1 | 2 | …… |n

transforms to

A  1A' | 2A' | …..| nA'


A'  1A' | 2A' |…..| mA' | Є
11
Left recursion hidden due to many
productions
• Left recursion may also be introduced by two or more grammar rules.
For example:

S  Aa | b
A  Ac | Sd | Є

there is a left recursion because

S  Aa  Sda

• In such cases, left recursion is removed systematically

– Starting from the first rule and replacing all the occurrences of the first
non terminal symbol

– Removing left recursion from the modified grammar

12
Removal of left recursion due to
many productions …
• After the first step (substitute S by its rhs in the rules) the
grammar becomes

S  Aa | b
A  Ac | Aad | bd | Є

• After the second step (removal of left recursion) the


grammar becomes

S  Aa | b
A  bdA' | A'
A'  cA' | adA' | Є
13
Left factoring
• In top-down parsing when it is not clear which production to choose
for expansion of a symbol
defer the decision till we have seen enough input.

In general if A  1 | 2

defer decision by expanding A to A'

we can then expand A’ to 1 or 2

• Therefore A   1 | 2

transforms to

A  A’
A’  1 | 2

14
Dangling else problem again
Dangling else problem can be handled by left factoring

stmt  if expr then stmt else stmt


| if expr then stmt

can be transformed to

stmt  if expr then stmt S'


S'  else stmt | Є

15
Predictive parsers
• A non recursive top down parsing method

• Parser “predicts” which production to use

• It removes backtracking by fixing one production for every


non-terminal and input token(s)

• Predictive parsers accept LL(k) languages


– First L stands for left to right scan of input
– Second L stands for leftmost derivation
– k stands for number of lookahead token

• In practice LL(1) is used

16
Predictive parsing
• Predictive parser can be implemented by
maintaining an external stack
input
Parse table is a
two dimensional array
M*X,a+ where “X” is a
stack

parser output non terminal and “a” is


a terminal of the grammar

Parse
table
17
Example
• Consider the grammar

E  T E’
E'  +T E' | Є
T  F T'
T'  * F T' | Є
F  ( E ) | id

18
Parse table for the grammar

id + * ( ) $
E ETE’ ETE’

E’ E’+TE’ E’Є E’Є

T TFT’ TFT’

T’ T’Є T’*FT’ T’Є T’Є

F Fid F(E)

Blank entries are error states. For example


E cannot derive a string starting with ‘+’

19
Parsing algorithm
• The parser considers 'X' the symbol on top of stack, and 'a' the
current input symbol

• These two symbols determine the action to be taken by the parser

• Assume that '$' is a special token that is at the bottom of the stack
and terminates the input string

if X = a = $ then halt

if X = a ≠ $ then pop(x) and ip++

if X is a non terminal
then if M[X,a] = {X  UVW}
then begin pop(X); push(W,V,U)
end
else error
20
Example
Stack input action
$E id + id * id $ expand by ETE’
$E’T id + id * id $ expand by TFT’
$E’T’F id + id * id $ expand by Fid
$E’T’id id + id * id $ pop id and ip++
$E’T’ + id * id $ expand by T’Є
$E’ + id * id $ expand by E’+TE’
$E’T+ + id * id $ pop + and ip++
$E’T id * id $ expand by TFT’

21
Example …
Stack input action
$E’T’F id * id $ expand by Fid
$E’T’id id * id $ pop id and ip++
$E’T’ * id $ expand by T’*FT’
$E’T’F* * id $ pop * and ip++
$E’T’F id $ expand by Fid
$E’T’id id $ pop id and ip++
$E’T’ $ expand by T’Є
$E’ $ expand by E’Є
$ $ halt

22
Constructing parse table
• Table can be constructed if for every non terminal, every lookahead
symbol can be handled by at most one production

• First(α) for a string of terminals and non terminals α is


– Set of symbols that might begin the fully expanded (made of only tokens)
version of α

• Follow(X) for a non terminal X is


– set of symbols that might follow the derivation of X in the input stream

first follow
23
Compute first sets
• If X is a terminal symbol then First(X) = {X}

• If X  Є is a production then Є is in First(X)

• If X is a non terminal
and X  YlY2 … Yk is a production
then
if for some i, a is in First(Yi)
and Є is in all of First(Yj) (such that j<i)
then a is in First(X)

• If Є is in First (Y1) … First(Yk) then Є is in First(X)

24
Example
• For the expression grammar
E  T E’
E'  +T E' | Є
T  F T'
T'  * F T' | Є
F  ( E ) | id

First(E) = First(T) = First(F) = { (, id }


First(E') = {+, Є}
First(T') = { *, Є}

25
Compute follow sets
1. Place $ in follow(S)
2. If there is a production A → αBβ
then everything in first(β) (except ε) is in follow(B)

3. If there is a production A → αB
then everything in follow(A) is in follow(B)

4. If there is a production A → αBβ


and First(β) contains ε
then everything in follow(A) is in follow(B)

Since follow sets are defined in terms of follow sets last two steps
have to be repeated until follow sets converge

26
Example
• For the expression grammar
E  T E’
E'  + T E' | Є
T  F T'
T'  * F T' | Є
F  ( E ) | id

follow(E) = follow(E’) = , $, ) -
follow(T) = follow(T’) = , $, ), + -
follow(F) = { $, ), +, *}
27
Construction of parse table
• for each production A  α do
– for each terminal ‘a’ in first(α)
M[A,a] = A  α

– If Є is in First(α)
M[A,b] = A  α
for each terminal b in follow(A)

– If ε is in First(α) and $ is in follow(A)


M[A,$] = A  α

• A grammar whose parse table has no multiple entries is called LL(1)

28
Practice Assignment
• Construct LL(1) parse table for the expression grammar
bexpr  bexpr or bterm | bterm
bterm  bterm and bfactor | bfactor
bfactor  not bfactor | ( bexpr ) | true | false

• Steps to be followed
– Remove left recursion
– Compute first sets
– Compute follow sets
– Construct the parse table

29
Error handling
• Stop at the first error and print a message
– Compiler writer friendly
– But not user friendly

• Every reasonable compiler must recover from errors and identify as


many errors as possible

• However, multiple error messages due to a single fault must be


avoided

• Error recovery methods


– Panic mode

– Phrase level recovery

– Error productions

– Global correction
30
Panic mode
• Simplest and the most popular method

• Most tools provide for specifying panic mode


recovery in the grammar

• When an error is detected


– Discard tokens one at a time until a set of tokens is
found whose role is clear
– Skip to the next token that can be placed reliably in the
parse tree

31
Panic mode …
• Consider following code
begin
a = b + c;
x=pr;
h = x < 0;
end;

• The second expression has syntax error

• Panic mode recovery for begin-end block


skip ahead to next ‘;’ and try to parse the next expression

• It discards one expression and tries to continue parsing

• May fail if no further ‘;’ is found

32
Phrase level recovery
• Make local correction to the input

• Works only in limited situations


– A common programming error which is easily detected
– For example insert a “;” after closing “-” of a class
definition

• Does not work very well!

33
Error productions
• Add erroneous constructs as productions in the grammar

• Works only for most common mistakes which can be


easily identified

• Essentially makes common errors as part of the grammar

• Complicates the grammar and does not work very well

34
Global corrections
• Considering the program as a whole find a correct
“nearby” program

• Nearness may be measured using certain metric

• PL/C compiler implemented this scheme:


anything could be compiled!

• It is complicated and not a very good idea!

35
Error Recovery in LL(1) parser
• Error occurs when a parse table entry M[A,a] is empty

• Skip symbols in the input until a token in a selected set


(synch) appears

• Place symbols in follow(A) in synch set. Skip tokens until


an element in follow(A) is seen.
Pop(A) and continue parsing

• Add symbol in first(A) in synch set. Then it may be


possible to resume parsing according to A if a symbol in
first(A) appears in input.

36
Practice Assignment
• Reading assignment: Read about error
recovery in LL(1) parsers
• Assignment to be submitted:
– introduce synch symbols (using both follow
and first sets) in the parse table created for the
boolean expression grammar in the previous
assignment
– Parse “not (true and or false)” and show how
error recovery works

37

You might also like