Chapter 5 Syntax Analysis
Chapter 5 Syntax Analysis
Syntax Analysis(Parser)
An Algorithm that Groups the Set of Tokens Sent by the Scanner to Form
Syntax Structures Such As Expressions, Statements, Blocks, etc.
Simply, the parser examines if the source code written follows the
grammar(production rules) of the language.
The Syntax structure of programming languages and even spoken languages can
be expressed in what is called BNF
notation, which stands for Bakus Naur Form.
Note : The BNF Notation uses different symbols, for example, a sentence is
defined as :
< sentence > ::= < noun-phrase > < verb-phrase >
a sentence :
In the same way, the parser tries to derive your source program
from the starting symbol of the grammar. Let us say we have these
sentences :
Syntax-wise, all of these sentences are correct. However, their meaning is not
correct, and they are not useful. What differentiates 2 sentences that are
grammatically correct is their meaning or their semantics. You and I can
agree that the meaning of a grammatically correct sentence is not correct, but
how does the computer do it?
Grammar
A grammar G=(VN, VT, S, P) where:
Note :
1. VN ∩ VT = ∅.
2. VN ∪ VT = V (the vocabulary of the grammar).
formed from VN OR VT = V.
then
α = A11B
β = S110B
γ = 0010
Productions
1. A Production α --> β (alpha derives beta) is a rewriting rule such that the
occurrence of α can be substituted by β in any string.
Note that α must contain at least one
nonterminal from VN, i.e. ∈VN. For
P:
S --> aSBC
S --> abC
CB --> BC
bB --> bb
bC --> bc
cC --> cc
What is L(G)=?
S --> aSBC --> aabCBC --> aabbcBC --> blocked, so we try another path
S --> aSBC --> aabCBC --> aabBCC --> aabbCC -->
aabbcC --> aabbcc ∈ L(G) <--- A sentence
Therefore, L(G)={an,bn,cn| n ≥ 1}
E --> E+T <-- we can write the productions 1 and 2 as a single production E --> E+T | T
E --> T
T --> T*F
T --> F
F --> (E) <-- we can write the productions 5 and 6 as a single production F --> (E) | n
F --> n
Let us Take another Example (Tokens between double quotes are terminals)
{READ ; WRITE ;} #
the grammar
V --> S R $
S --> + | - | λ
R --> .dN | dN.N
N --> dN | λ
VN = {V,R,S,N}
VT = {+, - , ., d, $}
V --> SR$ --> -R$ --> -dN.N$ --> -ddN.N$ --> -dddN.N$ --> -ddd.N$ -->
-ddd.dN$ --> -ddd.d$ <-- A sentence.
V --> SR$ --> SdN.N --> SdN.dN$ --> SdN.d$ --> sddN.d$ --> sdddN.d$ -->
Sddd.d$ --> -ddd.d$ <-- A sentence.
Derivation Trees
A Derivation Tree is a Tree that displays the derivation of some sentence in the
language. For example, let us look at the tree for the previous example
Note that if we traverse the tree in order, recording only the leaves, we obtain the
sentence.
Classes of Grammars
are CFG.
4. Regular Grammar (Regular Expressions) : Each production in this
grammar class is of the form A -- > aB or A --> a, where A,B ∈ VN
and a ∈ VT, with the exception of S --> λ
5.
For example, given the grammar:
A --> aA
A --> a
Parsing Techniques
Top-Down Parsing
In Top-Down Parsing, the parser builds the derivation tree from the
root(S : the starting symbol) down to the leaves(sentence).
In Simple words, the parser tries to derive the sentence using leftmost
S --> + | - | λ
N --> dN | λ
Problem. The Parser does not know which production it should select in each
derivation statement. We will learn how to solve these issues later in the
course.
Bottom-Up Parsing
In Bottom-Up Parsing, the parser builds the derivation tree from the
leaves(sentence) up to the root(S : Starting Symbol). This type of tree, built from
the leaves to the root, is called a B-Tree.
In Simple words, the parser starts with the given sentence, does
reduction(opposite of derivation) steps, until the starting symbol is reached.
+dd.d$ --> +ddλ.d$ --> +ddN.d$ --> +dN.d$ --> +dN.dλ$ -->
+dN.dN$ --> +dN.N$ --> +R$ --> SR$ --> V Which means that the
Note that we can run into deadlocks here. say we took this path instead :
+dd.d$ --> +ddλ.d$ --> +ddN.d$ --> +dN.d$ --> +dN.dλ$ --> +dN.dN$ -->
+dNR$ --> +NR$ --> SNR$ --> Deadlock This technique also has a major
step?
Ambiguity
no.
If there is only one derivation tree representing the sentence, this means
there is only one way to derive the sentence. Based on this, we can say
that :
sentence with more than one derivation tree. That is, there
ritten as :
E --> E + E | T
T --> T*T | F
F--> (E) | a
Now, Take the sentence a + a * a and find the derivation tree now.
There is only possible derivation trees now. This solves the associativity problem
with + and * of the grammar before with the operations.
Let us try to find the derivation tree and any alternative trees.
We can see here that there is more than 1 derivation tree, and the language is still
ambiguous.
We can solve this if we rewrite the grammar with the left-associative rule
E --> E + T | T
T --> T * F | F
F --> (E) | a
This new grammar is not ambiguous, however, however it does not solve the fact
That associativity issue according to our standard.
left-recursive Grammar
This causes problems when it comes to Top-down parsing techniques(see why later).
A grammar is said to be left recursive if there is a production of the form:
A-->Aα
Conversely, a grammar is right-recursive if there is a production of the form:
A-->αA
which causes no problems in top-down parsing.
Algorithm.
Given that:
A-->Aα1 | Aα2 | …...| Aαn
A-->β1 | β 2 | …...| βm
This grammar is now perfect. It solves all our ambiguity issues, and this is a grammar
we can use to construct the production rules for our programming language.
1. Add a delimiter to the IF statement, such as ENDIF or END or FI to the end of the
statement, resulting in these
productions :
2. Make the compiler always prefers to shift the ELSE when it sees the ELSE in the
source code.
Left Factoring
Consider the productions :
A --> αβ
A --> αγ
Note how the first part of the productions is the same. This grammar can be
transformed by introducing a new non-terminal B,
So what happens now is:
A --> αB
B --> βγ
For our grammar, this results in
if-stmt --> IF condition stmt
if-stmt --> IF condition stmt ELSE stmt
becomes:
if-stmt --> IF conditon stmt else-part
else-part --> ELSE stmt | λ
Does this solve the ambiguity? No, but it helps in removing choices, since the if-stmt is
now one production. If we look at the
statement :
IF C
IF C
S
ELSE
S
It still has 2 derivation trees
E --> E + T | T
T --> T * F | F
F --> (E) | a
which can give us a derivation in the form of
E --> E + T --> E + T + T --> E + T + T + T --> T + T + T + T....+T
or in the same line,
E --> T (+ T)*
T --> F (* F)*
F --> (E) | a
Syntax Diagrams
Another way to express languages are Syntax Diagrams. These are used only with
Extended-BNF notation.
A square shape represents a nonterminal and an oval shape represents a terminal.
DRAW
Parsing Techniques
The Major problem with this parsing technique is that the parser doesn't know which
production it should select in each derivation step.
2. Bottom-Up Parsers : The parser in this parsing technique starts from the sentence,
doing reduction steps, until it reaches the starting symbol S of the grammar.
The Major problem with this technique is that the parser doesn't know which
substring the parser should select in each reduction step.
That is to say, FIRST(α) = Set of all terminals that may begin strings derived from α.
For example
α --*--> cBx
α --*--> ayD
α --*--> ab
α -----> ddd
Then
FIRST(α) = {c,a,d}
Assume as well that
α --*--> λ
then
FIRST(α) = {c,a,d,λ}
That is to say, λ appears in the FIRST() function.
That is to say, FOLLOW(A) = The set of all terminals that may appear after A in the
any derivation.
S --*--> aaXdd
S --*--> Xa
S --*--> BXc
Then
FOLLOW(X) = {d,a,c}
Rules To Compute FIRST() and FOLLOW() Sets
1. FIRST(λ) = {λ}.
2. FIRST(a) = { a }.
3. FIRST(aα)= {a}.
4. FIRST(XY) = FIRST(FIRST(X).FIRST(Y)) OR
FIRST(X.FIRST(Y)) OR
FIRST(FIRST(X).Y).
5. Given the production A --> αXβ, Then :
a. FIRST(β) ⊂ FOLLOW(X) if β ≠ λ.
b. FOLLOW(A) ⊂ FOLLOW(X) if β = λ.
Note that the FIRST() and FOLLOW() sets are made of terminals only
Notes :
1. λ may appear in FIRST() but it doesn't appear in FOLLOW(). We will see this when
we define augmented grammars.
2. Generally, we start computing the FIRST() from bottom to top, But FOLLOW() from
top to bottom.
3. When we compute FOLLOW(X), we search for X in the right side of any production.
Augmented Grammars
Example 1 :
S` --> S$
S --> AB
A --> a | λ
B --> b | λ
Example 2 :
S` --> S$
S --> aAcb
S --> Abc
A --> b | c | λ
Let us take the FIRST() for this grammar :
FIRST(A) = {b,c,λ}
FIRST(S) = FIRST(aAcb)∪FIRST(Abc) = {a,} ∪{b,c}
= {a,b,c}
FIRST(S`) = FIRST(S$) = FIRST(FIRST(S).FIRST($))
= FIRST({a,b,c}.{$})
= {a,b,c}
Example 3:
G --> E$
E --> E + T | T
T --> T * F | F
F --> (E) | a
If(token == "a")
get-next()
else
report-error()
2- For X = α1 α2 ...αn :
Code(X):
{ Code(α1);
Code(α2);
.
.
Code(αn);
}
Code(X):
{ If (token ϵ FIRST(α1))
Code(α1);
Else
If (token ϵ FIRST(α2))
Code(α2);
Else
.
.
Else
If (token ϵ FIRST(αn))
Code(αn);
Else
Report-error();
}
4- For X = α1 | α2 | ... | αn= λ , If one of the αi’s = λ, say αn= λ
Code(X):
{ If (token ϵ FIRST(α1))
Code(α1);
Else
If (token ϵ FIRST(α2))
Code(α2);
Else
.
.
Else
If (token ϵ FIRST(αn-1))
Code(αn-1);
Else
If (token is not ϵ FOLLOW(X))
Code(αn);
Else
Report-error();
}
5- For X= α*
Code(X):
Notes :
1. Every nonterminal has a code(a function).
2. S` in augmented grammar is represented by the function "main".
3. We only start with calling "get-token" in function "main".
Example:
G --> E$
E --> T( + T )*
T --> F( * F )*
F --> ( E ) | a
main(){ //represents G
get-token;
call E();
if(token!="$")
Error;
else
SUCCESS;
}
and
VN = { Program, body, stmt, block}
VT = { ., Begin, ;, End, Read, Write}
Begin
Read;
Write;
Read;
Write;
End.
Or
Begin
Read;
End.
Or
Begin
Read;
Begin
Read;
Write;
End.
Write;
End.
Or
Begin;
;
;
;
;
End.
Let us write the recursive descent code for this programming language.
main(){
get-token();
call body();
if(token != "."){
ERROR;
}
Else {
SUCCESS;
}
}
function body()
{ if(token == "Begin")
{get-token();
call stmt();
while(token !=";")
{ get-token();
call stmt();
}
if(token == "End")
get-token();
else
ERROR;
}
else
ERROR;
}
function stmt()
{
if(token == "Read")
get-token();
else if (token == "Write")
get-token();
else if(token == "Begin")
call body();
else
if(token != ";" || token != "End" )
ERROR();
}
LL(1) Parsing
This Parsing method is a table-driven parsing method. The LL(1) parsing table
selects which production to choose for the next derivation step.
Let us assume that we have a grammar that is LL(1). How do we build the LL(1) parsing
table?
FIRST(SR $) = {+,-,d,.}
FIRST(+) = {+}
FOLLOW(S) = {d,.}
FIRST(R) = {d, .}
FIRST(d) = { d }
FOLLOW(N) = {d, ., $}
VN + - d . $
\V T
V 1 1 1 1
S 2 3 4 4
R 5 5
N 7 8 8
V -dd.d$ Production 1
R$ dd.d$ Production 5
N$ d$ Production 7
N$ $ Production 8
λ λ Accept
If at any point the parser reaches a place where the input and the stack have 2
different terminal symbols, it throws a syntax error.
Let us Take another example. Let the Grammar be :
program --> block $ 1
block --> { declarations stmnts } 2
decls --> D ; decls 3 | λ 4
stmnts --> statement ; stmts 5 | λ 6
statement --> if 7 | while 8 | ass 9 | scan 10 | print11 | block 12 | λ13
VT = {$,{,},D,;,if,while,ass,scan,print}
Program 1
block 2
decls 4 4 4 4 4 4 4 3 4
stmts 5 5 5 5 5 5 6 5
statement 7 8 9 10 11 2 13
Another example is the If..else statement with a delimiter. the grammar looks like this :
S` --> S$
S --> iCSE
E --> S | λ
S --> a
C --> c
V N \V T i a e c $
S` 1 1
S 2 5
E 3,4 4
C 6
There is a conflict. To solve this, we can add a delimiter.
S` --> S$
S --> iCSEd
E --> eS | λ
S --> a
C --> c
V N \V T i a e c d $
S` 1 1
S 2 5
E 3 4
C 6
Alternatively, we can just remove out the production 4 in the conflict entry from the LL(1) table.
V N \V T i a e c d $
S` 1 1
S 2 5
E 3 4
C 6
Note:
- if a grammar is LL(1), then it is unambiguous. However, the opposite is not
necessarily true.
- Another thing to note is that in Top-Down parsing, we should avoid a grammar
that is not LL(1).
Bottom-Up Parsing
Recall that in Bottom-Up parsing, the parser starts from the given sentence, applying
reductions until it reaches the starting symbol of the grammar or a deadlock.
The major problem with Bottom-Up parsing is which substring we should select in
each reduction step.
V --> S R $
S --> +|-|λ
R --> .dN | dN.N
N --> dN | λ
V <-- SR$ <-- SdN.N$<-- SdN.dN$ <-- SdN.dλ$ <-- SddN.d$ <-- Sddλ.d$ <-- -dd.d$
But Compilers do not work like this. We already derived the sentence, why would we
go back and do it again?
We could not build a Bottom-Up parser for every Context-Free Grammar. However,
we are fortunate enough that there exist subsets of the Context-Free Grammar for
which we can build a deterministic Bottom-Up parser i.e. the parser can
determine/decide precisely where the handle is in each reduction step.
some of these subsets are:
LR Parsers :
- SLR(Simple-LR).
- LALR(Look-Ahead LR).
- LR.
Operator Precedence.
We will only be talking about the LR parsers, just to get an idea of how Bottom-Up
parsing works.
SLR Parsing
1. A parsing table.
2. A stack.
3. The input string.
I0: E`-->.E
E-->.E+T
E-->.T
T--> .T*F
T-->.F
F-->.(E)
F-->.a
function GOTO(I,X) =
CLOSURE(all items A-->αX.β Where A-->α.Xβ in I)
I1 : E-->.E, E-->.E+T
I2 : E-->.T, T--> .T*F
I3 : T-->.F
I4 : F-->.(E)
I5 :F-->.a
and take the CLOSURE for all these sets. The resultant is :
I0: E`-->.E
E-->.E+T
E-->.T
T--> .T*F
T-->.F
F-->.(E)
F-->.a
I1 : E-->E.
E-->E.+T
I2 : E-->T.
T--> T.*F
I3 : T-->F.
I4 : F-->(.E)
E --> .E+T
E-->.T
E-->.T*F
T-->.F
F-->.(E)
F-->.a
I5 : F-->.a
I6 : E-->E+.T
E-->.T*F
T-->.F
F-->.(E)
F-->.a
I7 : E-->T*.F
F-->.(E)
F-->.a
I8 : F-->(E.)
E-->E.+T
I9 : E --> E+T.
T --> T.*F
Let us apply this to the example above and generate the table
ACTION GOTO
State/
token a + * ( ) $ E T F
0 S5 S4 1 2 3
1 S6 A
2 R2 S7 R2 R2
3 R4 R4 R4 R4
4 S5 S4 8 2 3
5 R6 R6 R6 R6
6 S5 S4 9 3
7 S5 S4 10
8 S6 S11
9 R1 S7 R1 R1
10 R3 R3 R3 R3
11 R5 R5 R5 R5
E --> E+T
E --> T
T --> T*F
T --> F
F --> (E)
F --> a
0 a+a$ S5
0a5 +a$ R6
0F3 +a$ R4
0T2 +a$ R2
0E1 +a$ S6
0E1+6 a$ S5
0E1+6a5 $ R6
0E1+6F3 $ R4
0E1+6T9 $ R1
0E1 $ Accept
LR Parsing Techniques
For example [A --> α.Β , a] where a is the look-ahead. The look-ahead symbol "a" has
no effect whatsoever on an item [A --> α. β,a] β ≠ λ.
(not complete item) However, if the item is a complete [A --> α.,a], this means we
reduce by the production A --> α on token "a".
S --> S`
(1) S --> CC
(2) C --> cC
(3) C --> d
I0:
S` --> .S, $ I1
S --> .CC $ I2
C --> .cC c,d I3
C --> .d c,d I4
I1:
S` --> .S $ Accept
I2:
S --> C.C $ I5
C --> .cC $ I6
C --> .d $ I7
I3:
C --> c.C c,d I8
C --> .cC c,d I3
C --> .d c,d I4
I4:
C --> d., c,d Complete
I5 :
S --> CC. $ Complete
I6 :
C --> c.C $ I9
C --> .cC $ I6
C --> .d $ I7
I7 :
C --> d. $ Complete
I8 :
C --> cC. c,d Complete
I9 :
C --> cC. $ Complete
ACTION GOTO
V c d $ S C
0 S3 S4 1 2
1 A
2 S6 S7 5
3 S3 S4 8
4 R3 R3
5 R1
6 S6 S7 9
7 R3
8 R2 R2
9 R2
If we look at the above example, we can see that some sets of items have the
same core items(LR(0) items), but the look-ahead is different.
For example (I7 , I4), (I3 , I6), (I8 , I9). Let us say we merge the states.
ACTION GOTO
V c d $ S C
0 S3 S4 1 2
1 A
2 S3 S4 5
3 S3 S4 8
4 R3 R3 R3
5 R1
8 R2 R2 R2
This is now a simplified table. if the parsing table after merging has no conflicts(like in the
above example), then the grammar is an LALR(1) Grammar.