SE Compiler Chapter 3-Parser
SE Compiler Chapter 3-Parser
3. Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and pattern
rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of the
regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis. Therefore,
this phase uses context-free grammar (CFG), which is recognized by push-down automata.
It implies that every Regular Grammar is also context-free, but there exists some problems, which are
beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming
languages.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser
analyzes the source code (token stream) against the production rules to detect any errors in the code. The
output of this phase is a parse tree.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use error
recovering strategies, which we will learn later in this chapter.
In this section, we will first see the definition of context-free grammar and introduce terminologies used in
parsing technology.
A context-free grammar has four components:
· A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.
· A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.
· A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals,
called the right side of the production.
· One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular Expression.
That is, L = {w | w = wR} is not a regular language. But it can be described by means of CFG, as illustrated
below:
G = (V, Σ, P, S)
Where:
V = {Q, Z, N}
Σ = {0, 1}
P = {Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1}
S = {Q}
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we
take two decisions for some sentential form of input:
· Deciding the non-terminal which is to be replaced.
· Deciding the production rule, by which, the non-terminal will be replaced.
At each point in this derivation process, the string is a collection of terminal or nonterminal symbols. Such
a string is called a sentential form if it occurs in some step of a valid derivation. Any sentential form can
be derived from the start symbol in zero or more steps. Similarly, from any sentential form we can derive a
valid sentence in zero or more steps.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation.
The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential form.
Example
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the
start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by an
example from the last topic.
Step 1:
Step 2:
Step 3:
Step 4:
In a parse tree:
· All leaf nodes are terminals.
· All interior nodes are non-terminals.
· In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least
one string.
Example
E → E + E
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by
the associativity of those operators. If the operation is left-associative, then the operand will be taken by the
left operator; or if the operation is right-associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id
(id op id) op id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression
will be:
id op (id op id)
Precedence
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the
left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down parsers.
Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when the parser
encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the
left non-terminal and it goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse A, which in-turn will yield a string consisting of A itself and the parser
may go into a loop forever.
START
Arrange non-terminals in some order like A1, A2, A3,…, An
for each i from 1 to n
{
for each j from 1 to i-1
{
replace each production of form Ai⟹Aj
with Ai ⟹ δ1 | δ2 | δ3 |…|
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
END
Example
The production set
S => Aα | β
A => Sd
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules have a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ | α | …
Then it cannot determine which production to follow to parse the string, as both productions are starting
from the same terminal (or non-terminal). To remove this confusion, we use a technique called left
factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make
one production for each common prefixes and the rest of the derivation is added by new productions.
A => αA’
A’=> β | | …
Now the parser has only one production per prefix which makes it easier to take decisions.
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers are
responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the following
drawbacks:
· It cannot determine if a token is valid,
· It cannot determine if a token is declared before it is being used,
· It cannot determine if a token is initialized before it is being used,
· It cannot determine if an operation performed on a token type is valid or not.
These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.
Types of Parsing
Syntax analyzers follow production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation) divides parsing into two types: top-down parsing and
bottom-up parsing.
Top-down Parsing
When the parser starts constructing the parse tree from the start symbol and then tries to transform the start
symbol to the input, it is called top-down parsing.
· Recursive descent parsing: It is a common form of top-down parsing. It is called recursive, as it
uses recursive procedures to process the input. Recursive descent parsing suffers from backtracking.
· Backtracking: It means, if one derivation of a production fails, the syntax analyzer restarts the
process using different rules of same production. This technique may process the input string more
than once to determine the right production.
Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree
up to the start symbol.
Example:
Input string: a + b * c
Production rules:
Read the input and check if any production matches with the input:
a + b * c
T + b * c
E + b * c
E + T * c
E * c
E * T
E
S
Top-Down parsing
We have learnt that the top-down parsing technique parses the input, and starts constructing a parse tree
from the root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted
below:
This parsing technique is regarded recursive, as it uses context-free grammar which is recursive in nature.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the production
rules to replace them (if matched). To understand this, take the following example of CFG:
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances to the next input
letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its production from the left (X →
oa). It does not match with the next input symbol. So the top-down parser backtracks to obtain the next
production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which production is to be
used to replace the input string. The predictive parser does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next input
symbols. To make the parser back-tracking free, the predictive parser puts some constraints on the
grammar and accepts only a class of grammar known as LL(k) grammar.
· Input buffer: our string to be parsed. We will assume that its end is marked with a special symbol $.
· Output: a production rule representing a step of derivation sequence (left-most derivation) of the string
in the input buffer.
· Stack: contains the grammar symbols
o At the bottom of the stack, there is a special end mark symbol $.
o Initially the stack contains only the symbol $ and the starting symbol S ($S ← initial stack)
o When the stack is emptied (i.e. only $ left in the stack), the parsing is completed.
· Parsing Table:
o A two-dimensional array M[A, a]
o Each row is a non-terminal symbol.
o Each column is a terminal symbol or the special symbol $.
o Each entry holds a production rule.
In recursive descent parsing, the parser may have more than one production to choose from for a single
instance of input; whereas in predictive parser, each step has at most one production to choose. There
might be instances where there is no production matching the input string, making the parsing procedure to
fail.
LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can be
implemented by means of both algorithms, namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in
LL(k) stands for left-most derivation and k itself represents the number of look aheads. Generally k = 1, so
LL(k) may also be written as LL(1).
FIRST (α) = is the set of terminals that can begin strings derived from α.
FOLLOW (α) = is the set of terminals that can immediately follow X. That is, t ∈ FOLLOW(X) if there is
any derivation containing Xt. This can occur if the derivation contains X YZt where Y and Z both derive ε.
Parser Actions: The symbol at the top of the stack (say X) and the current symbol in the input string (say
a) determine the parser action.
• There are four possible parser actions:
1. If X and a are $ → parser halts (Successful completion)
2. If X and a are the same terminal symbol (different from $) → parser pops X from the stack, and
moves the next symbol in the input buffer.
3. If X is a non-terminal → parser looks at the parsing table entry M[X,a] holds a production rule X
→ Y1Y2…Yk, it pops X from the stack and pushes Yk, Yk-1,…, Y1 into the stack. The parser also
outputs the production rule X → Y1Y2…Yk to represent a step of the derivation.
4. None of the above → error
• All empty entries in the parsing table are errors.
• If X is a terminal symbol different from a, this is also an error case.
Example:
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
FIRST FOLLOW
E { ( , id} { $, ) }
E’ {+ , ε } { $, ) }
T { ( , id} { +, $, ) }
T’ {*, ε } { +, $, ) }
F { ( , id } {+, * , $ , ) }
Parsing Table
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry. This type of
grammar is called LL(1) grammar.
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Parsing Table:
Since there are more than one production, the grammar is not LL(1) grammar.
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches the root
node. Here, we start from a sentence and then apply production rules in reverse manner in order to reach
the start symbol. The image given below depicts the bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing use two unique steps for bottom-up parsing. These steps are known as shift-step and
reduce-step.
· Shift step: The shift step refers to the advancement of the input pointer to the next input symbol,
which is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is
treated as a single node of the parse tree.
· Reduce step: When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is
known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP
function is performed on the stack which pops off the handle and replaces it with LHS non-terminal
symbol.
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for an input string
beginning at the leaves (the bottom) and working up towards the root (the top).
Example:
Consider the grammar:
S → aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde.
E→E+E
E→E*E
E → (E)
E → id
Handle Pruning:
· Shift: The next input symbol is shifted onto the top of the stack.
· Reduce: Replace the handle on the top of the stack by the non-terminal.
· Accept: Successful completion of parsing.
· Error: Parser discovers a syntax error, and calls an error recovery routine.
E→E+E
E→E*E
E → (E)
E → id
Types of Shift Reduce Parsing: There are two main categories of shift-reduce parsers.
1. Operator-Precedence Parser: Simple, but only a small class of grammar.
2. LR-Parsers: Covers wide range of grammars.
a. SLR – Simple LR parser
b. CLR – Most general LR parse (Canonical LR)
c. LALR – Intermediate LR parser (Look Ahead LR)
SLR, CLR and LALR work same, only their parsing tables are different.
Examples:
E →AB E →EOE E →E+E
A →a E →id E →E*E
B →b O →+|*|/ E →E/E | id
Not operator grammar Not operator grammar Operator grammar
Precedence Relations
These parsers rely on the following three precedence relations:
The determination of correct precedence relations between terminals are based on the traditional notions of
associativity and precedence of operators. (unary minus causes a problem). The intention of the precedence
relations is to find the handle of right-sentential form,
<• with marking the left end,
=• appearing in the interior of the handle, and
•> marking the right hand.
2. If operator θ1 and θ2 are operators of equal precedence (they may in fact be the same operator), then
make θ1 •> θ2 and θ2 •> θ1 if the operators are left-associative, or make. θ1 <• θ2 and θ2 <• θ1 if they
are right-associative.
For Example, if + and – are left-associative, then make + •> +, + •> -, - •> -, - •> +.
3. For all operator θ: θ <• id, id •> θ, θ <• (, (<• θ, ) •> θ, θ •>), θ •> $, and $ <• θ
4. For all operators θ. Also, let
( =• ) $ <• ( $ <• id
( <• ( id •> $ ) •> $
(<• id id •> ) ) •> )
• NB: id has higher precedence than any other symbol
$ has lowest precedence
The input string is w$, the initial stack is $ and a table holds precedence relations between certain
terminals.
Repeat forever
if ($ is on top of the stack and p points to $) then return
else
{
let a be the topmost terminal symbol on the stack and let b be the symbol pointed to by p;
if (a <• b or a =• b) then
{ /* SHIFT */
push b onto the stack;
advance p to the next input symbol;
}
else if (a •> b ) then /* REDUCE*/
repeat pop stack until ( the top of stack terminal is related by <• to the terminal most
recently popped);
else
error();
}
id + * $
id •> •> •>
+ <• •> <• •>
* <• •> •> •>
$ <• <• <•
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as LR(k)
parsers, where L stands for left-to-right scanning of the input stream; R stands for the construction of right-
most derivation in reverse, and k denotes the number of lookahead symbols to make decisions.
Advantages of LR parsing:
· It recognizes virtually all programming language constructs for which CFG can be written.
· It is efficient non-backtracking shift-reduce parsing method.
· A grammar that can be parsed using LR method is a proper superset of a grammar that can be
parsed with predictive parser.
Drawbacks of LR parsing:
· It is too much of work to construct a LR parser by hand for programming language grammar. A
specialized tool, called LR parser generator, is needed. Example: YACC/BISON.
There are three widely used algorithms available for constructing an LR parser:
It consists of: an Input, an output, a stack, a driver program, and a parsing table that has two part (action
and goto)
· The driver program is the same for all LR parser.
· The parsing program reads characters from an input buffer one at a time.
· The program uses a stack to store a string of the form s0X1s1X2s2…Xmsm, where sm is on top. Each
Xi is a grammar symbol and each si is a state.
· The parsing table consists of two parts: action and goto functions.
An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse.
States represent sets of "items." An LR(0) item (item for short) of a grammar G is a production of G with a
dot at some position of the body. Thus, production A → XYZ yields the four items:
One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the basis for
constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton is
called an LR(0) automaton. In particular, each state of the LR(0) automaton represents a set of items in
the canonical LR(0) collection.
Example: G: S → E + S | E
E →num
2. Goto State: The top of the stack now contains a state and a symbol (terminal or non-terminals). In the
Goto table, look up the row for the state, and the column symbol.
a. If there is a number there, put this number on the top of the stack (it is the new state number)
b. If there is no Goto, then return ERROR ( the input cannot be parsed by the grammar)
Action GOTO
State id * + - $ E T A
I0
I1
I2
I3
0. S’ → S
1. S→(L)
2. S→x
3. L→S
4. L→L,S
c) Parsing input (x , x)
LL Vs LR
LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still to be Uses the stack for designating what is already seen.
expected.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the stack, and Tries to recognize a right hand side on the stack,
pushes the corresponding right hand side. pops it, and pushes the corresponding nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off the stack. Reads the terminals while it pushes them on the
stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.