PLDI Week 06 Parsing
PLDI Week 06 Parsing
PLDI Week 06 Parsing
Ilya Sergey
ilya@nus.edu.sg
ilyasergey.net/CS4212/
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
Last Week: Lexing
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
Regular Expressions
• Regular expressions precisely describe sets of strings.
• Useful extensions:
– “foo” Strings, equivalent to 'f''o''o'
– R+ One or more repetitions of R, equivalent to RR*
– R? Zero or one occurrences of R, equivalent to (ε|R)
– ['a'-'z'] One of a or b or c or … z, equivalent to (a|b|…|z)
– [^'0'-'9'] Any character except 0 through 9
– R as x Name the string matched by R as x
Example Regular Expressions
• Regular expressions alone are ambiguous, need a rule to choose between the options above
• Most languages choose “longest match”
– So the 2nd option above will be picked
– Note that only the first option is “correct” for parsing purposes
• Conflicts: arise due to two tokens whose regular expressions have a shared prefix
– Ties broken by giving some matches higher priority
– Example: keywords have priority over identifiers
– Usually specified by order the rules appear in the lex input file
Lexer Generators
• Reads a list of regular expressions: R1,…,Rn , one per token.
• Each token has an attached “action” Ai
(just a piece of code to run when the regular expression is matched)
• Other approaches:
– Brzozowski derivatives
– Idea: directly manipulate the (abstract syntax of) the regular expression
– Compute partial “derivatives”
• Regular expression that is “left-over” after seeing the next character
– Elegant, purely functional, implementation (very cool!)
– See “Regular-expression derivatives re-examined” (2009) by Owens et al.
Finite Automata
• Consider the regular expression: ‘”’[^’”’]*’”’
• An automaton (DFA) can be represented as:
Non-"
– A graph:
" "
0 1 2
RE to Finite Automaton?
'a' a
What about?
ε R1|R2
R1 R2
??
R 1R 2
Nondeterministic Finite Automata
b
a
ε
a
ε b
a
RE to NFA?
a
‘a’
R1 R2
ε
R 1R 2
RE to NFA (cont’d)
• Alternatives and Kleene star are easy with NFAs
R1
ε ε
R1|R2
R2
ε ε
R
R*
ε ε
ε
DFA versus NFA
• DFA:
– Action of the automaton for each input is fully determined
– Automaton accepts if the input is consumed upon reaching an accepting state
– Obvious table-based implementation
• NFA:
– Automaton potentially has a choice at every step
– Automaton accepts an input string if there exists a way to reach an accepting state
– Less obvious how to implement efficiently
NFA to DFA conversion (Intuition)
• Idea: Run all possible executions of the NFA “in parallel”
• Keep track of a set of possible states: “finite fingers”
• Consider: -?[0-9]+
[0-9]
-
[0-9] ε
• NFA representation: 0 1 2 3
{1} [0-9]
-
• DFA representation: {0,1} {2,3} [0-9]
[0-9]
Summary of Lexer Generator Behavior
• Take each regular expression Ri and it’s action Ai
• Compute the NFA formed by (R1 | R2 | … | Rn)
– Remember the actions associated with the accepting states of the Ri
• Compute the DFA for this big NFA
– There may be multiple accept states (why?)
– A single accept state may correspond to one or more actions (why?)
• Compute the minimal equivalent DFA
– There is a standard algorithm due to Myhill & Nerode
• Produce the transition table
• Implement longest match:
– Start from initial state
– Follow transitions, remember last accept state entered (if any)
– Accept input until no transition is possible (i.e. next state is “ERROR”)
– Perform the highest-priority action associated with the last accept state;
if no accept state there is a lexing error
Lexer Generators in Practice
• Many existing implementations: lex, Flex, Jlex, ocamllex, …
– For example ocamllex program
• see lexlex.mll, olex.mll, piglatin.mll
• Error reporting:
– Associate line number/character position with tokens
– Use a rule to recognise ‘\n’ and increment the line number
– The lexer generator itself usually provides character position info.
• Lexer generators are usually designed to work closely with parser generators…
5 min break
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
This week: Parsing
Source Code
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
Parsing: Finding Syntactic Structure
Block
If While
{
if (b == 0) a = b;
while (a != 1) {
Bop … … Bop Block
print_int(a);
a = a – 1;
} Expr …
} b == 0 a != 1
Source input
Call
Abstract Syntax tree
…
Syntactic Analysis (Parsing): Overview
• Input: stream of tokens (generated by lexer)
• Output: abstract syntax tree
• Strategy:
– Parse the token stream to traverse the “concrete” syntax
– During traversal, build a tree representing the “abstract” syntax
a + b Bop
(a + ((b))) Same abstract syntax tree
((a) + (b)) a + b
• Idea: “derive” a string in the language by starting with S and rewriting according to the rules:
– Example: S ⟼ (S)S ⟼ ((S)S)S ⟼ ((ε)S)S ⟼ ((ε)S)ε ⟼ ((ε)ε)ε = (())
S ⟼ (S)S
S⟼ε
S ⟼ E+S | E
E ⟼ number | (S)
e.g.: (1 + 2 + (3 + 4)) + 5
S⟼E+S 4 productions
S⟼E 2 nonterminals: S, E
E ⟼ number 4 terminals: (, ), +, number
E ⟼ (S) Start symbol: S
Derivations in CFGs
• Example: derive (1 + 2 + (3 + 4)) + 5
S ⟼ E+S | E
• S⟼E+S
⟼ (S) + S E ⟼ number | (S)
⟼ (E + S) + S
⟼ (1 + S) + S For arbitrary strings α, β, γ and
⟼ (1 + E + S) + S production rule A ⟼ β
⟼ (1 + 2 + S) + S a single step of the derivation is:
⟼ (1 + 2 + E) + S
αAγ ⟼ αβγ
⟼ (1 + 2 + (S)) + S
⟼ (1 + 2 + (E + S)) + S ( substitute β for an occurrence of A)
⟼ (1 + 2 + (3 + S)) + S
⟼ (1 + 2 + (3 + E)) + S
In general, there are many possible derivations for a given string
⟼ (1 + 2 + (3 + 4)) + S
⟼ (1 + 2 + (3 + 4)) + E Note: Underline indicates symbol being expanded.
⟼ (1 + 2 + (3 + 4)) + 5
From Derivations to Parse Trees
S
E + S
• Tree representation of the derivation
• Leaves of the tree are terminals ( S ) E
derivation steps ( S )
E + S
• (1 + 2 + (3 + 4)) + 5
S⟼E+S | E 3 E
E ⟼ number | ( S )
4
From Parse Trees to Abstract Syntax
S
• Parse tree: • Abstract syntax tree (AST):
“concrete syntax” E + S
+
( S ) E
+ 5
E + S 5
1 E + S 1 +
2 E 2 +
( S )
3 4
E + S
3 E • Hides, or abstracts,
unneeded information.
4
Derivation Orders
• Note that for this grammar both strategies (and any other)
yield the same parse tree!
– Parse tree doesn’t contain the information about what order the productions were applied.
Example: Left- and rightmost derivations
• Leftmost derivation: Rightmost derivation:
• S⟼E+S S⟼E+S (1 + 2 + (3 + 4)) + 5
⟼ (S) + S ⟼E+E
⟼ (E + S) + S ⟼E+5
S⟼E+S | E
⟼ (1 + S) + S ⟼ (S) + 5
E ⟼ number | ( S )
⟼ (1 + E + S) + S ⟼ (E + S) + 5
⟼ (1 + 2 + S) + S ⟼ (E + E + S) + 5
⟼ (1 + 2 + E) + S ⟼ (E + E + E) + 5
⟼ (1 + 2 + (S)) + S ⟼ (E + E + (S)) + 5
⟼ (1 + 2 + (E + S)) + S ⟼ (E + E + (E + S)) + 5
⟼ (1 + 2 + (3 + S)) + S ⟼ (E + E + (E + E)) + 5
⟼ (1 + 2 + (3 + E)) + S ⟼ (E + E + (E + 4)) + 5
⟼ (1 + 2 + (3 + 4)) + S ⟼ (E + E + (3 + 4)) + 5
⟼ (1 + 2 + (3 + 4)) + E ⟼ (E + 2 + (3 + 4)) + 5
⟼ (1 + 2 + (3 + 4)) + 5 ⟼ (1 + 2 + (3 + 4)) + 5
Loops and Termination
• Some care is needed when defining CFGs to avoid loops
• Consider: S⟼ E
E⟼ S
• Consider: S⟼ (S)
• This grammar is productive, but again there is no finite derivation starting from S, so the language is
empty. One can easily generalise these examples to a “chain” of many nonterminals, which can be
harder to find in a large grammar
S
Leftmost derivation: Rightmost derivation:
S⟼E+S S⟼E+S E + S +
⟼1+S ⟼E+E+S
⟼E+E+E 1 E + S
⟼1+E+S 1 + AST
⟼E+E+3
⟼1+2+S ⟼E+2+3 2 E
⟼1+2+E ⟼1+2+3 2 3
⟼1+2+3 3
Parse Tree
Associativity
S⟼E+S | E
E ⟼ number | ( S )
S ⟼ S + S | ( S ) | number
+ +
• One derivation gives left
associativity, the other gives
right associativity to ‘+’
+ 3 1 +
– Which is which?
1 2 2 3
AST 1 AST 2
Precedence
• The ‘+’ operation is associative, so it doesn’t matter which tree we pick.
Mathematically, x + (y + z) = (x + y) + z
– But, some operations aren’t associative. Examples?
– Some operations are only left (or right) associative. Examples?
• Input: 1 + 2 * 3 +
– One parse = (1 + 2) * 3 = 9
*
– The other = 1 + (2 * 3) = 7 + 3 vs. 1 *
1 2 2 3
Eliminating Ambiguity
• We can often eliminate ambiguity by adding nonterminals and allowing recursion
only on the left (or right) .
• Higher-precedence operators go farther from the start symbol.
• Example:
S ⟼ S + S | S * S | ( S ) | number
• To disambiguate:
– Decide (following math) to make ‘*’ higher precedence than ‘+’
– Make ‘+’ left associative
S0 ⟼ S0 + S1 | S1
– Make ‘*’ right associative
• Note: S1 ⟼ S2 * S1 | S2
– S2 corresponds to ‘atomic’ S2 ⟼ number | ( S0 )
expressions
Context Free Grammars: Summary
• Even with an unambiguous CFG, there may be more than one derivation
– Though in this case all derivations correspond to the same abstract syntax tree.
• Still to come: how to find a derivation that matches the string of tokens?
– But first, let’s see some tools: menhir
Demo: Parsing for Boolean Logic
• https://github.com/cs4212/week-06-parsing
• Definitions:
- ast.ml
- parser.mly
- lexer.mll
- range.ml
• What about precedence of binary connectives? Associativity?
• Running: main.ml
LL & LR Parsing
• Given the only one look-ahead symbol: ‘(‘ it isn’t clear whether to pick
S⟼E or S ⟼ E + S first.
LL(1) Grammars
Grammar is the problem
• Not all grammars can be parsed “top-down” with only a single lookahead symbol.
• Top-down: starting from the start symbol (root of the parse tree) and going down
• LL(1) means
– Left-to-right scanning
– Left-most derivation,
– 1 lookahead symbol
S⟼E+S | E
• This language isn’t “LL(1)” E ⟼ number | ( S )
• Construct the set of all input tokens that may appear first in strings
that can be derived from γ
– Add the production γ to the entry (A, token) for each such token.
• Note: if there are two different productions for a given entry, the
grammar is not LL(1)
Example T ⟼ S$
S ⟼ ES’
• First(T) = First(S) S’ ⟼ ε
• First(S) = First(E) S’ ⟼ + S
• First(S’) = { + } E ⟼ number | ( S )
• First(E) = { number, ‘(‘ }
Note: we want the least
solution to this system of
• Follow(S’) = Follow(S) set equations… a fixpoint
computation. More on
• Follow(S) = { $, ‘)’ } ∪ Follow(S’) these later in the course.
number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
Converting the table to code
• Define n mutually recursive functions
– one for each nonterminal A: parse_A
– Assuming the stream of tokens is globally available, the type of parse_A is unit -> ast,
if A is not an auxiliary nonterminal
– Parse functions for auxiliary nonterminals (e.g. S’) take extra ast’s as inputs, one for each
nonterminal in the “factored” prefix.
• Each function “peeks” at the lookahead token and then follows the production rule in the
corresponding entry.
– Consume terminal tokens from the input stream
– Call parse_X to create sub-tree for nonterminal X
– If the rule ends in an auxiliary nonterminal, call it with appropriate ast’s.
(The auxiliary rule is responsible for creating the ast after looking at more input.)
– Otherwise, this function builds the ast tree itself and returns it.
Demo: LL(1) Parsing
• https://github.com/cs4212/week-06-parsing
• ll1_parser.ml
• Hand-generated LL(1) code for the table below.
number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
LL(1) Summary
• Problems:
– Grammar must be LL(1)
– Can extend to LL(k) (it just makes the table bigger)
– Grammar cannot be left recursive (parser functions will loop!)
– There are CF grammars that cannot be transformed to LL(k)
• LR parsing
• Parsing in OCaml via Menhir
• Introducing HW4