Lexical Analyzer Parser
Lexical Analyzer Parser
Lexical Analyzer Parser
• Union
– L1 L2 = { s | s L1 or s L2 }
• Exponentiation:
– L0 = {} L1 = L L2 = LL
• Kleene Closure
– L* = L i
i 0
• Positive Closure
– L+ = i
L
i 1
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
• (r)+ = (r)(r)*
• (r)? = (r) |
CS416 Compiler Design 7
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
• Ex:
= {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0* => { ,0,00,000,0000,....}
– (0|1)* => all strings with 0 and 1, including the empty string
a
b a
The language recognized by
a b
0 1 2
this DFA is also (a|b) * a b
b
N(r1)
i f NFA for r1 | r2
N(r2)
NFA for r1 r2
i N(r) f
NFA for r*
(a | b)
b b
b:
a
(a|b) *
b
a
(a|b) * a
a
b
S1
S0 b a
S2
Syntax tree of (a|b) * a #
#
4
* a
3 • each symbol is numbered (positions)
| • each symbol is at a leave
a b
2
1 • inner nodes are operators
For example, ( a | b) * a #
1 2 3 4
• If firstpos and lastpos have been computed for each node, followpos
of each position can be computed by making one depth-first traversal
of the syntax tree.
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S1,a)=S2
b: followpos(2)={1,2,3}=S1 move(S1,b)=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S2,a)=S2
b: followpos(2)={1,2,3}=S1 move(S2,b)=S1
b a
a
S1 S2
start state: S1
accepting states: {S2} b
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2 move(S1,a)=S2
b: followpos(2)={3,4}=S3 move(S1,b)=S3
mark S2
b: followpos(2)={3,4}=S3 move(S2,b)=S3
S2
mark S3 a
b
c: followpos(3)={3,4}=S3 move(S3,c)=S3 S1
b
S3 c
start state: S1
CS416 Compiler Design 33
Minimizing Number of States of a DFA
• partition the set of states into two groups:
– G1 : set of accepting states
– G2 : set of non-accepting states
b a
{1,3} a {2}
1 4
Groups: {1,2,3} {4}
b
b a
{1,2} {3} a b
3 b no more partitioning 1->2 1->3
2->2 2->3
b 3->4 3->3
{3}
a b
{1,2} a b
a {4}
• What is the end of a token? Is there any character which marks the end
of a token?
– It is normally not defined.
– If the number of characters in a token is fixed, in that case no problem: + -
– But < < or <> (in Pascal)
– The end of an identifier : the characters cannot be in an identifier can mark the end of
token.
– We may need a lookhead
• In Prolog: p :- X is 1. p :- X is 1.5.
The dot followed by a white space character can mark the end of a number.
But if that is not the case, the dot must be treated as a part of the number.