Lecture 3 Lexical Analyzer

Lexical Analysis
Lecture 03
Example or Regular Expression / Regular Definition:
Regular Expression for numbers

digit → 0|1|…|9
digits → digit digit*
optional_fraction → .digits|ε
optional_exponent → ( E (+|-| ε) digits ) | ε
num → digits optional_fraction optional_exponent
Using shorthands:
digit → 0|1|…|9
digits → digit+
optional_fraction → (.digits)?
optional_exponent → ( E (+|-| ε) digits ) ?
num → digits optional_fraction optional_exponent
Transition Diagram
• A stylized flowchart produced intermediately in the construction

of lexical analyzer
• Depicts the actions take place in lexical analyzer
• states: positions in a transition diagram

• edges: arrows connecting the states
• start state: initial state of transition diagram
• accepting state: token recognized
• action: (optional) associated with a state that is executed when
the state is entered
start > =
0 1 2
other *
3
Example
• Transition diagram for token relop

start < =
0 1 2 Return (relop, LE)
>
3 Return (relop, NE)
other
4
*
Return (relop, LT)
=
5 Return (relop, EQ)
> =
6 7 Return (relop, GE)
other *
8 Return (relop, GT)
Example Letter or digit
start letter other *

9 10 11
Return (gettoken(), install_id())
digit digit digit
start digit . digit E + or - digit other *

12 13 14 15 16 17 18 11
E
digit
digit digit
start digit . digit other *

20 21 22 23 24
digit
start digit other *

25 26 27
Capturing Multiple Tokens
Capturing keyword “begin”
start b e g i n WS
WS – white space
Capturing variable names
A – alphabetic
start A WS AN – alphanumeric
AN
What if both need to happen at the same time?

Capturing Multiple Tokens
start b e g i n WS
A-b
AN WS – white space
A – alphabetic
AN WS AN – alphanumeric
Machine is much more complicated – just for these two tokens!

Implementing a Transition Diagram
• Systematic approach for all transition diagrams

– Program size ∝ number of states and edges
• We try each diagram and when we fail then we go to
try the next diagram
• Transition diagram for WS should be placed at the

beginning rather at the end
– Generalize: frequently occurring tokens should come
earlier
Finite State Automata (FSAs)
• AKA “Finite State Machines”, “Finite Automata”,

“FA”
• One start state

• Many final states
• Each state is labeled with a state name
• Directed edges, labeled with symbols
• Two types
– Deterministic (DFA)
– Non-deterministic (NFA)
Nondeterministic Finite Automata
A nondeterministic finite automaton (NFA) is a mathematical

model that consists of
1. A set of states S
• S= {s0, s1, …., sN}
2. A set of input symbols Σ
• Σ = {a, b, …}
3. A transition function that maps state/symbol pairs to a set of
states:
S x {Σ + ε} Æ set of S
4. A special state s0 called the start state
5. A set of states F (subset of S) of final states
INPUT: string
OUTPUT: yes or no
Example: NFA
Transition Table:
ε
start
0 a 1 b 2 b 3
STATE
a b ε
a,b 0 0,1 0 3
S = { 0,1,2,3} 1 2
S0 = 0 2 3
Σ = {a,b}
F = {3} 3
Deterministic Finite Automata
A deterministic finite automaton (DFA) is a mathematical

model that consists of
1. A set of states S
• S= {s0, s1, …., sN}
2. A set of input symbols Σ
• Σ = {a, b, …}
3. A transition function that maps state/symbol pairs to a state:
SxΣÆS
4. A special state s0 called the start state
5. A set of states F (subset of S) of final states
INPUT: string
OUTPUT: yes or no
DFA Execution
DFA(int start_state) {
state current = start_state;
input_element = next_token();
while (input to be processed) {
current = transition(current,table[input_element])
if current is an error state return No;
input_element = next_token();
}
if current is a final state return Yes;
else return No;
}
Relation between RE, NFA and DFA
1. There is an algorithm for converting any RE into an NFA.

2. There is an algorithm for converting any NFA to a DFA.
3. There is an algorithm for converting any DFA to a RE.
These facts tell us that REs, NFAs and DFAs have

equivalent expressive power.
All three describe the class of regular languages.

DFA vs NFA
• Both DFA and NFA are the recognizers of regular

sets.
• But – time-space trade space exists
• DFAs are faster recognizers
– Can be much bigger too..
Converting Regular Expressions to NFAs
Thompson’s Construction
The regular expressions over finite Σ are the strings

over the alphabet Σ + { ), (, |, *} such that:
• { } (empty set) is a regular expression for the empty set
• Empty string ε is a regular expression denoting { ε }
start i ε f
• a is a regular expression denoting {a } for any a in Σ
start i a f
If P and Q are regular expressions with NFAs Np, Nq:
P | Q (union)
ε Np ε
start
i f
ε Nq ε
PQ (concatenation)
start
i Np Nq f
If Q is a regular expression with NFA Nq:

Q* (closure)
ε
start
i
ε Nq ε f
ε
Example (ab* | a*b)*
Starting with:
ab* start a a*b start b
1 2 3 4
b a
ab* | a*b
a
1 2
ε ε
start b
5 6
ε ε
3 b 4
a
Example (ab* | a*b)*
ab* | a*b
a
1 2
ε ε
start b
5 6
ε b ε
3 4
(ab* | a*b)*
a
ε
1 2 ε
start ε b
ε
7 5 6 8
ε ε
ε b
ε 3 4
a
Terminology: ε-closure
Defn: ε-closure(T) = T + all NFA states reachable

from any state in T using only ε transitions.
b
1 2 b
a ε
b 5 ε-closure({1,2,5}) = {1,2,5}
a
ε-closure({4}) = {1,4}
3
ε
4 ε-closure({3}) = {1,3,4}
ε-closure({3,5}) = {1,3,4,5}
Converting NFAs to DFAs (subset construction)
• Idea: Each state in the new DFA will correspond to

some set of states from the NFA. The DFA will be in
state {s0,s1,…} after input if the NFA could be in any of
these states for the same input.
• Input: NFA N with state set SN, alphabet Σ, start state

sN, final states FN, transition function TN: SN x {Σ U ε} Æ
SN
• Output: DFA D with state set SD, alphabet Σ, start state
sD = ε-closure(sN), final states FD, transition function
TD: SD x Σ Æ SD
Algorithm: Computation of ε-closure
push all states a ∈ T onto stack STK
initialize: ε-closure(T) = T
while STK is not empty do begin
pop t, the top element, off STK
for each stat u with and edge from t to u labeled ε do begin
if u is not in ε-closure(T) do begin
add u to ε-closure(T)
push u onto STK
end if
end for
end while
Algorithm: Subset Construction
sD = ε-closure(sN) -- create start state for DFA
SD = {sD} (unmarked)
while there is some unmarked state R in SD
mark state R
for all a in Σ do
s = ε-closure(ΤΝ(R,a));
if s not already in SD then add it (unmarked)
TD(R,a) = s;
end for
end while
FD = any element of SD that contains a state in FN
Example 1: Subset Construction
NFA N with
NFA • State set SN = {1,2,3,4,5},
• Alphabet Σ = {a,b}
• Start state sN=1,
start ε
1 2 a,b • Final states FN={5},
b • Transition function TN: SN x {Σ ∪ ε} Æ SN
a 5
a,b
3
b
4 a b ε
1 3 - 2
2 5 5, 4 -
3 - 4 -
4 5 5 -
5 - - -
NFA
start 1,2
start ε
1 2 a,b
a b 5
a,b
3 4 a b
b
{1,2}
NFA
start b
1,2 4,5
start ε a
1 2 a,b
a b 5 3,5
a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5}
{4,5}
NFA
start b
1,2 4,5
start ε a
1 2 a,b
b 3,5 b 4
a 5
a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5}
{4}
NFA
start b a,b
1,2 4,5 5
start ε a
1 2 a,b
b 3,5 b 4
a 5
a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
{4}
{5}
NFA
start b a,b
1,2 4,5 5
start ε a a,b
1 2 a,b
b 3,5 b 4
a 5
a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
{4} {5} {5}
{5} - -
start b
a,b
1,2 4,5 5
NFA
a
a,b
start ε b
1 2 a,b 3,5 4
a b 5
a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
All final states since the
NFA final state is included {4} {5} {5}
{5} - -
NFA
start b
1 2 b
a ε b 5
a
3 4
ε
NFA DFA
start b start a
1 2 b 1 1,3,4
ε b b
a 5 b a
a 2 b a
3 4 1,3,
ε b b 4,5
a
1,4,5
NFA DFA
b
start ε b start
1 2 a 3 5 1,2,4
ε a
b b
b 3,4 3,5 3
a 4 5
b b
a
4
a
Converting DFAs to REs
1. Combine serial links by concatenation

2. Combine parallel links by alternation
3. Remove self-loops by Kleene closure
4. Select a node (other than initial or final) for removal. Replace
it with a set of equivalent links whose path expressions
correspond to the in and out links
5. Repeat steps 1-4 until the graph consists of a single link
between the entry and exit nodes.
Example
a
start d b d a d
0 1 2 3 4 5
c
d b b
6 c 7
parallel edges become alternation
start d a|b|c d a d
0 1 2 3 4 5
d b
b|c
6 7
Example
start d a|b|c d a d
0 1 2 3 4 5
d b
b|c
6 7
serial edges become concatenation
start d (a|b|c) d a d
0 3 4 5
b (b|c) d
Example
0 3 4 5
b (b|c) d
Find paths that can be “shortened”
0 3 4 5
b(b|c)da
Example
0 3 4 5
b(b|c)da eliminate self-loops
start d (a|b|c) d a (b(b|c)da)*d

0 3 4 5
serial edges become concatenation
start d (a|b|c) d a (b(b|c)da)*d

0 5
Describing Regular Languages
• Generate all strings in the language

• Generate only strings in the language
Try the following:

– Strings of {a,b} that end with ‘abb’
– Strings of {a,b} where every a is followed by at least
one b
Strings of (a|b)* that end in abb
re: (a|b)*abb
start a b b
0 1 2 3
a,b NFA
b
a
start a b b
0 1 2 3
b a a
DFA
Relationship among RE, NFA, DFA
• The set of strings recognized by an NFA can be described by a

Regular Expression.
• The set of strings described by a Regular Expression can be

recognized by an NFA.
• The set of strings recognized by an DFA can be described by a

Regular Expression.
• The set of strings described by a Regular Expression can be

recognized by an DFA.
• DFAs, NFAs, and Regular Expressions all have the same “power”.
They describe “Regular Sets” (“Regular Languages”)
• The DFA may have a lot more states than the NFA. (May have
exponentially as many states, but...)
Suggestions for writing NFA/DFA/RE
• Typically, one of these formalisms is more natural for

the problem. Start with that and convert if necessary.
• In DFAs, each state typically captures some partial
solution
• Be sure that you include all relevant edges (ask –
does every state have an outgoing transition for all
alphabet symbols?)
Non-Regular Languages
Not all languages are regular”

• The language ww where w=(a|b)*
Non-regular languages cannot be described using REs,

NFAs and DFAs.

Lecture 3 Lexical Analyzer

Uploaded by

Copyright:

Available Formats

Lecture 3 Lexical Analyzer

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 Lexical Analyzer

Uploaded by

Copyright:

Available Formats

Lexical Analysis

Regular Expression for numbers

• A stylized flowchart produced intermediately in the construction

• states: positions in a transition diagram

• Transition diagram for token relop

start letter other *

digit digit digit

start digit . digit E + or - digit other *

start digit . digit other *

start digit other *

Capturing keyword “begin”

What if both need to happen at the same time?

Machine is much more complicated – just for these two tokens!

• Systematic approach for all transition diagrams

• Transition diagram for WS should be placed at the

• AKA “Finite State Machines”, “Finite Automata”,

• One start state

A nondeterministic finite automaton (NFA) is a mathematical

A deterministic finite automaton (DFA) is a mathematical

1. There is an algorithm for converting any RE into an NFA.

These facts tell us that REs, NFAs and DFAs have

All three describe the class of regular languages.

• Both DFA and NFA are the recognizers of regular

The regular expressions over finite Σ are the strings

• { } (empty set) is a regular expression for the empty set

• Empty string ε is a regular expression denoting { ε }

• a is a regular expression denoting {a } for any a in Σ

If P and Q are regular expressions with NFAs Np, Nq:

If Q is a regular expression with NFA Nq:

Defn: ε-closure(T) = T + all NFA states reachable

• Idea: Each state in the new DFA will correspond to

• Input: NFA N with state set SN, alphabet Σ, start state

1. Combine serial links by concatenation

parallel edges become alternation

serial edges become concatenation

Find paths that can be “shortened”

b(b|c)da eliminate self-loops

start d (a|b|c) d a (b(b|c)da)*d

serial edges become concatenation

start d (a|b|c) d a (b(b|c)da)*d

• Generate all strings in the language

Try the following:

• The set of strings recognized by an NFA can be described by a

• The set of strings described by a Regular Expression can be

• The set of strings recognized by an DFA can be described by a

• The set of strings described by a Regular Expression can be

• Typically, one of these formalisms is more natural for

Not all languages are regular”

Non-regular languages cannot be described using REs,

You might also like