Lecture 3 Lexical Analyzer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Lexical Analysis

Lecture 03
Example or Regular Expression / Regular Definition:

Regular Expression for numbers


digit → 0|1|…|9
digits → digit digit*
optional_fraction → .digits|ε
optional_exponent → ( E (+|-| ε) digits ) | ε
num → digits optional_fraction optional_exponent

Using shorthands:
digit → 0|1|…|9
digits → digit+
optional_fraction → (.digits)?
optional_exponent → ( E (+|-| ε) digits ) ?
num → digits optional_fraction optional_exponent
Transition Diagram

• A stylized flowchart produced intermediately in the construction


of lexical analyzer
• Depicts the actions take place in lexical analyzer

• states: positions in a transition diagram


• edges: arrows connecting the states
• start state: initial state of transition diagram
• accepting state: token recognized
• action: (optional) associated with a state that is executed when
the state is entered

start > =
0 1 2

other *
3
Example

• Transition diagram for token relop


start < =
0 1 2 Return (relop, LE)

>
3 Return (relop, NE)
other
4
*
Return (relop, LT)

=
5 Return (relop, EQ)

> =
6 7 Return (relop, GE)

other *
8 Return (relop, GT)
Example Letter or digit

start letter other *


9 10 11
Return (gettoken(), install_id())

digit digit digit

start digit . digit E + or - digit other *


12 13 14 15 16 17 18 11
E
digit
digit digit

start digit . digit other *


20 21 22 23 24

digit

start digit other *


25 26 27
Capturing Multiple Tokens

Capturing keyword “begin”

start b e g i n WS

WS – white space
Capturing variable names
A – alphabetic
start A WS AN – alphanumeric
AN

What if both need to happen at the same time?


Capturing Multiple Tokens

start b e g i n WS

A-b
AN WS – white space
A – alphabetic
AN WS AN – alphanumeric

Machine is much more complicated – just for these two tokens!


Implementing a Transition Diagram

• Systematic approach for all transition diagrams


– Program size ∝ number of states and edges
• We try each diagram and when we fail then we go to
try the next diagram

• Transition diagram for WS should be placed at the


beginning rather at the end
– Generalize: frequently occurring tokens should come
earlier
Finite State Automata (FSAs)

• AKA “Finite State Machines”, “Finite Automata”,


“FA”

• One start state


• Many final states
• Each state is labeled with a state name
• Directed edges, labeled with symbols
• Two types
– Deterministic (DFA)
– Non-deterministic (NFA)
Nondeterministic Finite Automata

A nondeterministic finite automaton (NFA) is a mathematical


model that consists of
1. A set of states S
• S= {s0, s1, …., sN}
2. A set of input symbols Σ
• Σ = {a, b, …}
3. A transition function that maps state/symbol pairs to a set of
states:
S x {Σ + ε} Æ set of S
4. A special state s0 called the start state
5. A set of states F (subset of S) of final states

INPUT: string
OUTPUT: yes or no
Example: NFA

Transition Table:
ε
start
0 a 1 b 2 b 3
STATE
a b ε
a,b 0 0,1 0 3

S = { 0,1,2,3} 1 2
S0 = 0 2 3
Σ = {a,b}
F = {3} 3
Deterministic Finite Automata

A deterministic finite automaton (DFA) is a mathematical


model that consists of
1. A set of states S
• S= {s0, s1, …., sN}
2. A set of input symbols Σ
• Σ = {a, b, …}
3. A transition function that maps state/symbol pairs to a state:
SxΣÆS
4. A special state s0 called the start state
5. A set of states F (subset of S) of final states

INPUT: string
OUTPUT: yes or no
DFA Execution

DFA(int start_state) {
state current = start_state;
input_element = next_token();
while (input to be processed) {
current = transition(current,table[input_element])
if current is an error state return No;
input_element = next_token();
}
if current is a final state return Yes;
else return No;
}
Relation between RE, NFA and DFA

1. There is an algorithm for converting any RE into an NFA.


2. There is an algorithm for converting any NFA to a DFA.
3. There is an algorithm for converting any DFA to a RE.

These facts tell us that REs, NFAs and DFAs have


equivalent expressive power.

All three describe the class of regular languages.


DFA vs NFA

• Both DFA and NFA are the recognizers of regular


sets.
• But – time-space trade space exists
• DFAs are faster recognizers
– Can be much bigger too..
Converting Regular Expressions to NFAs
Thompson’s Construction

The regular expressions over finite Σ are the strings


over the alphabet Σ + { ), (, |, *} such that:

• { } (empty set) is a regular expression for the empty set

• Empty string ε is a regular expression denoting { ε }

start i ε f

• a is a regular expression denoting {a } for any a in Σ

start i a f
Converting Regular Expressions to NFAs

If P and Q are regular expressions with NFAs Np, Nq:

P | Q (union)

ε Np ε
start
i f
ε Nq ε

PQ (concatenation)

start
i Np Nq f
Converting Regular Expressions to NFAs

If Q is a regular expression with NFA Nq:


Q* (closure)
ε

start
i
ε Nq ε f

ε
Example (ab* | a*b)*

Starting with:
ab* start a a*b start b
1 2 3 4

b a

ab* | a*b
a
1 2
ε ε
start b
5 6
ε ε
3 b 4

a
Example (ab* | a*b)*

ab* | a*b
a
1 2
ε ε
start b
5 6
ε b ε
3 4

(ab* | a*b)*
a
ε
1 2 ε
start ε b
ε
7 5 6 8
ε ε
ε b
ε 3 4
a
Terminology: ε-closure

Defn: ε-closure(T) = T + all NFA states reachable


from any state in T using only ε transitions.
b
1 2 b

a ε
b 5 ε-closure({1,2,5}) = {1,2,5}
a
ε-closure({4}) = {1,4}
3
ε
4 ε-closure({3}) = {1,3,4}
ε-closure({3,5}) = {1,3,4,5}
Converting NFAs to DFAs (subset construction)

• Idea: Each state in the new DFA will correspond to


some set of states from the NFA. The DFA will be in
state {s0,s1,…} after input if the NFA could be in any of
these states for the same input.

• Input: NFA N with state set SN, alphabet Σ, start state


sN, final states FN, transition function TN: SN x {Σ U ε} Æ
SN
• Output: DFA D with state set SD, alphabet Σ, start state
sD = ε-closure(sN), final states FD, transition function
TD: SD x Σ Æ SD
Algorithm: Computation of ε-closure
push all states a ∈ T onto stack STK
initialize: ε-closure(T) = T
while STK is not empty do begin
pop t, the top element, off STK
for each stat u with and edge from t to u labeled ε do begin
if u is not in ε-closure(T) do begin
add u to ε-closure(T)
push u onto STK
end if
end for
end while
Algorithm: Subset Construction
sD = ε-closure(sN) -- create start state for DFA
SD = {sD} (unmarked)
while there is some unmarked state R in SD
mark state R
for all a in Σ do
s = ε-closure(ΤΝ(R,a));
if s not already in SD then add it (unmarked)
TD(R,a) = s;
end for
end while
FD = any element of SD that contains a state in FN
Example 1: Subset Construction

NFA N with
NFA • State set SN = {1,2,3,4,5},
• Alphabet Σ = {a,b}
• Start state sN=1,
start ε
1 2 a,b • Final states FN={5},
b • Transition function TN: SN x {Σ ∪ ε} Æ SN
a 5

a,b
3
b
4 a b ε
1 3 - 2
2 5 5, 4 -
3 - 4 -
4 5 5 -
5 - - -
Example 1: Subset Construction

NFA
start 1,2
start ε
1 2 a,b

a b 5

a,b
3 4 a b
b
{1,2}
Example 1: Subset Construction

NFA
start b
1,2 4,5

start ε a
1 2 a,b

a b 5 3,5

a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5}
{4,5}
Example 1: Subset Construction

NFA
start b
1,2 4,5

start ε a
1 2 a,b
b 3,5 b 4
a 5

a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5}
{4}
Example 1: Subset Construction

NFA
start b a,b
1,2 4,5 5

start ε a
1 2 a,b
b 3,5 b 4
a 5

a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
{4}
{5}
Example 1: Subset Construction

NFA
start b a,b
1,2 4,5 5

start ε a a,b
1 2 a,b
b 3,5 b 4
a 5

a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
{4} {5} {5}
{5} - -
Example 1: Subset Construction

start b
a,b
1,2 4,5 5

NFA
a
a,b

start ε b
1 2 a,b 3,5 4
a b 5

a,b a b
3 4
b {1,2} {3,5} {4,5}
{3,5} - {4}
{4,5} {5} {5}
All final states since the
NFA final state is included {4} {5} {5}
{5} - -
Example 2: Subset Construction

NFA

start b
1 2 b

a ε b 5

a
3 4
ε
Example 2: Subset Construction

NFA DFA

start b start a
1 2 b 1 1,3,4
ε b b
a 5 b a
a 2 b a
3 4 1,3,
ε b b 4,5
a
1,4,5
Example 3: Subset Construction

NFA DFA

b
start ε b start
1 2 a 3 5 1,2,4

ε a
b b
b 3,4 3,5 3
a 4 5
b b
a

4
a
Converting DFAs to REs

1. Combine serial links by concatenation


2. Combine parallel links by alternation
3. Remove self-loops by Kleene closure
4. Select a node (other than initial or final) for removal. Replace
it with a set of equivalent links whose path expressions
correspond to the in and out links
5. Repeat steps 1-4 until the graph consists of a single link
between the entry and exit nodes.
Example
a
start d b d a d
0 1 2 3 4 5
c
d b b

6 c 7

parallel edges become alternation

start d a|b|c d a d
0 1 2 3 4 5

d b
b|c
6 7
Example

start d a|b|c d a d
0 1 2 3 4 5

d b
b|c
6 7

serial edges become concatenation

start d (a|b|c) d a d
0 3 4 5
b (b|c) d
Example

start d (a|b|c) d a d
0 3 4 5
b (b|c) d

Find paths that can be “shortened”

start d (a|b|c) d a d
0 3 4 5

b(b|c)da
Example

start d (a|b|c) d a d
0 3 4 5

b(b|c)da eliminate self-loops

start d (a|b|c) d a (b(b|c)da)*d


0 3 4 5

serial edges become concatenation

start d (a|b|c) d a (b(b|c)da)*d


0 5
Describing Regular Languages

• Generate all strings in the language


• Generate only strings in the language

Try the following:


– Strings of {a,b} that end with ‘abb’
– Strings of {a,b} where every a is followed by at least
one b
Strings of (a|b)* that end in abb

re: (a|b)*abb

start a b b
0 1 2 3

a,b NFA

b
a

start a b b
0 1 2 3

b a a

DFA
Relationship among RE, NFA, DFA

• The set of strings recognized by an NFA can be described by a


Regular Expression.

• The set of strings described by a Regular Expression can be


recognized by an NFA.

• The set of strings recognized by an DFA can be described by a


Regular Expression.

• The set of strings described by a Regular Expression can be


recognized by an DFA.

• DFAs, NFAs, and Regular Expressions all have the same “power”.
They describe “Regular Sets” (“Regular Languages”)

• The DFA may have a lot more states than the NFA. (May have
exponentially as many states, but...)
Suggestions for writing NFA/DFA/RE

• Typically, one of these formalisms is more natural for


the problem. Start with that and convert if necessary.
• In DFAs, each state typically captures some partial
solution
• Be sure that you include all relevant edges (ask –
does every state have an outgoing transition for all
alphabet symbols?)
Non-Regular Languages

Not all languages are regular”


• The language ww where w=(a|b)*

Non-regular languages cannot be described using REs,


NFAs and DFAs.

You might also like