Unit 2 Lexical Analyzer
Unit 2 Lexical Analyzer
Unit 2 Lexical Analyzer
Chapter 3
Outline
Lexical Analysis
Role of Lexical Analyzer
Recognition of Tokens
Lexical Analyzer Generator: Lex/Flex
Design Aspects of Lexical Analyzer
Improves portability
Non-standard symbols and alternate character
encodings can be more easily translated
Lexical
Analyzer
Token,
tokenval
Parser
Get next
token
error
error
Symbol Table
Attributes of Tokens
y := 31 + 28*x
Lexical analyzer
<id, y> <assign, > <num, 31> <+, > <num, 28> <*, > <id, x>
token
tokenval
(token attribute)
Parser
Example
Example
Consider Pascal Statement
(C) 2014, Prepared by Partha Sarathi Chakraborty
const pi = 3.1416;
10
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of
other components, that there is a source-code error.
Example:
fi(a==f(x))
11
Input Buffering
Two-buffer input scheme to look ahead on
the input and identify tokens
Buffer pairs
Sentinels (Guards)
12
13
14
15
Basis symbols:
is a regular expression denoting language {}
a is a regular expression denoting {a}
16
d n rn
where ri is a regular expression over
{d1, d2, , di-1 }
Each dj in ri is textually substituted in ri
17
Example:
letter A | B | | Z | a | b | | z
digit 0 | 1 | | 9
id letter ( letter | digit )*
Cannot use recursion, this is illegal:
18
For example:
digit [0-9]
num digit+ (. digit+)? ( E (+|-)? digit+ )?
19
Grammar
stmt if expr then stmt
| if expr then stmt else stmt
|
expr term relop term
| term
term id
| num
Regular definitions
if if
then then
else else
relop < | <= | <> | > | >= | =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+|-)? digit+ )?
20
start
<
=
>
other
=
>
letter
5
6
return(relop, LE)
return(relop, NE)
4 * return(relop, LT)
return(relop, EQ)
=
7 return(relop, GE)
other
8 * return(relop, GT)
letter or digit
10
other
11 * return(gettoken(),
install_id())
case 9: c = nextchar();
if (isletter(c)) state = 10;
else state = fail();
break;
case 10: c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
Decides what
other start state
is applicable
int fail()
{ forward = token_beginning;
swith (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
case 25: recover(); break;
default: /* error */
}
return start;
}
21
22
23
lex.yy.c
input
stream
lex or flex
compiler
lex.yy.c
C
compiler
a.out
a.out
sequence
of tokens
24
Lex Specification
A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
The translation rules are of the form:
p1
{ action1 }
p2
{ action2 }
pn
{ actionn }
25
26
Lex actions
BEGIN: It indicates the start state. The lexical analyzer starts at
state 0.
ECHO: It emits the input as it is.
yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
yyin: It is the standard input file that stores input source program.
yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
27
Installing Software
28
For Windows 8
29
Translation
rules
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%%
int main( )
{
yylex( );
}
int yywrap( )
{
return 1;
}
Contains
the matching
lexeme
Invokes
the lexical
analyzer
lex spec.l
gcc lex.yy.c -ll
./a.out < spec.l
30
Digit
only
printed
31
Translation
rules
%{
#include <stdio.h>
int ch = 0, wd = 0, nl = 0;
%}
delim
[ \t]+
%%
\n
{ ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim}
{ ch+=yyleng; wd++; }
.
{ ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Regular
definition
32
Translation
rules
%{
#include <stdio.h>
Regular
%}
definitions
digit
[0-9]
letter
[A-Za-z]
id
{letter}({letter}|{digit})*
%%
{digit}+ { printf(number: %s\n, yytext); }
{id}
{ printf(ident: %s\n, yytext); }
.
{ printf(other: %s\n, yytext); }
%%
main()
{ yylex();
}
33
%}
delim
[ \t\n]
ws
{delim}+
letter
[A-Za-z]
digit
[0-9]
id
{letter}({letter}|{digit})*
number
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws}
{ }
if
{return IF;}
then
{return THEN;}
else
{return ELSE;}
{id}
{yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
<
{yylval = LT; return RELOP;}
<=
{yylval = LE; return RELOP;}
=
{yylval = EQ; return RELOP;}
<>
{yylval = NE; return RELOP;}
>
{yylval = GT; return RELOP;}
>=
{yylval = GE; return RELOP;}
%%
int install_id()
Return
token to
parser
Token
attribute
Install yytext as
identifier in symbol table
Lex Program
34
%}
%%
35
NFA
DFA
Simulate NFA
to recognize
tokens
Simulate DFA
to recognize
tokens
36
Nondeterministic Finite
Automata
(C) 2014, Prepared by Partha Sarathi Chakraborty
37
Transition Graph
An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph
a
start
0
b
S = {0,1,2,3}
= {a,b}
s0 = 0
F = {3}
38
Transition Table
(0,a) = {0,1}
(0,b) = {0}
(1,b) = {2}
(2,b) = {3}
State
Input
a
Input
b
{0, 1}
{0}
{2}
{3}
39
40
pn
{ action1 }
{ action2 }
{ actionn }
NFA
start
s0
N(p1)
N(p2)
N(pn)
action1
action2
actionn
Subset construction
(optional)
DFA
41
start
start
r1 | r2
start
r1 r2
r*
start
start
i
i
N(r1)
N(r2)
i N(r1)
N(r2) f
N(r)
42
start
a
{ action1 }
abb { action2 }
a*b+ { action3 }
start
start
1
3
start
1
3
a
a
b
4
8
b
b
43
start
7
a
action1
action2
action3
none
action3
44
start
7
a
action1
action2
action3
none
action2
action3
45
46
Example DFA
A DFA that accepts (a|b)*abb
b
start
a
b
1
a
2
a
47
48
start
b
a
-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
-closure({7}) = {7}
move({7},b) = {8}
-closure({8}) = {8}
move({8},a) =
none
49
50
end do
51
start
C
start
B
a
b
a
D
a
Dstates
A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
D = {1,2,4,5,6,7,9}
E = {1,2,4,5,6,7,10}
10
52
start
1
3
a
a
b
a1
2
4
8
a3
a2
a3
C
b
start
D
a
a1
a3
a2 a3
Dstates
A = {0,1,3,7}
B = {2,4,7}
C = {8}
D = {7}
E = {5,8}
F = {6,8}
53
C
start
B
a
b
a
D
a
start
D
a
54
55
56
#
6
closure
b
4
alternation
|
a
position
number
(for leafs )
57
nullable(n)
firstpos(n)
lastpos(n)
Leaf
true
Leaf i
false
{i}
{i}
|
/ \
c1
c2
nullable(c1)
or
nullable(c2)
firstpos(c1)
firstpos(c2)
lastpos(c1)
lastpos(c2)
/ \
c1
c2
nullable(c1)
and
nullable(c2)
if nullable(c1) then
firstpos(c1)
firstpos(c2)
else firstpos(c1)
if nullable(c2) then
lastpos(c1)
lastpos(c2)
else lastpos(c2)
*
|
c1
true
firstpos(c1)
lastpos(c1)
58
59
{1, 2, 3}
{1, 2, 3}
nullable
{1, 2, 3}
{1, 2}
{1, 2}
{1, 2}
{1, 2}
{1} a {1}
1
{3}
{4}
{6} # {6}
6
{5} b {5}
5
{4} b {4}
4
{3} a {3}
3
{2} b {2}
2
{5}
{6}
firstpos
lastpos
60
end do
61
62
followpos
{1, 2, 3}
{1, 2, 3}
{4}
{5}
{6}
1,2,3
b
start
b
a
a
b
1,2,
3,4
a
1,2,
3,5
a
1,2,
3,6
63
Time-Space Tradeoffs
Automaton
Space
(worst case)
Time
(worst case)
NFA
O(|r|)
O(|r||x|)
DFA
O(2|r|)
O(|x|)