1
Lex Programming
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Compilation Sequence
2
• The main job of a lexical
analyzer (scanner) is to break
up an input stream into more
usable elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID
MULT ID SEMICOLON
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
What is Lex?
3
• A Lex is a tool for automatically generating a Lexical
analyzer or lexer or Scanner given a lex specification
(.l file). It takes as its input a LEX source program and
produces lexical Analyzer as its output.
• A lexer (Lexical analyzer) or scanner is used to perform
lexical analysis, or the breaking up of an input stream
into meaningful units, or tokens.
• For example, consider breaking a text file up into
individual words.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
What is Lex?
4
Lexical analyzers tokenize input streams
Tokens are the terminals of a language
English
words, punctuation marks, …
Programming language
Identifiers, operators, keywords, …
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
An Overview of Lex
5
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
An Overview of Lex
6
Note :
• The program lex.yy.c consists of a tabular representation
of a transition diagram constructed from the regular
expressions of lex.l, together with a standard routine that
uses the table to recognize the tokens.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program structure
7
A Lex program is divided into three sections:
1. Global C and Lex declarations section
2. Lex rules section
3. C code section.
These sections are delimited by %%.
… Definition section …
%%
… Lex rules …
%%
… C subroutines …
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program structure
8
Global C and Lex declarations section
This is the place to define macros ,declare C
variables and import header files written in C.
We'll also perform token declarations of Lex here
(i.e regular definition)
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program structure
9
Lex rules section
Lex rules describe the tokens that we want to match
The Lex rules each have the form
Pattern { Action }
Each pattern is a regular expression, which may use the
regular definitions of the declaration section.
The actions are typically C code which the lexer will
execute when the input matches a given pattern
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program structure
10
C code section
The C code section contains C statements and
functions
Note that this section has to include the yylex()
and yywrap() function.
(Lex has a set of functions and variables that are
available to the user.)
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program Example
11
/*** Definition section ***/
%{
#include <stdio.h> C declarations and header files
int c=0;
%}
digit [0-9]
letter [a-zA-Z] Regular definition
id {letter}({letter}|{digit})*
%%
/*** LEX Rules section ***/
{id} {c++;}
%%
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program Example
12
/*** C Code section***/
int main() {
printf("Enter the input");
yylex();
printf(“No. of identifiers = %d”,c);
}
int yywrap()
{
return 1;
}
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex program structure
13
Note the following:
1. Anything within %{ and %} is copied directly to
the file lex . yy . C
2. Actions associated with regular expressions in lex
rules section are carried over directly to lex.yy.c
3. Also functions in C code section are copied
directly to the file lex . yy . C
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex functions
14
yylex() The function that starts the analysis. It is automatically
generated by Lex. (this is the entry point to LEX)
yywrap() This function is called by LEX when end of file (or
input) is encountered. If this function returns 1, the
scanning stops.
yyless(int n) This function can be used to push back all but first ‘n’
characters of the read token.
yymore() This function tells the lexer to append the next token to
the current token.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex variables
Lex has a set of functions and variables that are available to the user.
15
yyin Of the type FILE*. This points to the current file being parsed
by the lexer. (input file)
yyout Of the type FILE*. This points to the output file where the
output of the lexer will be written.
By default, both yyin and yyout point to standard input and
output.
yytext The text of the matched pattern is stored in this variable (char*).
yyleng Gives the length of the matched pattern.
yylineno Provides current line number information. (May or may not be
supported by the lexer.)
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Lex variables
16
yylval Purpose of this variable to hold the token
value(lexical value) returned by a function that is
defined in C code section
INITIAL initial start condition
BEGIN condition switch start condition
ECHO write matched string
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
17
A LEX regular expression is a word made of
text characters (letters of the alphabet, digits, ...)
operators " \ { } [ ] ^ $ < > ? . * + | () /
Moreover
An operator can be used as a text character if it preceded with
the escape operator (backslash).
The quotation marks indicate that whatever is contained
between a pair of quotes is to be taken as text characters.
For instance, xyz"++” matches the string xyz++.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
18
A CHARACTER CLASS is a class of characters
specified using the operator pair [ ].
The expression [ab] matches the string a or b.
Within square brackets most operators are ignored
except the three special characters \ - ^ are which
used as follows
the escape character \ as above,
the minus character - which is used for ranges like in digit
[0-9]
the hat character ^ as first character after the opening square
bracket, it is used for complemented matches like in NOTabc
[^abc]
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
19
OPTIONAL EXPRESSIONS:
The ? operator indicates an optional element of an expression.
For instance, ab?c matches either ac or abc.
REPEATED EXPRESSIONS:
Repetitions of patterns are indicated by the operators * and +.
The pattern a* matches any number of consecutive a
characters (including zero).
The pattern [a-z]+ is any positive number of consecutive
lower-case alphabetic characters.
Hence we can recognize identifiers in a typical computer
language with [A-Za-z][A-Za-z0-9]*
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
20
Repetitions can also be obtained with the pair
operator {}.
If {} encloses numbers, it specifies repetitions.
For instance a{1,5} matches 1 to 5 repetitions of a.
Note that if {} encloses a name, this name should be defined in
the definition section. Then LEX substitutes the definition for
the name.
ALTERNATING:
The operator | indicates alternation.
For instance, (ab|cd) matches the language consisting of both
words ab and cd.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
21
GROUPING:
Parentheses are used for grouping.
For instance, (ab|cd+)?(ef)* denotes the language of the
words that are either empty or
optionally starts with
ab or
c followed by any positive number of d
and continues with any number of repetition of ef
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
LEX REGULAR EXPRESSIONS
22
CONTEXT SENSITIVITY:
LEX provides some support for contextual grammatical rules.
If ^ is the first character in an expression, then this
expression will only be matched at the beginning of a line.
If $ is the last character in an expression, then this
expression will only be matched at the end of a line.
If r and s are two LEX regular expressions then r/s is
another LEX regular expression.
It matches r if and only if it is followed by an s.
It is called a trailing context.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Regular expressions in Lex- summary
23
Character Meaning
A-Z, 0-9, a-z Characters and numbers that form part of the pattern.
. matches any single character except \n
* matches 0 or more instances of the preceding regular
expression
+ matches 1 or more instances of the preceding regular
expression
? matches 0 or 1 of the preceding regular expression
| matches the preceding or following regular expression
[ ] Defines a character class. Matches any character in the
brackets. Example: [abC] matches either of a, b, and C.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Regular expressions in Lex
24
Character groups enclosed regular expression into a new
regular expression
[^S] Matches any characters not in S
() groups enclosed regular expression into a new regular
expression
“…” matches everything within the “ " literally (even
meta characters)
{ } Indicates how many times a pattern can be present.
Example: A{1,3} implies one or three occurrences of
A may be present
^ Matches beginning of line as the first character of
the pattern.
$ Matches end of line as the last character of the
pattern.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Defining regular expressions in Lex
25
Character groups enclosed regular expression into a new
regular expression
\ Used to escape meta characters. Also used to remove
the special meaning of characters as defined in this
table.
/ Look ahead. Matches the preceding pattern only if
followed by the succeeding expression. Example: A0/1
matches A0 only if A01 is
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Meta-characters
26
• meta-characters (do not match themselves, because
they are used in the preceding regular expressions):
()[]{}<>+/,^*|.\"$?-%
to match a meta-character, prefix with "\"
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
27
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Examples of regular expressions
28
Regular Meaning
expression
joke[rs] Matches either jokes or joker.
A{1,2}shis Matches AAshis, Ashis
(A)+ [b-e] Matches one or more occurrences of
A followed by any character from b
to e.
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Examples of regular expressions
29
Regular expression Meaning
ab?c ac or abc.
[a-zA-Z][a-zA-Z0-9]* all alphanumeric strings
with a leading alphabetic
character
[-+0-9] A –ve or +ve single digit
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Examples of regular expressions
30
Regular expression Meaning
[^a-zA-Z] any character which is not a
letter
[a-zA-Z][a-zA-Z0-9]* all alphanumeric strings
with a leading alphabetic
character
[-+0-9] A –ve or +ve single digit
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Examples of token declarations
31
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Word counting Program: Example
32
%{
#include <stdio.h>
int wordCount = 0;
%}
word [^ \t\n,\.:]+
delim [" "\n\t]+
%%
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Word counting Program: Example
33
{word} { wordCount++;}
{delim} {;}
%%
int main()
{
printf(" Enter the input string");
yylex(); /* start the analysis*/
printf(" No of words: %d\n", wordCount);
}
int yywrap()
{
return 1;
}
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
Execution steps:
34
Text editor type the code and save as example.l file
Open terminal
lex example.l
gcc lex.yy.c -ll
./a.out
Open terminal
lex example.l
gcc -o example lex.yy.c -ll
./example
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
options
35
gcc –ll
link the code with the library files
gcc -o output file
Build the output generated to output file
-o example: This will compile the lex.yy.c file but
instead of giving default name (a), it will give output
file as example.
-o is for output file option.
gcc –w
Disables all warning messages during the compilation
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022
36
THANK YOU
Dr. Anisha P Rodrigues,NMAMIT,Nitte 8/23/2022