Chapter 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Chapter-2//Lexical Analyzer

Chapter-2
Lexical Analysis
The Role of the Lexical Analyzer
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.

The stream of tokens is sent to the parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table as well. When the lexical analyzer
discovers a lexeme constituting an identifier, it needs to enter that lexeme into the
symbol table. In some cases, information regarding the kind of identifier may be read
from the symbol table by the lexical analyzer to assist it in determining the proper
token it must pass to the parser.
These interactions are suggested in Fig. 3.1. Commonly, the interaction is
implemented by having the parser call the lexical analyzer. The call, suggested
by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next
token, which it returns to the parser.

Since the lexical analyzer is the part of the compiler that reads the source text, it may
perform certain other tasks besides identification of lexemes. One such task is
stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input). Another task is correlating
error messages generated by the compiler with the source program.

Sometimes, lexical analyzers are divided into a cascade of two processes:


a) Scanning consists of the simple processes that do not require tokenization
of the input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
b) Lexical analysis proper is the more complex portion, which produces tokens from
the output of the scanner.

DR. GANGONE SWAPNA 1


Chapter-2//Lexical Analyzer

Tokens, Patterns, and Lexemes

Token
It is basically a sequence of characters that are treated as a unit as it cannot be
further broken down.
In programming languages like C language-
keywords (int, char, float, const, goto, continue, etc.)
identifiers (user-defined names), operators (+, -, *, /),
delimiters/punctuators like comma (,), semicolon(;), braces ({ }), etc. ,
strings can be considered as tokens.

Let’s understand now how to calculate tokens in a source code (C language):

Example 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5

Example 2:
int main ()
{
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Compiler Design Subject");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'
Answer – Total number of tokens = 14

Lexeme
It is a sequence of characters in the source code that are matched by given predefined
language rules for every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern
It specifies a set of rules that a scanner follows to create a token.
Example of Programming Language (C, C++):

DR. GANGONE SWAPNA 2


Chapter-2//Lexical Analyzer

For a keyword to be identified as a valid token, the pattern is the sequence of


characters that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that
it must start with alphabet, followed by alphabet or a digit.

The output of Lexical Analysis Phase:


The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of
tokens and not the series of lexemes because during the syntax analysis phase
individual unit is not vital but the category or class to which this lexeme belongs is
considerable.

Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol
Table that consists of all the tokens present in the source code except Whitespaces
and comments.

Attributes for Tokens


When a lexeme is encountered in a source program, it is necessary to keep a track of
other occurrences of the same lexeme. i.e... whether this lexeme seen before or not.

The most important example is the token id, where we need to associate with
the token a great deal of information. Normally, information about an identifier e.g., its
lexeme, its type, and the location at which it is first found (in case an error message
about that identifier must be issued) is kept in the symbol table. Thus, the appropriate
attribute value for an identifier is a pointer to the symbol-table entry for that identifier.
Example: s=a+b+c;
Avg=s/3;

Recognition of Tokens
Now, we must study how to take the patterns for all the needed tokens and build a
piece of code that examines the input string and finds a prefix that is a lexeme
matching one of the patterns. Our discussion will make use of the following running
example.

DR. GANGONE SWAPNA 3


Chapter-2//Lexical Analyzer

Example 3.8: The grammar fragment of Fig. 3.10 describes a simple form of
branching statements and conditional expressions. This syntax is similar to that of the
language Pascal, in that then appears explicitly after conditions.
For relop, we use the comparison operators of languages like Pascal or SQL,
where = is \equals" and <> is \not equals," because it presents an interesting
structure of lexemes.
The terminals of the grammar, which are if, then, else, relop, id, and
number, are the names of tokens as far as the lexical analyzer is concerned. The
patterns for these tokens are described using regular definitions, as in Fig. 3.11.
The patterns for id and number are similar to what we saw in Example 3.7.

For this language, the lexical analyzer will recognize the keywords if, then, and else, as
well as lexemes that match the patterns for relop, id, and number. To simplify
matters, we make the common assumption that keywords are also reserved words:
that is, they are not identifiers, even though their lexemes match the pattern for
identifiers.
In addition, we assign the lexical analyzer the job of stripping out white space, by
recognizing the \token" ws defined by:

DR. GANGONE SWAPNA 4


Chapter-2//Lexical Analyzer

Here, blank, tab, and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that,
when we recognize it, we do not return it to the parser, but rather restart the lexical
analysis from the character that follows the whitespace. It is the following token that
gets returned to the parser.
Our goal for the lexical analyzer is summarized in Fig. 3.12. That table
shows, for each lexeme or family of lexemes, which token name is returned to
the parser and what attribute value, as discussed in Section 3.1.3, is returned.
Note that for the six relational operators, symbolic constants LT, LE, and so
on are used as the attribute value, to indicate which instance of the
token relop we have found. The particular operator found will influence the
code that is output from the compiler.

LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser
generator.
o The lexical analyzer is a program that transforms an input stream into a
sequence of tokens.
o It reads the input stream and produces the source code as output through
implementing the lexical analyzer in the C program.

DR. GANGONE SWAPNA 5


Chapter-2//Lexical Analyzer

The function of Lex is as follows:


o Firstly, lexical analyzer creates a program lex.1 in the Lex language. Then Lex
compiler runs the lex.1 program and produces a C program lex.yy.c.
o Finally, C compiler runs the lex.yy.c program and produces an object program
a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of
tokens.

Lex file format


A Lex program is separated into three sections by %% delimiters. The formal of Lex
source is as follows:
1. {definitions}
2. %%
3. {rules}
4. %%
5. {user subroutines}
Definitions include declarations of constant, variable, and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.


Where pi describes the regular expression and action1 describes the actions what
action the lexical analyzer should take when pattern pi matches a lexeme.

User subroutines are auxiliary procedures needed by the actions. The subroutine can
be loaded with the lexical analyzer and compiled separately.

DR. GANGONE SWAPNA 6

You might also like