Chapter 2
Chapter 2
Chapter 2
Chapter-2
Lexical Analysis
The Role of the Lexical Analyzer
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as
output a sequence of tokens for each lexeme in the source program.
The stream of tokens is sent to the parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table as well. When the lexical analyzer
discovers a lexeme constituting an identifier, it needs to enter that lexeme into the
symbol table. In some cases, information regarding the kind of identifier may be read
from the symbol table by the lexical analyzer to assist it in determining the proper
token it must pass to the parser.
These interactions are suggested in Fig. 3.1. Commonly, the interaction is
implemented by having the parser call the lexical analyzer. The call, suggested
by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next
token, which it returns to the parser.
Since the lexical analyzer is the part of the compiler that reads the source text, it may
perform certain other tasks besides identification of lexemes. One such task is
stripping out comments and whitespace (blank, newline, tab, and perhaps other
characters that are used to separate tokens in the input). Another task is correlating
error messages generated by the compiler with the source program.
Token
It is basically a sequence of characters that are treated as a unit as it cannot be
further broken down.
In programming languages like C language-
keywords (int, char, float, const, goto, continue, etc.)
identifiers (user-defined names), operators (+, -, *, /),
delimiters/punctuators like comma (,), semicolon(;), braces ({ }), etc. ,
strings can be considered as tokens.
Example 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5
Example 2:
int main ()
{
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Compiler Design Subject");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'
Answer – Total number of tokens = 14
Lexeme
It is a sequence of characters in the source code that are matched by given predefined
language rules for every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
Pattern
It specifies a set of rules that a scanner follows to create a token.
Example of Programming Language (C, C++):
Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol
Table that consists of all the tokens present in the source code except Whitespaces
and comments.
The most important example is the token id, where we need to associate with
the token a great deal of information. Normally, information about an identifier e.g., its
lexeme, its type, and the location at which it is first found (in case an error message
about that identifier must be issued) is kept in the symbol table. Thus, the appropriate
attribute value for an identifier is a pointer to the symbol-table entry for that identifier.
Example: s=a+b+c;
Avg=s/3;
Recognition of Tokens
Now, we must study how to take the patterns for all the needed tokens and build a
piece of code that examines the input string and finds a prefix that is a lexeme
matching one of the patterns. Our discussion will make use of the following running
example.
Example 3.8: The grammar fragment of Fig. 3.10 describes a simple form of
branching statements and conditional expressions. This syntax is similar to that of the
language Pascal, in that then appears explicitly after conditions.
For relop, we use the comparison operators of languages like Pascal or SQL,
where = is \equals" and <> is \not equals," because it presents an interesting
structure of lexemes.
The terminals of the grammar, which are if, then, else, relop, id, and
number, are the names of tokens as far as the lexical analyzer is concerned. The
patterns for these tokens are described using regular definitions, as in Fig. 3.11.
The patterns for id and number are similar to what we saw in Example 3.7.
For this language, the lexical analyzer will recognize the keywords if, then, and else, as
well as lexemes that match the patterns for relop, id, and number. To simplify
matters, we make the common assumption that keywords are also reserved words:
that is, they are not identifiers, even though their lexemes match the pattern for
identifiers.
In addition, we assign the lexical analyzer the job of stripping out white space, by
recognizing the \token" ws defined by:
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that,
when we recognize it, we do not return it to the parser, but rather restart the lexical
analysis from the character that follows the whitespace. It is the following token that
gets returned to the parser.
Our goal for the lexical analyzer is summarized in Fig. 3.12. That table
shows, for each lexeme or family of lexemes, which token name is returned to
the parser and what attribute value, as discussed in Section 3.1.3, is returned.
Note that for the six relational operators, symbolic constants LT, LE, and so
on are used as the attribute value, to indicate which instance of the
token relop we have found. The particular operator found will influence the
code that is output from the compiler.
LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser
generator.
o The lexical analyzer is a program that transforms an input stream into a
sequence of tokens.
o It reads the input stream and produces the source code as output through
implementing the lexical analyzer in the C program.
User subroutines are auxiliary procedures needed by the actions. The subroutine can
be loaded with the lexical analyzer and compiled separately.