SE Compiler Chapter 2
SE Compiler Chapter 2
SE Compiler Chapter 2
2. Lexical Analysis
The primary function of a scanner is to transform a character stream into a token stream. A scanner is
sometimes called a lexical analyzer, or lexer. The names “scanner,” “lexical analyzer,” and “lexer” are
used interchangeably.
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with
the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the
data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules
for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a
pattern. A pattern explains what can be a token, and these patterns are defined by means of regular
expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators, and punctuations
symbols can be considered as tokens.
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
Example:
Description of token
A patter is a rule describing the set of lexemes that can represent a particular token in source program.
Lexical errors are the errors thrown by your lexer when unable to continue. This means that there’s no way
to recognize a lexeme as a valid token for your lexer. Syntax errors, on the other side, will be thrown by
your scanner when a given set of already recognized valid tokens don't match any of the right sides of your
grammar rules. Simple panic-mode error handling system requires that we return to a high-level parsing
function when a parsing or lexical error is detected.
To identify the tokens we need some method of describing the possible tokens that can appear in the input
stream. For this purpose we introduce regular expression, a notation that can be used to describe essentially
all the tokens of programming language.
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols ∑ = {0,1} is a set of binary alphabets, ∑ = {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, ∑ = {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of
alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string
having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Example: 01101, 111, 0001, 111 … are strings from the binary alphabet ∑ = { 0, 1 }.
• Power of an alphabet:
– If ∑ is an alphabet, we can express the set of all strings of a certain length from that alphabet
by using an exponential notation. We define ∑k to be the set of strings of length k, each of
whose symbols is in ∑.
• Operation on strings
– concatenation (product):
• Let x and y be strings. Then xy denotes the concatenation of x and y, that is, the
string formed by making a copy of x and followed it by a copy of y.
• Example: Let x= 01101 and y= 110, then xy= 01101110 (yx=11001101)
• Note: For any string w, εw = wε = w (i.e. ϵ is the identity for concatenation)
A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are
considered as finite sets, and mathematically set operations can be performed on them. Finite languages can
be described by means of regular expressions.
Example: A language L over an alphabet V is a subset of V*. For instance, if V= {a, b, c}, the following
are languages on V.
Operations on languages
Regular expressions are widely used in computer applications other than compilers. The Unix utility grep
uses them to define search patterns in files. Unix shells allow a restricted form of regular expressions when
specifying file lists for a command. Most editors provide a “context search” command that enables you to
specify desired matches using regular expressions.
Our definition of regular expressions starts with a finite character set, or vocabulary (denoted ∑). This
vocabulary is normally the character set used by a computer. Today, the ASCII character set, which
contains 128 characters, is very widely used. Java, however, uses the Unicode character set. This set
includes all of the ASCII characters as well as a wide variety of other characters.
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belongs to
the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings of
symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings,
so regular expressions serve as names for a set of strings. Programming language tokens can be described
by regular languages. The specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.
The lexical tokens of a language may also be specified by using a restricted form of BNF notation known
as a regular grammar. A regular grammar has production rules of the form where a nonterminal derives a
terminal alphabet, optionally followed by another nonterminal symbol. Thus, only the following two forms
are allowable in a regular grammar, but they are sufficient to specify lexical tokens.
A→a
A → aA
Regular expressions are convenient for specifying lexical tokens, but we need a formalism that can be
implemented as a computer program. For this we can use finite automata.
Finite automata is a state machine that takes a string of symbols as input and changes its state accordingly.
Finite automata is a recognizer for regular expressions. When a regular expression string is fed into finite
automata, it changes its state for each literal. If the input string is successfully processed and the automata
reaches its final state, it is accepted, i.e., the string just fed was said to be a valid token of the language in
hand.
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔ Q
Description of Automata
DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on any input. A DFA is a special
case of a NFA in which:-
· It has no transitions on input ε,
· Each input symbol has at most one transition from any state.
· The Finite Automata is called DFA if there is only one path for a specific input from current state to
next state.
The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA
An example of a finite state machine which accepts any identifier beginning with a letter and followed by
any number of letters and digits is shown below. Valid keywords of programming languages can also be
recognized with finite state automata.
A nondeterministic finite automaton (NFA) is one that has a choice of edges – labeled with the same
symbol – to follow out of a state. Or it may have special edges labeled with ε (the Greek letter epsilon) that
can be followed without eating any symbol from the input.
Lex/Flex constructs NFA's from regular expressions using a system called Thompson's Construction,
developed by Ken Thompson at Bell Labs for the QED editor. It works as follows:
a. The simplest possible regular expression is a single character, and this expression can be
represented by a correspondingly simple NFA. For example, a machine that matches an a is shown
below:
b. The concatenation of two regular expressions is also straightforward. The following machine
represents the expression ab by constructing individual machines for each subexpression (the a and
b), and then connecting the machines with an ε edge:
c. The OR operator (the vertical bar |) which can appear in the regular expression a|b can be
represented using NFA as follows:
Implementing deterministic finite automata (DFAs) as computer programs is easy. But implementing
NFAs is a bit harder, since most computers don’t have good “guessing” hardware.
You can apply the foregoing procedure to translate NFA's to DFA's by computing the ε – closure and move
sets for every possible input character. This generalized method, called subset construction, takes
advantage of the fact that all NFA states connected with ε edges are effectively the same state. A single
DFA state represents all NFA states that are connected in this manner. The outgoing edges from the DFA
state are just the sum of the outgoing edges from all states in the ε-closure set.
We formally define ε – closure as follows. Let edge(s, c) be the set of all NFA states reachable by
following a single edge with label c from state s. For a set of states S, closure(S) is the set of states that can
be reached from a state in S without consuming any of the input, that is, by going only through ε - edges.
Manipulating sets of states is expensive – too costly to want to do on every character in the source program
that is being lexically analyzed. But it is possible to do all the sets-of-states calculations in advance. We
make a DFA from the NFA, such that each set of NFA states corresponds to one DFA state. Since the NFA
has a finite number n of states, the DFA will also have a finite number (at most 2n) of states.
Start state of DFA: the starting DFA state consists of the ε - closure of the starting NFA state. (i.e, the set
of states that is reachable from NFA’s start state by traveling ε arrow, plus the start state of NFA.)
The next states in the DFA are then computed by figuring move(current_state, c), for every possible input
character c.
Accept states of DFA: The new accept states (of DFA) are those containing NFA’s accepting state;
How it works? You find equivalent states by systematically eliminating those states that can't be
equivalent-partitioning the initial array into potentially equivalent states, and then gradually partitioning the
partitions. When you're done, states that share a partition are equivalent. Initially, you partition the matrix
into two parts, one for accepting states and another for nonaccepting states:
(The partition number is on the right.) The implicit failure state is alone in a special, unnumbered partition
that's not shown in the foregoing table. The next step in the minimization process creates a set of new
partitions by looking at the goto transitions for each state and eliminating states that are not equivalent. If
the outgoing transitions from two states go to different partitions, then the states are not equivalent. You do
the elimination on a column-by-column basis. Starting with the D column, States 0, 1, and 4 all go to a
state in Partition 0 on a D, but State 2 goes to Partition I on a D. Consequently, State 2 must be removed
from Partition 0. Formally, you say that State 4 is distinguished from the other states in the partition by a
D. Continuing in this manner, State 3 is also distinguished from States 5, 6, and 7 by a D because State 3
goes to the failure state on a D, but the other states go to a state in Partition I. The failure state is not in
Partition I, so State 3 is distinguished by a D.
Now you go down the dot (.) column. The dot distinguishes State I from States 0 and 4 because State 1
goes to a state in Partition 1 on a dot, but States 0 and 4 both go to States in Partition 2. The new partitions
are:
The next step is to build a new transition matrix. Each partition is a single state in the minimized DFA, and
all next states in the original table are replaced by the number of the partition in which the state is found.
For example, since States 5, 6, and 7 are all in Partition 1, all references to one of these states in the
original table are replaced by a reference to the new State {5,6,7}. The new table looks like this: