SE Compiler Chapter 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Principles of Compiler Design – CENG 2042

Chapter II – Lexical Analysis


Software Engineering Department, School of Computing
Ethiopian Institute of Technology – Mekelle (EiT-M), Mekelle University

2. Lexical Analysis

The primary function of a scanner is to transform a character stream into a token stream. A scanner is
sometimes called a lexical analyzer, or lexer. The names “scanner,” “lexical analyzer,” and “lexer” are
used interchangeably.

Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or comments in the source code.

If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with
the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and passes the
data to the syntax analyzer when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined rules
for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by means of a
pattern. A pattern explains what can be a token, and these patterns are defined by means of regular
expressions.

In programming language, keywords, constants, identifiers, strings, numbers, operators, and punctuations
symbols can be considered as tokens.

For example, in C language, the variable declaration line

int value = 100;

Contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ;


(symbol).

Example of Non Tokens:

Ins: Fkrezgy Yohannes Compiler Design 1|Page


comment /* try again */
preprocessor directive #include<stdio.h>
preprocessor directive #define NUMS 5 , 6
macro NUMS
blanks, tabs, and newlines

2.1. TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
Example:
Description of token

Token lexeme pattern


const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
nun 3.14 any character b/w “and “except"
literal "core" pattern

A patter is a rule describing the set of lexemes that can represent a particular token in source program.

2.2. LEXICAL ERRORS:

Lexical errors are the errors thrown by your lexer when unable to continue. This means that there’s no way
to recognize a lexeme as a valid token for your lexer. Syntax errors, on the other side, will be thrown by
your scanner when a given set of already recognized valid tokens don't match any of the right sides of your
grammar rules. Simple panic-mode error handling system requires that we return to a high-level parsing
function when a parsing or lexical error is detected.

Error-recovery actions are:


i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

2.3. Specifications of Tokens

To identify the tokens we need some method of describing the possible tokens that can appear in the input
stream. For this purpose we introduce regular expression, a notation that can be used to describe essentially
all the tokens of programming language.

Ins: Fkrezgy Yohannes Compiler Design 2|Page


Secondly, having decided what the tokens are, we need some mechanism to recognize these in the input
stream. This is done by the token recognizers, which are designed using transition diagrams and finite
automata.

Let us understand how the language theory undertakes the following terms:

2.3.1. Alphabets, Strings and Languages

Alphabets
Any finite set of symbols ∑ = {0,1} is a set of binary alphabets, ∑ = {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, ∑ = {a-z, A-Z} is a set of English language alphabets.

Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of
alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string
having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

Example: 01101, 111, 0001, 111 … are strings from the binary alphabet ∑ = { 0, 1 }.

• Power of an alphabet:

– If ∑ is an alphabet, we can express the set of all strings of a certain length from that alphabet
by using an exponential notation. We define ∑k to be the set of strings of length k, each of
whose symbols is in ∑.

– ∑0= {ϵ}, regardless of what alphabet ∑ is.


– If ∑ = {0,1}, then ∑1 = {0,1}, ∑2 = {00, 01, 10, 11}, ∑3={000, 001, 010, 011, 100, 101,
110,111}

• The set of all strings over an alphabet ∑ is conventionally denoted ∑*.


Example: {0,1}* = {ϵ, 0, 1, 00, 01, 10, 11, 000, …}
∑* = ∑0 U∑1 U∑2 U∑3U …
• The set of non empty strings from alphabet ∑ is denoted ∑+ (excluding the empty string from the
set of strings)
∑+ = ∑1 U ∑2 U ∑3 U …
∑* = ∑+ U {ε}.

• Operation on strings
– concatenation (product):
• Let x and y be strings. Then xy denotes the concatenation of x and y, that is, the
string formed by making a copy of x and followed it by a copy of y.
• Example: Let x= 01101 and y= 110, then xy= 01101110 (yx=11001101)
• Note: For any string w, εw = wε = w (i.e. ϵ is the identity for concatenation)

Ins: Fkrezgy Yohannes Compiler Design 3|Page


Language

A language is considered as a finite set of strings over some finite set of alphabets. Computer languages are
considered as finite sets, and mathematically set operations can be performed on them. Finite languages can
be described by means of regular expressions.

If ∑ is an alphabet, and L ⊆ ∑*, then L is a Language over ∑.

Example: A language L over an alphabet V is a subset of V*. For instance, if V= {a, b, c}, the following
are languages on V.

• L1 = ∅ (the empty language; i.e. the empty subset of V)


• L2 = {ε } (The language containing just the empty string; notice that L1 ≠ L2)
• L3 = {a, b, c} = V (the language whose elements are just the strings of length 1)
• L4 = {aa, ba, ab}
• L5 = {a, aaa, aaaaa, bc}
• L6 = {ab, aab, aaab, aaaab, …} (the infinite language whose strings consists of any number of a’s
followed by a single b; L6 can also be defined in the more compact way L6 = {anb|n≥1})
• L7 = {(ab)ncm|n≥1, m ≥ 2}
• L8 = {{(anbn|n≥1} = {ab, aabb, aaabbb, …}

Note: It’s common to describe a language using a “set former”


{w| something about w} this expression is read “the set of words w such that (whatever is said about w to
the right of the vertical bar)”

Operations on languages

2.3.2. Regular Expressions


Regular expressions are a convenient way to specify various simple (although possibly infinite) sets of
strings. They are of practical interest because they can specify the structure of the tokens used in a
programming language. In particular, you can use regular expressions to program a scanner generator.

Regular expressions are widely used in computer applications other than compilers. The Unix utility grep
uses them to define search patterns in files. Unix shells allow a restricted form of regular expressions when
specifying file lists for a command. Most editors provide a “context search” command that enables you to
specify desired matches using regular expressions.

Ins: Fkrezgy Yohannes Compiler Design 4|Page


A set of strings defined by regular expressions is called a regular set. For purposes of scanning, a token
class is a regular set whose structure is defined by a regular expression. A particular instance of a token
class is sometimes called a lexeme; however, we simply call a string in a token class an instance of that
token. For example, we call the string abc an identifier if it matches the regular expression that defines the
set of valid identifier tokens.

Our definition of regular expressions starts with a finite character set, or vocabulary (denoted ∑). This
vocabulary is normally the character set used by a computer. Today, the ASCII character set, which
contains 128 characters, is very widely used. Java, however, uses the Unicode character set. This set
includes all of the ASCII characters as well as a wide variety of other characters.

The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belongs to
the language in hand. It searches for the pattern defined by the language rules.

Regular expressions have the capability to express finite languages by defining a pattern for finite strings of
symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.

Regular expression is an important notation for specifying patterns. Each pattern matches a set of strings,
so regular expressions serve as names for a set of strings. Programming language tokens can be described
by regular languages. The specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.

Definition: Regular expression over the alphabet ∑ can be defined as:


1. ε is a RE denoting the set {ε}
2. If a ∈ ∑, then a is RE denoting { a }
3. If r and s are REs, denoting L(r) and L(s), then
3.1. Union: (r)|(s) is a regular expression denoting L(r) U L(s)
3.2. Concatenation: (r)(s) is a regular expression denoting L(r)L(s)
3.3. Kleene closure: (r)* is a regular expression denoting (L(r))*
3.4. (r) is a regular expression denoting L(r)

Precedence and Associativity


· *, concatenation (.), and | (pipe sign) are left associative
· * has the highest precedence
· Concatenation (.) has the second highest precedence.
· | (pipe sign) has the lowest precedence of all.

Notation and their Meaning


If x is a regular expression, then:
· x* means zero or more occurrence of x. i.e., it can generate { e, x, xx, xxx, xxxx, … }
· x+ means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
· x? means at most one occurrence of x i.e., it can generate either {x} or {e}.
· xy means x followed by y
· x|y means x or y
· (x) means grouping

Ins: Fkrezgy Yohannes Compiler Design 5|Page


Representing language tokens using regular expressions
• digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
• sign = (+ | - )
• Decimal = (sign)?(digit)+
• FloatingPoint = (sign)? ((digit)* . (digit)+)
• unsigned_integer = digit digit*
• signed_integer = (+ | - ) unsigned_integer
• Letter = A|B|C|…|Y|Z|a|b|c|…|y|z
• Keywords = BEGIN|END|IF|THEN|ELSE|….
• Identifier = letter (letter|digit|_)*

2.3.3. Regular Grammar

The lexical tokens of a language may also be specified by using a restricted form of BNF notation known
as a regular grammar. A regular grammar has production rules of the form where a nonterminal derives a
terminal alphabet, optionally followed by another nonterminal symbol. Thus, only the following two forms
are allowable in a regular grammar, but they are sufficient to specify lexical tokens.

A→a
A → aA

2.3.4. Finite State Automata

Regular expressions are convenient for specifying lexical tokens, but we need a formalism that can be
implemented as a computer program. For this we can use finite automata.

Finite automata is a state machine that takes a string of symbols as input and changes its state accordingly.
Finite automata is a recognizer for regular expressions. When a regular expression string is fed into finite
automata, it changes its state for each literal. If the input string is successfully processed and the automata
reaches its final state, it is accepted, i.e., the string just fed was said to be a valid token of the language in
hand.

The mathematical model of finite automata consists of:


1. Finite set of states (Q)
2. Finite set of input symbols (Σ)
3. One Start state (q0)
4. Set of final states (qf)
5. Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔ Q

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).


· States: States of FA are represented by circles. State names are written inside circles.
· Start state: The state from where the automata starts is known as the start state. Start state has an
arrow pointed towards it.

Ins: Fkrezgy Yohannes Compiler Design 6|Page


· Intermediate states: All intermediate states have at least two arrows; one pointing to and another
pointing out from them.
· Final state: If the input string is successfully parsed, the automata is expected to be in this state. Final
state is represented by double circles.
· Transition: The transition from one state to another state happens when a desired symbol in the input
is found. Upon transition, automata can either move to the next state or stay in the same state.
Movement from one state to another is shown as a directed arrow, where the arrows point to the
destination state. If automata stays on the same state, an arrow pointing from a state to itself is drawn.

Description of Automata

1. An automata has a mechanism to read input from input tape,


2. Any language is recognized by some automation, Hence these automation are basically language
‘acceptors’ or ‘language recognizers’.

Types of Finite Automata


1. Deterministic Automata
2. Non-Deterministic Automata.

DETERMINISTIC AUTOMATA
A deterministic finite automata has at most one transition from each state on any input. A DFA is a special
case of a NFA in which:-
· It has no transitions on input ε,
· Each input symbol has at most one transition from any state.
· The Finite Automata is called DFA if there is only one path for a specific input from current state to
next state.

The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA

Examples of Finite State Machines for Lexical Analysis

An example of a finite state machine which accepts any identifier beginning with a letter and followed by
any number of letters and digits is shown below. Valid keywords of programming languages can also be
recognized with finite state automata.

Ins: Fkrezgy Yohannes Compiler Design 7|Page


The following two finite state automata can recognize numeric constants and relational operators.

Ins: Fkrezgy Yohannes Compiler Design 8|Page


Finite state automata for floating point

Non-Deterministic Finite Automata (NFA)

A nondeterministic finite automaton (NFA) is one that has a choice of edges – labeled with the same
symbol – to follow out of a state. Or it may have special edges labeled with ε (the Greek letter epsilon) that
can be followed without eating any symbol from the input.

Here is an example of an NFA:

2.3. Thompson's Construction: From a Regular Expression to an NFA*

Lex/Flex constructs NFA's from regular expressions using a system called Thompson's Construction,
developed by Ken Thompson at Bell Labs for the QED editor. It works as follows:

a. The simplest possible regular expression is a single character, and this expression can be
represented by a correspondingly simple NFA. For example, a machine that matches an a is shown
below:

b. The concatenation of two regular expressions is also straightforward. The following machine
represents the expression ab by constructing individual machines for each subexpression (the a and
b), and then connecting the machines with an ε edge:

Ins: Fkrezgy Yohannes Compiler Design 9|Page


This method needlessly wastes states, however. A better solution merges the ending state of the first
machine with the start state of the second one, like this:

c. The OR operator (the vertical bar |) which can appear in the regular expression a|b can be
represented using NFA as follows:

d. Representing the Closure Operators (*, +, ?)

Example: For a RE (a|b)* a, the NFA construction is shown below.

Ins: Fkrezgy Yohannes Compiler Design 10 | P a g e


2.4. Subset Construction: Converting an NFA to A DFA

Implementing deterministic finite automata (DFAs) as computer programs is easy. But implementing
NFAs is a bit harder, since most computers don’t have good “guessing” hardware.

You can apply the foregoing procedure to translate NFA's to DFA's by computing the ε – closure and move
sets for every possible input character. This generalized method, called subset construction, takes
advantage of the fact that all NFA states connected with ε edges are effectively the same state. A single
DFA state represents all NFA states that are connected in this manner. The outgoing edges from the DFA
state are just the sum of the outgoing edges from all states in the ε-closure set.

We formally define ε – closure as follows. Let edge(s, c) be the set of all NFA states reachable by
following a single edge with label c from state s. For a set of states S, closure(S) is the set of states that can
be reached from a state in S without consuming any of the input, that is, by going only through ε - edges.

Manipulating sets of states is expensive – too costly to want to do on every character in the source program
that is being lexically analyzed. But it is possible to do all the sets-of-states calculations in advance. We
make a DFA from the NFA, such that each set of NFA states corresponds to one DFA state. Since the NFA
has a finite number n of states, the DFA will also have a finite number (at most 2n) of states.

Start state of DFA: the starting DFA state consists of the ε - closure of the starting NFA state. (i.e, the set
of states that is reachable from NFA’s start state by traveling ε arrow, plus the start state of NFA.)

The next states in the DFA are then computed by figuring move(current_state, c), for every possible input
character c.
Accept states of DFA: The new accept states (of DFA) are those containing NFA’s accepting state;

Example: Convert the following NFA to DFA using ε – closure

Ins: Fkrezgy Yohannes Compiler Design 11 | P a g e


Solution:

Ins: Fkrezgy Yohannes Compiler Design 12 | P a g e


2.5. DFA Minimization (State Reduction Using Partitioning)

All DFAs created are not optimal DFAs.

How it works? You find equivalent states by systematically eliminating those states that can't be
equivalent-partitioning the initial array into potentially equivalent states, and then gradually partitioning the
partitions. When you're done, states that share a partition are equivalent. Initially, you partition the matrix
into two parts, one for accepting states and another for nonaccepting states:

Example: consider the following DFA that accepts floating number.

Ins: Fkrezgy Yohannes Compiler Design 13 | P a g e


Solution:

(The partition number is on the right.) The implicit failure state is alone in a special, unnumbered partition
that's not shown in the foregoing table. The next step in the minimization process creates a set of new
partitions by looking at the goto transitions for each state and eliminating states that are not equivalent. If
the outgoing transitions from two states go to different partitions, then the states are not equivalent. You do
the elimination on a column-by-column basis. Starting with the D column, States 0, 1, and 4 all go to a
state in Partition 0 on a D, but State 2 goes to Partition I on a D. Consequently, State 2 must be removed
from Partition 0. Formally, you say that State 4 is distinguished from the other states in the partition by a
D. Continuing in this manner, State 3 is also distinguished from States 5, 6, and 7 by a D because State 3
goes to the failure state on a D, but the other states go to a state in Partition I. The failure state is not in
Partition I, so State 3 is distinguished by a D.

Now you go down the dot (.) column. The dot distinguishes State I from States 0 and 4 because State 1
goes to a state in Partition 1 on a dot, but States 0 and 4 both go to States in Partition 2. The new partitions
are:

Ins: Fkrezgy Yohannes Compiler Design 14 | P a g e


Next, you go through the array a second time, column by column. Now, D distinguishes State 0 from State
4 because State 0 goes to a state in Partition 4 on a D, but State 4 goes to a state in Partition 0 on a D. No
other states can be distinguished from each other, so we're done. The final partitions look like this:

The next step is to build a new transition matrix. Each partition is a single state in the minimized DFA, and
all next states in the original table are replaced by the number of the partition in which the state is found.
For example, since States 5, 6, and 7 are all in Partition 1, all references to one of these states in the
original table are replaced by a reference to the new State {5,6,7}. The new table looks like this:

Ins: Fkrezgy Yohannes Compiler Design 15 | P a g e


The final minimized DFA:

Ins: Fkrezgy Yohannes Compiler Design 16 | P a g e

You might also like