2018 Com 414 (Compiler Construction)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 79

COMPILER CONSTRUCTION LECTURE NOTE

(COM 414)

FOR HND II COMPUTER SCIENCE

-1-
Introducing Compilers and Interpreters
A compiler is a computer program that implements a programming language specification
to "translate" programs, usually as a set of files which constitute the source code written in
source language, into their equivalent machine readable instructions (the target
language, often having a binary form known as object code). This translation process
is called compilation. We compile the source program to create the compiled program.
The compiled program can then be run (or executed) to do what was specified in the
original source program.
The source language is always a higher-level language in comparison to machine code,
written using some mixture of English words and mathematical notation, assembly
language being the lowest compilable language (an assembler being a special case of a
compiler that translates assembly language into machine code). Higher-level languages
are the most complex to support in a compiler/interpreter, not only because they increase
the level of abstraction between the source code and the resulting machine code, but
because increased complexity is required to formalize those abstract structures.
The target language is normally a low-level language such as assembly, written with
somewhat cryptic abbreviations for machine instructions, in this cases it will also run an
assembler to generate the final machine code.
But some compilers can directly generate machine code for some actual or virtual
computer e.g. byte-code for the Java Virtual Machine.
Another common approach to the resulting compilation effort is to target a virtual
machine. That will do just-in-time compilation and byte-code interpretation and blur the
traditional categorizations of compilers and interpreters.
For example, C and C++ will generally be compiled for target `architecture'. The draw-
back is that because there are many types of processor there will need to be as many
distinct compilations. In contrast Java will target a Java Virtual Machine, which is an
independent layer above the 'architecture'. The difference is that the generated byte-code,
not true machine code, brings the possibility of portability, but will need a Java Virtual
Machine (the byte-code interpreter) for each platform. The extra overhead of this byte-
code interpreter means slower execution speed.
An interpreter is a computer program which executes the translation of the source
program at run-time. It will not generate independent executable programs nor object
libraries ready to be included in other programs.
A program which does a lot of calculation or internal data manipulation will generally run
faster in compiled form than when interpreted. But a program which does a lot of

-2-
input/output and very little calculation or data manipulation may well run at about the
same speed in either case.
Being themselves computer programs, both compilers and interpreters must be written in
some implementation language. Up until the early 1970's, most compilers were written in
assembly language for some particular type of computer. The advent of C and Pascal
compilers, each written in their own source language, led to the more general use of high-
level languages for writing compilers. Today, operating systems will provide at least a free
C compiler to the user and some will even include it as part of the OS distribution.
Compiler construction is normally considered as an advanced rather than a novice
programming task, mainly due to the quantity of code needed rather than the difficulty of
any particular coding constructs. To this most books about compilers have some blame.
The large gap between production compilers and educational exercises promotes this
defeatist view.
For the first version of a compiler written in its own source language you have a
bootstrapping problem. Once you get a simple version working, you can then use it to
improve it.
The compilation process
At the highest level, compilation is broken into a number of parts:
1. Lexical analysis (tokenizing)
2. Syntax analysis (parsing)
3. Type checking
4. Code generation

Requirements
Any compiler has some essential requirements, which are perhaps more stringent than for
most programs:
• Any valid program must be translated correctly, i.e. no incorrect translation is allowed.
• Any invalid program must be rejected and not translated.
There will inevitably be some valid programs which can't be translated due to their size or
complexity in relation to the hardware available, for example problems due to memory
size. The compiler may also have some fixed-size tables which place limits on what can be
compiled (some language definitions place explicit lower bounds on the sizes of certain
tables, to ensure that programs of reasonable size/complexity can be compiled).
There are also some desirable requirements, some of which may be mutually exclusive:
• Errors should be reported in terms of the source language or program.

-3-
• The position at which an error was detected should be indicated; if the actual error
probably occurred somewhat earlier then some indication of possible cause(s) should also
be provided.
• Compilation should be fast.
• The translated program should be fast.
• The translated program should be small.
• If the source language has some national or international standard:
• Ideally the entire standard should be implemented.
• Any restrictions or limits should be well and clearly documented.
• If extensions to the standard have been implemented:
• These extensions should be documented as such.
• There should be some way of turning these extensions off.
There are also some possibly controversial requirements to consider (see chapter on
dealing with errors):
• Errors detected when the translated program is running should still be reported in relation
to the original source program e.g. line number.
• Errors detected when the translated program is running should include division by 0,
running out of memory, use of an array subscript/index which is too big or too small,
attempted use of an undefined variable, incorrect use of pointers, etc.

Overall Structure
For ease of exposition we will divide the compiler into a front end and a back end. These
need not even be written in the same implementation language, providing they can
communicate effectively via some intermediate representation.
The following list itemizes the tasks carried out by the front end and the back end. Note
that the tasks are not carried out in any particular order, as outlined below, and discussed
in more detail in subsequent sessions.

• Front end
• lexical analysis - convert characters to tokens
• syntax analysis - check for valid sequence of tokens
• semantic analysis - check for meaning
• some global/high-level optimization

• Back end
• some local optimization

-4-
• register allocation
• peep-hole optimization
• code generation
• instruction scheduling
Almost all the source-language aspects are handled by the front end. It is possible to have
different front ends for different high-level languages, and a common back end which does
most of the optimization.
Almost all the machine-dependent aspects are handled by the back end. It is possible to
have different back ends for different computers so that the compiler can produce code for
different computers.
The front end is normally controlled by the syntax analysis processing. As necessary, the
syntax analysis code will call a routine which performs some lexical analysis and returns
the next token. At selected points during syntax analysis, appropriate semantic routines are
called which perform any relevant semantic checks and/or add information to the internal
representation.

What are Compilers


Compilers translate information from one representation to another. Thus, a tool that
translates, say, Russian into English could be labeled as a compiler. In this course,
however, information = program in a computer language. In this context, we will talk of
compilers such as VC, VC++, GCC, JavaC FORTRAN, Pascal, VB. Application that
convert, for example, a Word file to PDF or PDF to Postscript will be called “translators”.
In this course we will study typical compilation: from programs written in high-level
languages to low-level object code and machine code.

Typical Compilation
Consider the source code of C function

int expr( int n )


{
int d;
d = 4*n*n*(n+1)*(n+1); return d;
}

Expressing an algorithm or a task in C is optimized for human readability and


comprehension. This presentation matches human notions of grammar of a programming
language. The function uses named constructs such as variables and procedures which aid

-5-
human readability. Now consider the assembly code that the C compiler gcc generates for
the Intel platform:

.globl _expr _expr:


pushl %ebp movl %esp,%ebp
subl $24,%esp movl 8(%ebp),
%eax movl %eax,%edx leal 0(,
%edx,4),%eax movl %eax,%edx
imull 8(%ebp),%edx movl
8(%ebp),%eax incl %eaximull
%eax,%edx movl 8(%ebp),%eax
incl %eax imull %eax,%edx
movl %edx,-4(%ebp) movl
-4(%ebp),%edx
movl %edx,%eax jmp L2
.align 4
L2:
leave ret

The assembly code is optimized for hardware it is to run on. The code consists of
machine instructions, uses registers and unnamed memory locations. This version is
much harder to understand by humans.

Issues in Compilation
The translation of code from some human readable form to machine code must be
“correct”, i.e., the generated machine code must execute precisely the same computation
as the source code. In general, there is no unique translation from source language to a
destination language. No algorithm exists for an “ideal translation”.
Translation is a complex process. The source language and generated code are very
different. To manage this complex process, the translation is carried out in multiple passes.
Two-pass Compiler

-6-
The figure above shows the structure of a two-pass compiler. The front end maps legal
source code into an intermediate representation (IR). The back end maps IR into target
machine code. An immediate advantage of this scheme is that it admits multiple front ends
and multiple passes.
The algorithms employed in the front end have polynomial time complexity while
majority of those in the backend are NP-complete. This makes compiler writing a
challenging task.
Let us look at the details of the front and back ends.

Front End
source tokens IR
scanner parser
code

errors

The front end recognizes legal and illegal programs presented to it. When it encounters
errors, it attempts to report errors in a useful way. For legal programs, front end produces
IR and preliminary storage map for the various data structures declared in the program.
The front end consists of two modules:
1. Scanner and 2.Parser
Scanner

The scanner takes a program as input and maps the character stream into “words” that are
the basic unit of syntax. It produces pairs – a word and its part of speech. For example, the
input

x = x + y

becomes

<id,x>
<assign,=>
<id,x>
<op,+>
<id,y>

-7-
We call the pair “<token type, word>” a token. Typical tokens are: number, identifier, +, -,
new, while, if.

Parser

The parser takes in the stream of tokens, recognizes context-free syntax and reports errors.
It guides context-sensitive (“semantic”) analysis for tasks like type checking. The parser
builds IR for source program.
The syntax of most programming languages is specified using Context-Free Grammars
(CFG). Context-free syntax is specified with a grammar G=(S, N, T, P) where

• S is the start symbol


• N is a set of non-terminal symbols
• T is set of terminal symbols or words
• P is a set of productions or rewrite rules

For example, the Context-Free Grammar for arithmetic expressions is

1. goal ? expr
2. expr ? expr op term
3. | term
4. term ? number
5. | id
6. op ? +
7. | –

For this CFG,

S = goal
T = { number, id, +, -}
N = { goal, expr, term, op}
P = { 1, 2, 3, 4, 5, 6, 7}

Given a CFG, we can derive sentences by repeated substitution. Consider the sentence

x+2–y

-8-
Production Result
Goal
1: goal ? expr Expr
2: expr ? expr op Expr op term
term
5: term ? id expr op y
7: op ?– expr – y
2: expr ? expr op expr op term – y
term
4: term ? number expr op 2 – y
6: op ?+ expr + 2 – y
3: expr ? term term + 2 – y
5: term ? id x+2–y

To recognize a valid sentence in some CFG, we reverse this process and build up a parse,
thus the name “parser”. A parse can be represented by a tree: parse tree or syntax tree.
For example, here is the parse tree for the expression x+2-y

The parse tree captures all rewrite during the derivation. The derivation can be extracted
by starting at the root of the tree and working towards the leaf nodes.

-9-
Abstract Syntax Trees

The parse tree contains a lot of unneeded information. Compilers often use an abstract
syntax tree (AST). For example, the AST for the above parse tree is

+ <id,y>

<id,x> <number,2>

This is much more concise; AST summarizes grammatical structure without the details of
derivation. ASTs are one kind of intermediate representation (IR).
The Back End
The back end of the compiler translates IR into target machine code. It chooses machine
(assembly) instructions to implement each IR operation. The back end ensure
conformance with system interfaces. It decides which values to keep in registers in order
to avoid memory access; memory access is far slower than register access.

The back end is responsible for instruction selection so as to produce fast and compact
code. Modern processors have a rich instruction set. The back end takes advantage of
target features such as addressing modes. Usually, instruction selection is viewed as a
pattern matching problem that can be solved by dynamic programming based algorithms.
Instruction selection in compilers was spurred by the advent of the VAX-11 which had a
CISC (Complex Instruction Set Computer) architecture. The VAX-11 had a large
instruction set that the compiler could work with.

- 10 -
CISC architecture provided a rich set of instructions and addressing modes but it made the
job of the compiler harder when it came to generate efficient machine code. The RISC
architecture simplified this problem.

Register Allocation

Registers in a CPU play an important role for providing high speed access to operands.
Memory access is an order of magnitude slower than register access. The back end
attempts to have each operand value in a register when it is used. However, the back end
has to manage a limited set of resources when it comes to the register file. The number of
registers is small and some registers are pre-allocated for specialize used, e.g., program
counter, and thus are not available for use to the back end. Optimal register allocation is
NP-Complete.

Instruction Scheduling

Modern processors have multiple functional units. The back end needs to schedule
instructions to avoid hardware stalls and interlocks. The generated code should use all
functional units productively. Optimal scheduling is NP-Complete in nearly all cases.

Three-pass Compiler
There is yet another opportunity to produce efficient translation: most modern compilers
contain three stages. An intermediate stage is used for code improvement or optimization.
The topology of a three-pass compiler is shown in the following figure:

Front IR Middle IR Back machine


End End End code

errors

The middle end analyzes IR and rewrites (or transforms) IR. Its primary goal is to reduce
running time of the compiled code. This may also improve space usage, power
consumption, etc. The middle end is generally termed the “Optimizer”. Modern optimizers
are structured as a series of passes:

- 11 -
IR Opt IR Opt IR Opt IR Opt IR
1 2 3 n

errors

Typical transformations performed by the optimizer are:

• Discover & propagate some constant value


• Move a computation to a less frequently executed place
• Specialize some computation based on context
• Discover a redundant computation & remove it
• Remove useless or unreachable code
• Encode an idiom in some particularly efficient form

Role of Run-time System


The executable code typically runs as a process in an Operating System Environment. The
application will need a number of resources from the OS. For example, dynamic memory
allocation and input output. The process may spawn more processes of threads. If the
underline architecture has multiple processors, the application may want to use them.
Processes communicate with other and may share resources. Compilers need to have an
intimate knowledge of the runtime system to make effective use of the runtime
environment and machine resources. The issues in this context are:

• Memory management
• Allocate and de-allocate memory
• Garbage collection
• Run-time type checking
• Error/exception processing
• Interface to OS – I/O
• Support for parallelism
• Parallel threads
• Communication and synchronization

Lexical Analysis
The scanner is the first component of the front-end of a compiler; parser is the second

- 12 -
source tokens IR
scanner parser
code

errors

The task of the scanner is to take a program written in some programming language as a
stream of characters and break it into a stream of tokens. This activity is called lexical
analysis. A token, however, contains more than just the words extracted from the input.
The lexical analyzer partition input string into substrings, called words, and classifies
them according to their role.

Tokens

A token is a syntactic category in a sentence of a language. Consider the sentence:


He wrote the program
of the natural language English. The words in the sentence are: “He”, “wrote”, “the” and
“program”. The blanks between words have been ignored. These words are classified as
subject, verb, object etc. These are the roles. Similarly, the sentence in a programming
language like C:

if(b == 0) a = b

the words are “if”, “(”, “b”, “==”, “0”, “)”, “a”, “=” and “b”. The roles are keyword,
variable, boolean operator, assignment operator. The pair <role, word> is given the name
token. Here are some familiar tokens in programming languages:
• Identifiers: x y11 maxsize
• Keywords: if else while for
• Integers: 2 1000 -44 5L
• Floats: 2.0 0.0034 1e5
• Symbols: ( ) + * / { } <> ==
• Strings: “enter x” “error”

Ad-hoc Lexer

The task of writing a scanner is fairly straight forward. We can hand-write code to
generate tokens. We do this by partitioning the input string by reading left-to-right,
recognizing one token at a time. We will need to look-ahead in order to decide where one

- 13 -
token ends and the next token begins. The following C++ code present is template for a
Lexer class. An object of this class can produce the desired tokens from the input stream.

class Lexer
{
Inputstream s; char next; //look
ahead
Lexer(Inputstream _s)
{ s = _s;
next = s.read();
}

Token nextToken() { if( idChar(next) )return readId();


if( number(next) )return readNumber(); if( next ==
‘”’ ) return readString(); ...
...
}
Token readId() { string id = “”; while(true){ char c
= input.read(); if(idChar(c) == false) return
new Token(TID,id);
id = id + string(c);
}
}
boolean idChar(char c)
{ if( isAlpha(c) ) return true; if( isDigit(c)
) return true; if( c == ‘_’ ) return
true;
return false;
}
Token readNumber(){ string num =
“”; while(true){
next = input.read(); if( !isNumber(next))
return new Token(TNUM,num); num =
num+string(next);
}
}

This works ok, however, there are some problem that we need to tackle.

• We do not know what kind of token we are going to read from seeing first
character.
• If token begins with “i”, is it an identifier “i” or keyword “if”?
• If token begins with “=”, is it “=” or “==”?

- 14 -
We can extend the Lexer class but there are a number of other issues that can make the
task of hand-writing a lexer tedious. We need a more principled approach. The most
frequently used approach is to use a lexer generator that generates efficient tokenizer
automatically.

How to Describe Tokens?

Regular Languages are the most popular for specifying tokens because
• These are based on simple and useful theory,
• Are easy to understand and
• Efficient implementations exist for generating lexical analysers based on such
languages.

GRAMMAR AND LANGUAGE


Introduction
Until we present a sharp definition later in this session, stay with your roadside
understanding of what a sentence is. Everything about language centers on the sentence. If
you doubt that, try identifying those things about language that do not hang from the
sentence.
A language is, to us, merely a set of sentences. There are two distinct ways of defining a
language (and any other set for that matter);
(a) by enumerating all the members (sentences), and
(b) Presenting a brief generative device with which all the members (sentences) can be
derived or produced.

When the set is small, the former method will suffice. When the set is large, the former
option is hopeless and the second is to be preferred. The latter approach is how languages
are defined and the sentence generating device is called a grammar for the language. Note
that articles before the word grammar in the last sentence – ‘a’
In this session, we shall give a formal definition of the grammar and show how it is used
to generate sentences. We also indicate how it is used to decide which string of words is
actually a sentence. This converse function of the grammar is called syntax analysis,
language recognition, or parsing and is nearly the subject of the next session.

- 15 -
1.1 Basic Concepts
Alphabet
Whenever we say ‘alphabet’, we shall mean a non-empty finite set of elements. We will
call these elements symbols. We shall overwork the word symbol in this course. In fact,
whenever we don’t know what to call an object, we call it a symbol. But we shall in all
cases make sense.
String
A finite ordered sentence of symbols taken from an alphabet is said to be a string over that
alphabet. Notice that the symbols of a string are ordered, i.e., string AB is different from
string BA although both strings contain the same symbols. Apart from the symbols it
contains, a string has another property – its length. This is the number of symbols it
contains. If it contains no symbols, we call it the null string and denote it by Ɵ. There is
nothing odd here; if a person owes you nothing, you can state quite categorically that he
owes you N0. We shall denote by /x/ or 1(x) the length of a string x.
//// Example 1.1.1
(i) the set {0, 1} is an alphabet, A.
(ii) the set {a, b, c, d} is an alphabet, B.
(iii) the set {0, 1, 2, 3, 4, …} is not an alphabet because it is not finite.
(iv) the set { } is not an alphabet because it is empty.
(v) 0, 00, 01010011 are strings over A;
(vi) 1 (‘0’) = 1 (‘a’) = 1 (‘d’) = 1
1 (‘01010011’) = 8
N1.1.1 Notational Convention
We will use upper case such as X, Y, Z, characters to represent names of symbols as well
as sets of strings over an alphabet. We will use lower case characters such as x, y, z to
represent strings. Thus we may write
X = ‘CS331’
and
X = {‘CS331’, ‘CS909’, ‘GS1000’}
String Operations
Several string operations are available and interesting but the ones that shall be in frequent
use to us are

- 16 -
(i) Catenation

If x and y are strings, the string obtained by writing the symbols of y immediately
following the symbols of x is called catenation of string y to string x or the concatenation
of x and y. it is denoted by xy (a notation that shows clearly which string comes first). For
example, the concatenation of x = ‘CS’ and y = ‘331’ is xy CS331’. The concatenation of
x and Ɵ, that null string, is XƟ = ‘CS’.
Q1.1.1 Show that for any string x, xƟ = Ɵx = x.
(Hint: beware of points that appear trite. Often they form the weak link in a strong chain
and when ignored …)
Q1.1.2 What exactly does the expression “let Z = xy be a string” mean, if x and y are
strings?
Note particularly that in the above, it was nowhere stated that x must differ from y.
(ii) h(.), ph(.), t(.), pt(.)
let Z= xy be a string. Then x is said to be head, and y is a tail, of Z. thus if ‘CS331’
is a string, its heads are ‘C’,’CS’, ‘CS3’, ‘CS33’, and ‘CS331’, and its tails ‘CS331’, ‘1’,
‘31’, etc.
If, for example, you view ‘CS331’ as Z, x as ‘CS’ (and so Y as ‘331’) you obtain
the head ‘CS’. If you view ‘CS331’ as Z, x as Ɵ (why not?), then y = ‘CS331’ is a clear
tail. By moving the Ɵ to the rear, you obtain ‘CS331’ as head. For brevity, we introduce the
notations h(.), t(.) … to denote the operations of computing heads and tails of a string. If x
is a head and y is not Ɵ, then we call x a proper head. Similarly, should y be a tail and x
non-null, we call x a proper head. Thus
h(‘CS33’) = {‘C’, ‘CS’, ‘CS’…}, and
t(‘CS3’) = {‘CS3’, ‘S3’, ‘3’} and
‘CS3’ is a proper head and ‘S3’ is a proper tail of Z = ‘CS3’.
(iii) String multiplication
In this we carry (i) one step further. The catenation of two strings x and y when x =
y may be written x² = xx. Similarly, xⁿ = xx … x (n times) shall denote the catenation of a
string x n times. xº is defined to be Ɵ for any string x and x1 shall return x.
We start not with two strings but with two sets A and B of strings. Take for
example,
A = {‘01’, ‘ab’}
B = {‘+’, ‘-’, ‘xyz’}
We define the product of A and B by
AB = {xy/x є A, y є B}

- 17 -
In words, the product of A and B is
(a) A set of strings p
(b) Each p is the concatenation of a string from A with a string from B.

That’s all.

//// Example 1.1.2

Using A = {‘01’, ‘ab’}

B = {‘+’, ‘-’, ‘xyz’}

Then

AB = {‘01+’, ‘01-’, ‘01xyz’, ‘ab+’, …}

While

BA = {‘+’, ‘+ab’, ‘xyz01’, …} ////

The production becomes interesting when A = B. then the above example extends to, e.g.,

AA = {‘0101’, ‘01ab’, …}

The product becomes very interesting when a multiplicand (see GCE ‘0’ level maths) is
itself a product e.g., AB, when A was in fact created from XX or X2. Then

AB = XXB

and if B = X, then we have

XXX, OR X3

There is nothing new. To calculate XXX, we calculate XX and then finish as

Xn = X(Xn-1)

//// Example 1.1.3

Let X = {0,1}

Then

- 18 -
X2 = {‘00’, ‘01’, ‘10’, ‘11’}

i.e., for every member x 1(x) = 2. Also

X3 = X(X2) = {‘000’, ‘001’, ‘010’, ‘011’, ‘100’, ‘101’, ‘110’, ‘111’}

i.e., every member x has 1(x) = 3. For every member x of Xn 1(x) shall be n. ////

as expected, we shall denote by X0 the null set {Ɵ}.

Thus we summarise:

X0 = {Ɵ}, X1 = X, …, Xn = X(Xn-1), n > 0

Two key, central, supper-critical concepts follow: the closure X * and the transitive close
X+ of a set X.

Closure of X

The closure X* = X0 Ս X1 Ս X2 Ս … Ս Xn Ս …

The positive closure X+ of a set X is defined as

X = X 1 Ս X2 Ս … Ս X n

Note that

X* = X0 Ս X+ and X+ = (X*) X

//// Example 1.1.4

If A = {a, b}, then

A* = Ɵ, ‘a’, ‘b’, ‘ab’, …,’aaa’, ‘aba’, …

i.e., the null strings of arbitrary length ≥ 0. To obtain the A +, drop the null string in the
above list. ////

All the above has been in preparation for the

- 19 -
1.2 Definition of Grammar

A production, definition, or rewriting rule is an ordered pair

(U, x) or U ::=x

Where U is a symbol and x is a non-empty finite string of symbols. U is called the left part
and x is called the right part of the production.

D1.2.1 Grammar

A grammar G [Z] is a finite non-empty set of rules. Z is a symbol which must appear as
the left part of at least one rule in the set. Z is called the distinguished or sentence symbol.
‘Sentence symbol’ is a better name since Z represents the generic object defined by the
grammar. All the symbols appearing in either the left parts or the right parts from the
vocabulary, V.

Any symbols appearing as left part of a rule is called a non-right part of the rule, in which
it is called a terminal. The set of non-terminals of a grammar will be denoted by VN; the
set of terminals called the alphabet of the grammar, will be denoted by VT.

Q1.2.1 surely you can show that

V = VN U VT

//// Example 1.2.1

(a) The rules


<identifier> ::= ‘A’
<identifier> ::= ‘B’
<identifier> ::= ‘C’

From a grammar G [<identifier>]. The symbol Z = <identifier> is the sentence symbol.


VN = {<identifier>} and VT = {‘A’, ‘B’, ‘C’}.

The rules define a class called identifier. The rules define ‘A’, ‘B’, ‘C’ each to be read as
‘is defined to be’.

- 20 -
N1.2.1 Notation

When several rules have the same left part, we may combine the rules into one composite
rule with the left part and the right parts, now called alternates, separated by dots ‘•’ or
vertical bars ‘ | ’. thus, the three rules in example 1.2.1 may be combined to

<identifier> ::= ‘A’ • ‘B’ • ‘C’


<identifier> ::= ‘A’ | ‘B’ | ‘C’

N1.2.2 Notation

The word ‘symbol’ will be used in several contexts. The point of this ‘notation’ is that
‘symbol’ could be a single character but need not be. For example, each of identifier, ::=,
‘A’ is a symbol. Whenever we use the term ‘symbol’ the meaning will be clear (perhaps
on second reading).

//// Example 1.2.2

(a) The following grammar has 8 rules:


G [<assignment>]
<assignment> ::= <identifier> := <expression>;
<expression> ::= <term>•<expression>+<term>
<term> ::= <identifier>•<term>*<identifier>
<identifier> ::= ‘A’ • ’B’ • ‘C’

The alphabet includes +, ;, *, ‘A’, ‘B’, ‘C’.

VN = {<assignment>,<identifier>,<expression>,<term>}.

(b) The grammar GN [<number>] has the following 13 rules:


<number> ::= <no>
<no> ::= <no> <digit>|<digit>
<digit> ::= 0|1|2|3|…|9
and
V = {0,1,2,3,…,9, <number>, <no>, <digit>}
We shall return to this grammar several times in the sequel. ////

- 21 -
The formalism we have used so far to represent a grammar is called the Backus-Naur
Form (BNF). It is a notation for expressing productions. Another common formalism is
given in

// D1.2.2 Grammar

A grammar G[Z] is a 4-tuple (VN, VT, P, Z) where VN, VT are alphabets of non-
terminals and terminals, respectively, with non-terminals specifying sets used to build up
the language;

P is a finite set of productions also called rules, of the form U u where U belongs to

VN and u is a string in V*.

Z is the distinguished non-terminal which is the set that constitutes the whole language
defined by the grammar. //

//// Example 1.2.3

For the language of arithmetic operation with operators + and *, we have the grammar

G[E] = ({E}, {+, *, (, ), id}, P, E)

Where P consists of

E E + E|E*E|(E)|id

and the operands are represented by id.

Z is E.

Nothing in the above has shown the real meaning of the claim that the ‘grammar defines
the language’. We now bring this out by showing that the grammar is device that indeed
generates its language. In the next session, we show that the (same) grammar is a device
for deciding which strings are in its language. The subject of derivation is usually the
easiest part of the course.

- 22 -
1.3 Derivations

//D1.3.1 Direct Derivation

Let G[Z] be a grammar and v and w strings in V. we make any of the following
statements:

(a) v directly produces w


(b) w is a direct derivation of v
(c) w directly reduces to v,

which may be written as

v w

if it is true that

v = xUy and w = xuy

for some strings x, y, where U ::= u is a rule //

//// Example 1.3.1

Consider the grammar (b) of example 1.2.2.


v w x y
<no> <no><digit> Ɵ Ɵ <no>
<digit><digit> 3<digit> Ɵ <digit> <digit>
3<digit> 34 3 Ɵ <digit> ////

D1.3.2 Derivation
Now we drop the qualifier ‘direct’. We say that
(a) v produces w
(b) w is a derivation of v
(c) w reduces to v,

which may be written


- 23 -
v +w
if there exists a sequence of direct derivations
v = u0 u1 u2 ••• un = w
with n > 0. The sequence is called a derivation of length n. w is then said to be a word for
v or derivable from v. if v +w or v = w1 then we abbreviate this to v *w. for
example, in the grammar (b) of example 1.2.2,
<number> +59
Since we can find a sequence
<number> <no> <digit> <digit> <digit> 4<digit> 59
The length of our derivation here is 4
D1.3.3 Sentence
Let G [Z] be a grammar. A string w is called a sentential from if it is derivable from the
start symbol Z, i.e., if Z *w. f a sentential form consist only of terminal symbols, it is
called a sentence. The language L (G{Z}) defined by the grammar G{Z} is the set of
sentences
L(G) = x | Z *x, xVT
Q1.3.1 clearly the student can put the above definition in lain GS 101.
//// Example 1.3.2
(a) 59, and in fact any natural number, is a sentence in L(G) where G is grammar (b) of
example 1.2.2.
(b) Let G[S] be
S ::= a/Saa

Then we consider the following strings for membership of L(G):


(i) a
(ii) aa
(iii) aaa
(i) S a (by rule 1)

- 24 -
Hence a is a sentence
(ii) To derive aa, we must proceed via rule 2 it being obvious that rule 1 won’t take us
too far. Suppose we thus start out as
S Saa

Our next move must be such that Saa aa. But, since the rule S ::= Ɵ does not appear
in the grammar, we have reached a dead ed. Conclusion: ‘aa’ is not a sentence in L(G).
(iii) S Saa (rule 2)
S Saa aaa (rule 1)

So the rule sequence 21 takes us from the sentence symbol to the trial sentence. ‘aaa’ is in
L(G). A little reflection or induction on the length of the string of a’s will convince the
reader that in fact
L(G) = {a2m+1, m = 0, 1, 2, •••}
(c) Consider the grammar G[a]
a ::= i := e;
e ::= t|e + t
t ::= i|t *i
I::= A|B|C
Now is the string
‘C: = B + C * A in L(G[a])?
As before, this translates to trying to see if
a + C: = B + C * A
we take off:
a i := e; (r1)
a C := e; (r8)
C: = e + t; (r3)
C: = t + t; (r2)
C: = i + t; (r4)
- 25 -
C: = B + t; (r7)
C: = B + t * i; (r5)
C: = B + C * i; (r8)
C: = B + C * A (r6)

and thus confirm that ‘ C := B + C * A is in L(G[a]) and have the rule sequence
183247586 to prove it. ////
The derivation, as we have presented it has a single algorithm which we key to the
syntactic class as follows:
A1.3 string derivation from a syntactic class
1. Write down the definition of the class <C> as a string S.
2. If S contains no class names, stop.
3. Select a class name (scanning, left to right) from S; say <X>.
4. Select a rule (scanning left to right) in which <X> appears on the left.
5. Replace the occurrence of <X> in S by the right hand side of that rule.
6. Return to stop 2.

Derivation Tree
Our formalism for expressing a derivation has been linear. A derivation may also be
presented in a graphical form, as a tree structure called a derivation (or syntax or parse)
tree.
Formally, let G = (V, T, P, Z) as before. A tree is a derivation tree for G if:
1. every vertex has a label which is a symbol of VN Ս T Ս {Ɵ}
2. the label of the root is Z
3. if a vertex is interior and has a label A, then A must be in VN.
4. If a vertex has a label A and vertices n1, n2, …, nu are the sons of vertex n1 in order
from left t right, with labels X1, X2, …, Xk, respectively, then
A ::= X1 X2 … Xk
must be a production in P.
5. if vertex n has a label in VT, then n is a leaf.

//// Example 1.3.3


Let G = ({Z, A}, {a, b}, P, Z)
where P consists of
Z aAZ | a

- 26 -
A ZbA | ZZ | ba
We draw a tree
1
Z

Z
a 2 A 3 4

a
8
Z A
5 b 6 7

a
a 9
b 10 11

By checking the satisfaction or otherwise of conditions in the definition, the reader can
easily show that the tree as given is (not) a parse tree for the given grammar.
D1.3.2 Phrase
Let G[Z] be a grammar. Let w = xuy be a sentential form. Then u is called a phrase of w
for a non-terminal U if:
(i) Z * xuy and
(ii) U +u

u is called a simple phrase if Z * xuy and


U ::=u
//// Example 1.3.4
(a) Consider the grammar GN (of example 1.2.2) and a sentential form w = <no> 1.
Any of the following is a candidate for a phrase of w:
<no> 1, <no> and 1
W= Ɵ <no> 1 Ɵ.
Then
(i) Writing w = Ɵ <no> 1 Ɵ with U = <no>
Z * Ɵ <no> Ɵ and U + <no> 1

- 27 -
We conclude that <no> 1 is a phrase.
(ii) 1
<number> * <no> <digit>
And <digit> +1
Since <digit> ::= 1 is a rule of GN, 1 is a simple phrase as well. ////
D1.3.3 Handle
A handle of a sentence form is a leftmost simple phrase. The similarity between a
derivation and the syntax tree for the same sentential form is obvious.

Language Hierarchy
2.0 Introduction
As you already know, a grammar is a generative structure or device. Given as grammar G,
the set of strings it generates is called the language L(G) of the grammar. Elements of this
set are called sentences. Note that L doesn’t have to e Igbo, but it could be. L can be
BASIC, FORTRAN, Pascal, an Assembly Language, etc. there is a hierarchy, at times
called the Chomsky hierarchy (in honour of the Indian and later American philosopher and
linguist who developed it) of language types. In this session, we present this hierarchy and
thus define the class of languages described as regular. The usual convention in language
theory in computer science as in linguistics is that if a language L is generated by a type x
grammar, then L is said to be a type x language.

2.1 The Chomsky Hierarchy

To force the reader to refresh his knowledge of the concept of a grammar and to introduce
a notation which will make us smile after a little frown, we define a grammar G by

//// Definition 2.1.1

A grammar is a quadruple G = <T, N, P, S>, where T and N are an alphabet and auxiliary
alphabet, respectively, S Ɛ N and P ⊆ (TƯ N)*2 is a relation.

By the way, in case you have forgotten, a binary on a set S is a subset of SxS, where SxS
is defined as the Cartesian product, SxS = {(a b) | a, b Ɛ S}. Take for example, the set S of
natural numbers. SxS is the familiar Cartesian product consisting of the Cartesian co-
ordinate, e.g., (0, 0), (1, 3), (17, 2), etc. The subset consisting of (0, 0), (1, 4), (3, 12),
(13, 42), (2, 8), (9, 36), etc. defines a relation on S. In fact whenever you draw a line graph

- 28 -
on the Euclidean plane you are defining a relation on R, the set of real numbers. Back
home, if V is a vocabulary, we can obtain a binary relation by listing the pairs consisting
of left part and right part of rules of the grammar, i.e., we include the pair (x, y) only if in
our grammar there is a rule in which x is the left part and y is the right part.

Members of P are called productions or (rewriting) rules. A rule (α, β) is usually written as
α β. Given a rule (α, β), α is called the antecedent and β is called the consequence of (α,
β). Members of T are called terminal symbols. Members of N are called non terminals, or
syntactic categories, or metalinguistic variables, or simply variables. S is called the
distinguished metalinguistic variable. This kind of grammar is called a constituent
structure grammar. The term constituent structure grammar, implies that the grammar so
described does, by means of variables, impose structure on the constituents (or parts) of
sentences in the language generated by the grammar. If you are still confused, remember
the English language teachers in the secondary school. They were sweating on the nose
trying to tell you about ‘parts of speech’, i.e., things like noun, adjective, clause, gerund,
etc. we are doing the same thing here from an advanced standpoint, after all, we are in the
higher institution and as you go higher, you learn more about less.

//// Definition 2.1.2

Let G = < T, N, P, S> be a constituent structure grammar. We write σi σi+1 if there


exists σi = αβγ and σi+1 = αβ’γ in (T  N) and β
*
β’P, and we write σ0 σt if
either σ0 σt or there exists a sequence σ0, σ1, …, σt such that σt σi+1 for all 0 ≤ i < t.
The sentence is called the σ0-derivation of σt; we denote it also by σ0 σ1, ••• σt..

In case you are still feeling uncomfortable with these symbols, what does the expression
σ1 = αβγ say? Simple, consider the grammar G [<Number>] of example 1.2.2. Relative to
this grammar, we can say that <Number> <no> because there is a rule

<Number> ::= <no>

σi is <Number> and αβγ is <no>. This will be true if we take α = γ = λ, the null string. i.e.,
<Number> is β and <no> is β’ because (<Number>, <no>) is a rule or <Number>
<no>. Thus, if the expression looks like β … … β …, then the prefix of each side is α
and the suffix is γ. Hopefully you can now breathe more easily.

- 29 -
//// Definition 2.1.3

If G = <T, N, P, S> is a constituent structure grammar, then

L(G) = {α | αT and S α}

is the constituent structure language, i.e., the sentences generated by G.

//// Example 2.1.1

Let T = {a, mango, ate, bought, child, green, man, ube, the}

N = {S, A, n, nP, t, V, VP}

and G = <T, N, P, S> be a grammar with

P = {S nP. VP, t a | the, n mango | ate | child | man | ube | A.n,

nP t.n, VP V | V nP, A green}.

Using the above notation, the reader can show that the string ‘a mango bought the green.
green. green.ube’ is in L(G), where the dot denotes concatenation. That is, the atrocity is
syntactically correct in the given grammar. ////

//// Definition 2.1.4

A grammar is said to be of Type 1 if the consequent of every production of the grammar


contains at least as many symbols as the antecedent. ////

//// Definition 2.1.4’

A type 1’ grammar is a grammar in which each production has the form

αAβ αγβ, where α, β  (TN)*, AN, γ(TN)* - (λ), where λ is the null string.

Type 1’ grammar is described as context-sensitive (or context-dependent). Referring to


Definition 2.1.4, if (α, β) is a production of a context-sensitive grammar, then /α/≤/β/. ////

The term context-sensitive refers to the fact that we are allowed to rewrite A as γ only in
the context β

- 30 -
The grammar G defined by

S aSBC | abC

CB BC

bB bb

bC bc

cC cc

is context-sensitive.

//// Definition 2.1.5

A grammar is of Type 2 only if all its rules take the form A γ, where A  N and γ 
(TN)* - (λ). A Type 2 grammar is also known as a context-free grammar. ////

The grammar in Example 2.1.1 is contex-free. The following important and classical
grammar is also context-free:

G = <{a, +, *, ), ( }, {E, T, F}, P, E>

with P = { E E+T | T, T T * F | F, F (E) | a}.

//// Definition 2.1.6

A grammar is of Type 3 on if all its productions are of the form A α or A αB, where A,
B N and αT* – {λ} or alternately, they are all of the forms

A α or A βα.

A Type 3 grammar is also described as a right-linear, one-sided linear, finite state or


regular grammar. A language generated be a Type 3 grammar is said to be right-linear
because the nonterminal is written to the right of the terminals in every consequent
containing a nonterminal. Every right-linear grammar has an equivalent left-linear
grammar. By way of comforting the unhappy reader, we state that you will not be required
to generate the left-linear form of a grammar in the examination and, that most computer
programming languages are regular.

- 31 -
//// Example 2.1.2

G is a regular grammar where

G = <{a, b, c}, {S, A}, P, S>, where P consists of the productions

S ::=abS | cA

A ::=baA | a.

L(G) = (ab)*c(ba)*a. the right hand side is an example of what is called a regular
expression (and discussed later)

The equivalent left-linear grammar to G is G’ with the productions

S ::= Aa

A ::= Aba | Bc | c.

B ::= Bab | ab (B being an additional nonterminal).

The grammars we have described above are called restricted grammars. A grammar that
has no restrictions is called an unrestricted grammar (as if you don’t know that already).

Regular Expressions and Regular Sets

Definition

Let Ʃ be a finite alphabet. A regular set over Ʃ defined as follows:

1. Ø is a regular set over Ʃ


2. {λ} is a regular set over Ʃ
3. {a} is a regular set over Ʃ ∀ƐƩ
4. If P, Q are regular sets over Ʃ, so are (a) PQ, (b) PQ, (c) P* (d) Q*
5. Nothing else is a regular set over Ʃ

Qu: So is it true that a regular set over Ʃ must be a subset of Ʃ*?

A regular expression is a convenient method of denoting a regular set over an alphabet Ʃ.

1. ɸ is a regular expression denoting the regular set ɸ

- 32 -
2. λ is a regular expression denoting the regular set(λ)
3. a ɛ Ʃ is a regular expression denoting the regular set {a}.
4. if p, q are regular expressions denoting regular sets P, Q respectively,
then
(a) (p+q) is a regular expression denoting the regular set P i.e. PQ
(b) (pq) is a regular expression denoting the regular set PQ. i.e. PQ
(c) (p)* is a regular expression denoting the regular set P*
5. Nothing else is a regular expression.

We use the shorthand p+ to denote pp *. Also, if ambiguities can arise, we eliminate


redundant parenthesis from expressions. For this purpose, we assume that

* has highest precedence

• (concatenation) has next highest precedence

+ has next highest precedence

Thus 0 + 10* means (0+(1(0*)))

Example

1. 01 denotes {0, 1}.


2. 0* denotes {0}*
3. (0+1)* denotes {0,1}*
4. (0+1)*011 denotes the set of all strings of 0’s and 1’s ending in 011
5. (a+b) (a+b+0+1)* denotes the set of all strings in {0, 1, a, b} beginning with a or b.

Qu: Can it be true that (00+11)*((01+10)(00+11)*(01+10)(00+11)*)* denotes the set of


all strings of 0’s and 1’s with even number of 0’s and 1’s?

- 33 -
LEXICAL ANALYSIS
OVER VIEW OF LEXICAL ANALYSIS
 To identify the tokens we need some method of describing the possible tokens that
can appear in the input stream. For this purpose we introduce regular expression, a
notation that can be used to describe essentially all the tokens of programming
language.
 Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which are
designed using transition diagrams and finite automata.

ROLE OF LEXICAL ANALYZER


The first step in compilation is to break a source program into its smallest basic units
usually called tokens. Most token of programming language fall into one of the following
classes:
I. Identifiers or variable names used by programmer e.g. X or YNUM.
II. KEYWORDS, e.g. DO, IF, BEGIN, etc.
III. Special operators e.g. <,>,=,+,*, etc.
IV. Delimiters, e.g. commas, parenthesis, semi-colon, etc.
V. Literals, both numeric (e.g. 99) and string (e.g. “Hello”).
VI. Separators e.g. blanks.

At the lexical analysis stage, a program called a lexical analyser or scanner scans the
characters of a source program from left to right and builds up the tokens of a program.
The actual procedure of doing this is as follows: when a source statement is read, the first
character scanned enables a token category to be identified. For example, scanning a digit
first will indicate token category integer. Similarly, scanning an alphabet first will indicate
token category identifier or reserved word. On identifying this token category, the lexical
analyser keeps scanning and concatenating, character by character, consecutive character
of that token until a separator is encountered.

- 34 -
A separator can be any of the following:
 a blank character
 a conventional arithmetic operator
 a delimiter such as comma, semi-colon, etc.
 a character that does not actually belong to the set of the expected character that
forms a particular token class. For example, in the token class integer, a scan of
any character other than digit 0 – 9 will serve as a separator.

The tokens of a source program are usually passed to the next phase of compilation
(syntax analysis); in an inter form. For example, ID, LT, OP, KW and DL may be the
internal representation of the token type identifiers, literals, operators, keywords and other
delimiters respectively. These internal representation and the actual values of token (i.e.
the actual identifier name and actual literal) are passed to the next phase of compilation.
Alternatively, different integers may be used to denote each of these token types. For
example, integers 01, 02, and 03 may be used to represent token identifier, integer, and
other literals respectively; while other unique integers may be used to represent each
unique reserved words, operators and delimiters in the language. This method allows the
token to be treated in a uniform way. The internal forms in which tokens are represented
have the same length regardless of whether a token is 1 or 30 character long. This allows
the rest steps in compilation to be done in an efficient manner with fixed length symbols
rather than variable length string of characters.

As an example, consider the program segment.


10 LET B = 5.3 * 3.2 + A
The scanner would pass the following internal representation of tokens to the syntax
analyser

Token Passed
Ste Internal Representation Value
p
1 LT 10
2 KW LET
3 TD B
4 OP =
5 LT 5.3
6 OP *
7 LT 3.2
8 OP +

- 35 -
9 TD A
Table 1 Sample internal representation of tokens

In addition to recognizing tokens, the scanner must recognize and delete blanks redundant
statements like comments. It is also the responsibility of the scanner to note and report
lexical errors such as invalid characters or improperly formed identifiers.

Lexical analysis can be programmed as a separate pass which performs a complete lexical
analysis on the source program and produces an output. A pass involves the reading of the
entire source program statements. However, it is sometimes common to program the
scanner as a subroutine called by the syntax analyser. Using the latter approach, when the
syntax analyser needs a new token, it calls the scanner which recognizes the next source
program token and passes it to the syntax analyser.

LEXICAL ANALYSIS VS PARSING:


Lexical analysis Parsing
A Scanner simply turns an input A parser converts this list of
String (say a file) into a list of tokens into a Tree-like object to
tokens. These tokens represent represent how the tokens fit
things like identifiers, together to form a cohesive whole
parentheses, operators etc. (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer")
parses individual symbols from A parser does not give the nodes
the source code file into tokens. any meaning beyond structural
From there, the "parser" proper cohesion. The next thing to do is
turns those whole tokens into extract meaning from this
sentences of your grammar structure (sometimes called
contextual analysis).

TOKEN, LEXEME, PATTERN:


Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants
Pattern: A set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched
by the pattern for a token.

- 36 -
Example:
Description of token

A patter is a rule describing the set of lexemes that can represent a particular token in
source program.
SYNTAX ANALYSIS
The second step in compilation is syntax analysis. At this stage, a complete syntactic
check is performed on the source program. The syntax of a programming language relates
to the grammatical rules governing the formation of legal statements in a language i.e.,
rules of combining tokens together in appropriate sequence to form a valid sentence or
syntactic construct. Common examples of syntactic constructs in programming languages
are:

 Expression (arithmetic, relational, logical, etc.)


 Assignment statements
 Declarative statements
 Jump instructions (conditional and unconditional)
 For – loops Etc.

ROLE OF THE SYNTAX ANALYSER (PARSER)


It is the responsibility of the syntax analyser (or parser) to ensure that the rules of the
programming language in question are conformed with. For instance, using a FORTRAN
Compiler as an example, the syntax analyser ensures that the structure of the source
program follows the rules of FORTRAN Language. It ensures, for example, that an array
variable is declared using a DEMENSION statement, before it is used and that, 2 operators

- 37 -
do not immediately follow each other in an arithmetic expression. Again it ensures that
left parenthesis are matched with corresponding right parenthesis. It also recognizes B =
5.3 * 3.2 + A as a valid assignment statement.The parser should report any syntax errors in
an intelligible fashion.

BASIC PARSING TECHNIQUES


The two types of parsers employed are:

1. Top down parser: which build parse trees from top (root) to bottom (leaves)
2. Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up
parsing

TOP-DOWN/ BOTTOM UP PARSING


A program that performs syntax analysis is called a parser. A syntax analyzer takes tokens
as input and output error message if the program syntax is wrong. The parser uses symbol-
look ahead and an approach called top-down parsing without backtracking. Top-down
parsers check to see if a string can be generated by a grammar by creating a parse tree
starting from the initial symbol and working down. Bottom-up parsers, however, check to
see a string can be generated from a grammar by creating a parse tree from the leaves, and
working up. Early parser generators such as YACC creates bottom-up parsers whereas
many of Java parser generators such as JavaCC create top-down parsers.

RECURSIVE DESCENT PARSING


Typically, top-down parsers are implemented as a set of recursive functions that descent
through a parse tree for a string. This approach is known as recursive descent parsing, also
- 38 -
known as LL(k) parsing where the first L stands for left-to-right, the second L stands for
leftmost-derivation, and k indicates k-symbol look ahead. Therefore, a parser using the
single symbol look-ahead method and top-down parsing without backtracking is called
LL(1) parser. In the following sections, we will also use an extended BNF notation in
which some regulation expression operators are to be incorporated.

A syntax expression defines sentences of the form, or A syntax of the form defines
sentences that consist of a sentence of the form followed by a sentence of the form
followed by a sentence of the form. A syntax of the form defines zero or one occurrence of
the form.
A syntax of the form defines zero or more occurrences of the form.
A usual implementation of an LL(1) parser is:

 initialize its data structures,


 get the look ahead token by calling scanner routines, and
 call the routine that implements the start symbol.

In a formal approach, without going into too much detail, BNF (Backus Normal Form) is a
notation for describing the formation of syntactic entities in a PL. for e.g., the BNF
representation off the syntax of a simplified PRINT statement in a simple PL is:

<PRINT statement> PRINT | PRINT <Expr>

<Expr> <ldn>|<ldn><Del><Expr>

<ldn> <Letter>|<Letter><Var>

<Var> <Letter>|<Digit>

<Del> ,|;

<Letter> A|B|…|Y|Z

<Digit> D|1|… …|8|9

In BNF, items enclosed in corner brackets (i.e. “<>”) are called syntactic entities while
those not so enclosed are the tokens in the language. Thus, <Expr>, <Var>, <PRINT

- 39 -
statement>, etc. are syntactic entities while the word PRINT, A, B, 1, 2, etc. are tokens in
the language.

In simple English, the first line of the BNF notation means the syntactic entity <PRINT
statement> can be formed by either the word PRINT OR PRINT followed by <Expr>. The
second line of the notation means the syntactic entity <Expr> is formed either by <ldn>
or<ldn> followed by <Del> and then followed by <Expr>. In the same vein <Letter> is
composed of either of the alphabet A or B or ….. or Z.

Through a number of techniques, the syntax analysis is able to detect whether or not a
syntactic entities are formed by appropriate combination of appropriate small syntactic
entities and / or tokens.

Symbol Table
At the syntax analysis stage, information about identifiers, literals for loops, and so on, are
kept on a table called table of information. A particularly important table is called the
symbol table or identifier table. It is a table a table containing the identifiers used in the
source program along with their attributes. The attributes of an identifier include its type
(e.g. integer, real, string) from (e.g. constant, simple variable, subscripted variable), block
number for structured language like ALGOL, location in memory and other attributes
relevant to the language.

Sample structures of a symbol table containing 3 identifiers is as shown in figure 5.1

Identifiers Attributes
Entry 1 AREA
Entry 2 BREADTH
Entry 3 LENGTH

FIGURE 5.1a symbol table containing 3 entries.

Since the number of characters in identifier names varies, pointers to the identifiers are
often stored in the column for identifier, while the identifiers themselves and their length
are stored in a special string list. This approach keeps the size of the identifier fixed and
thus facilitates searching for information about a particular identifier. Figure 5.2 shows
this approach of structuring a symbol table. Some simple methods of organising symbol
tables are given in the next Modules.

- 40 -
Identifier Attributes
s

3 7 6 LENGTH
AREA BREADTH
Figure 5.2 a symbol table with 3 entries having actual identifiers stored in a list

Each time an identifier is encountered, the symbol table is searched to see whether the
name has been previously entered. If the identifier is new, it is entered into the table. The
information collected in symbol table is used in semantic analysis, that is checking that
uses of variable names are consistent with their implicit and explicit declaration. There we
need to know how much and what kind of run time storage must be allocated to a variable
name.

The whole translation process revolves around the symbol table. Every occurrence (use or
declaration) of an identifier will call for need to search the symbol table. As a result it is
necessary that a symbol table be structured in a manner which allows efficient extraction
and insertion of information on the table.

The syntax analyser also reports syntactic errors such as multiple declaration of identifiers,
and other violations of the syntax of the language.

SEMANTIC ANALYSIS

The semantic analysis phase deals with the interpretation of the meanings associated with
a syntactic unit of the source program. When a parser has recognized a source language
construct, the semantic analyser will check for its semantic correctness, and store
necessary information about it into the symbol table or create an intermediate form of the
source program for it. For instance, when an assignment statement of the form:
<variable name> = <expression> is recognized the semantic analyser will check to ensure
that there is type compatibility between the <variable name> and the <expression> and
will create an intermediate form of source program for the assignment statement. For a

- 41 -
simple declarative statement, the semantic analyser will check to ensure that no variable
has been doubly declared, and necessary additional information about the declared
variable are entered into the symbol table.
Other semantic checks include ensuring that every “BEGIN” statement matching “END”;
every “DO” statement has matching “CONTINUE” in Fortran and every “ELSE”
statement has matching “IF”. In addition, complex or compound statement, such as a DO
LOOP, are broken down into simpler easy to follow statement at semantic analysis stage.

For example, consider the FORTRAN statement

DO 10 I = 1, 50, 2

Body of Loop

10 CONTINUE

This group of statement may be broken down as shown below:

I=1
J = 50
K=2

11 ……………..
……………..
……………..
……………..

10 CONTINUE
I=I+k
If (I .LE. J) GO TO 11
……………..
……………..
……………..
The semantic analyser also reports semantic errors discovered at this stage. Examples of
such errors include type incompatibility, transfer of control to non-existing statement,
duplicate labels on statements, etc.

THE INTERMEDIATE FORM OF SOURCE PROGRAM


Once the syntactic and semantic analysis has been performed on a syntactic unit, it is usual
to put the source program in an intermediate form. The intermediate form of a source
program depends largely on how it is to be manipulated later. These include the parse
- 42 -
tree, polish notation or a list of (operator, operand 1, operand 2, and result) quadruple in
the order in which they are to be executed.

(a) Parse Tree


One intermediate form of an arithmetic statement is the parse tree (also called syntax tree
or binary tree). The rules for converting an arithmetic statement into a parse tree are:
i. Any variable is a terminal node of the tree
ii. For every operator, construct (in the order dictated by the rules of algebra) a
binary (two branches) tree whose left branch is the first operand and whose
right branch is the tree for the second operand

Examples are as shown below


i. A+B ii. C=A*B
+ =

A B C *

A B

iii. Y=A+B*C iv. Y=A*B+C


= =

Y + Y +

A * C *

B C A B

(b) Polish Notation


Polish notation is used to represent arithmetic or logical expression in a manner which
specifies simply and exactly the order in which operators are to be evaluated. In this

- 43 -
notation, the operators come immediately after their operands. It is therefore, sometimes
called suffix or postfix notation. For instance, we can have the following expression in
infix notation with their corresponding polish notation.

Infix Notation Polish Notation


i. X+Y XY+
ii. X*Y+Z XY*Z+

Polish notation is the usual intermediate form of source program used by languages that
are commonly interpreted, e.g. SNOBOL, BASIC, etc.

It is easy to extend polish occasion to other operators, as long as operand are immediately
followed by the operators. For instance, an assignment statement with the usual syntax

<variable name> = <Expression>


Now has the syntax
<variable name> = <Expression>
A typical example is the statement
W=X*Y+Z
Which would now look like
WXY*Z+=

(c) List of Quadruples


A convenient form for a single binary operation is a list of quadruple. (Operators, Operand
1, Operand 2, Result)
Where operand 1 and operand 2 specify the argument and Result, the result of the
operation. Thus X * Y might be represented by *, X, Y, T, where T is an arbitrary
temporary variable to which the result of X * Y is assigned.
Similarly, W = X + Y * Z would be represented by the sequence.
*, Y, Z, T1
+, X, T1, T2
=, T2, , W

Where T1 and T2 are temporary variable created by the compiler. Note that the quadruple
appears in the order in which they are to be executed. Furthermore, the operands in the
above example would not be the symbolic names but pointers (or indexes) to the symbol
table element which describes the operands.

- 44 -
It is important to realize that compiler-directed instructions like PL/1’s DECLARE
statement or FORTRAN’S DIMENSION statement do not have intermediate form. They
are not machine instructions but instructions that provide information that will help the
compiler in doing its work of compilation. All information within these statements is
normally entered into the symbol table and no machine code is required to be generated
for them. They are sometimes called compiler – directed statements. Extension of this
intermediate form of source program to other constructs is provided in the next modules.

CODE GENERATION
The last step in compilation is the code generation. This is the actual translation of the
intermediate form of source program into machine language.

Types of Code
There are 3 types of code in which a source program can be translated into. Each of them
has its own pros and cons. A compiler designer normally makes use of one of the 3 forms.
These codes are:
i. Absolute ML instructions, directly placed in fixed locations in memory which the
compiler then executes immediately (load – and go compiler).
ii. An assembly language which must then be assembled.
iii. A machine language problem placed on a secondary storage device, which must be
linked with other sub-programs and then loaded, in order to be executed.

a. CODE IN ABSOLUTE ML INSTRUCTION


Advantage
 Most efficient from time point of view. Examples are PUFFT, ALGOL.

Disadvantages
 Subprogram cannot be pre-compiled separately; all subprograms must be compiled
at the same time.
 Storage is wasted, because compiler occupies a large part.
 Compilation will be done every time the program is to be executed.
 Should be generated by a compiler in an environment, where programs tend to be
small, and where much debugging is done e.g. training environment.

- 45 -
b. CODE IN ASSEMBLY LANGUAGE
Advantages
 Easiest, since AL is easier to deal with than binary digits.
 Calls on macros can be easily generated, whose definition has been previously
written in AL. for example, on some machines, it takes 3 or 4 instructions to convert
a floating point integer to fixed point. Instead of generating these 3 or 4 instructions,
the compiler generates a call on a macro FIX and let the assembler produce the
instructions later. This also helps to reduce the size of the compiler.
Disadvantage
 It adds one extra step (assembler) to the process of translating a program and
this extra step takes as long as the compilation itself.

c. CODE IN MACHINE LANGUAGE


Advantages
 Flexible, as in the AL format, but it does not require the extra assembly time.
 It is the standard in many systems

Disadvantage
 It however, requires liking and loading and this can be time consuming.

There are two types of this:


i. Absolute object code – if it can only be loaded in a specific location in memory.
ii. Relocatable object code – if it can be loaded in any part of the memory.

In IBM terminology, the linking f several object modules to form a load module is
performed by a linkage editor. The final loading of the load module is done by a loader

Storage Allocation
Prior to actual generation of code in either assembly language or machine language,
storage would have to be assigned to all identifiers (both user-defined and compiler-
generated, such as needed for temporarily storing an intermediate result) and literals. For
instance, consider the FORTAN declarative statement below:
INTEGER AREA, BREADTH, LENGTH
For this statement, a portion of the symbol table similar to the one in figure 5.3 may be
built. It is the responsibility of the storage allocation routine to assign required location to
each identifier.

- 46 -
As shown in the table, four bytes are reserved for each integer variable. The first variable
will be assigned to a relative location zero, the second to 4, and so on. Of course, the
absolute address cannot be known until load time. These relative assigned addresses are
used by the compiler in later phase for proper accessing.

Symbol Name Other Storage


Attributes
AREA 0
BREADTH 4
LENGTH 8
. . .
. . .
. . .
Figure 5.3 sample portion of symbol table showing assignment of storage addresses

In addition to storage allocation, some form of intermediate symbolic codes is usually


generated. The primary difference between this intermediate symbolic codes and assembly
code is that the intermediate code need not specify the registers to be used for each
operation.

Sample Code Generation


Code generation shall now be illustrated by generating symbolic intermediate code on the
hypothetical computer together with its ML and AL codes. This will be illustrated by
generating code for an assignment statement.
W=X*Y+Z
This would be represented by list of quadruple as
*, X, Y, T1
+, T1, Z, T2
=, T2, W
Using the common symbolic codes for the hypothetical computer described chapter 3, the
codes for the quadruples of the assignment statement would be.
LDA X
MUL Y
STA T1
LDA T1
ADD Z
STA Z
- 47 -
LDA T2
STA W

Code Optimization
During the intermediate code generation, it is possible to generate some redundant codes.
Redundant in the sense that their elimination would not have any adverse effect on the
expected result. The removal of these redundant codes is called code optimization. It is a
step designed to improve the efficiency of the object code in terms of both storage
utilization and execution time.

To illustrate this, it will be observed that eight codes are generated for the statement W =
X * Y + Z. However, it is possible to generate a more efficient code for this statement. For
instance, the third and fourth codes are not really necessary. Their elimination will not
affect the result provided by the rest codes. This is because the product of the contents of
memory location X and Y are already in the accumulator; all that is needed is to just add
the content of memory location Z to it.
The same principle applies to the sixth and seventh codes. At the code optimization stage
the 3rd. 4th, 6th and 7th codes would be eliminated to give:
LDA X
MUL Y
ADD Z
STA W

which is a more efficient code than the first.


This type of optimization is called machine-independent optimization. The optimization is
done without regard to the features of the machine. - Other machine independent
optimization method includes.
i. Performing operation whose operands are known at compile time. For
Example, in the Fortran statement.

A = 7*22/7 *R**2
The compiler can perform the multiplication and division involving constant and
substitute' 22 * R ** 2 for the original expression. This is especially useful if such an
expression occurs within a loop.
ii. Moving operations whose operands do not change within a DO-loop out of the
loop. Consider the program segment below as an example:

KSUM = 0
DO 10 K= 1,100

- 48 -
KSUM = KSUM + K
IX = 25
10 CONTINUE

It should not be difficult to see that the statement IX = 25 within the DO-loop >vi1l
always produce the same result. This could be removed out of the loop.
There is also machine dependent optimization. It is done while actually generating code.
This is done, by exploiting the features of the machine, such as better utilization of
available registers, and instruction set of the machine.

ERROR HANDLING DURING COMPILATION


Errors can be detected at any phase of compilation, especially the first 3 phases. When
errors are discovered, compilation cannot be completed. The usual practice is to continue
to the end of the semantic analysis phase, in case any more errors would be encountered.
Code generation does not take place.
In some cases it may be difficult to continue either-the syntatic or semantic analysis as
some assumption would to be made about the correctness of code near the error. Analysis
generally re-commences from the start of the next source program statement.

LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. This means
that there’s no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on
the other side, will be thrown by your scanner when a given set of already recognised
valid tokens don't match any of the right sides of your grammar rules. Simple panic-
mode error handling system requires that we return to a high-level parsing function when a
parsing or lexical error is detected.

Error-recovery actions are:


i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

Exercise
1 (a) What is a compiler? - -
(b) Describe the four basic steps involved in the compilation of a program written in
high level language.
2(a) -What do you understand by
i. Polish notation ii. quadruple form?

- 49 -
(b) Represent the arithmetic expression
a-b(c/d * e/f) in
(i) Polish notation (ii) Quadruple form (iii) Syntax tree

Write brief notes on the following:


(i) Token (ii) Symbol Table (iii) Semantic analysis(iv) Polish notation : (v) Code
optimization.

IMPLEMENTATION OF A LEXICAL ANALYSER


The logic flow for implementing a scanner can be described in form of a state diagram (a
diagram showing how tokens are parsed or recognized by scanner) or inform of a
generalised algorithm. Both of these are now described.

Generalised algorithm for a scanner


The pseudocode below is a generalised but simplified algorithm of how a scanner
recognises and classifies tokens in a programming-language.
Note:
1. The algorithm assumes a source statement is contained in one line (or record).
2. Simple (‘) or double (") quotes is used to delimit string literals, where different
characters are used the algorithm will be modified suitably (see the next section on
state diagram).
3. Single character delimiters or special operators are assumed, where two or more
characters are used as special operator, this will be handled specially.
4. TOKEN is a variable containing the next token, while CHAR is a variable
containing next character recognised.

The algorithm now follows.


Do while Not E of (end of source program file)
Read a source statement
Set TOKEN to null string
Set CHAR to null string
Do until end of a source statement string.
IF CHARf null string or blank character, scan the string for next non blank character
Endif

Do CASE
CASE CHAR = Digit (0-9)
TOKEN = TOKEN + CHAR

- 50 -
Scan next charcter (CHAR)
Enddo
TOKEN = Numeric literal

CASE CHAR = alphabet (A-z)


Do while CHAR = alphabet or digit
TOKEN = TOKEN + CHAR
Scan next character
Enddo
Check the token in table of reversved word
if found
TOKEN = Revervised word
Else
Token = Identifier
End if

CASE CHAR = any valid character (other than ‘’or’ )


Token = Delimiter/Special operator

CASE CHAR = “ (or ‘)


Do until CHAR = “ (or ‘)
TOKEN = TOKEN + CHAR
Enddo
Token = String literals

OTHERWISE
CALL Error handling routine
ENDCASE
ENDDO
ENDDO

Figure 8.1 Generalised algorithm for a simple scanner.

State diagram for a simple scanner


In implementing a scanner, a state diagram can be drawn to show how a token is parsed.
Let us assume the following descriptions for a simple language L.
A. The tokens of L are:
i. Delimiters/operators/,+,-,*,**, (and')

- 51 -
ii. Reserved words consists of BEGIN, ABS, INT, and END
iii. Identifiers (which may not be a reserved word and must start with an
alphabet and followed by zero or more alphabets or digits).
iv. Integers
B. At least one blank character must separate adjacent identifiers, integers and / or
reserved words; and no, blank may separate adjacent characters of a token
C. The double character /* begins a comment, and the comment ends at the first
occurrence of the double character */

For the above language, figure 6below is a simplified version of the state diagram for the
scanner. In the-diagram D stands for digits 0-9, L for alphabets A-Z, and DELIM
represents single character delimiter: +, ( and ), excluding / and *.
For example, beginning from state START, if the first character scanned makes us
proceed to state INT, we stay in this state as long as we can scan a digit, if a non-digit is
scanned, we proceed along the arc to output the token and then proceed to START again.

D D

1. Start INT Go to start

2. L ID/R L,D Go to start


/ * *
3. COM ENDD
Go to start
SLA
go to start

4. * STAR * Go to start

Go to start

5. DELIM go to start

ERRORR
- 52 -
6. Go to start

Figure 6state diagram of a simple scanner for a simple Language L.


Similarly, from START, if the first character scanned makes us to proceed to state ID/RM
(Identifier or reserved word), we stay in this state as long as we can scan a digit or
alphabet. If a character other than digit or alphabet is scanned, we proceed along the arc to
output the token and then proceed to START again.

ORGANIZATION OF SYMBOL TABLES

Tables of all types have the general form

Argument Value
Eatry 1

In the
Eatrytable,
2 'argument’ refers to the identifiers while
values to their attributes. As already
described in the preceding chapter, it is pointers
to the
Eatry nidentifiers that are usually stored in the
argument, while the identifier themselves .and.
their length are stored in a special string list.
There are many techniques of organising tables. Three simple methods are hereby
considered briefly

(a) Unordered Table


In this technique, entries for arguments are added in the order in which they arrive,
without any attempt at ordering.
- A search for a symbol requires a comparison with each of the table until a match
is found
- For a table containing n entries, an average of n/2 comparisons would be
required
- If n ≥ 20, the method is inefficient.

(b) Ordered Table


In this technique, entries in the table are stored in their alphabetic sequence.
The common search method is the binary search. In this technique,
- The symbol, S, being sought for is compared with the middle entry, i.e, (n + 1)/2.
- 53 -
- If the entry is not the desired one, we have only to look either in block of entries
number 1 to ((n + 2) / 2) - 1 or ((n + :1) / 2) + 1 to n depending on whether S is
less than or greater than the compared entry.
- The process is iterated until the symbol is found.
- A maximum of 1 + Log2n comparisons is required.
- If n = 2, at most 2 comparisons are needed; 3, if n = 4; 4, if n = 8.
- If n = 128, binary search requires maximum of 8 comparisons while the unsorted
method requires an average of 64.

(c) Hashing Techniques


In this method, symbols are converted to indexes of entries in the table (the entries are
numbered 0, 1, 2,............n - 1 for a table with n entries). The hashing is done by
performing some arithmetic or logical operations on the symbol (and possibly its length.
The result usually yields a unique result.
One simple hash function is the internal representation of the first character of the symbol.
Assume the binary representation of A = 1T000001, the symbol ADE will hash to
1000001 (C1 in hexadecimal). Thus, the first entry to look at in searching for one with
argument ADE is 11000001. Table 8.1 illustrates for identifiers ADE, BELLO and J.

Symbol Table
Entry 0
ADE 11000001 Entry 11000001
BELLO 11000010 Entry 11000010
J 11001010

Entry 11001010

Table 6: Symbol Table organization using Hashing Techniques.

In this technique, the cost of a search is only the cost of doing hashing, as long as 2
symbols do not hash into the same index.
When 2 symbols hash to the same index, a collision is said to occur. For example, the
symbol ADE and AJAO will collide). A good hash function usually minimize occurrence

- 54 -
of collision. Two methods, of resolving the collision, details of which will not be
discussed, are Re-hashing and Chaining.

EXTENDING INTERMEDIATE FORMS OF SOURCE PROGRAM TO


OTHER CONSTRUCTS
It is possible to extend the intermediate forms described in the previous module to other
syntactic construct of a PL. Some examples are given in this section.

Extending Polish Notation


Polish notation can be extended easily to other operations provided we keep the rule that
operands be immediately followed by their operators.

a) Assignment Operator

The assignment statement with the usual syntax


<variable> = expression>
now has the syntax
<variable><expression> =
A typical example is the statement
W = X*Y + Z
which would now become WXY*Z+- = .

b) Unary Operator

One simple way of handling unary operator like unary minus is to write them as binary
operators, for example, 0-B for -B Thus:
Z + (-W+Y*X), which in polish notation now becomes
ZOW - YX * ++
An alternative approach would be to introduce a special symbol for the operator. For e.g.
@ for unary minus. Thus Z(-W-)-Y*X), in polish notation becomes Z W@ YX *++

c) Unconditional Jump

An unconditional jump of the form


GO TO L
would now have the syntax
L BR
Where L represent a number corresponding to the symbol table entry address of the label,
and the operator BR means branch. BR is a unary operator.

- 55 -
d) Conditional Jump

A conditional jump would be of the form


e I BZ
Where e has a numeric value and I is the number or location of a symbol in the polish
string. This polish notation means if e is zero then a branch should be made to I otherwise
the usual order of evaluation should be followed.

Similarly, operations such as branch if negative (BN), branch if positive (BP), etc would
be allowable. Note that BZ, BN, BP, etc. are binary operators unlike BR which is a unary
operator.

e) Conditional Statement

A conditional statement of the form


IF <expr> THEN<Stmt 1 > ELSE <Stmt 2>

would have the syntax


<expr> L1 BZ <stmt1> L2 BR <stmt2>

Where <expr> has a boolean value (1 = true, 0 = false), L1 is the label of the statement
beginning <stmt 2>, L2 the label of the statement or symbol following <stmt 2>, (i.e. a
statement that follows the last statement in <stmt 2>. The operators BZ and BR are as
defined previously For example, the statement
IF A = B THEN C=B+A ELSE D = A - B

would have its polish notation as


A B - L1 BNZ C B A + = L2 BR D AB - =

where BNZ means branch if not zero, L1 represents the label for statement b= A-C and L2
represents the number of the statement or symbol following the complete IF statement
given above.

Alternatively the polish notation could be written as;


(1) AB - (13)BNZ,
(6) CBA + = (18)BR
(13) DAB - =
(18) -----------------

- 56 -
Where the number in parenthesis represents the position, location or number of symbol the
polish string. For example the first A, BNZ, C and last B respectively occupies the 1st, 5 th,
and 6th, and 15th position in the string.

f) For – Loops
This construct would need to be broken down into its primitive forms with
appropriate initializations testing and branching

For example the primitive form of


Do 101 = 1 20 (Fortran DO - LOOP)
10 S=S+1
Is:
I=1
10 S =S + 1
1 1+1
IF 1 LE 20 GOTO 10
20 ---------------

N.B
20 is assumed to be the statement number of the statement following the end of the DO
LOOP body.

The polish notation for the above DO LOOP is:

(1) I 1=

(4) S S I + =

(9) II 1 + =

(14) I 20 – (4) BNN

(19) ----------------

Here, BNN means branch if not negative.

Extending Quadruples
Extension of quardruples to the operators is not difficult either, some example are show
below.

- 57 -
(a) Unary Operator

In quadruple notation, common convention for unary operators is that they do not make
use of the second operand. For example -X in quadruple would be
-, X. . T

which stores the value of -X in T.


Consequently, the quadruple for - X * (Y + Z) is

+, Y, Z, T1
-, X, T2
*, T1, T2, T3

(b)Assignment Operator

Like most unary operators, the quadruple for assignment operation is


=, OP1, OPS)
which means store the content of OP1 , in the location of OP3. Thus, the quadruple

W=X+Y*Z

is the sequence .-
*, Y, Z, T1. ,
+, X, T1, T2
=, T2, W
.
(c) Unconditional Jump

The quadruple for unconditional branch instruction isBR, i

This means branch to quadruple i.

(d)Conditional jump

A quadruple for conditional branch statement can be

BZ.i, OP

This notation means if the operand OP is zero then branch to quadruple i. BN (if OP is
negative) and BP (If OP is positive) are similarly defined.

- 58 -
Another quadruple for conditional jump (with a relational operator) can be
BE, i, OP1, OP2

Where BE means branch to quadruple i if OP1 = OP2


BG (for OP1 > OP2), BL (for OPI < OP2), etc. are similarly defined.

(e) Conditional statement

The quadruples defined for the jump instruction, can be applied to conditional statement as
shown in the following examples.
Example 1
The quadruple for
IF I > J THEN K=K*G. could be
(1) -, I, J, Tl
(2) BP (4), Tl
(3) BR, (6)
(4) *, K, G, T2
(5) =, T2, , K
(6) ---------
Alternatively, it could be written as
(1) BG, (3), I, J
(2) BR, (5)
(3) *, K, G, T1.
(4) =, T1, K
(5) ….....

Note
The number in parenthesis indicates quadruple sequence number.

Example 2
The quadruple for
IF A=B THEN A = C +D ELSE B=C-D Could be either of the-following quadruples.
1) -, A, B, T1 (1) BNE, (5) A, B
2) BZ, (6), T1 (2) +, C, D, T1
3) -, C, D, T2 (3) =, T1, A
4) =, T2, , B (4) BR, (7)
5) BR, (8) (5) -, C, D, T2
6) + , C, .D, T3 (6) =, T2, , B

- 59 -
7) = , T3, , A (7) ----------------
8) ------------------

where BNE mean branch if the 2 specified operands are not equal, .and BR is as
previously defined.

(f) For-Loops

The approach to writing quadruple for this constructs is to break it down into its primitive
form, with the appropriate initialization, testing and branching, For example, the primitive
forms of

FOR I = N1 TO N2 STEP N 3 (BASIC FOR-LOOP)


Is
I = N1
10 IF I >N2THEN 99
. .
.
I = I+N3
99 GO TO 10

This primitive form assumes that the for-loop given is pretest iteration, meaning that
the test for the stopping point is carried out before executing the body of the loop.
Furthermore, it is assumed that the final value (N2) is greater than or equal to the initial
value (N1). The quadruple for the for-loop is as shown below.

1) =. N1, , I
2) -, I, N2, TI
3) BP, (99), T1
-------------
-------------
95) ------------
96) +, T, N3, T2
97) =, T2, , I
98) BR, (2)
99) ----------

In the above quadruple number 4 and 97 respectively begins and ends the body of the
loop.

- 60 -
Extending syntax tree to other operator
Syntax tree could similarly be extended as graphically illustrated in these examples below

Example 1: The syntax tree for A= B could be


=

A B
Example 2
The syntax tree for
IF e THEN S1
Could be
IF - THEN

e S

Example 3
The syntax tree for
IF e THEN S1 ELSE S2
Could be

IF - THEN - ELSE

e S1 S2

Example 4
The syntax tree for
IF A = B THEN A = C + D ELSE B = C - D

Could be

IF THEN ELSE

= = =

- 61 -
A B A + B

C D C D

Example 5
The syntax tree for
DO <Stmtno> V1 = V2, V3, (Fortran Do loop)
could be
DO FROM TO INCREASE BY (VI)

<Stmtno> = = V4

V1 V2 V1 V3

Exercise
Draw syntax tree for: i + i + I

COMPILER IMPLEMENTATION APPROACH

A compiler is generally implemented either as s single pass or a multi-pass compiler. Any


step (or group of steps) of compilation process which involves the reading of the whole
source program or a modified version of it is called a pass. Compilers that can perform
their translation in a single pass are called Single pass compilers, while those that perform
their translation in more than one pass are called multi-pass compilers:

Single pass compilers perform the four steps of compilation – lexical, syntax semantic
analyses and, code generation in succession for each source statement. In typical single
pass compiler, the parser calls the scanner when a construct is recognized. This procedure
does semantic checking, storage allocation and code generation for the construct before
returning to the parser.

In a multi-pass compiler these steps are performed in parallel that is each step may be
wholly or partially completed for all the source statement before the next step is begun.

WATFOR is single-pass compiler implemented on the IBM 300 for Watch-processing


FORTRAN IV program. Its scanner translates one FORTRAN statement at a time into an
intermediate form, determines the types of statement and calls the appropriate semantic
routine (a procedure) to process it. This routine analyses the syntax and semantics of the

- 62 -
statement and generate object code for it immediately. An example of a multi-pass
compiler is the IBM 360, FORTRAN IV II compiler which has four passes.

Single—pass compilers are usually: faster than multi-pass compiler; however, they may
produce less, efficient" object code since they may not likely perform extensive code
optimization. It is also required that all identifiers be declared before they are used. Single-
pass compiler is useful in a training environment. 1 A production environment will require
compiler that does more optimization.

Criteria for determining the number of passes


Criteria for determining the number, of passes in implementing a compiler include:

i. Amount of available main memory - few passes for very limited memory.
ii. How fast the compiler should be. - The fewer the passes the faster.
iii. How efficient the object program should be. - Less passes implies less
efficiency.
iv. Environment of use - fewer passes for training, many passes for production
environment.

In summary, a single pass approach will be preferred to a multi-pass in Implementing a


compiler when
i. Main memory is very limited.
ii. Compilation efficiency is of prime importance
iii. Execution efficiency is not critical
iv. The compiler is to be used mainly in a training environment.

LANGUAGE IMPLEMENTATION APPROACH

This section briefly compares and contrasts the 2 methods of implementing a high level
language. H.L.L. is generally implemented either as compilers or interpreters. Compiler
translates HLL to machine language equivalent for later execution, while interpreter
translates and executes the program statement by statement.

Compiler processes the program statements -in their physical input sequence, while
interpreter follows the logical flow of control through the program. Interpreter processes
some statement repeatedly (if they were part of a loop) and might ignore others completely
(if control never reaches them).

- 63 -
Object code is produced for subsequent executions with compilation; but with
interpretation, no object code is produced and re-translation is required every time
execution is required.

A pure interpreter normally analyses a source program statement each time it is to be


executed in order to know how to carry out the execution. In the analysis, syntactic and
semantic checks are performed on the source statement, and the statement is then
translated into an intermediate form, usually a polish notation is often used. This
intermediate form is then executed (i.e. interpreted or simulated). This stage of analysis is
similar to the first part of a multi-pass compiler.

Again, interpreters are very useful for debugging purposes since part output can be
obtained as the program runs, but with a compiler, the whole program must be error free
before it can be executed to obtain any result.
Moreover, compilation requires that both source program and object program be stored;
while for interpretation there is no object program stored; only the source program has to
be stored. This feature often makes interpreters preferable to compilers for very small
computers.

On the other hand, a translated program may execute 20 to 100 times faster than an
interpreted one. This is because certain portions of a program may need to be executed
repeatedly, perhaps hundreds of times. An interpreter would have to analyse each
statement every time, it is to be executed. However, a compiler translates each statement
only once; regardless of how many times the translated statement will eventually be
executed.

Another significant difference between a compiler and interpreter is that only one error
can be detected at a time by an interpreter, and this error will cause program execution to
halt until the error is corrected. However, as many syntax errors as possible are normally
detected by a compiler.

Moreover, there is guarantee for detection of all syntax errors in compilation but there is
no such guarantee for interpretation, because control might never reach some statements
that have syntax errors.

Criteria for determining language implementation approach


The following criteria are important factors to consider when deciding whether to
implement a language as a compiler or an interpreter,
i. Execution efficiency:- Compiler, if this is a prime concern, otherwise,
interpreter,

- 64 -
ii. Environment:- Compiler for production environment, interpreter for training
environment,
iii. Available main memory:- Compiler for unlimited memory, interpreter for
limited memory.
iv. Debugging and diagnostic support:- Compiler if its production is critical;
otherwise interpreter
v. Detection of all translation errors:- Compiler, if its guarantees of prime concern,
otherwise, interpreter.

Note: Criteria (iv) & (v) are related to criteria (ii).

Exercise
1. What factors will inform your implementing a language as a compiler rather than
an interpreter? Why will you implement the language as single-pass rather than a
multi-pass compiler?
2. Develop a suitable algorithm for 'binay search' technique.
3. Put the following into polish notation.

(i) A = B * C + D (ii) IF<expr>THEN <S1>


(iii) FOR <var1> = <var2> TO <var3> STEP <var4> (BASIC For-loop)
4. Write quardruples for the following:

(i) A = - B * (C + D) (ii) IF<expr> THEN <S1> ELSE <S2>


(iii) DO <stmt-No><Va1> = <var2>, <var3>, <var4> (Fortran DO-loop)
5. Draw syntax tree for.

(i) i -+ i * i ii) i * i + i (Si) i + i * i


6. Contrast a compiler with an interpreter.

- 65 -
CHAPTER ONE
What is a compiler?
A compiler is a program that takes as input a program written in one language (the source
language) and translates it into a functionally equivalent program in another language (the
target language). The source language is usually a high-level language like Pascal or C,
and the target language is usually a low-level language like assembly or machine
language. As it translates, a compiler also reports errors and warnings to help the
programmer make corrections to the source, so the translation can be completed.
Theoretically, the source and target can be any language, but the most common use of a
compiler is translating an ASCII source program written in a language such as C into a
machine specific result like SPARC assembly that can execute on that designated
hardware.

Although we will focus on writing a compiler for a programming language, the techniques
you learn can be valuable and useful for a wide-variety of parsing and translating tasks—
e.g converting javadoc comments to HTML, generating a table from the results of an SQL
query, collating responses from e-mail survey, implementing a server that responds to a
network protocol like http or imap, or “screen-scraping” information from an on-line
source. Your printer uses parsing to render PostScript files. Hardware engineers use a full-
blown compiler to translate from a hardware description language to the schematic of the
circuit. Your email spam filter quite possibly uses scanning and parsing to detect
unwanted email. And the list goes on…

- 66 -
How does a compiler work?
From the previous diagram, you can see there are two main stages in the compiling
process: analysis and synthesis. The analysis stage breaks up the source program into
pieces, and creates a generic (language-independent) intermediate representation of the
program. Then, the synthesis stage constructs the desired target program from the
intermediate representation. Typically, a compiler’s analysis stage is called its front-end
and the synthesis stages its back-end. Each of the stages is broken down into a set of
“phases” that handle different parts of the tasks.

What advantage do you see to structuring the compiler this way (separating the front and
back ends)?

The analysis stage (front-end)


There are four phases in the analysis stage of compiling:

1) Lexical Analysis or Scanning: The stream of characters making up a source program


is read from left to right and grouped into tokens, which are sequences of characters
that have a collective meaning. Examples of tokens are identifiers (user-defined

- 67 -
names), reserved words, integers, doubles or floats, delimiters, operators, and
special symbols.

Example of lexical analysis:

int a;
a = a + 2;

A lexical analyser scanning the code fragment above might return:

int T_INT (reserved word)


a T_IDENTIFIER (variable name)
; T_SPECIAL (special symbol with value of “;”)
a T_IDENTIFIER (variable name)
= T_OP (operator with value of “=”)
a T_IDENTIFIER (variable name)
+ T_OP (operator with value of “+”)
2 T_INTCONSTANT (integer constant with value of 2)
; T_SPECIAL (special symbol with value of “;”)

2) Syntax Analysis or Parsing: The tokens found during scanning are grouped together
using a context-free grammar. A grammar is a set of rules that define valid
structures in the programming language. Each token is associated with a specific
rule, and grouped together accordingly. This process is called parsing. The output of
this phase is called a parse tree or a derivation, i.e., a record of which grammar
rules were used to create the source program.

Example of syntax analysis:


Part of a grammar for simple arithmetic expressions in C might look like this:

Expression -> Expression + Expression |

Expression – Expression |
...
Variable |

Constant |
...
Variable -> T_IDENTIFIER
Constant -> T_INTCONSTANT |
- 68 -
T_DOUBLECONSTANT

The symbol on the left side of the “->” in each rule can be replaced by the symbols on the
right. To parse a + 2, we would apply the following rules:

Expression -> Expression + Expression


-> Variable + Expression
-> T_IDENTIFIER + Expression
-> T_IDENTIFIER + Constant
-> T_IDENTIFIER + T_INTCONSTANT

When we reach a point in the parse where we have only tokens, we have finished. By
knowing which rules are used to parse, we can determine the structures present in the
source program.

3) Semantic Analysis: The parse tree or derivation is checked for semantic errors i.e., a
statement that is syntactically correct (associates with a grammar rule correctly), but
disobeys the semantic rules of the source language. Semantic analysis is the phase
where we detect such things as use of an undeclared variable, a function called with
improper arguments, access violations, and incompatible operands and type
mismatches, e.g. an array variable added to a function name.

Example of semantic analysis:

intarr[2], c;
c = arr * 10;

Most semantic analysis pertains to the checking of types. Although the C fragment above
will scan into valid tokens and successfully match the rules for a valid expression, it isn't
semantically valid. In the semantic analysis phase, the compiler checks the types and
reports that you cannot use an array variable in a multiplication expression and that the
type of the right hand-side of the assignment is not compatible with the left:

4) Intermediate Code Generation: This is where the intermediate representation of the


source program is created. We want this representation to be easy to generate, and
easy to translate into the target program. The representation can have a variety of
forms, but a common one is called three-address code (TAC) which is a lot like a
generic assembly language. Three-address code is a sequence of simple instructions,
each of which can have at most three operands.

- 69 -
Example of intermediate code generation:

a=b*c+b*d _t1 = b * c
_t2 = b * d
_t3 = _t1 + _t2
a = _t3

The single C statement on the left is translated into a sequence of four instructions in three
address code on the right. Note the use of temp variables that are created by the compiler
as needed to keep the number of operands down to three.

Of course, it's a little more complicated than this, because we have to translate branching
and looping instructions, as well as function calls. Here is some TAC for a branching
translation:

if (a <= b) _t1 = a > b


a = a - c; if _t1 goto L0
c = b * c; _t2 = a – c
a = _t2
L0: _t3 = b * c
c = _t3

The synthesis stage (back-end)


There can be up to three phases in the synthesis stage of compiling:

1) Intermediate Code Optimisation: The optimiser accepts input in the intermediate


representation (e.g. TAC) and outputs a streamlined version still in the intermediate
representation. In this phase, the compiler attempts to produce the smallest, fastest
and most efficient running result by applying various techniques such as:

• inhibiting code generation of unreachable code segments


• getting rid of unused variables
• eliminating multiplication by 1 and addition by 0
• loop optimisation (e.g., remove statements that are not modified in the loop)
• common sub-expression elimination
• strength reduction
....

- 70 -
The optimisation phase can really slow down a compiler, so typically it is an optional
phase. The compiler may even have fine-grain controls that allow the developer to make
tradeoffs between time spent compiling versus optimisation quality.
Example of code optimisation:

_t1 = b * c _t1 = b * c
_t2 = _t1 + 0 _t2 = _t1 + _t1
_t3 = b * c a = _t2
_t4 = _t2 + _t3
a = _t4

In the example shown above, the optimiser was able to eliminate an addition to the zero
and a re-evaluation of the same expression, allowing the original five TAC statements to
be re-written in just three statements and use two fewer temporary variables.

2) Object Code Generation: This is where the target program is generated. The output
of this phase is usually machine code or assembly code. Memory locations are
selected for each variable. Instructions are chosen for each operation. The three-
address code is translated into a sequence of assembly or machine language
instructions that perform the same tasks.

Example of code generation:

_t1 = b * c lw $t1, -16($fp) # load


_t2 = _t1 + _t1 lw $t2, -20($fp) # load
a = _t2 mul $t0, $t1, $t2 # mult
add $t3, $t0, $t0 # add
sw $t3, -24($fp) # store

In the example above, the code generator translated the TAC input into MIPS assembly
output.

3) Object Code Optimisation: There may also be another optimisation pass that
follows code generation, this time transforming the object code into tighter, more
efficient object code. This is where we consider features of the hardware itself to
make efficient usage of the processor(s) and registers. The compiler can take
advantage of machine-specific idioms (specialised instructions, pipelining, branch
prediction, and other peephole optimisations) in reorganising and streamlining the
object code itself. As with IR optimisation, this phase of the compiler is usually
configurable or can be skipped entirely.
- 71 -
The symbol table
There are a few activities that interact with various phases across both stages. One is
symbol table management; a symbol table contains information about all the identifiers in
the program along with important attributes such as type and scope. Identifiers can be
found in the lexical analysis phase and added to the symbol table. During the two phases
that follow (syntax and semantic analysis), the compiler updates the identifier entry in the
table to include information about its type and scope.

When generating intermediate code, the type of the variable is used to determine which
instructions to emit. During optimisation, the “live range” of each variable may be placed
in the table to aid in register allocation. The memory location determined in the code
generation phase might also be kept in the symbol table.

Error-handling
Another activity that occurs across several phases is error handling. Most error handling
occurs in the first three phases of the analysis stage. The scanner keeps an eye for stray
tokens, the syntax analysis phase reports invalid combinations of tokens, and the semantic
analysis phase reports type errors and the like. Sometimes these are fatal errors that stop
the entire process, while at other times; the compiler can recover and continue.

One-pass versus multi-pass


In looking at this phased approach to the compiling process, one might think that each
phase generates output that is then passed on to the next phase. For example, the scanner
reads through the entire source program and generates a list of tokens. This list is the input
to the parser that reads through the entire list of tokens and generates a parse tree or
derivation. If a compiler works in this manner, we call it a multi-pass compiler. The “pass”
refers to how many times the compiler must read through the source program. In reality,
most compilers are one-pass up to the code optimisation phase. Thus, scanning, parsing,
semantic analysis and intermediate code generation are all done simultaneously as the
compiler reads through the source program once. Once we get to code optimisation,
several passes are usually required which is why this phase slows the compiler down so
much.

STRUCTURE OF THE COMPILER DESIGN

Phases of a compiler: A compiler operates in phases. A phase is a logically


interrelated operation that takes source program in one representation and produces
output in another representation. The phases of a compiler are shown in below.
There are two phases of compilation.
- 72 -
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis (Machine Dependent/Language independent) Compilation process
is partitioned into no-of-sub processes called ‘phases’.

Lexical Analysis:-
LA or Scanners reads the source program one character at a time, converting the source
program into a sequence of automic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced. This
phase bridges the analysis and synthesis phases of translation.
Code Optimization:-
This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space.

- 73 -
Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the
length of machine language program are carried out during this phase. The output of the
code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
This is the portion to keep the names used by the program and records essential
information about each. The data structure used to record this information called a
‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.

The output of LA is a stream of tokens, which is passed to the next phase, the syntax
analyzer or parser. The SA groups the tokens together into syntactic structure called as
expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur
in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the
compiler.

Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On
seeing the /, the syntax analyzer should detect an error situation, because the
presence of these two adjacent binary operators violates the formulations rule of an
expression.

Syntax analysis is to make explicit the hierarchical structure of the incoming token
stream by identifying which parts of the token stream should be grouped.

Example, (A/B*C has two possible interpretations.)


1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.

Each of these two interpretations can be represented in terms of a parse tree.

- 74 -
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax analyzer to
create a stream of simple instructions. Many styles of intermediate code are possible. One
common style uses instruction with one operator and a small number of operands.
The output of the syntax analyzer is some representation of a parse tree. The intermediate
code generation phase transforms this parse tree into an intermediate language
representation of the source program.

Code Optimization
This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space. Its output is another intermediate code program that does the
same job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A >B goto L2
GotoL3
L2 :
This can be replaced by a single statement
If A < B gotoL3
Another important local optimization is the elimination of common sub-expressions
A := B + C + D
E := B + C + F

Might be evaluated as
T1 := B + C

A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.

2, Loop Optimization:-
Another important source of optimization concerns about increasing the speed of
loops. A typical loop improvement is to move a computation that produces the same result
each time around the loop to a point, in the program just before the loop is entered.

- 75 -
Code generator:-
Cg produces the object code by deciding on the memory locations for data, selecting code
to access each datum and selecting the registers in which each computation is to be done.
Many computers have only a few high speed registers in which computations can be
performed quickly. A good code generator would attempt to utilize registers as efficiently
as possible.
Table Management OR Book-keeping:-
A compiler needs to collect information about all the data objects that appear in the
source program. The information about data objects is collected by the early phases of the
compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.

Error Handing:-
One of the most important functions of a compiler is the detection and reporting of errors
in the source program. The error message should allow the programmer to determine
exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error
handler, which issues an appropriate diagnostic msg. Both of the table-management and
error-Handling routines interact with all phases of the compiler.

Example:
Position:= initial + rate *60

Lexical Analyzer

Tokens id1 = id2 + id3 * id4

SyntsxAnalyzer

id1 +

id2 *

id3 id4

- 76 -
Semantic Analyzer

id1 +

id2 *

id3 60

int to real

I ntermediate Code Generator

temp1:= int to real (60)


temp2:= id3 * temp1
temp3:= id2 + temp2
id1:= temp3.

Code Optimizer

Temp1:= id3 * 60.0

- 77 -
Id1:= id2 +temp1

Code Generator

MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
TOKEN

LA reads the source program one character at a time, carving the source

program into a sequence of automatic units called ‘Tokens’.

1). Type of the token.

2). Value of the token.

Type : variable, operator, keyword, constant

Value : N1ame of variable, current variable (or) pointer to symbol table.

If the symbols given in the standard format the LA accepts and produces

token as output. Each token is a sub-string of the program that is to be treated as a

single unit. Token are two types.

1). Specific strings such as IF (or) semicolon.

2). Classes of string such as identifiers, label, constants.

78 | P a g e
2.2 An Exercise in Recognition

The interested reader is encouraged (since, in a higher institution everything,


including getting the degree, is optional) to implement the following algorithm in
his/her favourate programming language. We recommend Pascal, C, Basic,
FORTRAN 99, or C++ in that order, i.e., in ascending order of magnitude of
reduction of one’s life span and happiness. Not that, being a computer science
student, you must present your work in line with the rules of program development
as studied in, e.g., systems analysis.

Given a Type 1 grammar, G = <T, N, P, S> and a string α over T, where |α|=k, it is
strongly rumoured that this algorithm will determine whether αL(G) or not.

Algorithm 2.2.1

1. i 1; Q {S}
2. i i+1
3. Qi Qi+1  {γ|β γ and βQi+1, |γ|=k}
4. If Qi = Qi+1, stop. String α cannot be generated by G.
5. If αQi, goto step 2.
6. Stop. String α has been generated by G.

79 | P a g e

You might also like