2018 Com 414 (Compiler Construction)
2018 Com 414 (Compiler Construction)
2018 Com 414 (Compiler Construction)
(COM 414)
-1-
Introducing Compilers and Interpreters
A compiler is a computer program that implements a programming language specification
to "translate" programs, usually as a set of files which constitute the source code written in
source language, into their equivalent machine readable instructions (the target
language, often having a binary form known as object code). This translation process
is called compilation. We compile the source program to create the compiled program.
The compiled program can then be run (or executed) to do what was specified in the
original source program.
The source language is always a higher-level language in comparison to machine code,
written using some mixture of English words and mathematical notation, assembly
language being the lowest compilable language (an assembler being a special case of a
compiler that translates assembly language into machine code). Higher-level languages
are the most complex to support in a compiler/interpreter, not only because they increase
the level of abstraction between the source code and the resulting machine code, but
because increased complexity is required to formalize those abstract structures.
The target language is normally a low-level language such as assembly, written with
somewhat cryptic abbreviations for machine instructions, in this cases it will also run an
assembler to generate the final machine code.
But some compilers can directly generate machine code for some actual or virtual
computer e.g. byte-code for the Java Virtual Machine.
Another common approach to the resulting compilation effort is to target a virtual
machine. That will do just-in-time compilation and byte-code interpretation and blur the
traditional categorizations of compilers and interpreters.
For example, C and C++ will generally be compiled for target `architecture'. The draw-
back is that because there are many types of processor there will need to be as many
distinct compilations. In contrast Java will target a Java Virtual Machine, which is an
independent layer above the 'architecture'. The difference is that the generated byte-code,
not true machine code, brings the possibility of portability, but will need a Java Virtual
Machine (the byte-code interpreter) for each platform. The extra overhead of this byte-
code interpreter means slower execution speed.
An interpreter is a computer program which executes the translation of the source
program at run-time. It will not generate independent executable programs nor object
libraries ready to be included in other programs.
A program which does a lot of calculation or internal data manipulation will generally run
faster in compiled form than when interpreted. But a program which does a lot of
-2-
input/output and very little calculation or data manipulation may well run at about the
same speed in either case.
Being themselves computer programs, both compilers and interpreters must be written in
some implementation language. Up until the early 1970's, most compilers were written in
assembly language for some particular type of computer. The advent of C and Pascal
compilers, each written in their own source language, led to the more general use of high-
level languages for writing compilers. Today, operating systems will provide at least a free
C compiler to the user and some will even include it as part of the OS distribution.
Compiler construction is normally considered as an advanced rather than a novice
programming task, mainly due to the quantity of code needed rather than the difficulty of
any particular coding constructs. To this most books about compilers have some blame.
The large gap between production compilers and educational exercises promotes this
defeatist view.
For the first version of a compiler written in its own source language you have a
bootstrapping problem. Once you get a simple version working, you can then use it to
improve it.
The compilation process
At the highest level, compilation is broken into a number of parts:
1. Lexical analysis (tokenizing)
2. Syntax analysis (parsing)
3. Type checking
4. Code generation
Requirements
Any compiler has some essential requirements, which are perhaps more stringent than for
most programs:
• Any valid program must be translated correctly, i.e. no incorrect translation is allowed.
• Any invalid program must be rejected and not translated.
There will inevitably be some valid programs which can't be translated due to their size or
complexity in relation to the hardware available, for example problems due to memory
size. The compiler may also have some fixed-size tables which place limits on what can be
compiled (some language definitions place explicit lower bounds on the sizes of certain
tables, to ensure that programs of reasonable size/complexity can be compiled).
There are also some desirable requirements, some of which may be mutually exclusive:
• Errors should be reported in terms of the source language or program.
-3-
• The position at which an error was detected should be indicated; if the actual error
probably occurred somewhat earlier then some indication of possible cause(s) should also
be provided.
• Compilation should be fast.
• The translated program should be fast.
• The translated program should be small.
• If the source language has some national or international standard:
• Ideally the entire standard should be implemented.
• Any restrictions or limits should be well and clearly documented.
• If extensions to the standard have been implemented:
• These extensions should be documented as such.
• There should be some way of turning these extensions off.
There are also some possibly controversial requirements to consider (see chapter on
dealing with errors):
• Errors detected when the translated program is running should still be reported in relation
to the original source program e.g. line number.
• Errors detected when the translated program is running should include division by 0,
running out of memory, use of an array subscript/index which is too big or too small,
attempted use of an undefined variable, incorrect use of pointers, etc.
Overall Structure
For ease of exposition we will divide the compiler into a front end and a back end. These
need not even be written in the same implementation language, providing they can
communicate effectively via some intermediate representation.
The following list itemizes the tasks carried out by the front end and the back end. Note
that the tasks are not carried out in any particular order, as outlined below, and discussed
in more detail in subsequent sessions.
• Front end
• lexical analysis - convert characters to tokens
• syntax analysis - check for valid sequence of tokens
• semantic analysis - check for meaning
• some global/high-level optimization
• Back end
• some local optimization
-4-
• register allocation
• peep-hole optimization
• code generation
• instruction scheduling
Almost all the source-language aspects are handled by the front end. It is possible to have
different front ends for different high-level languages, and a common back end which does
most of the optimization.
Almost all the machine-dependent aspects are handled by the back end. It is possible to
have different back ends for different computers so that the compiler can produce code for
different computers.
The front end is normally controlled by the syntax analysis processing. As necessary, the
syntax analysis code will call a routine which performs some lexical analysis and returns
the next token. At selected points during syntax analysis, appropriate semantic routines are
called which perform any relevant semantic checks and/or add information to the internal
representation.
Typical Compilation
Consider the source code of C function
-5-
human readability. Now consider the assembly code that the C compiler gcc generates for
the Intel platform:
The assembly code is optimized for hardware it is to run on. The code consists of
machine instructions, uses registers and unnamed memory locations. This version is
much harder to understand by humans.
Issues in Compilation
The translation of code from some human readable form to machine code must be
“correct”, i.e., the generated machine code must execute precisely the same computation
as the source code. In general, there is no unique translation from source language to a
destination language. No algorithm exists for an “ideal translation”.
Translation is a complex process. The source language and generated code are very
different. To manage this complex process, the translation is carried out in multiple passes.
Two-pass Compiler
-6-
The figure above shows the structure of a two-pass compiler. The front end maps legal
source code into an intermediate representation (IR). The back end maps IR into target
machine code. An immediate advantage of this scheme is that it admits multiple front ends
and multiple passes.
The algorithms employed in the front end have polynomial time complexity while
majority of those in the backend are NP-complete. This makes compiler writing a
challenging task.
Let us look at the details of the front and back ends.
Front End
source tokens IR
scanner parser
code
errors
The front end recognizes legal and illegal programs presented to it. When it encounters
errors, it attempts to report errors in a useful way. For legal programs, front end produces
IR and preliminary storage map for the various data structures declared in the program.
The front end consists of two modules:
1. Scanner and 2.Parser
Scanner
The scanner takes a program as input and maps the character stream into “words” that are
the basic unit of syntax. It produces pairs – a word and its part of speech. For example, the
input
x = x + y
becomes
<id,x>
<assign,=>
<id,x>
<op,+>
<id,y>
-7-
We call the pair “<token type, word>” a token. Typical tokens are: number, identifier, +, -,
new, while, if.
Parser
The parser takes in the stream of tokens, recognizes context-free syntax and reports errors.
It guides context-sensitive (“semantic”) analysis for tasks like type checking. The parser
builds IR for source program.
The syntax of most programming languages is specified using Context-Free Grammars
(CFG). Context-free syntax is specified with a grammar G=(S, N, T, P) where
1. goal ? expr
2. expr ? expr op term
3. | term
4. term ? number
5. | id
6. op ? +
7. | –
S = goal
T = { number, id, +, -}
N = { goal, expr, term, op}
P = { 1, 2, 3, 4, 5, 6, 7}
Given a CFG, we can derive sentences by repeated substitution. Consider the sentence
x+2–y
-8-
Production Result
Goal
1: goal ? expr Expr
2: expr ? expr op Expr op term
term
5: term ? id expr op y
7: op ?– expr – y
2: expr ? expr op expr op term – y
term
4: term ? number expr op 2 – y
6: op ?+ expr + 2 – y
3: expr ? term term + 2 – y
5: term ? id x+2–y
To recognize a valid sentence in some CFG, we reverse this process and build up a parse,
thus the name “parser”. A parse can be represented by a tree: parse tree or syntax tree.
For example, here is the parse tree for the expression x+2-y
The parse tree captures all rewrite during the derivation. The derivation can be extracted
by starting at the root of the tree and working towards the leaf nodes.
-9-
Abstract Syntax Trees
The parse tree contains a lot of unneeded information. Compilers often use an abstract
syntax tree (AST). For example, the AST for the above parse tree is
+ <id,y>
<id,x> <number,2>
This is much more concise; AST summarizes grammatical structure without the details of
derivation. ASTs are one kind of intermediate representation (IR).
The Back End
The back end of the compiler translates IR into target machine code. It chooses machine
(assembly) instructions to implement each IR operation. The back end ensure
conformance with system interfaces. It decides which values to keep in registers in order
to avoid memory access; memory access is far slower than register access.
The back end is responsible for instruction selection so as to produce fast and compact
code. Modern processors have a rich instruction set. The back end takes advantage of
target features such as addressing modes. Usually, instruction selection is viewed as a
pattern matching problem that can be solved by dynamic programming based algorithms.
Instruction selection in compilers was spurred by the advent of the VAX-11 which had a
CISC (Complex Instruction Set Computer) architecture. The VAX-11 had a large
instruction set that the compiler could work with.
- 10 -
CISC architecture provided a rich set of instructions and addressing modes but it made the
job of the compiler harder when it came to generate efficient machine code. The RISC
architecture simplified this problem.
Register Allocation
Registers in a CPU play an important role for providing high speed access to operands.
Memory access is an order of magnitude slower than register access. The back end
attempts to have each operand value in a register when it is used. However, the back end
has to manage a limited set of resources when it comes to the register file. The number of
registers is small and some registers are pre-allocated for specialize used, e.g., program
counter, and thus are not available for use to the back end. Optimal register allocation is
NP-Complete.
Instruction Scheduling
Modern processors have multiple functional units. The back end needs to schedule
instructions to avoid hardware stalls and interlocks. The generated code should use all
functional units productively. Optimal scheduling is NP-Complete in nearly all cases.
Three-pass Compiler
There is yet another opportunity to produce efficient translation: most modern compilers
contain three stages. An intermediate stage is used for code improvement or optimization.
The topology of a three-pass compiler is shown in the following figure:
errors
The middle end analyzes IR and rewrites (or transforms) IR. Its primary goal is to reduce
running time of the compiled code. This may also improve space usage, power
consumption, etc. The middle end is generally termed the “Optimizer”. Modern optimizers
are structured as a series of passes:
- 11 -
IR Opt IR Opt IR Opt IR Opt IR
1 2 3 n
errors
• Memory management
• Allocate and de-allocate memory
• Garbage collection
• Run-time type checking
• Error/exception processing
• Interface to OS – I/O
• Support for parallelism
• Parallel threads
• Communication and synchronization
Lexical Analysis
The scanner is the first component of the front-end of a compiler; parser is the second
- 12 -
source tokens IR
scanner parser
code
errors
The task of the scanner is to take a program written in some programming language as a
stream of characters and break it into a stream of tokens. This activity is called lexical
analysis. A token, however, contains more than just the words extracted from the input.
The lexical analyzer partition input string into substrings, called words, and classifies
them according to their role.
Tokens
if(b == 0) a = b
the words are “if”, “(”, “b”, “==”, “0”, “)”, “a”, “=” and “b”. The roles are keyword,
variable, boolean operator, assignment operator. The pair <role, word> is given the name
token. Here are some familiar tokens in programming languages:
• Identifiers: x y11 maxsize
• Keywords: if else while for
• Integers: 2 1000 -44 5L
• Floats: 2.0 0.0034 1e5
• Symbols: ( ) + * / { } <> ==
• Strings: “enter x” “error”
Ad-hoc Lexer
The task of writing a scanner is fairly straight forward. We can hand-write code to
generate tokens. We do this by partitioning the input string by reading left-to-right,
recognizing one token at a time. We will need to look-ahead in order to decide where one
- 13 -
token ends and the next token begins. The following C++ code present is template for a
Lexer class. An object of this class can produce the desired tokens from the input stream.
class Lexer
{
Inputstream s; char next; //look
ahead
Lexer(Inputstream _s)
{ s = _s;
next = s.read();
}
This works ok, however, there are some problem that we need to tackle.
• We do not know what kind of token we are going to read from seeing first
character.
• If token begins with “i”, is it an identifier “i” or keyword “if”?
• If token begins with “=”, is it “=” or “==”?
- 14 -
We can extend the Lexer class but there are a number of other issues that can make the
task of hand-writing a lexer tedious. We need a more principled approach. The most
frequently used approach is to use a lexer generator that generates efficient tokenizer
automatically.
Regular Languages are the most popular for specifying tokens because
• These are based on simple and useful theory,
• Are easy to understand and
• Efficient implementations exist for generating lexical analysers based on such
languages.
When the set is small, the former method will suffice. When the set is large, the former
option is hopeless and the second is to be preferred. The latter approach is how languages
are defined and the sentence generating device is called a grammar for the language. Note
that articles before the word grammar in the last sentence – ‘a’
In this session, we shall give a formal definition of the grammar and show how it is used
to generate sentences. We also indicate how it is used to decide which string of words is
actually a sentence. This converse function of the grammar is called syntax analysis,
language recognition, or parsing and is nearly the subject of the next session.
- 15 -
1.1 Basic Concepts
Alphabet
Whenever we say ‘alphabet’, we shall mean a non-empty finite set of elements. We will
call these elements symbols. We shall overwork the word symbol in this course. In fact,
whenever we don’t know what to call an object, we call it a symbol. But we shall in all
cases make sense.
String
A finite ordered sentence of symbols taken from an alphabet is said to be a string over that
alphabet. Notice that the symbols of a string are ordered, i.e., string AB is different from
string BA although both strings contain the same symbols. Apart from the symbols it
contains, a string has another property – its length. This is the number of symbols it
contains. If it contains no symbols, we call it the null string and denote it by Ɵ. There is
nothing odd here; if a person owes you nothing, you can state quite categorically that he
owes you N0. We shall denote by /x/ or 1(x) the length of a string x.
//// Example 1.1.1
(i) the set {0, 1} is an alphabet, A.
(ii) the set {a, b, c, d} is an alphabet, B.
(iii) the set {0, 1, 2, 3, 4, …} is not an alphabet because it is not finite.
(iv) the set { } is not an alphabet because it is empty.
(v) 0, 00, 01010011 are strings over A;
(vi) 1 (‘0’) = 1 (‘a’) = 1 (‘d’) = 1
1 (‘01010011’) = 8
N1.1.1 Notational Convention
We will use upper case such as X, Y, Z, characters to represent names of symbols as well
as sets of strings over an alphabet. We will use lower case characters such as x, y, z to
represent strings. Thus we may write
X = ‘CS331’
and
X = {‘CS331’, ‘CS909’, ‘GS1000’}
String Operations
Several string operations are available and interesting but the ones that shall be in frequent
use to us are
- 16 -
(i) Catenation
If x and y are strings, the string obtained by writing the symbols of y immediately
following the symbols of x is called catenation of string y to string x or the concatenation
of x and y. it is denoted by xy (a notation that shows clearly which string comes first). For
example, the concatenation of x = ‘CS’ and y = ‘331’ is xy CS331’. The concatenation of
x and Ɵ, that null string, is XƟ = ‘CS’.
Q1.1.1 Show that for any string x, xƟ = Ɵx = x.
(Hint: beware of points that appear trite. Often they form the weak link in a strong chain
and when ignored …)
Q1.1.2 What exactly does the expression “let Z = xy be a string” mean, if x and y are
strings?
Note particularly that in the above, it was nowhere stated that x must differ from y.
(ii) h(.), ph(.), t(.), pt(.)
let Z= xy be a string. Then x is said to be head, and y is a tail, of Z. thus if ‘CS331’
is a string, its heads are ‘C’,’CS’, ‘CS3’, ‘CS33’, and ‘CS331’, and its tails ‘CS331’, ‘1’,
‘31’, etc.
If, for example, you view ‘CS331’ as Z, x as ‘CS’ (and so Y as ‘331’) you obtain
the head ‘CS’. If you view ‘CS331’ as Z, x as Ɵ (why not?), then y = ‘CS331’ is a clear
tail. By moving the Ɵ to the rear, you obtain ‘CS331’ as head. For brevity, we introduce the
notations h(.), t(.) … to denote the operations of computing heads and tails of a string. If x
is a head and y is not Ɵ, then we call x a proper head. Similarly, should y be a tail and x
non-null, we call x a proper head. Thus
h(‘CS33’) = {‘C’, ‘CS’, ‘CS’…}, and
t(‘CS3’) = {‘CS3’, ‘S3’, ‘3’} and
‘CS3’ is a proper head and ‘S3’ is a proper tail of Z = ‘CS3’.
(iii) String multiplication
In this we carry (i) one step further. The catenation of two strings x and y when x =
y may be written x² = xx. Similarly, xⁿ = xx … x (n times) shall denote the catenation of a
string x n times. xº is defined to be Ɵ for any string x and x1 shall return x.
We start not with two strings but with two sets A and B of strings. Take for
example,
A = {‘01’, ‘ab’}
B = {‘+’, ‘-’, ‘xyz’}
We define the product of A and B by
AB = {xy/x є A, y є B}
- 17 -
In words, the product of A and B is
(a) A set of strings p
(b) Each p is the concatenation of a string from A with a string from B.
That’s all.
Then
While
The production becomes interesting when A = B. then the above example extends to, e.g.,
AA = {‘0101’, ‘01ab’, …}
The product becomes very interesting when a multiplicand (see GCE ‘0’ level maths) is
itself a product e.g., AB, when A was in fact created from XX or X2. Then
AB = XXB
XXX, OR X3
Xn = X(Xn-1)
Let X = {0,1}
Then
- 18 -
X2 = {‘00’, ‘01’, ‘10’, ‘11’}
i.e., every member x has 1(x) = 3. For every member x of Xn 1(x) shall be n. ////
Thus we summarise:
Two key, central, supper-critical concepts follow: the closure X * and the transitive close
X+ of a set X.
Closure of X
The closure X* = X0 Ս X1 Ս X2 Ս … Ս Xn Ս …
X = X 1 Ս X2 Ս … Ս X n
Note that
X* = X0 Ս X+ and X+ = (X*) X
i.e., the null strings of arbitrary length ≥ 0. To obtain the A +, drop the null string in the
above list. ////
- 19 -
1.2 Definition of Grammar
(U, x) or U ::=x
Where U is a symbol and x is a non-empty finite string of symbols. U is called the left part
and x is called the right part of the production.
D1.2.1 Grammar
A grammar G [Z] is a finite non-empty set of rules. Z is a symbol which must appear as
the left part of at least one rule in the set. Z is called the distinguished or sentence symbol.
‘Sentence symbol’ is a better name since Z represents the generic object defined by the
grammar. All the symbols appearing in either the left parts or the right parts from the
vocabulary, V.
Any symbols appearing as left part of a rule is called a non-right part of the rule, in which
it is called a terminal. The set of non-terminals of a grammar will be denoted by VN; the
set of terminals called the alphabet of the grammar, will be denoted by VT.
V = VN U VT
The rules define a class called identifier. The rules define ‘A’, ‘B’, ‘C’ each to be read as
‘is defined to be’.
- 20 -
N1.2.1 Notation
When several rules have the same left part, we may combine the rules into one composite
rule with the left part and the right parts, now called alternates, separated by dots ‘•’ or
vertical bars ‘ | ’. thus, the three rules in example 1.2.1 may be combined to
N1.2.2 Notation
The word ‘symbol’ will be used in several contexts. The point of this ‘notation’ is that
‘symbol’ could be a single character but need not be. For example, each of identifier, ::=,
‘A’ is a symbol. Whenever we use the term ‘symbol’ the meaning will be clear (perhaps
on second reading).
VN = {<assignment>,<identifier>,<expression>,<term>}.
- 21 -
The formalism we have used so far to represent a grammar is called the Backus-Naur
Form (BNF). It is a notation for expressing productions. Another common formalism is
given in
// D1.2.2 Grammar
A grammar G[Z] is a 4-tuple (VN, VT, P, Z) where VN, VT are alphabets of non-
terminals and terminals, respectively, with non-terminals specifying sets used to build up
the language;
P is a finite set of productions also called rules, of the form U u where U belongs to
Z is the distinguished non-terminal which is the set that constitutes the whole language
defined by the grammar. //
For the language of arithmetic operation with operators + and *, we have the grammar
Where P consists of
E E + E|E*E|(E)|id
Z is E.
Nothing in the above has shown the real meaning of the claim that the ‘grammar defines
the language’. We now bring this out by showing that the grammar is device that indeed
generates its language. In the next session, we show that the (same) grammar is a device
for deciding which strings are in its language. The subject of derivation is usually the
easiest part of the course.
- 22 -
1.3 Derivations
Let G[Z] be a grammar and v and w strings in V. we make any of the following
statements:
v w
if it is true that
D1.3.2 Derivation
Now we drop the qualifier ‘direct’. We say that
(a) v produces w
(b) w is a derivation of v
(c) w reduces to v,
- 24 -
Hence a is a sentence
(ii) To derive aa, we must proceed via rule 2 it being obvious that rule 1 won’t take us
too far. Suppose we thus start out as
S Saa
Our next move must be such that Saa aa. But, since the rule S ::= Ɵ does not appear
in the grammar, we have reached a dead ed. Conclusion: ‘aa’ is not a sentence in L(G).
(iii) S Saa (rule 2)
S Saa aaa (rule 1)
So the rule sequence 21 takes us from the sentence symbol to the trial sentence. ‘aaa’ is in
L(G). A little reflection or induction on the length of the string of a’s will convince the
reader that in fact
L(G) = {a2m+1, m = 0, 1, 2, •••}
(c) Consider the grammar G[a]
a ::= i := e;
e ::= t|e + t
t ::= i|t *i
I::= A|B|C
Now is the string
‘C: = B + C * A in L(G[a])?
As before, this translates to trying to see if
a + C: = B + C * A
we take off:
a i := e; (r1)
a C := e; (r8)
C: = e + t; (r3)
C: = t + t; (r2)
C: = i + t; (r4)
- 25 -
C: = B + t; (r7)
C: = B + t * i; (r5)
C: = B + C * i; (r8)
C: = B + C * A (r6)
and thus confirm that ‘ C := B + C * A is in L(G[a]) and have the rule sequence
183247586 to prove it. ////
The derivation, as we have presented it has a single algorithm which we key to the
syntactic class as follows:
A1.3 string derivation from a syntactic class
1. Write down the definition of the class <C> as a string S.
2. If S contains no class names, stop.
3. Select a class name (scanning, left to right) from S; say <X>.
4. Select a rule (scanning left to right) in which <X> appears on the left.
5. Replace the occurrence of <X> in S by the right hand side of that rule.
6. Return to stop 2.
Derivation Tree
Our formalism for expressing a derivation has been linear. A derivation may also be
presented in a graphical form, as a tree structure called a derivation (or syntax or parse)
tree.
Formally, let G = (V, T, P, Z) as before. A tree is a derivation tree for G if:
1. every vertex has a label which is a symbol of VN Ս T Ս {Ɵ}
2. the label of the root is Z
3. if a vertex is interior and has a label A, then A must be in VN.
4. If a vertex has a label A and vertices n1, n2, …, nu are the sons of vertex n1 in order
from left t right, with labels X1, X2, …, Xk, respectively, then
A ::= X1 X2 … Xk
must be a production in P.
5. if vertex n has a label in VT, then n is a leaf.
- 26 -
A ZbA | ZZ | ba
We draw a tree
1
Z
Z
a 2 A 3 4
a
8
Z A
5 b 6 7
a
a 9
b 10 11
By checking the satisfaction or otherwise of conditions in the definition, the reader can
easily show that the tree as given is (not) a parse tree for the given grammar.
D1.3.2 Phrase
Let G[Z] be a grammar. Let w = xuy be a sentential form. Then u is called a phrase of w
for a non-terminal U if:
(i) Z * xuy and
(ii) U +u
- 27 -
We conclude that <no> 1 is a phrase.
(ii) 1
<number> * <no> <digit>
And <digit> +1
Since <digit> ::= 1 is a rule of GN, 1 is a simple phrase as well. ////
D1.3.3 Handle
A handle of a sentence form is a leftmost simple phrase. The similarity between a
derivation and the syntax tree for the same sentential form is obvious.
Language Hierarchy
2.0 Introduction
As you already know, a grammar is a generative structure or device. Given as grammar G,
the set of strings it generates is called the language L(G) of the grammar. Elements of this
set are called sentences. Note that L doesn’t have to e Igbo, but it could be. L can be
BASIC, FORTRAN, Pascal, an Assembly Language, etc. there is a hierarchy, at times
called the Chomsky hierarchy (in honour of the Indian and later American philosopher and
linguist who developed it) of language types. In this session, we present this hierarchy and
thus define the class of languages described as regular. The usual convention in language
theory in computer science as in linguistics is that if a language L is generated by a type x
grammar, then L is said to be a type x language.
To force the reader to refresh his knowledge of the concept of a grammar and to introduce
a notation which will make us smile after a little frown, we define a grammar G by
A grammar is a quadruple G = <T, N, P, S>, where T and N are an alphabet and auxiliary
alphabet, respectively, S Ɛ N and P ⊆ (TƯ N)*2 is a relation.
By the way, in case you have forgotten, a binary on a set S is a subset of SxS, where SxS
is defined as the Cartesian product, SxS = {(a b) | a, b Ɛ S}. Take for example, the set S of
natural numbers. SxS is the familiar Cartesian product consisting of the Cartesian co-
ordinate, e.g., (0, 0), (1, 3), (17, 2), etc. The subset consisting of (0, 0), (1, 4), (3, 12),
(13, 42), (2, 8), (9, 36), etc. defines a relation on S. In fact whenever you draw a line graph
- 28 -
on the Euclidean plane you are defining a relation on R, the set of real numbers. Back
home, if V is a vocabulary, we can obtain a binary relation by listing the pairs consisting
of left part and right part of rules of the grammar, i.e., we include the pair (x, y) only if in
our grammar there is a rule in which x is the left part and y is the right part.
Members of P are called productions or (rewriting) rules. A rule (α, β) is usually written as
α β. Given a rule (α, β), α is called the antecedent and β is called the consequence of (α,
β). Members of T are called terminal symbols. Members of N are called non terminals, or
syntactic categories, or metalinguistic variables, or simply variables. S is called the
distinguished metalinguistic variable. This kind of grammar is called a constituent
structure grammar. The term constituent structure grammar, implies that the grammar so
described does, by means of variables, impose structure on the constituents (or parts) of
sentences in the language generated by the grammar. If you are still confused, remember
the English language teachers in the secondary school. They were sweating on the nose
trying to tell you about ‘parts of speech’, i.e., things like noun, adjective, clause, gerund,
etc. we are doing the same thing here from an advanced standpoint, after all, we are in the
higher institution and as you go higher, you learn more about less.
In case you are still feeling uncomfortable with these symbols, what does the expression
σ1 = αβγ say? Simple, consider the grammar G [<Number>] of example 1.2.2. Relative to
this grammar, we can say that <Number> <no> because there is a rule
σi is <Number> and αβγ is <no>. This will be true if we take α = γ = λ, the null string. i.e.,
<Number> is β and <no> is β’ because (<Number>, <no>) is a rule or <Number>
<no>. Thus, if the expression looks like β … … β …, then the prefix of each side is α
and the suffix is γ. Hopefully you can now breathe more easily.
- 29 -
//// Definition 2.1.3
Let T = {a, mango, ate, bought, child, green, man, ube, the}
Using the above notation, the reader can show that the string ‘a mango bought the green.
green. green.ube’ is in L(G), where the dot denotes concatenation. That is, the atrocity is
syntactically correct in the given grammar. ////
αAβ αγβ, where α, β (TN)*, AN, γ(TN)* - (λ), where λ is the null string.
The term context-sensitive refers to the fact that we are allowed to rewrite A as γ only in
the context β
- 30 -
The grammar G defined by
S aSBC | abC
CB BC
bB bb
bC bc
cC cc
is context-sensitive.
A grammar is of Type 2 only if all its rules take the form A γ, where A N and γ
(TN)* - (λ). A Type 2 grammar is also known as a context-free grammar. ////
The grammar in Example 2.1.1 is contex-free. The following important and classical
grammar is also context-free:
A grammar is of Type 3 on if all its productions are of the form A α or A αB, where A,
B N and αT* – {λ} or alternately, they are all of the forms
A α or A βα.
- 31 -
//// Example 2.1.2
S ::=abS | cA
A ::=baA | a.
L(G) = (ab)*c(ba)*a. the right hand side is an example of what is called a regular
expression (and discussed later)
S ::= Aa
A ::= Aba | Bc | c.
The grammars we have described above are called restricted grammars. A grammar that
has no restrictions is called an unrestricted grammar (as if you don’t know that already).
Definition
- 32 -
2. λ is a regular expression denoting the regular set(λ)
3. a ɛ Ʃ is a regular expression denoting the regular set {a}.
4. if p, q are regular expressions denoting regular sets P, Q respectively,
then
(a) (p+q) is a regular expression denoting the regular set P i.e. PQ
(b) (pq) is a regular expression denoting the regular set PQ. i.e. PQ
(c) (p)* is a regular expression denoting the regular set P*
5. Nothing else is a regular expression.
Example
- 33 -
LEXICAL ANALYSIS
OVER VIEW OF LEXICAL ANALYSIS
To identify the tokens we need some method of describing the possible tokens that
can appear in the input stream. For this purpose we introduce regular expression, a
notation that can be used to describe essentially all the tokens of programming
language.
Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which are
designed using transition diagrams and finite automata.
At the lexical analysis stage, a program called a lexical analyser or scanner scans the
characters of a source program from left to right and builds up the tokens of a program.
The actual procedure of doing this is as follows: when a source statement is read, the first
character scanned enables a token category to be identified. For example, scanning a digit
first will indicate token category integer. Similarly, scanning an alphabet first will indicate
token category identifier or reserved word. On identifying this token category, the lexical
analyser keeps scanning and concatenating, character by character, consecutive character
of that token until a separator is encountered.
- 34 -
A separator can be any of the following:
a blank character
a conventional arithmetic operator
a delimiter such as comma, semi-colon, etc.
a character that does not actually belong to the set of the expected character that
forms a particular token class. For example, in the token class integer, a scan of
any character other than digit 0 – 9 will serve as a separator.
The tokens of a source program are usually passed to the next phase of compilation
(syntax analysis); in an inter form. For example, ID, LT, OP, KW and DL may be the
internal representation of the token type identifiers, literals, operators, keywords and other
delimiters respectively. These internal representation and the actual values of token (i.e.
the actual identifier name and actual literal) are passed to the next phase of compilation.
Alternatively, different integers may be used to denote each of these token types. For
example, integers 01, 02, and 03 may be used to represent token identifier, integer, and
other literals respectively; while other unique integers may be used to represent each
unique reserved words, operators and delimiters in the language. This method allows the
token to be treated in a uniform way. The internal forms in which tokens are represented
have the same length regardless of whether a token is 1 or 30 character long. This allows
the rest steps in compilation to be done in an efficient manner with fixed length symbols
rather than variable length string of characters.
Token Passed
Ste Internal Representation Value
p
1 LT 10
2 KW LET
3 TD B
4 OP =
5 LT 5.3
6 OP *
7 LT 3.2
8 OP +
- 35 -
9 TD A
Table 1 Sample internal representation of tokens
In addition to recognizing tokens, the scanner must recognize and delete blanks redundant
statements like comments. It is also the responsibility of the scanner to note and report
lexical errors such as invalid characters or improperly formed identifiers.
Lexical analysis can be programmed as a separate pass which performs a complete lexical
analysis on the source program and produces an output. A pass involves the reading of the
entire source program statements. However, it is sometimes common to program the
scanner as a subroutine called by the syntax analyser. Using the latter approach, when the
syntax analyser needs a new token, it calls the scanner which recognizes the next source
program token and passes it to the syntax analyser.
- 36 -
Example:
Description of token
A patter is a rule describing the set of lexemes that can represent a particular token in
source program.
SYNTAX ANALYSIS
The second step in compilation is syntax analysis. At this stage, a complete syntactic
check is performed on the source program. The syntax of a programming language relates
to the grammatical rules governing the formation of legal statements in a language i.e.,
rules of combining tokens together in appropriate sequence to form a valid sentence or
syntactic construct. Common examples of syntactic constructs in programming languages
are:
- 37 -
do not immediately follow each other in an arithmetic expression. Again it ensures that
left parenthesis are matched with corresponding right parenthesis. It also recognizes B =
5.3 * 3.2 + A as a valid assignment statement.The parser should report any syntax errors in
an intelligible fashion.
1. Top down parser: which build parse trees from top (root) to bottom (leaves)
2. Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up
parsing
A syntax expression defines sentences of the form, or A syntax of the form defines
sentences that consist of a sentence of the form followed by a sentence of the form
followed by a sentence of the form. A syntax of the form defines zero or one occurrence of
the form.
A syntax of the form defines zero or more occurrences of the form.
A usual implementation of an LL(1) parser is:
In a formal approach, without going into too much detail, BNF (Backus Normal Form) is a
notation for describing the formation of syntactic entities in a PL. for e.g., the BNF
representation off the syntax of a simplified PRINT statement in a simple PL is:
<Expr> <ldn>|<ldn><Del><Expr>
<ldn> <Letter>|<Letter><Var>
<Var> <Letter>|<Digit>
<Del> ,|;
<Letter> A|B|…|Y|Z
In BNF, items enclosed in corner brackets (i.e. “<>”) are called syntactic entities while
those not so enclosed are the tokens in the language. Thus, <Expr>, <Var>, <PRINT
- 39 -
statement>, etc. are syntactic entities while the word PRINT, A, B, 1, 2, etc. are tokens in
the language.
In simple English, the first line of the BNF notation means the syntactic entity <PRINT
statement> can be formed by either the word PRINT OR PRINT followed by <Expr>. The
second line of the notation means the syntactic entity <Expr> is formed either by <ldn>
or<ldn> followed by <Del> and then followed by <Expr>. In the same vein <Letter> is
composed of either of the alphabet A or B or ….. or Z.
Through a number of techniques, the syntax analysis is able to detect whether or not a
syntactic entities are formed by appropriate combination of appropriate small syntactic
entities and / or tokens.
Symbol Table
At the syntax analysis stage, information about identifiers, literals for loops, and so on, are
kept on a table called table of information. A particularly important table is called the
symbol table or identifier table. It is a table a table containing the identifiers used in the
source program along with their attributes. The attributes of an identifier include its type
(e.g. integer, real, string) from (e.g. constant, simple variable, subscripted variable), block
number for structured language like ALGOL, location in memory and other attributes
relevant to the language.
Identifiers Attributes
Entry 1 AREA
Entry 2 BREADTH
Entry 3 LENGTH
Since the number of characters in identifier names varies, pointers to the identifiers are
often stored in the column for identifier, while the identifiers themselves and their length
are stored in a special string list. This approach keeps the size of the identifier fixed and
thus facilitates searching for information about a particular identifier. Figure 5.2 shows
this approach of structuring a symbol table. Some simple methods of organising symbol
tables are given in the next Modules.
- 40 -
Identifier Attributes
s
3 7 6 LENGTH
AREA BREADTH
Figure 5.2 a symbol table with 3 entries having actual identifiers stored in a list
Each time an identifier is encountered, the symbol table is searched to see whether the
name has been previously entered. If the identifier is new, it is entered into the table. The
information collected in symbol table is used in semantic analysis, that is checking that
uses of variable names are consistent with their implicit and explicit declaration. There we
need to know how much and what kind of run time storage must be allocated to a variable
name.
The whole translation process revolves around the symbol table. Every occurrence (use or
declaration) of an identifier will call for need to search the symbol table. As a result it is
necessary that a symbol table be structured in a manner which allows efficient extraction
and insertion of information on the table.
The syntax analyser also reports syntactic errors such as multiple declaration of identifiers,
and other violations of the syntax of the language.
SEMANTIC ANALYSIS
The semantic analysis phase deals with the interpretation of the meanings associated with
a syntactic unit of the source program. When a parser has recognized a source language
construct, the semantic analyser will check for its semantic correctness, and store
necessary information about it into the symbol table or create an intermediate form of the
source program for it. For instance, when an assignment statement of the form:
<variable name> = <expression> is recognized the semantic analyser will check to ensure
that there is type compatibility between the <variable name> and the <expression> and
will create an intermediate form of source program for the assignment statement. For a
- 41 -
simple declarative statement, the semantic analyser will check to ensure that no variable
has been doubly declared, and necessary additional information about the declared
variable are entered into the symbol table.
Other semantic checks include ensuring that every “BEGIN” statement matching “END”;
every “DO” statement has matching “CONTINUE” in Fortran and every “ELSE”
statement has matching “IF”. In addition, complex or compound statement, such as a DO
LOOP, are broken down into simpler easy to follow statement at semantic analysis stage.
DO 10 I = 1, 50, 2
Body of Loop
10 CONTINUE
I=1
J = 50
K=2
11 ……………..
……………..
……………..
……………..
10 CONTINUE
I=I+k
If (I .LE. J) GO TO 11
……………..
……………..
……………..
The semantic analyser also reports semantic errors discovered at this stage. Examples of
such errors include type incompatibility, transfer of control to non-existing statement,
duplicate labels on statements, etc.
A B C *
A B
Y + Y +
A * C *
B C A B
- 43 -
notation, the operators come immediately after their operands. It is therefore, sometimes
called suffix or postfix notation. For instance, we can have the following expression in
infix notation with their corresponding polish notation.
Polish notation is the usual intermediate form of source program used by languages that
are commonly interpreted, e.g. SNOBOL, BASIC, etc.
It is easy to extend polish occasion to other operators, as long as operand are immediately
followed by the operators. For instance, an assignment statement with the usual syntax
Where T1 and T2 are temporary variable created by the compiler. Note that the quadruple
appears in the order in which they are to be executed. Furthermore, the operands in the
above example would not be the symbolic names but pointers (or indexes) to the symbol
table element which describes the operands.
- 44 -
It is important to realize that compiler-directed instructions like PL/1’s DECLARE
statement or FORTRAN’S DIMENSION statement do not have intermediate form. They
are not machine instructions but instructions that provide information that will help the
compiler in doing its work of compilation. All information within these statements is
normally entered into the symbol table and no machine code is required to be generated
for them. They are sometimes called compiler – directed statements. Extension of this
intermediate form of source program to other constructs is provided in the next modules.
CODE GENERATION
The last step in compilation is the code generation. This is the actual translation of the
intermediate form of source program into machine language.
Types of Code
There are 3 types of code in which a source program can be translated into. Each of them
has its own pros and cons. A compiler designer normally makes use of one of the 3 forms.
These codes are:
i. Absolute ML instructions, directly placed in fixed locations in memory which the
compiler then executes immediately (load – and go compiler).
ii. An assembly language which must then be assembled.
iii. A machine language problem placed on a secondary storage device, which must be
linked with other sub-programs and then loaded, in order to be executed.
Disadvantages
Subprogram cannot be pre-compiled separately; all subprograms must be compiled
at the same time.
Storage is wasted, because compiler occupies a large part.
Compilation will be done every time the program is to be executed.
Should be generated by a compiler in an environment, where programs tend to be
small, and where much debugging is done e.g. training environment.
- 45 -
b. CODE IN ASSEMBLY LANGUAGE
Advantages
Easiest, since AL is easier to deal with than binary digits.
Calls on macros can be easily generated, whose definition has been previously
written in AL. for example, on some machines, it takes 3 or 4 instructions to convert
a floating point integer to fixed point. Instead of generating these 3 or 4 instructions,
the compiler generates a call on a macro FIX and let the assembler produce the
instructions later. This also helps to reduce the size of the compiler.
Disadvantage
It adds one extra step (assembler) to the process of translating a program and
this extra step takes as long as the compilation itself.
Disadvantage
It however, requires liking and loading and this can be time consuming.
In IBM terminology, the linking f several object modules to form a load module is
performed by a linkage editor. The final loading of the load module is done by a loader
Storage Allocation
Prior to actual generation of code in either assembly language or machine language,
storage would have to be assigned to all identifiers (both user-defined and compiler-
generated, such as needed for temporarily storing an intermediate result) and literals. For
instance, consider the FORTAN declarative statement below:
INTEGER AREA, BREADTH, LENGTH
For this statement, a portion of the symbol table similar to the one in figure 5.3 may be
built. It is the responsibility of the storage allocation routine to assign required location to
each identifier.
- 46 -
As shown in the table, four bytes are reserved for each integer variable. The first variable
will be assigned to a relative location zero, the second to 4, and so on. Of course, the
absolute address cannot be known until load time. These relative assigned addresses are
used by the compiler in later phase for proper accessing.
Code Optimization
During the intermediate code generation, it is possible to generate some redundant codes.
Redundant in the sense that their elimination would not have any adverse effect on the
expected result. The removal of these redundant codes is called code optimization. It is a
step designed to improve the efficiency of the object code in terms of both storage
utilization and execution time.
To illustrate this, it will be observed that eight codes are generated for the statement W =
X * Y + Z. However, it is possible to generate a more efficient code for this statement. For
instance, the third and fourth codes are not really necessary. Their elimination will not
affect the result provided by the rest codes. This is because the product of the contents of
memory location X and Y are already in the accumulator; all that is needed is to just add
the content of memory location Z to it.
The same principle applies to the sixth and seventh codes. At the code optimization stage
the 3rd. 4th, 6th and 7th codes would be eliminated to give:
LDA X
MUL Y
ADD Z
STA W
A = 7*22/7 *R**2
The compiler can perform the multiplication and division involving constant and
substitute' 22 * R ** 2 for the original expression. This is especially useful if such an
expression occurs within a loop.
ii. Moving operations whose operands do not change within a DO-loop out of the
loop. Consider the program segment below as an example:
KSUM = 0
DO 10 K= 1,100
- 48 -
KSUM = KSUM + K
IX = 25
10 CONTINUE
It should not be difficult to see that the statement IX = 25 within the DO-loop >vi1l
always produce the same result. This could be removed out of the loop.
There is also machine dependent optimization. It is done while actually generating code.
This is done, by exploiting the features of the machine, such as better utilization of
available registers, and instruction set of the machine.
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. This means
that there’s no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on
the other side, will be thrown by your scanner when a given set of already recognised
valid tokens don't match any of the right sides of your grammar rules. Simple panic-
mode error handling system requires that we return to a high-level parsing function when a
parsing or lexical error is detected.
Exercise
1 (a) What is a compiler? - -
(b) Describe the four basic steps involved in the compilation of a program written in
high level language.
2(a) -What do you understand by
i. Polish notation ii. quadruple form?
- 49 -
(b) Represent the arithmetic expression
a-b(c/d * e/f) in
(i) Polish notation (ii) Quadruple form (iii) Syntax tree
Do CASE
CASE CHAR = Digit (0-9)
TOKEN = TOKEN + CHAR
- 50 -
Scan next charcter (CHAR)
Enddo
TOKEN = Numeric literal
OTHERWISE
CALL Error handling routine
ENDCASE
ENDDO
ENDDO
- 51 -
ii. Reserved words consists of BEGIN, ABS, INT, and END
iii. Identifiers (which may not be a reserved word and must start with an
alphabet and followed by zero or more alphabets or digits).
iv. Integers
B. At least one blank character must separate adjacent identifiers, integers and / or
reserved words; and no, blank may separate adjacent characters of a token
C. The double character /* begins a comment, and the comment ends at the first
occurrence of the double character */
For the above language, figure 6below is a simplified version of the state diagram for the
scanner. In the-diagram D stands for digits 0-9, L for alphabets A-Z, and DELIM
represents single character delimiter: +, ( and ), excluding / and *.
For example, beginning from state START, if the first character scanned makes us
proceed to state INT, we stay in this state as long as we can scan a digit, if a non-digit is
scanned, we proceed along the arc to output the token and then proceed to START again.
D D
4. * STAR * Go to start
Go to start
5. DELIM go to start
ERRORR
- 52 -
6. Go to start
Argument Value
Eatry 1
In the
Eatrytable,
2 'argument’ refers to the identifiers while
values to their attributes. As already
described in the preceding chapter, it is pointers
to the
Eatry nidentifiers that are usually stored in the
argument, while the identifier themselves .and.
their length are stored in a special string list.
There are many techniques of organising tables. Three simple methods are hereby
considered briefly
Symbol Table
Entry 0
ADE 11000001 Entry 11000001
BELLO 11000010 Entry 11000010
J 11001010
Entry 11001010
In this technique, the cost of a search is only the cost of doing hashing, as long as 2
symbols do not hash into the same index.
When 2 symbols hash to the same index, a collision is said to occur. For example, the
symbol ADE and AJAO will collide). A good hash function usually minimize occurrence
- 54 -
of collision. Two methods, of resolving the collision, details of which will not be
discussed, are Re-hashing and Chaining.
a) Assignment Operator
b) Unary Operator
One simple way of handling unary operator like unary minus is to write them as binary
operators, for example, 0-B for -B Thus:
Z + (-W+Y*X), which in polish notation now becomes
ZOW - YX * ++
An alternative approach would be to introduce a special symbol for the operator. For e.g.
@ for unary minus. Thus Z(-W-)-Y*X), in polish notation becomes Z W@ YX *++
c) Unconditional Jump
- 55 -
d) Conditional Jump
Similarly, operations such as branch if negative (BN), branch if positive (BP), etc would
be allowable. Note that BZ, BN, BP, etc. are binary operators unlike BR which is a unary
operator.
e) Conditional Statement
Where <expr> has a boolean value (1 = true, 0 = false), L1 is the label of the statement
beginning <stmt 2>, L2 the label of the statement or symbol following <stmt 2>, (i.e. a
statement that follows the last statement in <stmt 2>. The operators BZ and BR are as
defined previously For example, the statement
IF A = B THEN C=B+A ELSE D = A - B
where BNZ means branch if not zero, L1 represents the label for statement b= A-C and L2
represents the number of the statement or symbol following the complete IF statement
given above.
- 56 -
Where the number in parenthesis represents the position, location or number of symbol the
polish string. For example the first A, BNZ, C and last B respectively occupies the 1st, 5 th,
and 6th, and 15th position in the string.
f) For – Loops
This construct would need to be broken down into its primitive forms with
appropriate initializations testing and branching
N.B
20 is assumed to be the statement number of the statement following the end of the DO
LOOP body.
(1) I 1=
(4) S S I + =
(9) II 1 + =
(19) ----------------
Extending Quadruples
Extension of quardruples to the operators is not difficult either, some example are show
below.
- 57 -
(a) Unary Operator
In quadruple notation, common convention for unary operators is that they do not make
use of the second operand. For example -X in quadruple would be
-, X. . T
+, Y, Z, T1
-, X, T2
*, T1, T2, T3
(b)Assignment Operator
W=X+Y*Z
is the sequence .-
*, Y, Z, T1. ,
+, X, T1, T2
=, T2, W
.
(c) Unconditional Jump
(d)Conditional jump
BZ.i, OP
This notation means if the operand OP is zero then branch to quadruple i. BN (if OP is
negative) and BP (If OP is positive) are similarly defined.
- 58 -
Another quadruple for conditional jump (with a relational operator) can be
BE, i, OP1, OP2
The quadruples defined for the jump instruction, can be applied to conditional statement as
shown in the following examples.
Example 1
The quadruple for
IF I > J THEN K=K*G. could be
(1) -, I, J, Tl
(2) BP (4), Tl
(3) BR, (6)
(4) *, K, G, T2
(5) =, T2, , K
(6) ---------
Alternatively, it could be written as
(1) BG, (3), I, J
(2) BR, (5)
(3) *, K, G, T1.
(4) =, T1, K
(5) ….....
Note
The number in parenthesis indicates quadruple sequence number.
Example 2
The quadruple for
IF A=B THEN A = C +D ELSE B=C-D Could be either of the-following quadruples.
1) -, A, B, T1 (1) BNE, (5) A, B
2) BZ, (6), T1 (2) +, C, D, T1
3) -, C, D, T2 (3) =, T1, A
4) =, T2, , B (4) BR, (7)
5) BR, (8) (5) -, C, D, T2
6) + , C, .D, T3 (6) =, T2, , B
- 59 -
7) = , T3, , A (7) ----------------
8) ------------------
where BNE mean branch if the 2 specified operands are not equal, .and BR is as
previously defined.
(f) For-Loops
The approach to writing quadruple for this constructs is to break it down into its primitive
form, with the appropriate initialization, testing and branching, For example, the primitive
forms of
This primitive form assumes that the for-loop given is pretest iteration, meaning that
the test for the stopping point is carried out before executing the body of the loop.
Furthermore, it is assumed that the final value (N2) is greater than or equal to the initial
value (N1). The quadruple for the for-loop is as shown below.
1) =. N1, , I
2) -, I, N2, TI
3) BP, (99), T1
-------------
-------------
95) ------------
96) +, T, N3, T2
97) =, T2, , I
98) BR, (2)
99) ----------
In the above quadruple number 4 and 97 respectively begins and ends the body of the
loop.
- 60 -
Extending syntax tree to other operator
Syntax tree could similarly be extended as graphically illustrated in these examples below
A B
Example 2
The syntax tree for
IF e THEN S1
Could be
IF - THEN
e S
Example 3
The syntax tree for
IF e THEN S1 ELSE S2
Could be
IF - THEN - ELSE
e S1 S2
Example 4
The syntax tree for
IF A = B THEN A = C + D ELSE B = C - D
Could be
IF THEN ELSE
= = =
- 61 -
A B A + B
C D C D
Example 5
The syntax tree for
DO <Stmtno> V1 = V2, V3, (Fortran Do loop)
could be
DO FROM TO INCREASE BY (VI)
<Stmtno> = = V4
V1 V2 V1 V3
Exercise
Draw syntax tree for: i + i + I
Single pass compilers perform the four steps of compilation – lexical, syntax semantic
analyses and, code generation in succession for each source statement. In typical single
pass compiler, the parser calls the scanner when a construct is recognized. This procedure
does semantic checking, storage allocation and code generation for the construct before
returning to the parser.
In a multi-pass compiler these steps are performed in parallel that is each step may be
wholly or partially completed for all the source statement before the next step is begun.
- 62 -
statement and generate object code for it immediately. An example of a multi-pass
compiler is the IBM 360, FORTRAN IV II compiler which has four passes.
Single—pass compilers are usually: faster than multi-pass compiler; however, they may
produce less, efficient" object code since they may not likely perform extensive code
optimization. It is also required that all identifiers be declared before they are used. Single-
pass compiler is useful in a training environment. 1 A production environment will require
compiler that does more optimization.
i. Amount of available main memory - few passes for very limited memory.
ii. How fast the compiler should be. - The fewer the passes the faster.
iii. How efficient the object program should be. - Less passes implies less
efficiency.
iv. Environment of use - fewer passes for training, many passes for production
environment.
This section briefly compares and contrasts the 2 methods of implementing a high level
language. H.L.L. is generally implemented either as compilers or interpreters. Compiler
translates HLL to machine language equivalent for later execution, while interpreter
translates and executes the program statement by statement.
Compiler processes the program statements -in their physical input sequence, while
interpreter follows the logical flow of control through the program. Interpreter processes
some statement repeatedly (if they were part of a loop) and might ignore others completely
(if control never reaches them).
- 63 -
Object code is produced for subsequent executions with compilation; but with
interpretation, no object code is produced and re-translation is required every time
execution is required.
Again, interpreters are very useful for debugging purposes since part output can be
obtained as the program runs, but with a compiler, the whole program must be error free
before it can be executed to obtain any result.
Moreover, compilation requires that both source program and object program be stored;
while for interpretation there is no object program stored; only the source program has to
be stored. This feature often makes interpreters preferable to compilers for very small
computers.
On the other hand, a translated program may execute 20 to 100 times faster than an
interpreted one. This is because certain portions of a program may need to be executed
repeatedly, perhaps hundreds of times. An interpreter would have to analyse each
statement every time, it is to be executed. However, a compiler translates each statement
only once; regardless of how many times the translated statement will eventually be
executed.
Another significant difference between a compiler and interpreter is that only one error
can be detected at a time by an interpreter, and this error will cause program execution to
halt until the error is corrected. However, as many syntax errors as possible are normally
detected by a compiler.
Moreover, there is guarantee for detection of all syntax errors in compilation but there is
no such guarantee for interpretation, because control might never reach some statements
that have syntax errors.
- 64 -
ii. Environment:- Compiler for production environment, interpreter for training
environment,
iii. Available main memory:- Compiler for unlimited memory, interpreter for
limited memory.
iv. Debugging and diagnostic support:- Compiler if its production is critical;
otherwise interpreter
v. Detection of all translation errors:- Compiler, if its guarantees of prime concern,
otherwise, interpreter.
Exercise
1. What factors will inform your implementing a language as a compiler rather than
an interpreter? Why will you implement the language as single-pass rather than a
multi-pass compiler?
2. Develop a suitable algorithm for 'binay search' technique.
3. Put the following into polish notation.
- 65 -
CHAPTER ONE
What is a compiler?
A compiler is a program that takes as input a program written in one language (the source
language) and translates it into a functionally equivalent program in another language (the
target language). The source language is usually a high-level language like Pascal or C,
and the target language is usually a low-level language like assembly or machine
language. As it translates, a compiler also reports errors and warnings to help the
programmer make corrections to the source, so the translation can be completed.
Theoretically, the source and target can be any language, but the most common use of a
compiler is translating an ASCII source program written in a language such as C into a
machine specific result like SPARC assembly that can execute on that designated
hardware.
Although we will focus on writing a compiler for a programming language, the techniques
you learn can be valuable and useful for a wide-variety of parsing and translating tasks—
e.g converting javadoc comments to HTML, generating a table from the results of an SQL
query, collating responses from e-mail survey, implementing a server that responds to a
network protocol like http or imap, or “screen-scraping” information from an on-line
source. Your printer uses parsing to render PostScript files. Hardware engineers use a full-
blown compiler to translate from a hardware description language to the schematic of the
circuit. Your email spam filter quite possibly uses scanning and parsing to detect
unwanted email. And the list goes on…
- 66 -
How does a compiler work?
From the previous diagram, you can see there are two main stages in the compiling
process: analysis and synthesis. The analysis stage breaks up the source program into
pieces, and creates a generic (language-independent) intermediate representation of the
program. Then, the synthesis stage constructs the desired target program from the
intermediate representation. Typically, a compiler’s analysis stage is called its front-end
and the synthesis stages its back-end. Each of the stages is broken down into a set of
“phases” that handle different parts of the tasks.
What advantage do you see to structuring the compiler this way (separating the front and
back ends)?
- 67 -
names), reserved words, integers, doubles or floats, delimiters, operators, and
special symbols.
int a;
a = a + 2;
2) Syntax Analysis or Parsing: The tokens found during scanning are grouped together
using a context-free grammar. A grammar is a set of rules that define valid
structures in the programming language. Each token is associated with a specific
rule, and grouped together accordingly. This process is called parsing. The output of
this phase is called a parse tree or a derivation, i.e., a record of which grammar
rules were used to create the source program.
Expression – Expression |
...
Variable |
Constant |
...
Variable -> T_IDENTIFIER
Constant -> T_INTCONSTANT |
- 68 -
T_DOUBLECONSTANT
The symbol on the left side of the “->” in each rule can be replaced by the symbols on the
right. To parse a + 2, we would apply the following rules:
When we reach a point in the parse where we have only tokens, we have finished. By
knowing which rules are used to parse, we can determine the structures present in the
source program.
3) Semantic Analysis: The parse tree or derivation is checked for semantic errors i.e., a
statement that is syntactically correct (associates with a grammar rule correctly), but
disobeys the semantic rules of the source language. Semantic analysis is the phase
where we detect such things as use of an undeclared variable, a function called with
improper arguments, access violations, and incompatible operands and type
mismatches, e.g. an array variable added to a function name.
intarr[2], c;
c = arr * 10;
Most semantic analysis pertains to the checking of types. Although the C fragment above
will scan into valid tokens and successfully match the rules for a valid expression, it isn't
semantically valid. In the semantic analysis phase, the compiler checks the types and
reports that you cannot use an array variable in a multiplication expression and that the
type of the right hand-side of the assignment is not compatible with the left:
- 69 -
Example of intermediate code generation:
a=b*c+b*d _t1 = b * c
_t2 = b * d
_t3 = _t1 + _t2
a = _t3
The single C statement on the left is translated into a sequence of four instructions in three
address code on the right. Note the use of temp variables that are created by the compiler
as needed to keep the number of operands down to three.
Of course, it's a little more complicated than this, because we have to translate branching
and looping instructions, as well as function calls. Here is some TAC for a branching
translation:
- 70 -
The optimisation phase can really slow down a compiler, so typically it is an optional
phase. The compiler may even have fine-grain controls that allow the developer to make
tradeoffs between time spent compiling versus optimisation quality.
Example of code optimisation:
_t1 = b * c _t1 = b * c
_t2 = _t1 + 0 _t2 = _t1 + _t1
_t3 = b * c a = _t2
_t4 = _t2 + _t3
a = _t4
In the example shown above, the optimiser was able to eliminate an addition to the zero
and a re-evaluation of the same expression, allowing the original five TAC statements to
be re-written in just three statements and use two fewer temporary variables.
2) Object Code Generation: This is where the target program is generated. The output
of this phase is usually machine code or assembly code. Memory locations are
selected for each variable. Instructions are chosen for each operation. The three-
address code is translated into a sequence of assembly or machine language
instructions that perform the same tasks.
In the example above, the code generator translated the TAC input into MIPS assembly
output.
3) Object Code Optimisation: There may also be another optimisation pass that
follows code generation, this time transforming the object code into tighter, more
efficient object code. This is where we consider features of the hardware itself to
make efficient usage of the processor(s) and registers. The compiler can take
advantage of machine-specific idioms (specialised instructions, pipelining, branch
prediction, and other peephole optimisations) in reorganising and streamlining the
object code itself. As with IR optimisation, this phase of the compiler is usually
configurable or can be skipped entirely.
- 71 -
The symbol table
There are a few activities that interact with various phases across both stages. One is
symbol table management; a symbol table contains information about all the identifiers in
the program along with important attributes such as type and scope. Identifiers can be
found in the lexical analysis phase and added to the symbol table. During the two phases
that follow (syntax and semantic analysis), the compiler updates the identifier entry in the
table to include information about its type and scope.
When generating intermediate code, the type of the variable is used to determine which
instructions to emit. During optimisation, the “live range” of each variable may be placed
in the table to aid in register allocation. The memory location determined in the code
generation phase might also be kept in the symbol table.
Error-handling
Another activity that occurs across several phases is error handling. Most error handling
occurs in the first three phases of the analysis stage. The scanner keeps an eye for stray
tokens, the syntax analysis phase reports invalid combinations of tokens, and the semantic
analysis phase reports type errors and the like. Sometimes these are fatal errors that stop
the entire process, while at other times; the compiler can recover and continue.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, converting the source
program into a sequence of automic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced. This
phase bridges the analysis and synthesis phases of translation.
Code Optimization:-
This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space.
- 73 -
Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the
length of machine language program are carried out during this phase. The output of the
code generator is the machine language program of the specified computer.
Table Management (or) Book-keeping:-
This is the portion to keep the names used by the program and records essential
information about each. The data structure used to record this information called a
‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected.
The output of LA is a stream of tokens, which is passed to the next phase, the syntax
analyzer or parser. The SA groups the tokens together into syntactic structure called as
expression. Expression may further be combined to form statements. The syntactic
structure can be regarded as a tree whose leaves are the token called as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer, occur
in pattern that are permitted by the specification for the source language. It also
imposes on tokens a tree-like structure that is used by the sub-sequent phases of the
compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On
seeing the /, the syntax analyzer should detect an error situation, because the
presence of these two adjacent binary operators violates the formulations rule of an
expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming token
stream by identifying which parts of the token stream should be grouped.
- 74 -
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax analyzer to
create a stream of simple instructions. Many styles of intermediate code are possible. One
common style uses instruction with one operator and a small number of operands.
The output of the syntax analyzer is some representation of a parse tree. The intermediate
code generation phase transforms this parse tree into an intermediate language
representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the output runs
faster and takes less space. Its output is another intermediate code program that does the
same job as the original, but in a way that saves time and / or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A >B goto L2
GotoL3
L2 :
This can be replaced by a single statement
If A < B gotoL3
Another important local optimization is the elimination of common sub-expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing the speed of
loops. A typical loop improvement is to move a computation that produces the same result
each time around the loop to a point, in the program just before the loop is entered.
- 75 -
Code generator:-
Cg produces the object code by deciding on the memory locations for data, selecting code
to access each datum and selecting the registers in which each computation is to be done.
Many computers have only a few high speed registers in which computations can be
performed quickly. A good code generator would attempt to utilize registers as efficiently
as possible.
Table Management OR Book-keeping:-
A compiler needs to collect information about all the data objects that appear in the
source program. The information about data objects is collected by the early phases of the
compiler-lexical and syntactic analyzers. The data structure used to record this
information is called as Symbol Table.
Error Handing:-
One of the most important functions of a compiler is the detection and reporting of errors
in the source program. The error message should allow the programmer to determine
exactly where the errors have occurred. Errors may occur in all or the phases of a
compiler.
Whenever a phase of the compiler discovers an error, it must report the error to the error
handler, which issues an appropriate diagnostic msg. Both of the table-management and
error-Handling routines interact with all phases of the compiler.
Example:
Position:= initial + rate *60
Lexical Analyzer
SyntsxAnalyzer
id1 +
id2 *
id3 id4
- 76 -
Semantic Analyzer
id1 +
id2 *
id3 60
int to real
Code Optimizer
- 77 -
Id1:= id2 +temp1
Code Generator
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
TOKEN
LA reads the source program one character at a time, carving the source
If the symbols given in the standard format the LA accepts and produces
78 | P a g e
2.2 An Exercise in Recognition
Given a Type 1 grammar, G = <T, N, P, S> and a string α over T, where |α|=k, it is
strongly rumoured that this algorithm will determine whether αL(G) or not.
Algorithm 2.2.1
1. i 1; Q {S}
2. i i+1
3. Qi Qi+1 {γ|β γ and βQi+1, |γ|=k}
4. If Qi = Qi+1, stop. String α cannot be generated by G.
5. If αQi, goto step 2.
6. Stop. String α has been generated by G.
79 | P a g e