LECTURE 04
LEXICAL AND SYNTAX
ANALYSIS
CSE 325/CSE 425:
CONCEPTS OF
PROGRAMMING LANGUAGE
INSTRUCTOR: DR. F. A. FAISAL
OUTLINE
• Introduction
• Lexical Analysis
• The Parsing Problem
• Recursive-Descent Parsing
• Bottom-up Parsing
INTRODUCTION
• Language implementation systems must analyze source
code, regardless of the specific implementation approach.
• Nearly all syntax analysis is based on a formal description
of the syntax of the source language (BNF).
SYNTAX ANALYSIS
• The syntax analysis portion of a language processor
nearly always consists of two parts;
• A low-level part called a lexical analyzer (mathematically, a
finite automaton based on a regular grammar)
• A high-level part called a syntax analyzer, or parser
(mathematically, a push-down automaton based on a
context-free grammar or BNF)
ADVANTAGES OF USING
BNF TO DESCRIBE SYNTAX
• Provides a clear and concise syntax description
• The parser can be based directly on the BNF
• Parsers based on BNF are easy to maintain.
REASONS TO SEPARATE LEXICAL
AND SYNTAX ANALYSIS
• Simplicity- less complex approaches can be used for
lexical analysis; separating them simplifies the parser.
• Efficiency- separation allows optimization of the lexical
analyzer.
• Portability- Parts of the lexical analyzer may not be
portable, but the parser always is portable.
LEXICAL ANALYSIS
• A lexical analyzer is a pattern matcher for character
strings.
• A lexical analyzer is a “front-end” for the parser
• Identifies substrings of the source program that belong
together - lexemes
• Lexemes match a character pattern, which is associated
with a lexical category called as token.
• sum is a lexeme; its token may be IDENT.
(CONT.)
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token.
• Three approaches to building a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs a table-driven lexical analyzer from
such a description.
• Design a state diagram that describe the tokens and write
a program that implements the state diagram
• Design a state diagram that describes the tokens and
hand-construct a table-driven implementation of the state
diagram.
STATE DIAGRAM
DESIGN
• Helps to describe the behavior.
• A naïve state diagram would have a transition from every
state on every character in the source language- such a
diagram would be vary large!
LEXICAL ANALYSIS
(CONT.)
• In many cases, transitions can be combined to simplify
the state diagram.
• When recognizing an identifier, all uppercase and
lowercase letters are equivalent.
• Use a character class that includes all letters.
• When recognizing an integer literal, all digits are
equivalent – use a digit class
• Reserved words and identifiers can be recognized
together (rather than having a part of the diagram for each
reserved word).
• Use a table lookup to determine whether a possible
identifier is in fact a reserved word.
LEXICAL ANALYSIS
(CONT.)
• Conveninent utility subprograms:
• getChar – gets the next character of input, puts it in
nextChar, determines its class and puts the class in
charClass
• addChar- puts the character from nextChar into the place
the lexeme is being accumulated, lexeme
• lookup- determines whether the string in lexemes is a
reserved word (returns a code)
PROGRAM
PROGRAM (CONT.)
SAMPLE OUTPUT
STATE DIAGRAM
THE PARSING
PROBLEM
• Goals of the Parser, given an input program:
• Find all syntax errors; for each, product an appropriate
diagnostic message and recover quickly
• Produce the parse tree, or at least a trace of the parse tree,
for the program
• Two categories of parsers
• Top down- Produce the parse tree, beginning at the root
• Order is that of a leftmost derivation
• Traces or builds the parse tree in preorder
• Bottom up- Produce the parse tree, beginning at the leaves
• Order is that of the reverse of a rightmost derivation
• Useful parsers look only one token ahead in the input
(CONT.)
• Top-down Parsers
• Given a sentential form, xAα, the parser must choose the
correct A-rule to get the next sentential form in the leftmost
derivation, using only the first token produced by A.
• The most common top-down parsing algorithms:
• Recursive descent- a coded implementation
• LL parsers- table driven implementation
• Bottom-up Parsers
• Given a right sentential form, α, determine what substring
α is the right-hand side of the rule in the grammar that
must be reduced to produce the previous sentential form in
the right derivation.
• The most common bottom-up parsing algorithms are in the
LR family.
COMPLEXITY OF
PARSING
• Parsers that work for any unambiguous grammar are
complex and inefficient (O(n3), where n is the length of the
input)
• Commercial Compilers use parsers that only work for a
subset of all unambiguous grammars, but do it in linear
time (O(n))
RECURSIVE-DESCENT
PARSING
• There is a subprogram for each nonterminal in the
grammar, which can parse sentences that can be
generated by that nonterminal
• EBNF is ideally suited for being the basis for a recursive-
descent parser, because EBNF minimizes the number of
nonterminals.
• A grammar for simple expressions:
(CONT.)
• Assume we have a lexical analyzer named lex, which puts
the next token code in nextToken
• The coding process when there is only one RHS:
• For each terminal symbol in the RHS, compare it with the
next input token; if they match, continue, else there is an
error.
• For each nonterminal symbol in the RHS, call its
associated paring subprogram
PROGRAM
N.B.- This particular routine does
not detect errors.
Every parsing routine leaves the
next token in nextToken.
(CONT.)
• A nonterminal that has more than one RHS requires an
initial process to determine which RHS it is to parse
• The correct RHS is chosen on the basis of the next token
of input
• The next token is compared with the first token that can be
generated by each RHS until a match is found
• If no match is found, it is a syntax error
PROGRAM
PROGRAM
OUTPUT
PARSE TREE
THE LL GRAMMAR
CLASS
A -> A + B (this requires the activation of the A parser subprogram and calls
itself again and again)
ε specifies as the empty string
EXAMPLE
PAIRWISE
DISJOINTNESS TEST
Read by yourself
BOTTOM-UP PARSING
• The parsing problem is finding the correct RHS in a right-
sentential form to reduce to get the previous right-
sentential form in the derivation
PARSE TREE
PHRASE, SIMPLE
PHRASE AND HANDLE
PHRASE, SIMPLE
PHRASE AND HANDLE
HANDLES
EXAMPLE
SHIFT REDUCE PARSING IN COMPILERS
(WWW.GEEKSFORGEEKS.ORG)
BASIC OPERATIONS
EXAMPLE
THANKS