Module 3 Notes
Module 3 Notes
Module 3 Notes
Assembler design options- Algorithm for Single Pass assembler, Multi pass assembler,
Implementation example of MASM Assembler
Here considering the design and implementation of an assembler for more complex SIC/XE. So
that it is easy to examine the effect of the extended hardware on the structure and functions of
assembler. Many real machines have certain architectural features that are similar to those we
consider here.
Indirect addressing mode is indicated by adding the prefix @ to the operand. Immediate
addressing mode is denoted by adding the prefix # to the operand. Instructions that refer to
memory are normally assembled using either the program-counter relative or base relative mode.
BASE (line 70) is an assembler directive used in conjunction with base relative addressing. If the
displacements required for both program-counter relative and base relative addressing are too
large to fit into a 3-byte instruction, then 4-byte extended format (format 4) must be used. The
extended instruction format is specified with the prefix + added to the operation code in the
source statement (line 15,35,65). Programmer has to specify this addressing when it is required.
COMPR A,S
TIXR T
– Register –to – register instructions are faster than the corresponding register – to –
memory instructions because they are shorter and more importantly, they do not require
another memory reference
– Fetching an operand from a register is much faster than retrieving it from the main
memory
– When using immediate addressing, the operand is already present as part of the
instruction and need not be fetched from anywhere (line 55, 70)
– Changing COMP to COMPR (line 150) forces to use CLEAR (line 132)
If you calculate program – counter relative displacement that would be required for the
statement on line 175, you will see that it is too large to fit into the 12-bit displacement field.
Line 20 can be used with base relative mode. In our assembler, however we have arbitrarily
chosen to attempt program – counter relative assembly first.
In this line 12, the immediate operand is the symbol LENGTH. Since, the value of symbol is the
address assigned to it, the immediate instruction has the effect of loading register B with the
address of LENGTH. Note that here we combined program counter relative addressing with
immediate addressing. In general, target address calculation is performed, then, if immediate
mode is specified, the target address (not the contents stored at that address) becomes the
operand.
Program Relocation
If we knew in advance exactly which programs were to be executed concurrently in this way,
we could assign addresses when the programs were assembled so that they would fit together
without overlap or waste space. It is impractical to plan program execution this closely. We
do not know exactly when jobs will be submitted, exactly how long they will run etc.
Desirable to load programs into memory wherever there is room for it. In such a situation, the
actual starting address of the program is not known until load time.
If we do this, the address 102D will not contain the value that we expect – in fact, it will
probably be part of some other user’s program. We need to make some change in the address
portion of this instruction so we can load and execute the program at address 2000.
The only parts of the program that require modification at load time are those that specify
direct addresses. The rest of the instructions need not be modified
– PC-relative, Base-relative
Looking at the object program, it is not possible to distinguish the values which represent
addresses and which represent the constant data items. Assembler does not know the actual
location where the program will be loaded, it cannot make changes in the addresses used by
the program. However, the assembler can identify for the loader those parts of the object
program that need modification
Note that no matter where the program is loaded, RDREC is always 1036 bytes past the
starting address of the program.
Please check the slide no. 48 and 49 in slide show for better understanding of program
relocation in SIC and SIC/XE.
Relocatable Program
An object program that contains information needed for address modification for loading is
called re-locatable program. Here we are considering JSUB instruction, when assembler
generates the object code for the JSUB instruction, it will insert the address of RDREC
relative to the start of the program (This is the reason we initialized the location counter to 0
for the assembly. The assembler will also produce a command for the loader, instructing it to
add the beginning address of the program to the address field in the JSUB instruction at load
time. The command for the loader must also be a part of the object program.
Half byte approach is closely related to SIC/XE. Other machines it is not appropriate.
The record specifies that the beginning address of the program is to be added to a field that
begins at address 000007( relative to start of the program) and is 5 half bytes in length. Thus
in the assembled instruction 4B101036, the first 12 bits (4B1) will remain unchanged. The
program load address will be added to the last 20 bits (01036) to produce the correct operand
address.
Exactly, the same kind of relocation must be performed for the instructions on lines 35 and
65. The rest of the instructions in the program, however, need not be modified when the
program is loaded:
• Some cases operand is not memory address at all( eg: CLEAR S or LDA # 3)
• Some cases operand is specified using program-counter relative or base relative
addressing
• In line 10 STL RETADR is assembled using program counter relative addressing
with displacement 02D)
• No matter where the program is loaded in memory, the word labelled RETADR will
always be 2D bytes away from the STL instruction→ Thus no instruction
modification is needed
• When STL is executed, the program counter will contain the (actual ) address of the
next instruction
• The target address calculation process will then produce the correct (actual ) operand
address corresponding to RETADR
Similarly, the distance between LENGTH and BUFFER will always be 3 bytes.
• Thus displacement in the base relative instruction on line 160 will be correct without
modification. (The contents of the base register will, depend upon where the program is
loaded. However, this will be taken care of automatically when the program –counter relative
instruction LDB # LENGTH is executed)
It is convenient if a programmer can write the value of a constant operand as a part of the
instruction that uses it. This avoids having to define the constant elsewhere in the program
and make up a label for it. Such an operand is called a literal because the value is stated
literally in the instruction.In our assembler language notation, a literal is identified with the
prefix =, which is followed by a specification of the literal value, using the same notation as
in the BYTE statementeg1 :
Thus the literal in the statement specifies a 3-byte operand whose value is the character string
EOF.
• Eg 2 :
215 1062 WLOOP TD =X’05’ E32011
This statement specifies a 1-byte literal with the hexadecimal value 05
The notation used for literals varies from assembler to assembler, however most assemblers use
some symbols (as we have used =) to make literal identification easier.
Literal Pools
All of the literal operands used in the program are gathered together into one or more literal
pools. Normally literals are placed into a pool at the end of the program. The assembly listing of
the program containing literals usually includes a listing of this literal pool, which shows the
assigned addresses and the generated data values. Such literal pool listing is shown in the figure
immediately following the END statement. Here in this case the pool consists of the single literal
= X’05’.
In some cases, however, it is desirable to place literals into a pool at some other location in the
object program. For this an assembler directive LTORG (line 93).
LTORG
When the assembler encounters a LTORG statement, it creates a literal pool that contains all of
the literal operands used since the previous LTORG (or the beginning of the program). This
literal pool is placed in the object program at the location where the LTORG directive was
encountered. Note that, the literals placed in a pool by LTORG will not be repeated in the pool at
the end of the program. If we had not used the LTORG statement on line 93, the literal =C’EOF’
would be placed in the pool at the end of the program. This literal pool would begin at address
1073.This means that the literal operand would be placed too far away from the instruction
referencing it to allow program – counter relative addressing.
By placing the literal pool before this buffer, avoids the need to use extended format instructions
when referring to literals.The need for an assembler directive LTORG arises when it is desirable
to keep the literal operand close to the instruction that uses it.Duplicate Literals – same literal in
more than one place in the program and store only one copy of the specified data value. Most
assemblers recognize duplicate literals. For eg: the literal =X’05’ is used in the program in line
215 and 230.However, only one data area with this value is generated.Both instructions refer to
the same address in the literal pool for their operand.
The easiest way to recognize duplicate literals is by comparison of the character strings defining
them eg: =X’05’. More intelligent way is to look at the generated value instead of the defining
expression.eg: =C’EOF’ and =X’454F46’ would specify identical operand values.The assembler
might avoid storing both literals if it recognised this equivalence. But the complexity of the
assembler increases.If used character string defining a literal to recognize the duplicates we must
be very careful of literals whose value depends upon their location in the program.For example,
there are literals that refer to the current value of the location counter (often used by the symbol
*). Such literals are sometimes useful for loading base registers.
This will load the beginning address of the program into register B. So that this value would be
available for base relative addressing.In detecting duplicate literals such a notation causes a
problem. For eg:
– If the same literal appeared on line 55, it would specify an operand with value
0020.
In such case, the literal operands have identical names; however they have different values and
both must appear in the pool. Another problem arises if a literal refers to any other item whose
value changes between one point in the program and another.
The basic data structure needed is a literal table (LITTAB). In LITTAB for each literal contains
the literal name, the operand value and length and the address assigned to the operand when it is
placed in a literal pool.LITTAB is organised as hash table, using the literal name or value as the
key.
– The assembler searches LITTAB for the specified literal name (or value)
– If the literal not present: the literal is added to LITTAB (leaving the address
unassigned)
– When Pass 1 encounters a LTORG statement or the end of the program, the
assembler makes a scan of the literal table.
– At this time each literal currently, in the table is assigned an address (unless such an
address has already been filled in)
– As these addresses are assigned, the location counter is updated to reflect the number
of bytes occupied by each literal
• In Pass 2, the operand address for using in generating object code is obtained by searching
LITTAB for each literal operand is encountered
– The data values specified by the literals in each literal pool are inserted at the
appropriate places in the object program exactly as if these values had been generated
by BYTE or WORD statements.
– If a literal value represents an address in the program (eg: location counter value), the
assembler must also generate the appropriate Modification Record
Till now we deal only with user-defined symbols appear in the assembly program as labels on
instructions or data areas.The value of such a label is the address assigned to the statement on
which it appears.Most assemblers provide an assembler directive that allows the programmer to
define symbols and specify their values.
EQU
EQU (for “equate”) is the assembler directive generally used to define symbols and specify their
values. General form:
This statement defines the given symbol (ie., it enters into SYMTAB) and assigns to it the value
specified.The value may be given as a constant or as any expression involving constants and
previously defined symbols.One common use of EQU is to establish symbolic names that can
be used for improved readability in place of numeric values. For eg: on line 133 of the program
2.5 (as per the text)
+ LDT # 4096
This statement loads the value 4096 into register T. This value represents the maximum length
record we could read with subroutine RDREC.
If we include a statement
in the program, then you can include the line 133 as in program 2.9 (as per text book)
+ LDT # MAXLEN
When the assembler encounters the EQU statement, it enters MAXLEN into SYMTAB (with
value 4096). During assembly of LDT instruction, the assembler searches SYMTAB for the
symbol MAXLEN, using its value as operand in the instruction.Another common use of EQU
is defining mnemonic names for registers.Assembler recognizes standard mnemonics for
registers – A, X, L, etc.Assembler expects register numbers instead of names in an instruction
like RMO.This instructs the programmer to write RMO 0,1 instead of RMO A, X
In such case the programmer could include a sequence of EQU statements like:
These statements cause the symbols A, X, L... To be entered into SYMTAB with their
corresponding values 0,1,2 ...The programmer can establish and use names that reflect the logical
function of the registers in the program.
ORG
ORG (for “origin”) is an assembler directive used to indirectly assign values to symbols.It is of
the form:
ORG value
program. In most cases, if we alter the location counter value it will result in incorrect assembly.
However, the ORG is useful in label definition. Suppose that we are defining a symbol table with
the following structure:
• In this table,
– the SYMBOL field contains a 6-byte user-defined symbol
– VALUE is a one-word representation of the value assigned to the symbol
– FLAGS is a 2-byte field that specifies symbol type and other information
We want to refer to entries in the table using indexed addressing (placing in the index register the
offset of the desired entry from beginning of the table). We want to refer the fields SYMBOL,
VALUE, and FLAGS individually, so we define these labels.
LDA instruction fetches the VALUE field from the table entry indicated by the contents of
register X. Same Symbol definition using ORG directive.
The first ORG statement resets the location counter value to the value of STAB (ie. The
beginning address of the table.)
The label on the following RESB statement defines SYMBOL to have the current value in the
LOCCTR, this is the same address assigned to SYMTAB. LOCCTR is then advanced so the
label on the RESW statement assigns to VALUE the address (STAB +6) and so on.The result is
a set of labels with the same values as those defined with EQU statements above.So, using ORG
the definition becomes clear that each entry in STAB consists of a 6-byte SYMBOL, followed by
a one-word VALUE, followed by a 2-byte FLAGS.Last ORG statement is very important
because:
• it sets the LOCCTR back to its previous value – the address of the next unassigned byte
of memory after the table STAB.
• any labels on subsequent statements, which do not represent the part of STAB, are
assigned proper addresses
In some assemblers, the previous value of LOCCTR is automatically remembered, so simply can
write
ORG
The reason for this is the symbol definition process. Here in the second example, BETA cannot
be assigned a value when it is encountered during Pass1 of the assembly, because ALPHA does
not yet have a value. However, two-pass assembler design requires that all symbols be defined
during Pass 1. In case of ORG, all symbols used to specify the new location counter value must
have been previously defined. Consider the sequence:
The sequence cannot be processed. In this case, the assembler would not know (during
Pass1)what value to assign to the location counter in response to the first ORG statement. As a
result, the symbols BYTE 1, BYTE 2, and BYTE 3 could not be assigned addresses during Pass
1.
This restriction is a result of particular way in which we defined the two passes of our
assembler.The forward reference problem cannot be resolved by an ordinary two pass assembler
regardless of how the work is divided between the passes. Consider the sequence:
Expressions
The assembly language statements have used single terms (labels, literals, etc.) as instruction
operands. Most assemblers allow the use of expressions as instruction operands.Each such
expression must be evaluated by the assembler to produce a single operand address or value.
• Most commonly the special term is the current value of the location counter (*). This *
represents the value of the next unassigned memory location.
This statement gives BUFFEND a value that is the address of the next after the buffer area.
– Some values in the object program are relative to the beginning of the program,Some
are absolute (independent of program location).
Similarly, the values of the terms and expressions are either relative or absolute.Constant is an
absolute term.Labels on the instructions and data areas, references to the location counter value
are relative terms. A symbol whose value is given by EQU (or any other similar directive) may
be either an absolute term or a relative term depending upon the expression used to define its
value.
– Absolute expression
– Relative expression
• It may contain relative terms provided that relative terms occur in pairs and the terms in
such pair have opposite signs
• Relative Expression: All of the relative terms except one can be paired and the
remaining unpaired relative terms must have a positive sign. No relative terms can enter
into a multiplication or division operation no matter in absolute or relative expression.
Assemblers should determine the type of an expression. Keep track of the types of all symbols
defined in the program in the symbol table. Generate Modification records in the object program
for relative values.We need a “flag” in the SYMTAB for indication.
Program Blocks
We have seen so far the program being assembled was treated as a unit. Even the source program
logically contains subroutines, data areas etc.The assembler handles it as one entity, resulting in
a single block of object code.Within this object program, the generated machine instructions and
data appeared in the same order as they were written in the source program. Many assemblers
provide features that allow more flexible handling of the source and object programs:
• Allow the generated machine instructions and data to appear in the object program in a
different order from the corresponding source statements
• Allowed the creation of several independent parts of the object program. These parts
maintain their identity and are handled separately by the loader
Program Block: segments of code that are rearranged within a single object program unit
Control Section: segments that are translated into independent object program units
Consider the program in the figure which is written using program blocks.
2. (named CDATA) contains all data areas that are a few words or less in length
3. (named CBLKS) contains all data areas that consist of larger blocks of memory
The assembler directive USE indicates which portions of source program belong to various
blocks.
At the beginning, statements are assumed to be part of the unnamed (default) block. If no USE
statements are included, the entire program belongs to this single block. The USE statement in
line 92 signals the beginning of the block named CDATA. Source statements are associated with
this block until the USE statement on line 103, which begins the block named CBLKS. The USE
statement may also indicate the continuation of a previously begun block.Thus the line 123
resumes the default block, and the statement on line 183 resumes the block named CDATA.
Each program block may actually contain several separate segments of the source program.
Assembler rearrange these segments to gather together the pieces of each block. These blocks
will then be assigned addresses in the object program, with the blocks appearing in the same
order in which they were first begun in the source program.
During Pass 1:
During Pass 2:
• For code generation, the assembler needs the address for each symbol relative to the start of
the object program (not the start of an individual program block) which is easily found from
the information in SYMTAB.
• The assembler simply adds the location of the symbol, relative to the start of its block, to the
assigned block starting address
Consider the program: The column headed Loc/Block shows the relative address within a
program block assigned to each source line and a block number indicating which program block
is involved. This information gets stored in the SYMTAB for each symbol. The value of
MAXLEN in line 107, shown without a block number indicates that it is an absolute symbol,
whose value is not relative to the start of the program. At the end of Pass1 assembler constructs a
table that contains the starting addresses and lengths for all blocks.
Separation of programs into blocks has considerably reduced the addressing problem. Because
the large buffer area is moved to the end of the object program. No longer needed to use
extended format instruction on line number 15,35,65. The base register is no longer necessary.
The problem of placing literal is also solved: by placing LTORG statement in CDATA block to
be sure that literals are placed ahead of any large data area. It is not necessary to physically
rearrange the generated code in the object program to place the pieces of each program block
together. The assembler simply writes the object code as it is generated during Pass 2and insert
the proper loader address in each Text record.These load addresses will reflect the starting
address of the block as well as the relative location of the code within the block. For example: in
figure: The first two Text records are generated from line 5~70. When the USE statement is
recognized. Assembler writes out the current Text record, even if there still room left in it. Begin
a new Text record for the new program block.
Line 95~105 no generated code, so no text records are created. Next Text record lines 125~180
→ statements that belong to the next program block generate the object code. Fifth text record
contains single byte of data from line 185. Sixth record resumes the default program block. Does
not matter that the text records of the object program are not in sequence by address. Loader will
load the object code from each record at the indicated addresses.
A control section is a part of the program that maintains its identity after the assembly. It can be
loaded and relocated independently of the other control sections. Different control sections are
most often used for subroutines or other logical subdivisions of a program. The programmer can
assemble, load, and manipulate each of these control sections separately. The resulting flexibility
is the major benefit of using control sections. When control sections form logically related parts
of the program, there should be some means for linking control sections together. For example,
the instructions in one control section might need to refer to instructions or data located in
another section. The control sections are independently loaded and located, so the assembler is
unable to process these reference in a usual way. The assembler has no idea where any other
control section will be located at execution time. The references between the control sections are
called external references. The assembler generates information for each external references that
will allow the loader to perform the required linking. Consider an example program in the next
slide that might be written using multiple control sections. In this case there are three control
sections:
The start statement identifies the beginning of the assembly and gives a name COPY the first
control section. The first section continues until the CSECT statement on line 109. CSECT is an
assembler directive signals the start of a new control section named RDREC.
Similarly, the CSECT statement on the line 193 begins the control section named WRREC. The
assembler establishes a separate location counter for each control section as done for program
blocks.
Difference:
It is not necessary for all control sections in a program to be assembled at the same time.
Symbols that are defined in one control section may not be used directly by another control
section. They must be identified as external references for the loader to handle. There are two
assembler directives to identify such references:
Control section names in this case COPY, RDREC,WRREC do not need to be named in an
EXTDEF statement because they are automatically considered to be external symbols. The order
in which symbols are listed in EXTDEF and EXTREF statements is not significant. Control
section names in this case COPY, RDREC,WRREC do not need to be named in an EXTDEF
statement because they are automatically considered to be external symbols. The order in which
symbols are listed in EXTDEF and EXTREF statements is not significant.
The operand RDREC is named EXTREF statement for the control section, so this is an external
reference. The assembler has no idea where the control section containing RDREC will be
loaded, so it cannot assemble the address for this instruction. Instead the assembler inserts an
address of zero and passes information to the loader, which will cause the proper address to be
inserted at the load time. The address of RDREC will have no predictable relationship to
anything in this control section; therefore, relative addressing is not possible. Thus an extended
format instruction must be used to provide room for the actual address to be inserted. This is true
of any instruction whose operand involves an external reference.
Here the value of data word to be generated is specified by an expression involving two external
references: BUFFEND and BUFFER. The assembler stores this value as zero. When the program
is loaded, the loader will add to this data area the address of BUFFEND and subtract from it the
address of BUFFER, which results in the desired value. Note the difference between handling of
expression on line 190 and the similar expression on line 107. The symbols BUFEND and
BUFFER are defined in the same control section with EQU statement on line 107. Thus the
value of the expression can be calculated immediately, by the assembler. This could not be done
for line 190, BUFFEND and BUFFER are defined in another control section, so their values are
unknown at assembly time.
This makes an external reference to BUFFER. The instruction is assembled using extended
format with an address of zero. The x bit is set to 1 to indicate indexed addressing as specified by
the instruction. The assembler must remember via entries in SYMTAB, that in which control
section a symbol is defined. Any attempt to refer to a symbol is identified using EXTREF as an
external reference. The assembler must allow the same symbol to be used in different control
sections. There is a conflicting definitions of MAXLEN on line 107 and 190 should cause no
problem. A reference to MAXLEN in the control section COPY would use the definition on line
107, whereas a reference to MAXLEN in RDREC would use the definition on line 190. It is
clear that assembler leaves room in the object code for the values of external symbols. The
assembler must also include information in the object program that will cause the loader to insert
the proper values where they are required. For that two new record types in the object program:
• Define Record
• Refer Record
Define Record: gives information about external symbols that are defined in this control section
ie, the symbols named by EXTDEF.
Refer Record: lists symbols that are used as external references by the control section-ie,
symbols named EXTREF
The other information needed for program linking is added to the Modification record type.
• Col. 1 M
• Col. 2-7 Starting address of the field to be modified, relative to the beginning of the
control section (hex)
• Col. 8-9 Length of the field to be modified, in half-bytes (hex)
• Col. 10 Modification flag (+ or - )
• Col.11-16 External symbol whose value is to be added to or subtracted from the
Indicated field.
For modification record the first three items are the same as we studied earlier. The two new
items specify the modification to be performed:
• The symbol used for modification may be defined either in thus control section or in
another one
The figure shows the object program corresponding to the source program written using control
sections. Note that there is separate set of object program records (from Header through End) for
each control section. The records for each control section are exactly the same as they would be
if the sections are assembled separately. The Define and Refer records for each control section
include the symbols named in EXTDEF and EXTREF statements. In case of Define, the record
also indicates the relative address for each external symbol within the control section. For
EXTREF symbols, no address information is available. These symbols are simply named in the
Refer record.
Modification record :
• M00000405+RDREC
• M00000705+COPY
• M00001405+COPY
• M00002705+COPY
The existence of multiple control sections that can be relocated independently of one another
makes the handling expressions slightly more complicated. If the two terms represent relative
locations in the same control section, their difference is an absolute value regardless of whether
the control section is loaded. On the other hand, if they are in different control sections, their
difference has a value that is unpredictable.
It is used when it is necessary or desirable to avoid a second pass over the source program. But
the main problem in trying to assemble a program in one pass involves forward reference.
Operand parts of the instructions are often symbols that have not yet been defined in the source
program.
One pass assembler that generate their object code in memory for execution. No object program
is written out, and no loader is needed. This kind of load and-go assemblers are useful in a
system.
Multi-Pass Assembler
In symbol defining statements – using EQU and ORG directive, the symbol or the expression
giving the new value which used on the right hand side should be defined previously in the
source program
As a result, ALPHA cannot be evaluated during the second pass. This means that any assembler
that makes only two sequential passes over the source program cannot resolve such a sequence
of definitions. The general solution is a multi-pass assembler that can make as many passes as
are needed to process the definitions of symbol. It is not necessary for such an assembler to make
more than two passes over the entire program. Instead the portions of the program that involve
forward references in symbol definition are saved during pass 1. Additional passes through these
stored definitions are made as the assembly progresses. This process is normally followed by a
2-pass assembler. These tasks are accomplished in several ways: Method involves storing of
symbol definitions that involve forward references in the symbol table. This table also indicates
which symbols are dependent on the values of others, to facilitate symbol evaluations
Consider the sequence of symbol defining statements that involve forward references:
MAXLEN has not yet been defined, so no value for HALFSZ can be computed. The
defining expression for HALFSZ is stored in the symbol table in place of its value. The entry
&1 indicates that one symbol in the defining expression is undefined. In actual
implementation, this definition might be stored at some other location. SYMTAB would then
simply contain a pointer to the defining expression. The symbol MAXLEN is also entered in
the symbol table, with the flag * identifying it is undefined. Associated with this entry is a
list of the symbols whose values depend on MAXLEN (here HALZSZ)→ similar to one-pass
assembler we have seen.
In this case there are two undefined symbols: BUFFEND and BUFFER. Both these are entered
into SYMTAB with lists indicating the dependence of MAXLEN upon them.
Similarly, the definitions of PREVBT cause this symbol to be added to the list of dependencies
on BUFFER.
So far we are simply saving symbol definitions for later processing. Line 4 the definition of
BUFFER begins the evaluation. Let us assume that when line 4 is read, the location counter
contains the hexadecimal value 1034. This address is stored as the value of BUFFER. The
assembler the examines the lists of symbols that are dependent on BUFFER. The symbol table
entry for the first symbol in this list (MAXLEN ) shows that it depends on two currently
undefined symbols; so MAXLEN can be calculated immediately. Then &2 is changed to &1 to
show that only one symbol in the definition (BUFFEND) remains undefined. The symbol
PREVBT can be calculated and stored in SYMTAB.
When BUFFEND is defined in line 5, its value is entered into the symbol table. The list
associated with BUFFEND then directs the assembler to evaluate MAXLEN, and entering a
value for MAXLEN causes the evaluation of the symbol in its list (HALFSZ). This completes
the symbol definition process. If any symbols remained undefined at the end of the program, the
assembler would flag them as errors.
• Have :
Memory
Described in two ways: At physical level memory consists of 8 bit bytes. All addresses used are
byte addresses. Two consecutive bytes form a word. Four bytes form a double word (also called
a dword). At programmers level the memory of x86 viewed as a collection of segments. So the
address consists of two parts – a segment number and an offset that points to a byte within the
segment. Segments can be of different sizes and are used for different purposes. Some segments
contains executable instructions and other segments may be used to store data. Some data
segments may be treated as stacks that can be used to save register contents, pass parameters to
subroutines and for other purposes. It is not necessary for all of the segments used by the
program to be in physical memory. In some cases a segment can also be divided into pages.
Some pages of the segment may be in physical memory, while others may be stored on disk.
When an x86 instruction is executed, the hardware and the operating system make sure that the
needed byte of the segment is loaded into physical memory. The segment/offset address specified
by the programmer is automatically translated into a physical byte address by the x86 Memory
Management Unit(MMU).
Registers
Each general –purpose registers are 32 bits long (ie. One double word). Registers
EAX,EBX,ECX,EDX are generally used for data manipulation. It is possible to access individual
works from these registers. These four registers commonly used to hold addresses
– Contain eight 80 bit data registers and several other control and status registers
Data Formats
Integers: are stored as 8-, 16-, 32-bit Binary numbers. Both signed and unsigned numbers
(ordinals) are supported. 2’s complement representation for negative numbers. Can also be
stored in Binary Coded Format (BCD).
• Packed: each byte represents two decimal digits, with each digit encoded (in binary) using 4
bits of byte
• Unpacked: each byte represents one decimal digit. The value of this digit is encoded(in
binary) in the low-order 4 bits of the byte; the higher order are normally zero
Characters are stored one per byte, and represented using 8-bit ASCII codes.
Strings may consist of bits, bytes,words or doublewords; there are also special instructions to
handle each type of string. FPU can handle 64-bit signed integers.
– single-precision format (32 bit long→24 bit floating point value,7 bit exponent, 1 bit
for storing the sign)
– double-precision format (64 bit long →53 significant bit, 10 bit exponent)
Instruction Formats
The format begins with an optional prefixes containing flags that modify the operation of the
instruction
• Others specify a segment register that is to be used for addressing an operand (overriding
the normal default assumptions made by the hardware)
Following the prefixes(if any) is an opcode (1 or 2 bytes). Some operations have different
opcodes, each specifying a different variant. Following opcode are the number of bytes that
specify the operands and addressing modes used. The opcode is the only element that is always
present in every instruction. Other elements may or may not be present, and may be of different
length depending on the operations and operand involved. Thus there are large number of
potential instruction formats varying in length from 1 byte to 10 bytes or more.
Addressing modes
• Immediate mode: Operand value may be specified as part of the instruction itself
• Operands stored in memory are often specified using variations of the general target address
calculation
• Any general purpose register may be used as a base register. Any general purpose register
except ESP can be used as an index register.The scale factor may have the value 1,2,4,or .
The displacement may be as 8,16,or 32 bit value
• The base and index register numbers, scale and displacement are encoded as parts of the
operand specifiers in the instruction.
• Various combinations of these items may be omitted, resulting in eight different addressing
modes
• Direct mode : The address of an operand in memory may also be specified as an absolute
location
• Relative mode: The address of an operand in memory may also be specified as a location
relative to the EIP register
Instruction Set
• There are more than 400 different instructions. An instruction may have zero, one, two or
three operands
• In some cases, operands may also be specified in the instruction as immediate value
• Most data movement and integer arithmetic instructions can use operands that are 1,2 or 4
bytes long
• String manipulation instructions, which uses repetition prefixes, can deal directly with
variable-length strings of bytes, words or double words
• There are many instructions that perform logical and bit manipulations, and support control
of the processor and memory management systems
• The x86 architecture also include special purpose instructions to perform operations
frequently required in high-level programming languages.For example, entering and leaving
procedures and checking subscript values against the bounds of an array
• Input is performed by instructions that transfer one byte, word or double word at a time
from an I/O port into register EAX
• Output instructions transfer one byte, word or double word from EAX to an I/O port
Implementation-MASM Assembler
ASSUME
Tell the assembler that register ES indicate the segment DATASEG2. Thus, any reference to
labels are defined in DATASEG2 will be assembled using register ES. It is possible to collect
several segments into a group and use ASSUME to associate a segment register with the group.
Registers DS,ES,FS and GS must be loaded by the program before they can be used to address
data segments.
MOV ES, AX
would set ES to indicate the data segment DATASEG2. Similar to BASE directive in SIC/XE.
BASE tell a SIC/XE assembler the contents of register B; programmer must provide the
executable instructions to load this value into the register. Similarly, ASSUME tells MASM the
contents of a segment register; the programmer must provide instructions to load this register
when the program is executed. Jump instructions are assembled in two ways depending on
whether the target of the jump is in the same code segment as the jump instruction
• Near jump: jump to a target address in the same code segment as the jump instruction
Near jump is assembled using the current code segment register CS. The assembled machine
instruction for a near jump occupies 2 or 3 bytes. A far jump is assembled using a different
segment register which is specified in the instruction prefix. The assembled machine instruction
for a far jump occupies 5 bytes. Forward references to the labels in the source program can cause
problem. For example: consider a jump instruction
JMP TARGET
If the definition of the label TARGET occurs in the program before the JMP instruction, the
assembler can tell whether this is a far jump or near jump. If forward reference to TARGET, the
assembler does not know how many bytes to reserve for the instruction. By default, the MASM
assumes that a forward jump is a near jump. If the target is in another code segment, the
programmer must warn the assembler by writing.
If the programmer does not specify FAR PTR a problem occurs: During Pass 1, the assembler
reserves 3 bytes for the jump instruction. But the actual assembled instruction requires 5 bytes.
Earlier version of MASM causes a phase error. Later version, the assembler can repeat pass1 to
generate the correct location counter values. Far jump is similar to forward references in SIC/XE
that require the use of extended format instructions. Other situations in which the length of an
assembled instruction depends on the operands that are used. Eg: For ADD instruction, the
operand may be registers, memory or immediate operands. Immediate operands may occupy
from 1 to 4 bytes in the instruction. An operand that specifies a memory location may tae varying
amounts of space in the instruction, depending upon the location of the operand. Other situations
in which the length of an assembled instruction depends on the operands that are used. Pass1 of
x86 assembler is more complex than SIC. Segments in a MASM source program can be written
in more than one part. If a SEGMENT directive specifies the same name as a previously defined
segment, it is considered to be continuation of that segment.
All the parts of a segment are gathered together by the assembly process. References between
segments that are assembled together are automatically handled by the assembler. External
references between separately assembled modules must be handled by the linker. PUBLIC is
used in MASM instead for EXTDEF in SIC/XE. EXTRN is used in MASM instead for EXTREF
in SIC/XE. The object program from MASM assembler may be in several different format.
MASM can produce instruction timing listing that shows the number of clock cycles required to
execute each machine instruction. This allows the programmer to exercise a great deal of control
in optimizing time-critical sections of code.