0% found this document useful (0 votes)
313 views147 pages

Assembly Language

This document provides a table of contents for chapters in a textbook on computer architecture and assembly language. It outlines topics including number systems, data representation, integer and floating point arithmetic, data structures, registers, assembly language, the assembly process, input/output, exception handling, performance features, case studies, and details on the MIPS RISC architecture. The first chapter introduces basic concepts in computer science including levels of abstraction, top-down and bottom-up design, and the relationships between hardware and software levels like high-level languages, assembly language, and machine language.

Uploaded by

Zerihun Bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
313 views147 pages

Assembly Language

This document provides a table of contents for chapters in a textbook on computer architecture and assembly language. It outlines topics including number systems, data representation, integer and floating point arithmetic, data structures, registers, assembly language, the assembly process, input/output, exception handling, performance features, case studies, and details on the MIPS RISC architecture. The first chapter introduces basic concepts in computer science including levels of abstraction, top-down and bottom-up design, and the relationships between hardware and software levels like high-level languages, assembly language, and machine language.

Uploaded by

Zerihun Bekele
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 147

Table of Contents

 Chapter 1 and 2 -- Some Basics


 Chapter 3 -- SASM
 Extra SASM programs
 Chapter 4 -- Number Systems
 Chapter 5 -- Data Representation
 Chapter 6 -- Integer Arithmetic
 Chapter 7 -- Floating Point Arithmetic
 Chapter 8 -- Data Structures
 Chapter 9 -- Using Registers
 Chapter 10 -- Pentium Assembly Language
 Chapter 11 -- Implementing Procedures
 Chapter 12 -- The Assembly Process
 Chapter 13 -- I/O
 Chapter 14 -- Exception Handling
 Chapter 15 -- Features for Performance
 Chapter 16 -- Case Studies
 Details of the MIPS RISC architecture

Chapter 1 -- Some basics


Some introductory material
--------------------------

basic concept within computer science and engineering:


Levels of Abstraction
hierarchy
models

used (for our purposes) to design programs, computers

when a problem is large, it needs to be broken down --


we "divide and conquer"
one way is by introducing a hierarchy (level), and solving
the problem at each level
example: design of a computer
1. transistors available
2. gates, flip flops
3. components, like registers and adders
4. CPU, memory system, I/O devices
5. computer system
| ------- CPU
|
computer system ------| ------- memory
|
| ------- I/O stuff

components <----> gates <--------> transistors

TOP-DOWN vs. BOTTOM-UP design

another example: software levels of abstraction


writing a large program -- >10,000 lines of code
TOP-DOWN:
divide into modules, design each module separately.
define procedures (functions) that accomplish a task.
specify interfaces (parameters) for the procedures.

the implementation of a function is independent of


its interface specification! it is a different level
in the abstraction of program design.

the "big picture" -- a computer with software running on it.

HLL computer
Pascal, C, Fortran . . . . . . . . . hardware

how do we get from one to the other?


wanted: write in nice abstract HLL.
have: stupid computer that only knows how to
execute machine language

what is machine language?


binary sequences (lots of 1's and 0's in a very specific order)
interpreted by computer as instructions.
not very human readable.

to help the situation, introduce assembly language --


a more human readable form of machine language.
uses MNEUMONICS for the instruction type, and operands
BUT, now we need something to translate assembly lang.
to machine lang.: an ASSEMBLER
an example might be something like:
add AA, BB

"add" is the mneumonic or opcode (operation code)


AA and CC are the operands, the variables used in
the instruction.

lastly, if we had a program that translated HLL programs


to assembly language, then we'd have it made.
a COMPILER does this.

complete picture:

----------- ------------
HLL ---> | compiler|---> assembly --->| assembler|--->machine
----------- language ------------ language

(least detailed) (most detailed)


(top level) (bottom level)

this course deals with the software aspects of assembly language,


assemblers and machine language. It also deals with the
hardware aspects of what the computer does to execute programs.
It is an introduction to study of COMPUTER ARCHITECTURE:
the interface between hardware and software.

about COMPUTER ARCHITECTURE

the relationship between hardware (stuff you can touch)


and software (programs, code)

I can design a computer that has hardware which executes


programs in any programming language.
For example, a computer that directly executes Pascal.

So, why don't we do just that?


1. From experience (in the engineering community), we know
that the hardware that executes HLL programs directly
are slower than those that execute a more simple, basic
set of instructions.
2. Usability of the machine. Not everyone wants a Pascal
machine. ANY high level language can be translated into
assembly language.

In this class, in whatever language you are writing programs,


it will look like you have a machine that executes
those programs directly.

What we will do:

hll ---> SASM ---> Pentium

we assume that you know a hll (high level language, like C++, C,
Pascal, Fortran). From that, we can give you
SASM. Later in the semester, you will learn Pentium.
Programs will be written in both SASM and Pentium.

hll and SASM are each abstractions.


Each defines a computer architecture.
Pentium happens to be a real (manufactured) architecture.

basic computer operation


------------------------

simplified diagram of a computer system (hardware!)

------- ----------
| CPU | <---------> | memory |
------- | ----------
|
|
-------
| I/O |
-------

CPU -- controls the running of programs


executes instructions
makes requests of the memory
CPU stands for central processing unit
CPU and processor are synonyms (book uses the term processor)

NOTE: Many PC users incorrectly identify the term CPU


with whatever is in the box that their display sits
on top of. Chances are the real CPU is inside that
box, but there will be many more things in there as
well.

memory -- where programs and program variables are stored


handles requests from the CPU

(STORED PROGRAM COMPUTER concept)

interaction between processor and memory.


to execute an instruction, the processor must be able to
request 3 things from memory:

1. instruction FETCH
2. operand (variable) load LOAD
3. operand (variable) store STORE

the memory really only needs to be able to do 2 operations

1. read (fetch or load)


2. write (store)

where? a label specifies a unique place (a location) in memory.


a label is often identified as an address.

read: CPU specifies an address and a read operation


memory responds with the contents of the given address

write: CPU specifies an address, data to be stored, and


a write operation
memory responds by overwriting the data at the
address specified

for discussion: how (most) processors operate WRT the execution


of instructions.

discussion by a generic assembly language instruction example:


mult aa, bb, cc

instructions and operands are stored in memory.


before they can be used by the processor, they must
be fetched/loaded

processor steps involved:


1. fetch the instruction
questions for later: which instruction? at what address?
2. figure out what the instruction is -- DECODE
IT IS A MULT INSTRUCTION
this also reveals how many operands there are, since
the number of operands is fixed for any given instruction
THERE ARE 3 OPERANDS
3. load operand(s)
OPERANDS ARE bb AND cc
4. do the operation specified by the instruction
MULTIPLY bb AND cc TOGETHER
5. store result(s) (if any)
RESULT GOES INTO VARIABLE aa

next step:
suppose we want to execute multiple instructions, like
a program

except for control instructions, execute instructions in their


(given) sequential storage order.

the CPU must keep track of which instruction is to be


executed

it does this by the use of an extra variable contained within


and maintained by the processor, called a PROGRAM COUNTER, or PC
the contents of the variable is the address of the next
instruction to be executed.

Note: Intel calls the PC an Instruction Pointer or IP. The rest


of the world uses PC.

modify the above CPU steps:


1. fetch the instruction at the address given by the PC

added step. modify the PC such that it contains the address of


the next instruction to execute

2-5. the same as above

The added step could come at any time after step 1. It is convenient
to think of it as step 2.

This set of steps works fine for all instructions EXCEPT


control instructions.

Control Instructions example


beq x, y, label

1. fetch instruction -- address given by PC


2. update PC
3. decode
(its a BEQ instruction, and there are 3 operands)
4. fetch operands (x and y)
5. compare operands (for equality)
6. if equal, overwrite PC with address implied by 3rd operand (label)
The processor steps involved:
1. fetch the instruction
2. update PC
3. decode
4. load operand(s)
5. do the operation specified by the instruction
6. store result(s) (if any)

notice that this series of steps gets repeated constantly --


to make the computer useful, all that is needed is a way
to give the PC an initial value (the first instruction of a program),
and to have a way of knowing when the program is done, so the
PC can be given the starting address of another program.

the cycle of steps is very important -- it forms the basis for


understanding how a computer operates. The cycle of steps
is termed the INSTRUCTION FETCH and EXECUTE CYCLE.

 Chapter 17 -- Virtual Memory

Chapter 3 -- SASM
ABOUT SASM
----------

MOTIVATION for SASM:


hiding the details of Pentium asm. lang. (on purpose!)
SASM code will look more like HLL code -- to make student's transition
easier.
Introducing one more level of abstraction in order to postpone
discussion of several topics.

HLL SASM assembly machine code

each HLL statement maps into 1 or MORE SASM instructions


each SASM instruction maps into 1 or MORE Pentium instructions

SASM -- the language


--------------------

A subset of the functionality of most high level languages --


no records/structures
no formal arrays
no procedures/functions

What is required by a programming language?


declarations
arithmetic operations
conditional execution (if then else)
looping control structures
communication w/user. . .(write statement)

About SASM:
-- one instruction, declaration per line
-- comments are anything on a line following `;'
(comments may not span lines)
-- given the Intel architecture and its history, there are an enormous
number of RESERVED WORDS. Consult APPENDIX A always!

DECLARATIONS
------------

- they give information about how much memory space is needed


- they assign a name to the memory space

SASM has 3 basic types: integer, float (real), character


can build other types out of these,
for example, boolean is really an integer with only 2 defined values.

Pascal:
var variablename: type;

C or C++:
type variablename;

SASM:
variablename type value

type is dd if integer
db if character
dd if floating point

value is required -- it gives the variable an initial value


-- to explicitly leave value undefined, use
the '?' character

examples:
bool_flag dd 0

counter dd 0

variable3 dd ?

constant_e dd 2.71828

uservalue db ?

letter_a db 'a'
string1 db 'This is a string.', 0
; null terminated string example, VERY USEFUL!

string2 db 'Another string', 0ah, 0


; that 0ah is the newline character, AND this string is
; null terminated.

remember:
-- one declaration per line.

DIRECTIVES
----------
a way to give information to the assembler.

- some directives start with `.' (period)

examples:

dd # tells the assember to allocate 32 bits


db # tells the assember to allocate 8 bits

.data # identifies the start of the declaration section


# there can be more than 1 .data section in
# a program

.code # identifies where instructions are


# there can be more than 1 .code section in
# a program

.stack # You get this set of memory, called a stack.


# Don't worry about it for now, just use it.

.model # Gives the assembler information about how to


# place stuff in memory, and how to call stuff
# outside the program (like library calls)

ARITHMETIC instructions
----------------------

SASM Pascal C or C++ NOTES

move x, y x := y; x = y; x and y are ints or floats


moveb x, y x := y; x = y; x and y are chars
movezx x, y NO EQUIV NO EQUIV x is int, y is char (SIZE)
movesx x, y x := y; x = y; x is int, y is char (SIZE)
ineg x x := -x; x = -x;
iadd x, y x := x + y; x = x + y; integer addition
isub x, y x := x - y; x = x - y; integer subtraction
imult x, y x := x * y; x = x * y; integer multiplication
idivi x, y x := x div y; x = x / y; integer division (quotient)
irem x, y x := x mod y; x = x % y; integer division (remainder)
fpadd x, y x := x + y; x = x + y; floating point addition
fpsub x, y x := x - y; x = x - y; floating point subtraction
fpmul x, y x := x * y; x = x * y; floating point multiplication
fpdiv x, y x := x / y; x = x / y; floating point division

NOTES: 2. cannot increase the number of operands.


3. y can be an IMMEDIATE for all except the floating point
instructions

examples:

move count, 0

imult product, multiplier

iadd sum, 1

NOTE: there are other instructions that implement boolean functions,


but we don't cover them yet.

The move instructions must be carefully chosen to match the type


of the data being moved. The operation and difference between
movezx and movesx will be covered after we talk about representations.

CONDITIONAL EXECUTION
---------------------
sometimes an instruction (or a set of instructions) should
be executed, and sometimes it (they) shouldn't.

HLL -- simplest form is a go-to. (Always discouraged.)

Pascal if-then-else (a conditional go-to!)

if (condition) then
statement
else
statement;

C if-then-else

if (condition)
statement;
else
statement;
SASM 'ifs' and 'gotos'
(a better name is CONTROL INSTRUCTIONS)
---------------------------------------
SASM effect of instruction

br label goto label;


blz label if SF=1 then goto label;
bgz label if SF=0 and ZF=0 then goto label;
blez label if SF=1 or ZF=1 then goto label;
bgez label if SF=0 or ZF=1 then goto label;
bez label if ZF=1 then goto label;
bnz label if ZF=0 then goto label;
compare x, y result of x-y sets condition codes
compareb x, y result of x-y sets condition codes

This is different than many other modern machines. There are


two CONDITION CODES that we must think about.

condition code contents

zero flag (ZF) ZF=1 if result is 0


sign flag (SF) SF=1 if result is negative
SF=0 if result is zero or positive

These condition codes get changed (set) according to the result of certain
instructions (iadd, isub, ineg, and some logical instructions).
The condition codes are used by the control instructions.

To explicitly set the condition codes, use a compare instruction.

EXAMPLE:
--------
Pascal if-then-else:

if (count < 0) then


begin
count := count + 1;
end;

C equivalent:

if (count < 0)
count = count + 1;

SASM equiv to if-then-else:

compare count, 0
blz ifstuff
br end_if
ifstuff: iadd count, 1
end_if: # next program instruction goes here

-- OR --

compare count, 0
bgez end_if
iadd count, 1
end_if: # next program instruction goes here

WHICH ONE OF THESE IS BETTER?

NOTE: Be careful not to use RESERVED WORDS for your variable names
or label names.
(some reserved words: end endif if else elseif for while repeat)

Structured loops can be built out of IF's and GOTO's


(test and branch)

EXAMPLES:
---------

while loop example

Pascal:

while ( count > 0 ) do


begin
a := a mod count;
count := count - 1;
end;

BAD STYLE Pascal:

while: if (count <= 0) then goto endwhile;


a := a mod count;
count := count - 1;
goto while;
endwhile:

C or C++:

while (count > 0) {


a = a % count;
count --;
}

SASM:

while_loop:
compare count, 0
blez end_while
irem a, count
isub count, 1
br while_loop
end_while: # next program instruction goes here

while loop example (compound conditional)


------------------------------------------
Pascal:

while (count < limit) and (c = d) do


begin
/* loop's code goes here */
end;

C or C++:

while ( (count < limit) && (c==d) )


{
/* loop's code goes here */
}
SASM:

while_loop:
compare count, limit
bgez end_while
compare c, d
bnz end_while

# loop's code goes here

br while_loop
end_while:

for loop example


----------------

Pascal:

for i:= 3 to 8 do
begin
a := a + i;
end;

C:

for ( i = 3; i <= 8; i++)


{
a = a + i;
}

SASM:

move i, 3
for_loop:
compare i, 8
bgz end_for
iadd a, i
iadd i, 1
br for_loop
end_for:

COMMUNICATION WITH THE USER (I/O operations)


--------------------------------------------
SASM effect of instruction

get_ch x read character from input, place into x


put_ch x send character in x to output
put_i x send integer in x to output
put_fp x send floating point value in x to output
put_str x send (NULL TERMINATED) string at x to output

SASM doesn't have any oddities about


testing for eoln or eof. The newline character (0ah, or '\n' in C)
is just another character to be read or written.

NOTE: There are times when you will want to 'get' something that
isn't a character (like an integer or floating point value input
by user). In SASM, you can't, since the instruction doesn't exist.
At the end of Chapter 5, you will know enough about data representation
to be able to read an integer (or floating point value) character
by character and translate it to an integer.

It is done this way because input from a keyboard are only characters.
Output to a simple display (which we are assuming) are only characters.
The C library (that we utilize) gives easy implementation of
output for other types, so you get that benefit in this language.

EXAMPLES:

; this is a code FRAGMENT, not a whole program


.data
msg1 db 'The integer is ', 0
int1 dd 285
newline db 0ah
msg2 db 'The second string.', 0ah, 0

.code
put_str msg1
put_i int1
put_ch newline
put_str msg2

prints:

The integer is 285


The second string.

SASM program examples

; SASM program to print out the alphabet

.486
.model flat, stdcall
.stack 1000h
include sasmacros.inc

title pralphabet program

.data

end_letter db 'z'
one_ch db 'a'
newline db 0ah

.code
main:
put_ch one_ch
compareb end_letter, one_ch
bez stop_printing ; when one_ch becomes 'z', done
printing
iadd one_ch, 1
br main ; label main is the top of the
loop
stop_printing:
put_ch newline

done

end

Chapter 4 -- Number Systems


about number systems.
---------------------

here's the decimal number system as an example:


digits (or symbols) allowed: 0-9
base (or radix): 10
the order of the digits is significant

345 is really
3 x 100 + 4 x 10 + 5 x 1
3 x 10**2 + 4 x 10**1 + 5 x 10**0
3 is the most significant symbol (it carries the most weight)
5 is the least significant symbol (it carries the least weight)

here's a binary number system:


digits (symbols) allowed: 0, 1
base (radix): 2

each binary digit is called a BIT

the order of the digits is significant

numbering of the digits


msb lsb
n-1 0
where n is the number of digits in the number

1001 (base 2) is really


1 x 2**3 + 0 x 2**2 + 0 x 2**1 + 1 x 2**0
9 (base 10)

11000 (base 2) is really


1 x 2**4 + 1 x 2**3 + 0 x 2**2 + 0 x 2**1 + 0 x 2**0
24 (base 10)

here's an octal number system:


digits (symbols) allowed: 0-7
base (radix): 8

the order of the digits is significant

345 (base 8) is really


3 x 8**2 + 4 x 8**1 + 5 x 8**0
192 + 32 + 5
229 (base 10)

1001 (base 8) is really


1 x 8**3 + 0 x 8**2 + 0 x 8**1 + 1 x 8**0
512 + 0 + 0 + 1
513 (base 10)
here's a hexadecimal number system:
digits (symbols) allowed: 0-9, a-f
base (radix): 16

the order of the digits is significant

hex decimal
0 0
1 1
.
.
.
9 9
a 10
b 11
c 12
d 13
e 14
f 15

a3 (base 16) is really


a x 16**1 + 3 x 16**0
160 + 3
163 (base 10)

given all these examples, here's a set of formulas for the


general case.

here's an n-bit number (in weighted positional notation):

S S . . . S S S
n-1 n-2 2 1 0

given a base b, this is the decimal value

the summation (from i=0 to i=n-1) of S x b**i


i

TRANSFORMATIONS BETWEEN BASES


-----------------------------

any base --> decimal


just use the definition give above.

decimal --> binary


divide decimal value by 2 (the base) until the value is 0

example:
36/2 = 18 r=0 <-- lsb
18/2 = 9 r=0
9/2 = 4 r=1
4/2 = 2 r=0
2/2 = 1 r=0
1/2 = 0 r=1 <-- msb

36 (base 10) == 100100 (base 2)

14/2 = 7 r=0 <-- lsb


7/2 = 3 r=1
3/2 = 1 r=1
1/2 = 0 r=1 <-- msb

14 (base 10) == 1110 (base 2)

binary --> octal


1. group into 3's starting at least significant symbol
(if the number of bits is not evenly divisible by 3, then
add 0's at the most significant end)
2. write 1 octal digit for each group

example:

100 010 111 (binary)


4 2 7 (octal)

10 101 110 (binary)


2 5 6 (octal)

binary --> hex


(just like binary to octal!)
1. group into 4's starting at least significant symbol
(if the number of bits is not evenly divisible by 4, then
add 0's at the most significant end)
2. write 1 hex digit for each group

example:

1001 1110 0111 0000


9 e 7 0

1 1111 1010 0011


1 f a 3

hex --> binary


(trivial!) just write down the 4 bit binary code for
each hexadecimal digit

example:

3 9 c 8
0011 1001 1100 1000
octal --> binary
(just like hex to binary!)
(trivial!) just write down the 8 bit binary code for
each octal digit

hex --> octal


do it in 2 steps, hex --> binary --> octal

decimal --> hex


do it in 2 steps, decimal --> binary --> hex

on representing nonintegers
---------------------------

what range of values is needed for calculations


very large: Avogadro's number 6.022 x 10 ** 23 atoms/mole
mass of the earth 5.98 x 10 ** 24 kilograms
speed of light 3.0 x 10 ** 8 meters/sec
very small: charge on an electron -1.60 x 10 ** (-19)

scientific notation
a way of representing rational numbers using integers
(used commonly to represent nonintegers in computers)

exponent
number = mantissa x base

mantissa == fraction == significand


base == radix
point is really called a radix point, for a number with
a decimal base, its called a decimal point.

all the constants given above are in scientific notation

normalization
to keep a unique form for every representable noninteger, they
are kept in NORMALIZED form. A normalized number will follow the
following rule:

1 <= mantissa < base

In this form, the radix point is always placed one place to


the right of the first significant symbol (as above).
on precision, accuracy, and significant digits

These terms are often used incorrectly or ignored. They are


important!

A measurement (in a scientific experiment) implies a certain


amount of error, depending on equipment used. Significant
digits tell about that error.
For example, a number given as
3.6 really implies that this number is in the range of
3.6 +- .05, which is 3.55 to 3.65
This is 2 significant digits.
3.60 really implies that this number is in the range of
3.6 +- .005, which is 3.595 to 3.605
This is 3 significant digits.

So, the number of significant digits given in a number tells


about how accurately the number is known. The larger the number
of significant digits, the better the accuracy.

Computers (or calculators, a more familiar machine) have a fixed


precision. No matter what accuracy of a number is known, they
give lots of digits in an number. They ignore how many significant
digits are involved.
For example, if you do the operation 1.2 x 2.2. given that
each number has 2 significant digits, a correct answer is

1.2
x 2.2
-----
24
+ 24
-----
264 --> 2.64 --> 2.6 or 2.6 +- .05

a calculator will most likely give an answer of 2.640000000,


which implies an accuracy much higher than possible. The
result given is just the highest precision that the calculator
has. It has no knowledge of accuracy -- only precision.

BINARY FRACTIONS
----------------

f f . . . f f f . f f f . . .
n-1 n-2 2 1 0 -1 -2 -3
|
|
binary point

The decimal value is calculated in the same way as for


non-fractional numbers, the exponents are now negative.
example:
101.001 (binary)
1 x 2**2 + 1 x 2**0 + 1 x 2**-3
4 + 1 + 1/8
5 1/8 = 5.125 (decimal)

2**-1 = .5
2**-2 = .25
2**-3 = .125
2**-4 = .0625 etc.

converting decimal to binary fractions

Consider left and right of the decimal point separately.


The stuff to the left can be converted to binary as before.
Use the following algorithm to convert the fraction:

fraction fraction x 2 digit right of point


.8 1.6 1 <-- most significant (f )
.6 1.2 1 -1
.2 0.4 0
.4 0.8 0
.8 (it must repeat from here!)

----
.8 is .1100

NON-BINARY FRACTIONS
--------------------

same as with binary, only the base changes!

f f . . . f f f . f f f . . .
n-1 n-2 2 1 0 -1 -2 -3
|
|
radix point

The decimal value is calculated in the same way as for


non-fractional numbers, the exponents are now negative.

example:
101.001 (octal)
1 x 8**2 + 1 x 8**0 + 1 x 8**-3
64 + 1 + 1/512
65 1/512 = 65.0019 (approx)

13.a6 (hexadecimal)
1 x 16**1 + 3 x 16**0 + a x 16**-1 + 6 x 16**-2
16 + 3 + 10/16 + 6/256
19 166/256 = 19.64 (approx)

CONVERSION WITH OTHER BASES


---------------------------

another base to decimal:


Just plug into the summation.

134 (base 5)
1 x 5**2 + 3 x 5**1 + 4 x 5**0
25 + 15 + 4
44 (base 10)

decimal to another base:


Keep dividing by the base, same algorithm.

100 (base 10) to base 5

rem
100/5 20 0
20/5 4 0
4/5 0 4

100 (base 10) = 400 (base 5)

Chapter 5 -- representations
ALL ABOUT INTEGER REPRESENTATION.
---------------------------------

Computers operate on binary values (as a result of being built from


transistors).

there are different binary representations for integers


possible qualifications:
1. positive numbers only
2. positive and negative numbers
3. ease of human readability
4. speed of computer operations

there are 4 commonly known (1 not common) integer reprentations.


All have been used at various times for various reasons.
1. unsigned
2. sign magnitude
3. one's complement
4. two's complement
5. biased (not commonly known)

virtually all modern computers operate based on 2's complement


representation. why?
1. hardware is faster
2. hardware is simpler (which makes it faster)

UNSIGNED
--------
the standard binary encoding already given

only positive values

range: 0 to 2**n - 1, for n bits

example: 4 bits, values 0 to 15


n=4, 2**4 -1 is 15

binary decimal hex binary decimal hex


0000 0 0 1000 8 8
0001 1 1 1001 9 9
0010 2 2 1010 10 a
0011 3 3 1011 11 b
0100 4 4 1100 12 c
0101 5 5 1101 13 d
0110 6 6 1110 14 e
0111 7 7 1111 15 f

SIGN MAGNITUDE
--------------

a human readable way of getting both positive and negative integers.


The hw that does arithmetic on sign magnitude integers
is not fast, and it is more complex than the hw that does arithmetic
on 1's comp. and 2's comp. integers.

use 1 bit of integer to represent the sign of the integer

let sign bit of 0 be positive,


1 be negative.

the rest of the integer is a magnitude, using same encoding


as unsigned integers

example: 4 bits
0101 is 5

1101 is -5

to get the additive inverse of a number, just flip


(not, invert, complement, negate) the sign bit.

range: -(2**(n-1)) + 1 to 2**(n-1) -1


where n is the number of bits

4 bits, -7 to +7
n=4, - 2**3 + 1 to 2**3 - 1
-8 + 1 to 8 - 1

because of the sign bit, there are 2 representations for 0.


This is a problem for hardware. . .

0000 is 0, 1000 is 0
the computer must do all calculations such that they
come out correctly and the same whichever representation is used.

ONE's COMPLEMENT
----------------

historically important, and we use this representation to get


2's complement integers.

Now, nobody builds machines that are based on 1's comp. integers.
In the past, early computers built by Semour Cray (while at CDC)
were based on 1's comp. integers.

positive integers use the same representation as unsigned.

00000 is 0
00111 is 7, etc.

negation (finding an additive inverse) is done by taking a bitwise


complement of the positive representation.

COMPLEMENT. INVERT. NOT. FLIP. NEGATE.


a logical operation done on a single bit
the complement of 1 is 0.
the complement of 0 is 1.

-1 --> take +1, 00001


complement each bit 11110

that is -1.

don't add or take away any bits.

EXAMPLES: 11100 this must be a negative number.


to find out which, find the additive
inverse!
00011 is +3 by sight,
so 11100 must be -3

things to notice: 1. any negative number will have a 1 in the MSB.


2. there are 2 representations for 0,
00000 and 11111.
TWO's COMPLEMENT
----------------

a variation on 1's complement that does NOT have 2 representations for


0. This makes the hardware that does arithmetic faster than for the
other representations.

a 3 bit example:
bit pattern: 100 101 110 111 000 001 010 011

1's comp: -3 -2 -1 0 0 1 2 3

2's comp.: -4 -3 -2 -1 0 1 2 3

the negative values are all "slid" down by one, eliminating the
extra zero representation.

how to get an integer in 2's comp. representation:

positive values are the same as for sign/mag and 1's comp.
they will have a 0 in the MSB (but it is NOT a sign bit!)

positive: just write down the value as before


negative:
take the positive value 00101 (+5)
take the 1's comp. 11010 (-5 in 1's comp)
add 1 + 1
------
11011 (-5 in 2's comp)

to get the additive inverse of a 2's comp integer,


1. take the 1's comp.
2. add 1

to add 1 without really knowing how to add:


start at LSB, for each bit (working right to left)
while the bit is a 1, change it to a 0.
when a 0 is encountered, change it to a 1 and stop.
the rest of the bits remain the same as they were.

EXAMPLE:
what decimal value does the two's complement 110011 represent?

It must be a negative number, since the most significant bit


is a 1. Therefore, find the additive inverse:

110011 (2's comp. ?)

001100 (after taking the 1's complement)


+ 1
------
001101 (2's comp. +13)
Therefore, it's additive inverse (110011) must be -13.

A LITTLE BIT ON ADDING


----------------------
we'll see how to really do this in the next chapter, but here's
a brief overview.

carry in a b sum carry out


0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

a 0011
+b +0001
-- -----
sum 0100

its really just like we do for decimal!


0 + 0 = 0
1 + 0 = 1
1 + 1 = 2 which is 10 in binary, sum is 0 and carry the 1.
1 + 1 + 1 = 3 sum is 1, and carry a 1.

BIASED REPRESENTATION
---------------------

an integer representation that skews the bit patterns so as to


look just like unsigned but actually represent negative numbers.

examples: given 4 bits, we BIAS values by 2**3 (8)

TRUE VALUE to be represented 3


add in the bias +8
----
unsigned value 11

so the bit pattern of 3 in 4-bit biased-8 representation


will be 1011

going the other way, suppose we were given a


biased-8 representation as 0110
unsigned 0110 represents 6
subtract out the bias - 8
----
TRUE VALUE represented -2

this representation allows operations on the biased numbers


to be the same as for unsigned integers, but actually represents
both positive and negative values.

choosing a bias:
the bias chosen is most often based on the number of bits
available for representing an integer. To get an approx.
equal distribution of true values above and below 0,
the bias should be 2 ** (n-1) or (2**(n-1)) - 1

SIGN EXTENSION
--------------
how to change an integer with a smaller number of bits into the
same integer (same representation) with a larger number of bits.

this must be done a lot by arithmetic units, so it is best to


go over it.

by representation:
unsigned: xxxxx --> 000xxxxx
copy the original integer into the LSBs, and put 0's elsewhere

sign magnitude: sxxxx --> s000xxxx


copy the original integer's magnitude into the LSBs,
put the original sign into the MSB, and put 0's elsewhere

1's and 2's complement: called SIGN EXTENSION.


copy the original integer into the LSBs,
take the MSB of original integer and copy it elsewhere.

example: 0010101
000 0010101

11110000
11111111 11110000

OVERFLOW
--------

sometimes a value cannot be represented in the limited number


of bits allowed. Examples:
unsigned, 3 bits: 8 would require at least 4 bits (1000)
sign mag., 4 bits: 8 would require at least 5 bits (01000)

when a value cannot be represented in the number of bits allowed,


we say that overflow has occurred. Overflow occurs when doing
arithmetic operations.

example: 3 bit unsigned representation

011 (3)
+ 110 (6)
---------
? (9) it would require 4 bits (1001) to represent
the value 9 in unsigned rep.

CHARACTER REPRESENTATION
------------------------

everything represented by a computer is represented by binary sequences.

A common non-integer needed to be represented is characters.


We use standard encodings (binary sequences) to repreesent characters.

REMEMBER: bit patterns do NOT imply a representation

Many I/O devices work with 8 bit quantities.


A standard code ASCII (American Standard for Computer Information
Interchange) defines what character is represented by each sequence.

examples:
0100 0001 is 41 (hex) or 65 (decimal). It represents `A'
0100 0010 is 42 (hex) or 66 (decimal). It represents `B'

Different bit patterns are used for each different character


that needs to be represented.

The code has some nice properties. If the bit patterns are compared,
(pretending they represent integers), then `A' < `B'
This is good, because it helps with sorting things into alphabetical
order.

Notes: `a' (61 hex) is different than `A' (41 hex)


`8' (38 hex) is different than the integer 8

the digits:
`0' is 48 (decimal) or 30 (hex)
`9' is 57 (decimal) or 39 (hex)

Because of this, code must deal with both characters and integers
correctly.

Consider the following example:

in1 db ?
result db ?
get_ch in1
moveb result, in1
iadd result, in1
put_ch result

suppose the user types `3'


result <- 51 + 51 = 102 (decimal)

put_ch prints out `f', since the ASCII code for 102(decimal) is `f'

What if we wanted the value 6 to be printed?

What we need is an algorithm for translating character strings


to the integers they represent, and visa versa.

Then, do the arithmetic on integers, and print out the integer


result.

ALGORITHM: character string --> integer


the steps:

for `3' `5' `4'

read `3'
translate `3' to 3

read `5'
translate `5' to 5
integer = 3 * 10 + 5 = 35

read `4'
translate `4' to 4
integer = 35 * 10 + 4 = 354

the algorithm:

integer = 0
while there are more characters
get character
digit <- character - 48
integer <- integer * 10 + digit

In the real world, we also have the equivalent problem, only


dealing with printing out integers.

ALGORITHM: integer --> character string


the steps:
for 354 (an integer), figure out how many characters there are (3)
354 div 100 gives 3
translate 3 to `3' and print it out
354 mod 100 gives 54

54 div 10 gives 5
translate 5 to `5' and print it out
54 mod 10 gives 4

4 div 1 gives 4
translate 4 to `4' and print it out
4 mod 1 gives 0, so you're done

NOTE: These algorithms work the same in bases other than 10.
Just change the 10 to the desired base. (There is some extra
work in bases > 10, since have to translate the digits > 9
into the characters. For example, 12 (decimal) is c (hex).)

REPRESENTATION OF FLOATING POINT NUMBERS


----------------------------------------

Computers represent real values in a form similar to that of


scientific notation. There are standards which define what the
representation means so that across computers there will be consistancy.

Note that this is not the only way to represent floating point
numbers, it is just the IEEE standard way of doing it.

Here's what we do:

the representation

-------------------
| S | E | F |
-------------------

S is one bit representing the sign of the number


E is an 8 bit biased integer representing the exponent
F is an unsigned integer

the value represented is:

S e
(-1) x f x 2

where
e = E - bias
n
f = F/2 + 1

for single precision numbers (the emphasis in this class)


n = 23
bias = 127
Now, what does all this mean? (Everything here is for
IEEE single precision floating point representation.)
--> S, E, F all represent fields within a representation. Each
is just a bunch of bits.

--> S is just a sign bit. 0 for positive, 1 for negative.

--> E is an exponent field. The E field is a biased-127 representation.


So, the true exponent represented is (E - bias). The radix for
the number is ALWAYS 2.

Note: Computers that did not use this representation, like those
built before the standard, did not always use a radix of 2.
Example: some IBM machines had radix of 16.

--> F is the mantissa. It is in a somewhat modified form. There are


23 bits available for the mantissa. It turns out that if fl. pt.
numbers are always stored in a normalized form, then the leading
bit (the one on the left, or MSB) is always a 1. So, why store
it at all? It gets put back into the number (giving 24 bits
of precision for the mantissa) for any calculation, but we only have
to store 23 bits.

This MSB is called the HIDDEN BIT.

An example: put the decimal number 64.2 into the IEEE single
precision representation.

first step:
get a binary representation for 64.2
to do this, get binary reps. for the stuff to the left
and right of the decimal point separately.

64 is 1000000

.2 can be gotten using the algorithm:

.2 x 2 = 0.4 0 (msb)
.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1

.2 x 2 = 0.4 0 now this whole pattern (0011) repeats.


.4 x 2 = 0.8 0
.8 x 2 = 1.6 1
.6 x 2 = 1.2 1

so a binary representation for .2 is .001100110011. . .

Putting the halves back together again:


64.2 is 1000000.0011001100110011. . .
second step:
normalize the binary representation. (make it look like
scientific notation)

6
1.000000 00110011. . . x 2

third step:
6 is the true exponent. For the standard form, it needs to
be in biased-127 form.

6
+ 127
-----
133 (value of biased exponent)

133 in 8 bit, unsigned representation is 1000 0101

this is bit pattern used for E in the standard form.

fourth step:
the mantissa stored (F) is the stuff to the right of the radix point
in the normal form. We need 23 bits of it.

000000 00110011001100110

put it all together (and include the correct sign bit):

S E F
0 10000101 00000000110011001100110

the values are often given in hex, so here it is

0100 0010 1000 0000 0110 0110 0110 0110

0x 4 2 8 0 6 6 6 6

or

42806666h

Some extra details:

--> Since floating point numbers are always stored in normal


form, how do we represent 0?

(What does the fl. pt. number 0x3f80 0000 represent?)


(What does the fl. pt. number 0x0080 0000 represent?)

We take the bit patterns 0x0000 0000 and 0x8000 0000


to represent the value 0.

(What fl. pt. numbers cannot be represented because of this?)


--> Other special values:

+infinity 0 11111111 00000... (0x7f80 0000)


-infinity 1 11111111 00000... (0xff80 0000)

NaN (Not a Number) ? 11111111 ?????...


(S is either 0 or 1, E=0xff, and F is anything but all zeros)

Chapter 6 -- integer arithmetic


all about integer arithmetic.
-------------------------------------

operations we'll get to know (and love):


addition
subtraction
multiplication
division
logical operations (not, and, or, nand, nor, xor, xnor)
shifting

the rules for doing the arithmetic operations vary depending


on what representation is implied.

A LITTLE BIT ON ADDING


----------------------
an overview.

carry in a b | sum carry out


---------------+----------------
0 0 0 | 0 0
0 0 1 | 1 0
0 1 0 | 1 0
0 1 1 | 0 1
1 0 0 | 1 0
1 0 1 | 0 1
1 1 0 | 0 1
1 1 1 | 1 1
|

a 0011
+b +0001
-- -----
sum 0100

its really just like we do for decimal!


0 + 0 = 0
1 + 0 = 1
1 + 1 = 2 which is 10 in binary, sum is 0 and carry the 1.
1 + 1 + 1 = 3 sum is 1, and carry a 1.

ADDITION
--------

unsigned:
just like the simple addition given.

examples:

100001 00001010 (10)


+011101 +00001110 (14)
------- ---------
111110 00011000 (24)

ignore (throw away) carry out of the msb.


Why? Because computers ALWAYS work with a fixed precision.

sign magnitude:

rules:
- add magnitudes only (do not carry into the sign bit)
- throw away any carry out of the msb of the magnitude
(Due, again, to the fixed precision constraints.)
- add only integers of like sign ( + to + or - to -)
- sign of result is same as sign of the addends

examples:

0 0101 (5) 1 1010 (-10)


+ 0 0011 (3) + 1 0011 (-3)
--------- ---------
0 1000 (8) 1 1101 (-13)

0 01011 (11)
+ 1 01110 (-14)
----------------
don't add! must do subtraction!

one's complement:

by example

00111 (7) 111110 (-1) 11110 (-1)


+ 00101 (5) + 000010 (2) + 11100 (-3)
----------- ------------ ------------
01100 (12) 1 000000 (0) wrong! 1 11010 (-5) wrong!
+ 1 + 1
---------- ----------
000001 (1) right! 11011 (-4) right!

so it seems that if there is a carry out (of 1) from the msb, then
the result will be off by 1, so add 1 again to get the correct
result. (Implementation in HW called an "end around carrry.")

two's complement:

rules:
- just add all the bits
- throw away any carry out of the msb
- (same as for unsigned!)

examples

00011 (3) 101000 111111 (-1)


+ 11100 (-4) + 010000 + 001000 (8)
------------ -------- --------
11111 (-1) 111000 1 000111 (7)

after seeing examples for all these representations, you may see
why 2's complement addition requires simpler hardware than
sign mag. or one's complement addition.

SUBTRACTION
-----------
general rules:
1 - 1 = 0
0 - 0 = 0
1 - 0 = 1
10 - 1 = 1
0 - 1 = borrow!

unsigned:

- it only makes sense to subtract a smaller number from a larger one

examples

1011 (11) must borrow


- 0111 (7)
------------
0100 (4)

sign magnitude:

- if the signs are the same, then do subtraction


- if signs are different, then change the problem to addition
- compare magnitudes, then subtract smaller from larger
- if the order is switched, then switch the sign too.

- when the integers are of the opposite sign, then do


a - b becomes a + (-b)
a + b becomes a - (-b)

examples

0 00111 (7) 1 11000 (-24)


- 0 11000 (24) - 1 00010 (-2)
-------------- --------------
1 10110 (-22)
do 0 11000 (24)
- 0 00111 (7)
--------------
1 10001 (-17)
(switch sign since the order of the subtraction was reversed)

one's complement:

figure it out on your own

two's complement:

- change the problem to addition!

a - b becomes a + (-b)

- so, get the additive inverse of b, and do addition.

examples

10110 (-10)
- 00011 (3) --> 00011
------------ |
\|/
11100
+ 1
-------
11101 (-3)
so do

10110 (-10)
+ 11101 (-3)
------------
1 10011 (-13)
(throw away carry out)

OVERFLOW DETECTION IN ADDITION

unsigned -- when there is a carry out of the msb

1000 (8)
+1001 (9)
-----
1 0001 (1)

sign magnitude -- when there is a carry out of the msb of the magnitude

1 1000 (-8)
+ 1 1001 (-9)
-----
1 0001 (-1) (carry out of msb of magnitude)

2's complement -- when the signs of the addends are the same, and the
sign of the result is different

0011 (3)
+ 0110 (6)
----------
1001 (-7) (note that a correct answer would be 9, but
9 cannot be represented in 4 bit 2's complement)

a detail -- you will never get overflow when adding 2 numbers of


opposite signs

OVERFLOW DETECTION IN SUBTRACTION

unsigned -- never
sign magnitude -- never happen when doing subtraction
2's complement -- we never do subtraction, so use the addition rule
on the addition operation done.

MULTIPLICATION of integers

0 x 0 = 0
0 x 1 = 0
1 x 0 = 0
1 x 1 = 1
-- longhand, it looks just like decimal

-- the result can require 2x as many bits as the larger multiplicand

-- in 2's complement, to always get the right answer without


thinking about the problem, sign extend both multiplicands to
2x as many bits (as the larger). Then take the correct number
of result bits from the least significant portion of the result.

2's complement example:

1111 1111 -1
x 1111 1001 x -7
---------------- ------
11111111 7
00000000
00000000
11111111
11111111
11111111
11111111
+ 11111111
----------------
1 00000000111
-------- (correct answer underlined)

0011 (3) 0000 0011 (3)


x 1011 (-5) x 1111 1011 (-5)
------ -----------
0011 00000011
0011 00000011
0000 00000000
+ 0011 00000011
--------- 00000011
0100001 00000011
not -15 in any 00000011
representation! + 00000011
------------------
1011110001

take the least significant 8 bits 11110001 = -15

DIVISION of integers
unsigned only!

example of 15 / 3 1111 / 011

To do this longhand, use the same algorithm as for decimal integers.


LOGICAL OPERATIONS
done bitwise

X = 0011
Y = 1010

X AND Y is 0010
X OR Y is 1011
X NOR Y is 0100
X XOR Y is 1001
etc.

SHIFT OPERATIONS
a way of moving bits around within a word

3 types: logical, arithmetic and rotate


(each type can go either left or right)

logical left - move bits to the left, same order


- throw away the bit that pops off the msb
- introduce a 0 into the lsb

00110101

01101010 (logically left shifted by 1 bit)

logical right - move bits to the right, same order


- throw away the bit that pops off the lsb
- introduce a 0 into the msb

00110101

00011010 (logically right shifted by 1 bit)

arithmetic left - move bits to the left, same order


- throw away the bit that pops off the msb
- introduce a 0 into the lsb
- SAME AS LOGICAL LEFT SHIFT!

arithmetic right - move bits to the right, same order


- throw away the bit that pops off the lsb
- reproduce the original msb into the new msb
- another way of thinking about it: shift the
bits, and then do sign extension

00110101 -> 00011010 (arithmetically right shifted by 1 bit)

1100 -> 1110 (arithmetically right shifted by 1 bit)


rotate left - move bits to the left, same order
- put the bit that pops off the msb into the lsb,
so no bits are thrown away or lost.

00110101 -> 01101010 (rotated left by 1 place)


1100 -> 1001 (rotated left by 1 place)

rotate right - move bits to the right, same order


- put the bit that pops off the lsb into the msb,
so no bits are thrown away or lost.

00110101 -> 10011010 (rotated right by 1 place)


1100 -> 0110 (rotated right by 1 place)

SASM INSTRUCTIONS FOR LOGICAL AND SHIFT OPERATIONS

SASM has instructions that do bitwise logical operations and


shifting operations.

lnot x x <- NOT (x)


land x, y x <- (x) AND (y)
lor x, y x <- (x) OR (y)
lxor x, y x <- (x) XOR (y)

llsh x x <- (x), logically left shifted by 1 bit


rlsh x x <- (x), logically right shifted by 1 bit
rash x x <- (x), arithmetically right shifted by 1 bit
rror x x <- (x), rotated right by 1 bi

Chapter 7 -- floating point arithmetic

about FLOATING POINT ARITHMETIC


-------------------------------

arithmetic operations on floating point numbers consist of


addition, subtraction, multiplication and division

the operations are done with algorithms similar to those used


on sign magnitude integers (because of the similarity of
representation) -- example, only add numbers of the same
sign. If the numbers are of opposite sign, must do subtraction.

ADDITION

example on decimal value given in scientific notation:


3.25 x 10 ** 3
+ 2.63 x 10 ** -1
-----------------

first step: align decimal points


second step: add

3.25 x 10 ** 3
+ 0.000263 x 10 ** 3
--------------------
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)

third step: normalize the result (already normalized!)

example on fl pt. value given in binary:

.25 = 0 01111101 00000000000000000000000

100 = 0 10000101 10010000000000000000000

to add these fl. pt. representations,


step 1: align radix points

shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1

shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1

we want to shift the mantissa right, because the bits that


fall off the end should come from the least significant end
of the mantissa

-> choose to shift the .25, since we want to increase it's exponent.
-> shift by 10000101
-01111101
---------
00001000 (8) places.

0 01111101 00000000000000000000000 (original value)


0 01111110 10000000000000000000000 (shifted 1 place)
(note that hidden bit is shifted into msb of mantissa)
0 01111111 01000000000000000000000 (shifted 2 places)
0 10000000 00100000000000000000000 (shifted 3 places)
0 10000001 00010000000000000000000 (shifted 4 places)
0 10000010 00001000000000000000000 (shifted 5 places)
0 10000011 00000100000000000000000 (shifted 6 places)
0 10000100 00000010000000000000000 (shifted 7 places)
0 10000101 00000001000000000000000 (shifted 8 places)

step 2: add (don't forget the hidden bit for the 100)
0 10000101 1.10010000000000000000000 (100)
+ 0 10000101 0.00000001000000000000000 (.25)
---------------------------------------
0 10000101 1.10010001000000000000000

step 3: normalize the result (get the "hidden bit" to be a 1)

it already is for this example.

result is
0 10000101 10010001000000000000000

SUBTRACTION

like addition as far as alignment of radix points

then the algorithm for subtraction of sign mag. numbers takes over.

before subtracting,
compare magnitudes (don't forget the hidden bit!)
change sign bit if order of operands is changed.

don't forget to normalize number afterward.

MULTIPLICATION

example on decimal values given in scientific notation:

3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------

algorithm: multiply mantissas


add exponents

3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------
1.50 x 10 ** 3

example in binary: use a mantissa that is only 4 bits so that


I don't spend all day just doing the multiplication
part.

0 10000100 0100
x 1 00111100 1100
-----------------
mantissa multiplication: 1.0100
(don't forget hidden bit) x 1.1100
------
00000
00000
10100
10100
10100
---------
1000110000
becomes 10.00110000

add exponents: always add true exponents


(otherwise the bias gets added in twice)

biased:
10000100
+ 00111100
----------

10000100 01111111 (switch the order of the subtraction,


- 01111111 - 00111100 so that we can get a negative value)
---------- ----------
00000101 01000011
true exp true exp
is 5. is -67

add true exponents 5 + (-67) is -62.

re-bias exponent: -62 + 127 is 65.


unsigned representation for 65 is 01000001.

put the result back together (and add sign bit).

1 01000001 10.00110000

normalize the result:


(moving the radix point one place to the left increases
the exponent by 1.)

1 01000001 10.00110000
becomes
1 01000010 1.000110000

this is the value stored (not the hidden bit!):


1 01000010 000110000
DIVISION

similar to multiplication.

true division:
do unsigned division on the mantissas (don't forget the hidden bit)
subtract TRUE exponents

The IEEE standard is very specific about how all this is done.
Unfortunately, the hardware to do all this is pretty slow.

Some comparisons of approximate times:


2's complement integer add 1 time unit
fl. pt add 4 time units
fl. pt multiply 6 time units
fl. pt. divide 13 time units

There is a faster way to do division. Its called


division by reciprocal approximation. It takes about the same
time as a fl. pt. multiply. Unfortunately, the results are
not always the same as with true division.

Division by reciprocal approximation:

instead of doing a / b

they do a x 1/b.

figure out a reciprocal for b, and then use the fl. pt.
multiplication hardware.

example of a result that isn't the same as with true division.

true division: 3/3 = 1 (exactly)

reciprocal approx: 1/3 = .33333333

3 x .33333333 = .99999999, not 1

It is not always possible to get a perfectly accurate reciprocal.

ISSUES in floating point


note: this discussion only touches the surface of some issues that
people deal with. Entire courses could probably be taught on each
of the issues.

rounding
--------
arithmetic operations on fl. pt. values compute results that cannot
be represented in the given amount of precision. So, we must round
results.

There are MANY ways of rounding. They each have "correct" uses, and
exist for different reasons. The goal in a computation is to have the
computer round such that the end result is as "correct" as possible.
There are even arguments as to what is really correct.

3 methods of rounding:
round toward 0 -- also called truncation.
figure out how many bits (digits) are available. Take that many
bits (digits) for the result and throw away the rest.
This has the effect of making the value represented closer
to 0.

example:
.7783 if 3 decimal places available, .778
if 2 decimal places available, .77

round toward + infinity --


regardless of the value, round towards +infinity.

example:
1.23 if 2 decimal places, 1.3
-2.86 if 2 decimal places, -2.8

round toward - infinity --


regardless of the value, round towards -infinity.

example:
1.23 if 2 decimal places, 1.2
-2.86 if 2 decimal places, -2.9

in binary -- rounding to 2 digits after radix point


----------------------------------------------------
round toward + infinity --

1.1101
|
1.11 | 10.00
------

1.001
|
1.00 | 1.01
-----

round toward - infinity --

1.1101
|
1.11 | 10.00
------

1.001
|
1.00 | 1.01
-----
round toward zero (TRUNCATE) --

1.1101
|
1.11 | 10.00
------

1.001
|
1.00 | 1.01
-----

-1.1101
|
-10.00 | -1.11
------

-1.001
|
-1.01 | -1.00
-----

round toward nearest --


ODD CASE:
if there is anything other than 1000... to the right
of the number of digits to be kept, then
rounded in IEEE standard such that the least significant
bit (to be kept) is a zero.

1.1111
|
1.11 | 10.00
------

1.1101
|
1.11 | 10.00
------

1.001 (ODD CASE)


|
1.00 | 1.01
-----
-1.1101 (1/4 of the way between)
|
-10.00 | -1.11
------

-1.001 (ODD CASE)


|
-1.01 | -1.00
-----

NOTE: this is a bit different than the "round to nearest" algorithm


(for the "tie" case, .5) learned in elementary school for decimal numbers.

use of standards
----------------
--> allows all machines following the standard to exchange data
and to calculate the exact same results.

--> IEEE fl. pt. standard sets


parameters of data representation (# bits for mantissa vs. exponent)

--> Pentium architecture follows the standard

overflow and underflow


----------------------
Just as with integer arithmetic, floating point arithmetic operations
can cause overflow. Detection of overflow in fl. pt. comes by checking
exponents before/during normalization.

Once overflow has occurred, an infinity value can be represented and


propagated through a calculation.

Underflow occurs in fl. pt. representations when a number is


to small (close to 0) to be represented. (show number line!)

if a fl. pt. value cannot be normalized


(getting a 1 just to the left of the radix point would cause
the exponent field to be all 0's)
then underflow occurs.

HW vs. SW computing
-------------------
floating point operations can be done by hardware (circuitry)
or by software (program code).

-> a programmer won't know which is occuring, without prior knowledge


of the HW.

-> SW is much slower than HW. by approx. 1000 times.

A difficult (but good) exercize for students would be to design


a SW algorithm for doing fl. pt. addition using only integer
operations.

SW to do fl. pt. operations is tedious. It takes lots of shifting


and masking to get the data in the right form to use integer arithmetic
operations to get a result -- and then more shifting and masking to put
the number back into fl. pt. format.

A common thing that manufacturers used to do is to offer 2 versions of the


same architecture, one with HW, and the other with SW fl. pt. ops.

Chapter 8 -- data structures

DATA STRUCTURES
---------------

common theme in programming:


SPACE vs. TIME tradeoff
space is memory space
time is time to execute program

it is often possible to write a program such that it


1. executes very fast, but wastes/utilizes more memory
or
2. utilizes little memory, but executes quite slow

data structures can make memory useage efficient or inefficient

data structures we will discuss:


arrays, stacks, and queues (in that order)

ARRAYS
------

array implementation is important


1. most assembly languages have no concept of arrays
2. from an array, any other data structure we might want can be built

Properties of arrays:
1. each element is the same size (char = 1 byte, integer = 4 bytes)
2. elements are stored contiguously, with the first element
stored at the smallest memory address

so, the whole trick in assembly language is


1. allocate correct amount of space for an array
2. an ADDRESS tells the location of an element
3. USE address to get at array element
memory can be thought of as an array
it is a giant array of bits or bytes or words (picture needed here!)
the element numbering starts at 0
the element number is an address

if each byte has a unique address, we have BYTE ADDRESSING.


(the Pentium has this, as do many machines)

if each doubleword has a unique address, we have WORD ADDRESSING.

SASM access of the memory array

we can access any element (byte) of memory

m(address) refers to the element (contents) at address


m(20) is the byte or doubleword numbered address 20,
the 21st byte of memory.

important: 20 is the address


m(20) is the contents at address 20

SASM declarations of arrays within memory

to allocate a portion of memory (more than a single variable's worth)


(allocating an array within memory)

variablename type numelements dup(initvalue)

type is just like before -- db or dd


numelements is just that, and numbering ALWAYS starts at 0
initvalue is a value given to each element of the array

example:

my_array db 8 dup(0)
8 character elements, numbered 0 - 7, initialized to 0

an example of how to calculate the address of an element:

byte (character) elements --


array1: array[6..12] of char; /* PASCAL */

6 7 8 9 10 11 12 <---- element index


-----------------------------
| | | | |XXX| | |
| | | | |XXX| | |
-----------------------------
25 26 27 28 29 30 31 <---- address

byte address of array1[10] = 25 + (10 - 6)


= 29

this same example (or close) only in SASM:


array1 db 7 dup(0)

0 1 2 3 4 5 6
-----------------------------
| | | | |XXX| | |
| | | | |XXX| | |
-----------------------------
25 26 27 28 29 30 31 <---- address

want the 5th element,

array1[4] is at address array1 + 4


If element[0] is at address 25,
byte address of array1[4] = 25 + 4

how do you get the address array1?


SASM la (load address) instruction

la ar_addr, array1

takes the address array1 (25 in the example) and puts it


into the variable called ar_addr.

this is where it is extremely important to understand and keep


clear the difference between an address and the contents of
an address.

to reference array1[4] in SASM, write the code,

la my_addr, array1
iadd my_addr, 4

; then if we wanted to decrement element number 4


sub m(my_addr), 1

remember, my_addr is an address (declared as dd)


m(my_addr) is the byte at the address my_addr

integer elements --
array2: array[0..5] of integer; /* PASCAL */

in SASM,
array2 dd 6 dup(0)

0 1 2 3 4 5 <-- index
-------------------------
| | | |XXX| | |
| | | |XXX| | |
-------------------------
80 84 88 92 96 100 <-- memory address

byte address of array2[3] = 80 + 4(3 - 0)


= 92

SO, we need to know


1. where the array starts (called base address)
2. size of an element in bytes (to get a byte address)
3. what the first element is numbered

byte address of element[x] = base + size(x - first index)

Note: if we always start the index numbering at 0, then


this formula becomes

byte address of element[x] = base + size(x)

2 DIMENSIONAL ARRAYS
--------------------

There are more issues here, than for 1 dimensional arrays.

First, how to map a 2 dimensional array onto a 1 dimensional memory?

TERMINOLOGY:

r x c array -- r rows
c columns

element[y, x] -- y is row number


x is column number

example: 4 x 2 array

mapping this 4 x 2 array into memory.


2 possiblilities:

row major order:


rows are all together

| |
-------
| 0,0 |
-------
| 0,1 |
-------
| 1,0 |
-------
| 1,1 |
-------
| 2,0 |
-------
| 2,1 |
-------
| 3,0 |
-------
| 3,1 |
-------
| |

column major order:

columns are all together

| |
-------
| 0,0 |
-------
| 1,0 |
-------
| 2,0 |
-------
| 3,0 |
-------
| 0,1 |
-------
| 1,1 |
-------
| 2,1 |
-------
| 3,1 |
-------
| |

2-D Arrays
----------

The goal: come up with a formula for calculating the address of


an element of a 2-D array.
Row Major:

addr. of [y, x] = base + offset to + offset within


correct row row

(size)(y - first_row) (# columns)

(size) (x - first_col)

Column Major:

addr. of [y, x] = base + offset to + offset within


correct column column

(size)(x - first_col) (# rows)

(size) (y - first_row)

Need to know:
1. row/column major (storage order)
2. base address
3. size of elements
4. dimensions of the array

HINTS toward getting this correct:


Draw pictures.
Don't forget to account for size.

BOUNDS CHECKING
---------------

Many HLL's offer some form of bounds checking. Your program crashes,
or you get an error message if an array index is out of bounds
cal example: x: array[1..6] of integer;
code. . .

y := x[8];

Assembly languages offer no implied bounds checking.


After all, if your program calculates an address of an element,
and then loads that element (by the use of the address), there is
no checking to see that the address calculated was actually within
the array!

example (to motivate some thought as to how to do bounds checking):

given -- a 5 x 3 array
byte size elements
row major order
first_row = 1
first_col = 1

what is the address of element[2, 5]

a program probably just plugs the numbers into the formula:

addr of [2, 5] = base + 1(2 - 1)(3) + 1(5)


= base + 8

this actually gives the address of element [3, 3],


still a valid element of the array, but not what was really
required. . .

STACKS
------

We often need a data structure that stores data in the reverse


order that it is used. Along with this is the concept that the
data is not known until the program is executed (RUN TIME).
A STACK allows both properties.

Abstractly, here is a stack. Analogy to stack of dishes.


Also dubbed Last In First Out, LIFO.

| |
| |
|-------|
| |
|-------|
| |
|-------|
| |
|-------|
| |
-------

Data put into the stack is said to be PUSHED onto the stack.
Data taken out of the stack is said to be POPPED off the stack.

Algorithm Example:
printing out a positive integer, character by character
(integer to character string conversion)
integer = 1024

if integer == 0 then
push '0'
else
while integer <> 0
digit <- integer mod base
char <- digit + 48
push char onto stack
integer <- integer div base

while stack is not empty


pop char
put char

IMPLEMENTATION OF A STACK
-------------------------
One implementation of a stack out of an array.

Need to know:
index of TOP OF STACK (tos), often called a stack pointer (sp).
In most implementations, tos is not an index at all, it is
an ADDRESS (pointer).

(initial state)
sp
|
\ /
-----------------------------
| | | | | | | |
-----------------------------

sp is a variable that contains the address of the empty location


at the top of the stack.

for an array declared (in SASM) as

my_stack dd 50 dup(0)
my_sp dd ?

OR (could be, since amount of space is important)


my_stack db 200
my_sp dd ? ; doesn't change in these declarations

a PUSH operation:

move m(my_sp), data


iadd my_sp, 4
a POP operation:

isub my_sp, 4
move data, M(my_sp)

A stack could instead be implemented such that the stack pointer


points to a FULL location at the top of the stack.

(initial state)
sp
|
\ /
-----------------------------
| | | | | | | |
-----------------------------

a PUSH operation:

iadd my_sp, 4
move M(my_sp), data

a POP operation:

move data, M(my_sp)


isub my_sp, 4

ANOTHER ALTERNATIVE:
The stack could "grow" from the end of the array towards
the beginning. (Note that which end of the array the stack
grows toward is independent of what stack pointer points to.)

For the student to figure out:


How do you initialize the stack pointer?
How do you know when there are no more items in the stack?
How do you know when the stack is full?

QUEUES
-------

whereas a stack is LIFO,


a queue is FIFO (First In, First Out).

a real life analogy is a line (called a queue in British English).


Person gets on the end of the line (the TAIL),
waits,
and gets off at the front of the line (the HEAD).
getting into the queue is an operation called ENQUEUE.
taking something off the queue is an operation called DEQUEUE.

it takes 2 pointers to keep track of the data structure,


the head and the tail.

initial state:
--------------------------------------------
| | | | | |
--------------------------------------------
^
|
head, and tail

after 1 enqueue operation:


--------------------------------------------
| | x | | | |
--------------------------------------------
^ ^
| |
| tail
head

after another enqueue operation:


--------------------------------------------
| | x | y | | |
--------------------------------------------
^ ^
| |
| tail
head

after a dequeue operation:


--------------------------------------------
| | x | y | | |
--------------------------------------------
^ ^
| |
| tail
head

Note that (like stacks) when an item is removed from


the data structure, it is physically still present,
but correct use of the structure cannot access it.

If enough items are enqueued (and possibly dequeued) from


the queue, the points will eventually run off the end of the
array! This leads to implementations that "wrap" the
beginning of the array to the end, and forms a CIRCULAR QUEUE.
The implementation of the circular queue is a bit more complex.
The conditions to test for an empty queue and full queue
are more difficult. They can be eased by implementing a
queue with one element that is a DUMMY. It is never used
for data storage. The DUMMY element works its way around the
memory allocated for the queue as items are enqueued and dequeued.

This is an example of the space vs. time trade-off.


An extra piece of memory is used in an inefficient
manner, in order to make the test for full/empty queues
more efficient.

Chapter 9 -- registers
REGISTERS
---------

An introduction to the subject of registers -- from a motivational


point of view.

This lecture is an attempt to explain a bit about why computers


are designed (currently) the way they are. Try to remember that
speed of program execution is an important goal. Desire for increased
speed drives the design of computer hardware.

The impediment to speed (currently): transfering data to and from


memory.

look at a SASM instruction:


iadd x, y

-x and y must all be addresses of data in memory.


-each address is 32 bits.
-so, this instruction requires MORE than 64 bits.

if each read from memory delivers 32 bits of data,


then it takes a lot of reads before this instruction can
be completed.
3 for instruction fetch
1 to load x
1 to load y
1 to store x

that's 6 transactions with memory for 1 instruction!

How bad is the problem?


Assume that a 32-bit 2's complement addition takes 1 time unit.
A read/write from/to memory takes about 10 time units.

So we get
fetch instruction: 30 time units
decode 1 time unit
load x 10 time units
load y 10 time units
add 1 time unit
store x 10 time units
---------------------------------
total time: 62 time units

60/62 = 96.7 % of the time is spent doing memory operations.

what do we do to reduce this number?


1. transfer more data at one time
if we transfer 2 words at one time, then it only takes 2 reads
to get the instruction. There is no savings in loading/storing
the operands. And, an extra word worth of data is transferred
for each load, a waste of resources.
So, this idea would give a saving of 1 memory transaction.

2. modify instructions such that they are smaller.


The Pentium ALREADY has done this! It only has 2 operands
for each instruction.

Most modern machines allow 3 operands, to give instructions


like:
add x, y, z ; x <- (y) + (z)
Note that this instruction makes the problem worse!
Add up the memory accesses for this one!
They call a machine like this a 3-address machine. Or, it has
a 3-address instruction set.

The differences between 2-address and 3-address instruction sets:


1. the 2-address instruction set can require more instructions
to do the same operation as the 3-address instruction set.

example: add x, y, z ; 3-address instruction set

move x, y ; 2-address instruction set


add x, z

memory accesses for this:


3-address instruction set: 4 instruction fetch
1 to load y
1 to load z
1 to store x
Total = 7

2-address instruction set: 3 instruction fetch (move)


1 to load y (move)
1 to store x (move)
3 instruction fetch (add)
1 to load z (add)
1 to load x (add)
1 to store x (add)
Total = 11

So, allow only 1 operand -- called a 1-address format.

now, the instruction add x, y, z will be accomplished


by something like

load z
add y
store x

to facilitate this, there is an implied integer of storage


associated with the ALU. All results of instructions
are placed into this integer -- called an ACCUMULATOR.

the operation of the sequence:


load z -- place the contents at address z into the accumulator
(sort of like if you did move accumulator, z in SASM)
add y -- implied operation is to add the contents of the
accumulator with the operand, and place the result
back into the accumulator.
store x-- place the contents of the accumulator into the location
specified by the operand.

Notice that this 1-address instruction format implies the use


of a variable (the accumulator).

How many memory transactions does it take?


3 -- (load) 2 for instruction fetch, 1 for read of z
3 -- (add) 2 for instruction fetch, 1 for read of y
3 -- (store) 2 for instruction fetch, 1 for write of x
---
9 Not better than the 3 address machine.

BUT, what if the operation following the add was something like
div x, x, 3
then, the value for x is already in the accumulator, and the
code on the 1 address machine could be
load z
add y
div 3
store x
there is only 1 extra instruction (3 memory transactions) for this
whole sequence!
On the 3-address machine: 13 transactions
On the 1-address machine: 11 transactions

REMEMBER this: the 1 address machine uses an extra word of storage


that is located in the CPU.

the example shows a savings in memory transactions


when a value is re-used.
3. shorten addresses. This restricts where variables can be placed.
First, make each address be 16 bits (instead of 32). Then
add x, y, z
requires 2 32-bit words for instruction fetch.

Shorten addresses even more . . . make them each 5 bits long.


Problem: that leaves only 32 words of data for operand storage.
So, use extra move instructions that allow moving data from
a 32 bit address to one of these special 32 words.

Then, the add can fit into 1 instruction.

NOW, put a couple of these ideas together.

Use of storage in CPU (accumulator) allowed re-use of data.


Its easy to design -- put a bunch of storage in the CPU --
call them REGISTERS. How about 32 of them? Then, restrict
arithmetic instructions to only use registers as operands.

add x, y, z

becomes something more like

load reg10, y
load reg11, z
add reg12, reg11, reg10
store x, reg12

presuming that the values for x, y, and z can/will be used again,


the load operations take relatively less time.

A set up like this where arith/logical instr. use only registers


for operands is called a LOAD/STORE architecture.

A computer that allows operands to come from main memory is often


called a MEMORY TO MEMORY architecture, although that term is not
universal.

Load/store architectures are common today. They have the advantages


1. instructions can be fixed length (and short)
2. their design allows (easily permits) pipelining, making load/store
architectures faster
(More about pipelining at the end of the semester)

IMPORTANT NOTE: The Pentium architecture (and also SASM) is NOT


a load/store architecture! It was designed (and propagated through
time) with different goals.

a discussion of addressing modes:


----------------

Once a computer has registers (and they ALL do!), then there
can be lots of interesting uses of these registers.

Many computers (including the Pentium) offer more ways of


getting at operands. These methods come under the classification
of addressing modes.

load/store architectures usually have a VERY limited set


of addressing modes available

memory to memory architectures (like Pentium) often offer LOTS


of modes. This flexibility often forces these machines to have
variable length instructions (like Pentium). Variable length
instructions can make for all sorts of difficulties in making
a processor go fast!

How to give an addressing mode? It requires extra bits for each


operand to specify which addressing mode is used.

We would likely see an instruction something like:

opcode addr.mode1 extra.stuff.for.operand1


addr.mode2 extra.stuff.for.operand2

Here are some addressing modes.


An addressing mode really gives the information of where
an operand is (its address). An instruction decides how
to use the address. This address is better termed an
EFFECTIVE ADDRESS.

The processor generates an effective address for each


operand. Depending on the instruction, that effective
address may be used directly, OR it may be used to get
the operand.

Register. The operand is in the register. The term effective


address is not really appropriate here, since there
is no address, just the designation for a register.

Imagine a computer that implemented SASM, but had


3 registers, called reg1, reg2, and reg3.
An addition instruction example that used a register
addressing mode for one of its operands could be

iadd reg2, 1

The contents of reg2 is added to the value 1, and the


result is placed back into reg2. The difference between
this imaginary instruction and a real one is in the
number of required bits for instruction encoding, and
in the number of memory accesses required.

Immediate. The operand is contained within the instruction itself.


So the effective address generated will be within the
instruction.

example: iadd count, 3 ; a SASM example

Often, no effective address is generated at all. When


the instruction is fetched, it contains an encoding of
the immediate operand. Decoding the addressing mode for
the operand leads to taking the operand from the instruction.

Direct. The effective address for an operand is in the


instruction. Note that this is what SASM implies for
most operands.

example: iadd count, 3 ; a SASM example

Register Direct. The effective address for an operand is in


a register.

example: add [reg3], 3

The contents of reg3 is the effective address. For the


add instruction, the contents at that address are loaded
and then added to the immediate value 3. The result
goes back to that same effective address.

Base Displacement. Also called indexed or relative.


The effective address is the sum of the contents of a
register plus a small constant.

Indirect. Adds a level of indirection to direct mode. An address


is specified within the instruction. The contents
at that address is the effective address.

A variation might be Register Indirect. The initial


address is located in a register (instead of in the
instruction).

PC Relative. The effective address is calculated relative to the


current value of the program counter.

As a real life example of this, virtually every architecture


has conditional control instructions that work this way.

An unnamed addressing mode for thought. The addressing mode specifies


2 registers. The effective address is calculated by adding
the contents of the 2 registers together.
Chapter 10 -- Pentium Architecture
It is time to stop programming in SASM, and get to the real stuff:
Pentium assembly language.

Notes ahead of time:

-- The assembly language details presented are not specific to just


the Pentium architecture. They are applicable to the 486 and
to the Pentium II and the Pentium Pro. Many details also work
for the earlier Intel architectures.

-- I'm only presenting a subset of the Pentium instruction set.


The real instruction set is REALLY large, and contains many
instructions that no one really uses anymore. We don't cover
those "obsolete" instructions. We also don't cover the
instructions that only the operating system would use.

About the Pentium Architecture


------------------------------

-- It is not a load/store architecture.

-- The instruction set is huge! We go over only a fraction of


the instruction set. The text only presents a fraction.

-- There are lots of restrictions on how instructions/operands are


put together, but there is also an amazing amount of flexibility.

Registers
---------

The Intel architectures as a set just do not have enough registers


to satisfy most assembly language programmers. Still, the processors
have been around for a LONG time, and they have a sufficient number
of registers to do whatever is necessary.

For our (mostly) general purpose use, we get

32-bit 16-bit 8-bit 8-bit


(high part of 16) (low part of 16)

EAX AX AH AL
EBX BX BH BL
ECX CX CH CL
EDX DX DH DL

and
EBP BP
ESI SI
EDI DI
ESP SP

There are a few more, but we won't use or discuss them. They
are only used for memory accessability in the segmented memory
model.

Using the registers:


As an operand, just use the name (upper case and lower case both
work interchangeably).

EBP is a frame pointer (see Chapter 11).


ESP is a stack pointer (see Chapter 11).

Oddities:
This is the only architecture that I know of where the programmer
can designate part of a register as an operand. On ALL other
machines, the whole register is designated and used.

ONE MORE REGISTER:


Many bits used for controlling the action of the processor and
setting state are in the register called EFLAGS. This register
contains the condition codes:

OF Overflow flag
SF Sign flag
ZF Zero flag
PF Parity flag
CF Carry flag

The settings of these flags are checked in conditional control


instructions. Many instructions set one or more of the flags.
(Note that we only utilized SF and ZF in SASM.)

There are many other bits in the EFLAGS register: TO BE DISCUSSED


LATER.

The use of the EFLAGS register is implied (rather than explicit)


in instructions.

Accessing Memory
----------------

There are 2 memory models supported in the Pentium architecture.


(Actually it is the 486 and more recent models that support 2 models.)

In both models, memory is accessed using an address. It is the


way that addresses are formed (within the processor) that differs
in the 2 models.

FLAT MEMORY MODEL

-- The memory model that we use. AND, the memory model that every
other manufactures' processors also use.

--

SEGMENTED MEMORY MODEL

-- Different parts of a program are assumed to be in their own,


set-aside portions of memory. These portions are called
segments.

-- An address is formed from 2 pieces: a segment location and


an offset within a segment.

Note that each of these pieces can be shorter (contain fewer


bits) than a whole address. This is much of the reason that
Intel chose this form of memory model for its earliest
single-chip processors.

-- There are segments for:

code
data
stack
other

-- Which segment something is in can be implied by the memory


access involved. An instruction fetch will always be looking
in the code segment. A push instruction (we'll talk about this
with chapter 11) always accesses the stack segment. Etc.

Addressing Modes
----------------

Some would say that the Intel architectures only support 1 addressing
mode. It looks (something like) this:

effective address = base reg + (index reg x scaling factor) + displacement

where
base reg is EAX, EBX, ECX, EDX or ESP or EBP
index reg is EDI or ESI
scaling factor is 1, 2, 4, or 8

The syntax of using this (very general) addressing mode will


vary from system to system. It depends on the preprocessor
and the syntax accepted by the assembler.

For our implementation, an operand within an instruction that


uses this addressing mode could look like
[EAX][EDI*2 + 80]

The effective address calculated with be the contents of


register EDI multiplied times 2 added to the constant 80,
added to the contents of register EAX.

There are extremely few times where a high-level language


compiler can utilize such a complex addressing mode. It is
much more likely that simplified versions of this mode
will be used.

SOME ADDRESSING MODES

-- register mode --
The operand is in a register. The effective address is the
register (wierd).

Example instruction:

mov eax, ecx

Both operands use register mode. The contents of register ecx


is copied to register eax.

-- immediate mode --
The operand is in the instruction. The effective address is within
the instruction.

Example instruction:

mov eax, 26

The second operand uses immediate mode. Within the instruction


is the operand. It is copied to register eax.

-- register direct mode --


The effective address is in a register.

Example instruction:

mov eax, [esp]

The second operand uses register direct mode. The contents of


register esp is the effective address. The contents of memory
at the effective address are copied into register eax.

-- direct mode --
The effective address is in the instruction.
Example instruction:

mov eax, var_name

The second operand uses direct mode. The instruction contains


the effective address. The contents of memory
at the effective address are copied into register eax.

-- base displacement mode --


The effective address is the sum of a constant and the contents
of a register.

Example instruction:

mov eax, [esp + 4]

The second operand uses base displacement mode. The instruction


contains a constant. That constant is added to the contents
of register esp to form an effective address. The contents
of memory at the effective address are copied into register eax.

-- base-indexed mode -- (Intel's name)


The effective address is the sum of the contents of two registers.

Example instruction:

mov eax, [esp][esi]

The contents of registers esp and esi are added to form an


effective address. The contents of memory at the effective
address are copied into register eax.

Note that there are restrictions on the combinations of registers


that can be used in this addressing mode.

-- PC relative mode --
The effective address is the sum of the contents of the PC and
a constant contained within the instruction.

Example instruction:

jmp a_label

The contents of the program counter is added to an offset that


is within the machine code for the instruction. The resulting
sum is placed back into the program counter. Note that from the
assembly language it is not clear that a PC relative addressing
mode is used. It is the assembler that generates the offset
to place in the instruction.

Instruction Set
----------------
Generalities:
-- Many (most?) of the instructions have exactly 2 operands.
If there are 2 operands, then one of them will be required
to use register mode, and the other will have no restrictions
on its addressing mode.

-- There are most often ways of specifying the same instruction


for 8-, 16-, or 32-bit oeprands. I left out the 16-bit ones
to reduce presentation of the instruction set. Note that
on a 32-bit machine, with newly written code, the 16-bit form
will never be used.

Meanings of the operand specifications:


reg - register mode operand, 32-bit register
reg8 - register mode operand, 8-bit register
r/m - general addressing mode, 32-bit
r/m8 - general addressing mode, 8-bit
immed - 32-bit immediate is in the instruction
immed8 - 8-bit immediate is in the instruction
m - symbol (label) in the instruction is the effective address

Data Movement
-------------

mov reg, r/m ; copy data


r/m, reg
reg, immed
r/m, immed

movsx reg, r/m8 ; sign extend and copy data

movzx reg, r/m8 ; zero extend and copy data

lea reg, m ; get effective address


(A newer instruction, so its format is much restricted
over the other ones.)

EXAMPLES:

mov EAX, 23 ; places 32-bit 2's complement immediate 23


; into register EAX
movsx ECX, AL ; sign extends the 8-bit quantity in register
; AL to 32 bits, and places it in ECX
mov [esp], -1 ; places value -1 into memory, address given
; by contents of esp
lea EBX, loop_top ; put the address assigned (by the assembler)
; to label loop_top into register EBX

Integer Arithmetic
------------------

add reg, r/m ; two's complement addition


r/m, reg
reg, immed
r/m, immed

inc reg ; add 1 to operand


r/m

sub reg, r/m ; two's complement subtraction


r/m, reg
reg, immed
r/m, immed

dec reg ; subtract 1 from operand


r/m

neg r/m ; get additive inverse of operand

mul eax, r/m ; unsigned multiplication


; edx||eax <- eax * r/m

imul r/m ; 2's comp. multiplication


; edx||eax <- eax * r/m
reg, r/m ; reg <- reg * r/m
reg, immed ; reg <- reg * immed

div r/m ; unsigned division


; does edx||eax / r/m
; eax <- quotient
; edx <- remainder

idiv r/m ; 2's complement division


; does edx||eax / r/m
; eax <- quotient
; edx <- remainder

cmp reg, r/m ; sets EFLAGS based on


r/m, immed ; second operand - first operand
r/m8, immed8
r/m, immed8 ; sign extends immed8 before subtract

EXAMPLES:

neg [eax + 4] ; takes doubleword at address eax+4


; and finds its additive inverse, then places
; the additive inverse back at that address
; the instruction should probably be
; neg dword ptr [eax + 4]

inc ecx ; adds one to contents of register ecx, and


; result goes back to ecx

Logical
-------
not r/m ; logical not

and reg, r/m ; logical and


reg8, r/m8
r/m, reg
r/m8, reg8
r/m, immed
r/m8, immed8

or reg, r/m ; logical or


reg8, r/m8
r/m, reg
r/m8, reg8
r/m, immed
r/m8, immed8

xor reg, r/m ; logical exclusive or


reg8, r/m8
r/m, reg
r/m8, reg8
r/m, immed
r/m8, immed8

test r/m, reg ; logical and to set EFLAGS


r/m8, reg8
r/m, immed
r/m8, immed8

EXAMPLES:

and edx, 00330000h ; logical and of contents of register


; edx (bitwise) with 0x00330000,
; result goes back to edx

Floating Point Arithmetic


-------------------------
Since the newer architectures have room for floating point
hardware on chip, Intel defined a simple-to-implement
extension to the architecture to do floating point arithmetic.
In their usual zeal, they have included MANY instructions to
do floating point operations.

The mechanism is simple. A set of 8 registers are organized


and maintained (by hardware) as a stack of floating point
values. ST refers to the stack top. ST(1) refers to the
register within the stack that is next to ST. ST and ST(0)
are synonyms.

There are separate instructions to test and compare the values


of floating point variables.
finit ; initialize the FPU

fld m32 ; load floating point value


m64
ST(i)

fldz ; load floating point value 0.0

fst m32 ; store floating point value


m64
ST(i)

fstp m32 ; store floating point value


m64 ; and pop ST
ST(i)

fadd m32 ; floating point addition


m64
ST, ST(i)
ST(i), ST

faddp ST(i), ST ; floating point addition


; and pop ST

ETC. (see p.201-202)

I/O
---
The only instructions which actually allow the reading and
writing of I/O devices are priviledged. The OS must handle
these things. But, in writing programs that do something
useful, we need input and output. Therefore, there are some
simple macros defined to help us do I/O.

These are used just like instructions.

put_ch r/m ; print character in the least significant


; byte of 32-bit operand

get_ch r/m ; character will be in AL

put_str m ; print null terminated string given


; by label m

Control Instructions
--------------------
These are the same control instructions that all started with
the character 'b' in SASM.

jmp m ; unconditional jump


jg m ; jump if greater than 0
jge m ; jump if greater than or equal to 0
jl m ; jump if less than 0
jle m ; jump if less than or equal to 0

ETC. (see p. 205)

Chapter 11 -- Procedures
All about Procedures
--------------------

an introduction to procedures

why have procedures?


-- reuse of code simplifies program writing
-- modular code facilitates modification
-- allows different programmers to write different parts of the
same program
-- etc.

Assembly languages typically provide little or no support for


procedure implementation.

So, we get to build a mechanism for implementing procedures out


of what we already know.

First, some terms and what we need.

In Pascal:

begin
.
.
.
x := larger(a, b); CALL
.
.
.
end.
HEADER PARAMETERS
function larger (one, two: integer): integer;
begin
if ( one > two ) then
larger := one BODY
else
larger := two
end;

In C:

{
.
.
.
x = larger(a, b); CALL
.
.
.
}
HEADER PARAMETERS
int larger (int one, int two)
{
if ( one > two )
larger = one; BODY
else
larger = two;
}

Steps in the execution of the procedure:


1. save return address
2. procedure call
3. execute procedure
4. return

what is return address? instruction following call

what is procedure call? jump or branch to first instruction


in the procedure

what is return? jump or branch to return address

A possible Pentium implementation of procedure call:

; one call
lea EAX, rtn_point
jmp proc1
rtn_point:

; another call
lea EAX, rtn_point2
jmp proc1
rtn_point2:

proc1: ; 1st instruction of procedure here


.
.
.
jmp (EAX)
This really doesn't work well if there are nested calls,
since we need a location to store the return address.
We CANNOT just use register EAX, as in the following
BAD code:

; one call
lea EAX, rtn_point
jmp proc1
rtn_point:

proc1: ; 1st instruction of procedure here


.
.
.
lea EAX, rtn_point_nested
jmp proc2
rtn_point_nested:
.
.
.
jmp [EAX] ; EAX will have been overwritten by the
; lea instruction in proc1.

proc2: ; 1st instruction of procedure here


.
.
.
jmp [EAX]

What is needed to handle this problem is to have a way to


save return addresses as they are generated. For a recursive
subroutine, it is not known ahead of time how many times
the subroutine will be called. This data is generated
dynamically; while the program is running.

These return addresses will need to be used in the reverse


order that they are saved.

The best way to save dynamically generated data is on a STACK.

Here is the code rewritten to use a stack:

.data
addr_stack dd 100 dup(0) ; hope that 100 addresses is enough!
stack_ptr dd ?

; stack initialization code


lea EDX, addr_stack
mov stack_ptr, EDX

; one call
lea EAX, rtn_point
jmp proc1
rtn_point:

proc1: ; 1st instruction of procedure here


; push return address on stack
mov [EDX], EAX
add EDX, 4
.
.
.
lea EAX, rtn_point_nested
jmp proc2
rtn_point_nested:
.
.
.
; pop retn address off stack
sub EDX, 4
mov EAX, [EDX]
jmp [EAX]

proc2: ; 1st instruction of procedure here


.
.
.
jmp [EAX]

SYSTEM STACK
------------
A stack is so frequently used in implementing procedure call/return,
that many computer systems predefine a stack, the SYSTEM STACK.

A stack handles dynamic data well.

STATIC -- can be defined when program is written (compile time)


DYNAMIC -- is defined when a program is executed (run time)

In this case, it is the amount of storage that cannot be


determined until run time.

The size of the system stack is very large. In theory, it should


be infinitely large. In practice, it must have a size limit.

In memory, we have:

address | |
0 | your |
| program |
| here |
| |
| |
| |
| |
| |
| system | / \
very | stack | | grows towards smaller addresses
large | here | |
addresses

terminology:

Some people say that this stack grows DOWN in memory.


This means that the stack grows towards smaller memory
addresses. Their picture would show address 0 at the
bottom (unlike my picture).

DOWN and UP are vague terms, unless you know what the
picture looks like.

The Pentium stack is defined to grow towards smaller


addresses, and the stack pointer points to the full location
at the top of the stack. The stack pointer is register ESP,
and it is defined before program execution begins (by the OS).

push, in Pentium:

sub esp, 4
mov [esp], ?

OR

mov [esp - 4], ? ; not a good implementation, since it


sub esp, 4 ; uses the space before allocation

pop, in Pentium:

mov ?, [esp]
add esp, 4

NOTE: If you use register esp for any use other than for
a stack pointer, then the location of the stack is lost.

The use of the stack is SO common, that there are explicit


instruction to do the push and pop operations:

push r/m
reg
immed

pop reg

So, we would never use the 2-instruction sequence above,


only the push and pop instructions!
An example of using the system stack to save return addresses:

lea EAX, rtn1


jmp proc1
rtn1
.
.
.
lea EAX, rtn2
jmp proc1
rtn2:
.
.
.

proc1: push EAX ; save return address

.
.
.
lea EAX, rtn3 ; this would overwrite the return
jmp proc2 ; address if it had not been saved.
rtn3:
.
.
.

pop EAX ; restore return address


jmp [EAX]

proc2:
.
.
.
jmp [EAX]

It is presumed that the code to call/return from procedures


will be well-used. And, it is.

To make the compiler or assembly language programmer's job


easier, there are 2 instructions that do procedure call and
return.

call r/m ; Push the return address onto the stack


; and jump to effective address given by
; the operand.
; The return address is the address of the
; instruction following the call.

ret immed ; Pop the return address off the stack


; and jump to that address. Stack pointer
; (esp) is adjusted by the number of bytes
; given in the immediate operand.
; This instruction can also be used with no
; operands. It then defaults to only popping
; the return address off the stack.

The example again, using call and return instead of push/pop/lea/jmp.

call proc1
.
.
.
call proc1
.
.
.

proc1:
.
.
.
call proc2
.
.
.
ret

proc2:
.
.
.
ret

about STACK FRAMES (ACTIVATION RECORDS)


---------------------------------------

From a compiler's point of view, there are a bunch of things


that should go on the stack relating to procedure call/return.
They include:
return address
parameters
other various registers

Each procedure has different requirements for numbers of


parameters, their size, and how many registers (which ones)
will need to be saved on the stack. So, we compose a
STACK FRAME or ACTIVATION RECORD that is specific to a
procedure.

Space for a stack frame gets placed on the stack each time
a procedure is called, and taken off the stack each time a
return occurs. These stack frames are pushed/popped
DYNAMICALLY (while the program is running).

example:
call A
call B
.
.

A: call C
call D
ret

B: call D
ret

C: call E
ret

D: ret

E: ret

show stack for a trace through this calling sequence

The code (skeleton) for one of these procedures:


A: call C
call D
ret

becomes . . .

A: sub esp, 20 ; allocate frame for A


; A's return address is at [esp+20]
call C
call D

add esp, 20 ; remove A's frame from stack


ret

Some notes on this:


-- the allocation and removal of a frame should be done within
the body of the procedure. That way, the compiler does not
need to know the size of a procedure's frame.
-- Accesses to A's frame can be done via offsets from esp.
about frame pointers
--------------

The stack gets used for more than just pushing/popping stack frames.
During the execution of a procedure, there may be a need for temporary
storage of variables. The common example of this is in expression
evaluation.
Example: high level language statement
Z = (X * Y) + (A/2) - 100
The intermediate values of X*Y and A/2 must be stored somewhere.
On older machines, register space was at a premium. There just weren't
enough registers to be used for this sort of thing. So, intermediate
results (local variables) were stored on the stack.

They don't go in the stack frame of the executing procedure; they


are pushed/popped onto the stack as needed.

So, at one point in a procedure, return address might be at [esp+16]

| |
---------
| |
---------
| | --- <- esp
--------- |
| | |
--------- |
| | |--- procedure's frame
--------- |
| | |
--------- |
|rtnaddr| |
--------- ---
| |
---------

and, at another point within the same procedure, it might be


at [esp+24]

---------
| |
---------
| temp2 | <- esp
---------
| temp1 |
---------
| | ---
--------- |
| | |
--------- |
| | |--- procedure's frame
--------- |
| | |
--------- |
|rtnaddr| |
--------- ---
| |
---------

All this is motivation for keeping an extra pointer around that does
not move with respect to the current stack frame.

Call it a FRAME POINTER. Make it point to the base of the current


frame:

---------
| |
---------
| temp2 | <- esp
---------
| temp1 |
---------
| | ---
--------- |
| | |
--------- |
| | |--- procedure's frame
--------- |
| | |
--------- |
|rtnaddr| | <- frame pointer
--------- ---
| |
---------

Now items within the frame can be accessed with offsets from the
frame pointer, AND the offsets do not change within the procedure.

The return address will now be at frame pointer.

Pentium architecture has a register dedicated as a frame


pointer. It is EBP. The last 2 letters stand for "base pointer".

Items within the the frame are accessed using negative


(but fixed) offsets from EBP.

NOTE:
-- EBP must be initialized at the start of every procedure,
and restored at the end of every procedure.

The skeleton implementation of a procedure that uses a frame pointer:


A: push ebp ; save caller's frame pointer
mov ebp, esp ; set up A's frame pointer
sub esp, 16 ; allocate remainder of frame for A
; A's return address is at [ebp+4]
call C
call D

mov esp, ebp ; remove A's frame from stack


pop ebp ; restore caller's frame pointer
ret

parameter passing.
------------------

parameter = argument

Just as there is little/no support for implementing procedures


in many assembly languages, there is little/no support for passing
parameters to those procedures.

Remember, when it comes to the implementation,


-- convention
-- its up to the programmer to follow the conventions

Passing parameters means getting data into a place set aside


for the parameters. Both the calling program and the procedure
need to know where the parameters are.

The calling program places them there, and possibly uses


values returned by the procedure. The procedure uses
the parameters.

a note on parameter passing --


a HLL specifies rules for passing parameters. There are basically
2 types of parameters.

Note that a language can offer 1 or both types.

call by value -- what C has. In Pascal, these are parameters


declared without the var in front of the variable name.
Fortran doesn't have this type of parameter.

The parameter passed may not be modified by the procedure.


This can be implemented by passing a copy of the value.
What call by value really implies is that the procedure can
modify the value (copy) passed to it, but that the value
is not changed outside the scope of the procedure.

call by reference -- what Fortran has. In Pascal, these are


"var type" parameters.
The parameter passed to the subroutine can be modified,
and the modification is seen outside the scope of the
subroutine. It is sort of like having access to global
variable.

There are many ways of implementing these 2 variable types.


If call by value is the only parameter type allowed, how
can we implement a reference type parameter?
Pass the address of the variable as the parameter. Then
access to the variable is made through its address. This
is what is done in C.

Simplest parameter passing mechanism -- Use registers


-----------------------------------------------------

the calling program puts the parameter(s) into specific registers,


and the procedure uses them.

example:

.
.
.
mov EAX, [EDX] ; put parameter in EAX
call decrement
mov [EDX], EAX ; recopy parameter to its correct place.
.
.
.
decrement:
sub EAX, 1
ret

Notes: -- This is a trivial example, since the procedure is 1 line


long.
-- Why not just use [EDX] within the procedure?
1. convention -- parameters are passed in specific registers.
2. same procedure could be used to decrement the value
in other parameters -- just copy the value in before
the call, and copy it out afterwards.

-- The Intel architectures suffer from not having enough


registers. With only a few to play around with (as
general purpose registers), we VERY soon run out of
registers to use. So, this method of passing parameters
is really not used for this architecture. It IS used
on all the more modern architectures. They even go so
far as to set aside a subset of their registers dedicated
as a location for passing parameters.
historically more significant mechanism: parameters on stack
---------------------
(The method of choice for this architecture.)

place the parameters to a procedure (function) in the activation


record (AR) for the procedure.

push P1 ; place parameter 1 into AR of proc


push P2 ; place parameter 2 into AR of proc
call proc
.
.
.
proc: push ebp ; save caller's frame pointer
mov ebp, esp ; set up A's frame pointer
sub esp, 16 ; allocate remainder of frame for A
; A's return address is at [ebp+4]

; use parameters in procedure calculations


; P1 is at [ebp+12]
; P2 is at [ebp+8]

mov esp, ebp ; remove A's frame from stack


pop ebp ; restore caller's frame pointer
ret 8 ; pop return address, return, and
; remove the parameters!

calling program: pushes parameters into stack


calls procedure

procedure: allocates AR (or remainder of AR)


deallocates AR of procedure

The activation record (frame) for a procedure so far:

^ smaller addresses up here


|----------------|
| |
|----------------|
| |
|----------------|
| caller's ebp |<- ebp (AND possibly esp)
|----------------|
| return address |
|----------------|
| P2 |
|----------------|
| P1 |
|----------------|
| |
|----------------|
| |
New problem:
What happens if a procedure has lots of variables, and it
runs out of registers to put them in. These are really the
local variables of the procedure.

Most common solution: store local variables temporarily on the


stack in AR.

Two ways of implementing this:

CALLEE SAVED
A procedure clears out some registers for its own use.

The called procedure saves register values in its AR.

CALLER SAVED
The calling program saves the registers and local variables
that it does not want a called procedure to overwrite.

REVISITING PROCEDURES.

What needs to be done depends on HLL.


The order is fairly consistent.
What is done by caller/callee varies from implementation to implementation.

Needed:
--> items to go in activation record.

return address
frame pointer (if used)
parameters
local variables --| may be some overlap here
saved registers --|

--> mechanism

before ANY procedure CALL


1. caller gets parameters into correct location
then
2. control is transfered to procedure

before procedure RETURN


1. put return values into correct location
2. restore anything that needs to be restored (return address, callee
saved registers, frame pointer)
3. remove activation record
then
4. jump to return location

Chapter 12 -- The Assembly Process

THE ASSEMBLY PROCESS


--------------------

-- a computer understands machine code


-- people (and compilers) write assembly language

assembly ----------------- machine


source --> | assembler | --> code
code -----------------

an assembler is a program -- a very deterministic program --


it translates each instruction to its machine code.

in the past, there was a one-to-one correspondence between


assembly language instructions and machine language instructions.

this is no longer the case. Assemblers are now-a-days made more


powerful, and can "rework" code.

The Pentium (being based on the 8086) has a one-to-one


correspondence between assembly language instructions and
machine language instructions.

ASSEMBLY
---------
the assembler's job is to
1. assign addresses
2. generate machine code

an assembler will

-- assign addresses

-- generate machine code

-- generate an image of what memory must look like for the


program to be executed.

a simple assembler will make 2 complete passes over the data


(source code) to complete this task.
pass 1: create complete SYMBOL TABLE
generate machine code for instructions other than
branches, jumps, call, lea, etc. (those instructions
that rely on an address for their machine code).
pass 2: complete machine code for instructions that didn't get
finished in pass 1.

assembler starts at the top of the source code program,


and SCANS. It looks for
-- directives (.data .code .stack .486, etc. )
-- instructions

IMPORTANT:
there are separate memory spaces for data and instructions.
the assembler allocates them IN SEQENTIAL ORDER as it scans
through the source code program.

the starting addresses are fixed -- ANY program will be assembled


to have data and instructions that start at the same address.

Generating Machine Code for an Instruction


------------------------------------------

This is complex due to the large variety of addressing


modes combined with the large number of instructions.

Most often, the machine code for an instruction consists of


1) an 8-bit opcode
(the choice of opcode will depend somewhat on the addressing
modes used for the operands)
followed by
2) one or more bytes describing the addressing modes for
the operands.

EXAMPLE INSTRUCTION:
add eax, 24

Find Appendix C. That is where all this machine code


stuff is specified.

For the add instruction, the table lists:

add reg, r/m 03 /r


r/m, reg 01 /r
r/m, immed 81 /0 id

The only one that would match the operand types is the
third one in the list:
add r/m, immed 81 /0 id

So, this is the one we choose.

The 81 is the 8-bit opcode.

What follows the opcode is information about the addressing


mode of the 2 operands (add always has exactly 2 operands
and the addressing mode of each must be explicitly specified)

Commonly, both operands are described (exactly) by the


encoding of a single byte that Intel calls the ModR/M byte.

Within the machine code description (81 /0 id), the


/0 symbol describes part of this ModR/M byte.
/0 is found in the explanations table on page 352.
It says the reg field of the ModR/M byte is 000.

The ModR/M byte:

This byte describes the addressing mode of operands.

It is divided up into 3 fields as follow

BITS 7 6 5 4 3 2 1 0
mod reg/opcode r/m

For this example instruction, bits 5, 4, 3 are set to be 000,


giving
BITS 7 6 5 4 3 2 1 0
mod reg/opcode r/m
0 0 0

This tells that the second operand is an immediate.


The description of the first operand will be done with
the mode and r/m fields of the ModR/M byte.

Look in the table (page 353) to find register mode,


using register EAX (since that is what the example instruction has).
Table says that Mod is 11, R/M is 000 giving

BITS 7 6 5 4 3 2 1 0
mod reg/opcode r/m
1 1 0 0 0 0 0 0

The last step is to get the id part. From the explanation


table (page 352), id is described as 32-bit immediate.
Therefore id corresponds to a 32-bit two's complement representation
of the value 24.

This is 0000 0000 0000 0000 0000 0000 0001 1000


In hex, this is 0x00000018.

Putting all this stuff together, we get machine code for


the example instruction ( add eax, 24 )

Note that everything is in hexadecimal.


AND, immediate values are listed least significant byte first!
opcode
81
ModR/M byte
c0
immediate
18
00
00
00

Written out left to right:


81 c0 18 00 00 00

One more example of generating machine code.

Machine code for the Pentium instruction

dec dword ptr [EDX]

From page 349, we want the form of the decrement instruction

dec r/m ff /1

The opcode is ff, and it describes that there will be one operand,
and it is of the general form. The /1 says that the
Reg field of the ModR/M byte will be 001.

BITS 7 6 5 4 3 2 1 0
Mod Reg/Opcode R/M
0 0 1

The table on page 353 describes the Mod and R/M fields.
Find the register direct addressing mode, using register EDX
in the table. It gives Mod 00 and R/M 010.

BITS 7 6 5 4 3 2 1 0
Mod Reg/Opcode R/M
0 0 0 0 1 0 1 0

In hex, this is 0a.

The machine code is now complete: ff 0a.

A BIG EXAMPLE:

.data
a1 dd 4
a2 dd ?
a3 dd 5 dup(0)
.code
main: mov ecx, 20
mov eax, 15
mov edx, 0
jz target_label
loop1: add edx, eax
imul [ebp + 8]
dec ecx
jg loop1
target_label:
done

First step: putting the data section together.

Upon scanning the source code, the token .data is read.


This is the directive that tells the assembler that what follows
gets allocated within the data section of the program.

Remember that a directive is a "command" to the assembler


about how to assemble the source code.

The next token encountered is the label a1.

This symbol (label) is not yet in its symbol table, so the


assembler assigns an address, and places it in the symbol
table.

Remember, the assembler assigns the first available address


within the data section.

Symbol table
symbol address
---------------------
a1 0040 0000 (I made up this address, 'cuz we need
a starting address for the data section.)

The next token is dd. It lets the assembler know to allocate


one doubleword of space at the current address.

The next token is 4. It tells the assembler that the value of


the allocated space is to be the value 4.

The following line does much the same,


placing a2 in the symbol table at the next available address
(0x0040 0004)
allocating 1 doubleword
not putting something specific in the allocated space

When finished with the data section, we will have the


symbol table
symbol address
---------------------
a1 0040 0000
a2 0040 0004
a3 0040 0008
0040 000c
0040 0010
0040 0014
0040 0018

0040 001c (the next available address within the


data section. NOT PART OF THE TABLE.)

and,

memory map of data section


address contents notes
hex
0040 0000 0000 0004 for a1
0040 0004 0000 0000 for a2 (defaults to 0)
0040 0008 0000 0000 5 double words for a3
0040 000c 0000 0000
0040 0010 0000 0000
0040 0014 0000 0000
0040 0018 0000 0000

Upon encounting the .code directive, the assembler knows that


the next addresses it assigns will be within the code section
of the program (separate from the data).

Assume that the code will be assembled such that the first
instruction is placed at address 0x0000 0000.

The code (repeated):


.code
main: mov ecx, 20
mov eax, 15
mov eax, 0
jz target_label
loop1: add edx, eax
imul [ebp + 8]
dec ecx
jg loop1
target_label:
done

The first token picked up after the .code directive


is the label main. (As with ALL symbols,) the assembler
looks to see if this symbol is already in the symbol table.
It is not, so the assembler assigns the first available
address, and places it in the symbol table.
symbol table
symbol address
---------------------
a1 0040 0000
a2 0040 0004
a3 0040 0008
0040 000c
0040 0010
0040 0014
0040 0018

0040 001c (the next available address within the


data section. NOT PART OF THE TABLE.)
main 0000 0000

Next, the assembler picks up the token mov. It knows that


this is an instruction, and reads the rest of the instruction
in order to generate the machine code for this instruction.

mov ecx, 20

mov reg, immed b8 + rd

No ModR/M byte needed, since the register is incorporated


into the opcode byte, and the immediate must follow.

rd (from table on page 352) is 1, b8+1=b9

The immediate is 0x00000014.

So, the machine code will be


b9 14 00 00 00

These 5 bytes are placed at address 0x0000 0000, and


the next available address for an instruction becomes
0x0000 0005.

The assembler is ready for the next token. It will be the


second mov instruction in the program. It knows that
this is an instruction, and reads the rest of the instruction
in order to generate the machine code for this instruction.

mov eax, 15

mov reg, immed b8 + rd

No ModR/M byte needed, since the register is incorporated


into the opcode byte, and the immediate must follow.

rd (from table on page 352) is 0, b8+0=b8

The immediate is 0x0000000f.


So, the machine code will be
b8 10 00 00 00

These 5 bytes are placed at address 0x0000 0005, and


the next available address for an instruction becomes
0x0000 000a.

The assembler is ready for the next token. It will be the


third mov instruction in the program. It knows that
this is an instruction, and reads the rest of the instruction
in order to generate the machine code for this instruction.

mov edx, 0

mov reg, immed b8 + rd

No ModR/M byte needed, since the register is incorporated


into the opcode byte, and the immediate must follow.

rd (from table on page 352) is 2, b8+2=ba

The immediate is 0x00000000.

So, the machine code will be


ba 00 00 00 00

These 5 bytes are placed at address 0x0000 000a, and


the next available address for an instruction becomes
0x0000 000f.

The assembler is ready for the next token. It will be the


jz instruction in the program. It knows that
this is an instruction, and reads the rest of the instruction
in order to generate the machine code for this instruction.

jz target_label

jz rel32 0f 84 "cd"

The "cd" is a 32-bit code offset. It needs to be the


difference between what the PC will be when executing this
code and the address assigned for label target_label.

The problem with this is that target_label has not yet


been assigned an address. So, the assembler will need to
wait on figuring out the 32-bit code offset portion of
this instruction until the second pass of the assembler.

The assembler does know that this instruction will be


exactly 6 bytes long, so it can continue with assembly
at location 0x0000 0015.

A memory map of text section so far is:


memory map of text section
address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 0f 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 ?? ?? ?? ??
0000 0015

The assembler is ready for the next token. It will be the


label loop1. The assembler checks if this symbol is in the
symbol table. It is not, so the assembler assigns an address
and places the symbol in the table.

symbol table
symbol address
---------------------
a1 0040 0000
a2 0040 0004
a3 0040 0008
0040 000c
0040 0010
0040 0014
0040 0018

0040 001c (the next available address within the


data section. NOT PART OF THE TABLE.)
main 0000 0000
loop1 0000 0015

The assembler is ready for the next token. It will be the


add instruction in the program. It knows that
this is an instruction, and reads the rest of the instruction
in order to generate the machine code for this instruction.

add edx, eax

add reg, r/m 03 /r


or
add r/m, reg 01 /r

I doesn't matter which one is chosen. They are the


same length. Chose the first one.

/r means that the ModR/M byte has both a register


operand and a R/M operand.

BITS 7 6 5 4 3 2 1 0
Mod Reg/Opcode R/M
1 1 0 1 0 0 0 0

In hex, this is d0.


So, the machine code for the instruction is 03 d0.
These 2 bytes are placed at address 0x0000 0015.
The next available address for code will be 0x0000 0017.

A memory map of text section so far is:

memory map of text section


address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 0f 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 ?? ?? ?? ??
0000 0015 03 d0
0000 0017

On to the next instruction.

imul [ebp + 8]

imul r/m f7 /5

/5 means that the ModR/M byte has a register


field of 101

The addressing mode for [ebp + 8] is under disp32[EBP]


in the table on page 353.

BITS 7 6 5 4 3 2 1 0
Mod Reg/Opcode R/M
1 0 1 0 1 1 0 1

In hex, this is ad.

The 32-bit displacement follows the ModR/M byte. It contains


a 32-bit 2's complement encoding of the value 8.
0x 00 00 00 08

The machine code for this instruction is f7 ad 08 00 00 00.


These 6 bytes are placed at address 0x0000 0017.
The next available address for code will be 0x0000 0019.

A memory map of text section so far is:

memory map of code section


address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 10 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 ?? ?? ?? ??
0000 0015 03 d0
0000 0017 f7 ad 08 00 00 00
0000 001d

The next instruction is easy.

dec ecx

dec reg 48 + rd

rd is 1 for ecx. So the machine code is the single byte 49.

memory map of code section


address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 10 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 ?? ?? ?? ??
0000 0015 03 d0
0000 0017 f7 ad 08 00 00 00
0000 001d 49
0000 001e

The decrement instruction is followed by

jg loop1

jg rel32 0f 8f "cd"

Like the other control instruction:


the "cd" is a 32-bit code offset. It needs to be the
difference between what the PC will be when executing this
code and the address assigned for label target_label.

The assembler does know that this instruction will be


exactly 6 bytes long.

To calculate "cd",

at execution time (for taken control instruction):


contents of PC + offset field --> PC

PC points to instruction after the control instruction


when offset is added.

at assembly time:

byte offset = target addr - ( addr of instruction after conditional


control instr addr )
= addr loop1 - (6 + 0x0000 001e)
(taken from symbol table)

= 0x0000 0015 - 0x0000 0024

Notice that this would be a negative number. That is


OK -- generate a 32-bit 2's complement value.

0000 0000 0000 0000 0000 0000 0001 0101


- 0000 0000 0000 0000 0000 0000 0010 0100
------------------------------------------
becomes

0000 0000 0000 0000 0000 0000 0001 0101


+ 1111 1111 1111 1111 1111 1111 1101 1100
------------------------------------------
1111 1111 1111 1111 1111 1111 1111 0001

in hex 0x ff ff ff f1

this value is "cd",


giving the machine code 0f 8f f1 ff ff ff
(Remember that the least significant byte comes first.)

Again,
memory map of code section (so far)
address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 10 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 ?? ?? ?? ??
0000 0015 03 d0
0000 0017 f7 ad 80 00 00 00
0000 001d 49
0000 001e 0f 8f f1 ff ff ff
0000 0024

The last thing we'll worry about in the table is


the next label: target_addr

It gets placed in the symbol table at the next available


address, 0x0000 0024.

We now have a completed symbol table:


symbol address
---------------------
a1 0040 0000
a2 0040 0004
a3 0040 0008
0040 000c
0040 0010
0040 0014
0040 0018

0040 001c (the next available address within the


data section. NOT PART OF THE TABLE.)
main 0000 0000
loop1 0000 0015
target_addr0000 0024

After this first pass of the assembler is done, ALL the


labels have been given addresses.

During this second pass of the assembler, any remaining


code left to be completed is completed. For this example
code fragment, that is the jz instruction at address 0x0000 000f.

All that remains is the offset calculation. It works just like


the calculation for the other control instruction.

byte offset = target addr - ( addr of instruction after conditional


control instr addr )

= addr target_addr - (6 + 0x0000 000f)


(taken from symbol table)

= 0x0000 0024 - 0x0000 0015

0000 0000 0000 0000 0000 0000 0010 0100


- 0000 0000 0000 0000 0000 0000 0001 0101
------------------------------------------
0000 0000 0000 0000 0000 0000 0000 1111

Notice that this offset is a positive number. This is ok.


It corresponds to a branch/jump forward in the code.
A negative offset would correspond to a branch/jump backward
within the code.

It is the offset of 0x 0000000f that gets placed into the


machine code.

The completed machine code is

memory map of text section


address contents

0000 0000 b9 14 00 00 00
0000 0005 b8 10 00 00 00
0000 000a ba 00 00 00 00
0000 000f 0f 84 0f 00 00 00
0000 0015 03 d0
0000 0017 f7 ad 80 00 00 00
0000 001d 49
0000 001e 0f 8f f1 ff ff ff
0000 0024

Chapter 13 -- I/O
all about I/O
-------------

computers aren't useful unless we can put data into them,


and get results out.

input -- data to computer


output -- data from computer

computer model:

----- --------
|CPU| <----> |memory|
----- ^ --------
|
|
\ /

-----
|i/o|
-----

examples of input devices:


keyboard, mouse, network, disk, ??

examples of output devices:


printer, (terminal) display, network, ??

simulator has only 2 I/O devices,


keyboard (for input)
display (for output)

ISSUES THAT MUST BE SOLVED:

programmer interface -- tools give


get_ch, put_ch, put_str

these are actually OS implemented procedures.


(The OS is the program that interfaces between the programmer
or application user and the actual hardware. For us, it is NT.)

protection issues --
in a real system, there could be more than one terminal
(terminal is a keyboard and display together)
Should one user be able to display characters on another's
display? Lock up another's keyboard? Send a file of
infinite length to the printer, effectively shutting
all others out?

In practice, the OS's job is "resource management,"


allocating all portions of the processor. Examples
of resources are the CPU and all I/O devices.

physical issues --
A computer today (1998) can complete an instruction at
the rate of about 1 each nsec.
Unfortunately, typical I/O devices are much slower, often
requiring 10s of milliseconds to deal with a single
character. That is approx. 1 million times slower!
This situation is dubbed the "access gap."

disk - a real, live, phisical device


------------------------------------

Vocabulary, to form a picture of a disk (ch 13, p260)

PLATTER -- sort of like a phonograph record or CD.

data is stored on a SURFACE of a platter.

all platters are tied together and rotate around the SPINDLE
at a fixed speed.

each surface has one or more READ/WRITE HEADS.

Platters are broken down into TRACKS. A single track is


one of many concentric circles on the platter.

All the corresponding tracks on all surfaces, taken together,


form a CYLINDER.

Each track is broken down into SECTORS.

How we read/write to a sector.


Given: the sector position on the cylinder. (looked up in a
table, or calculated from the disk address).

-- the disk is spinning.

-- the read/write head is moving to the correct cylinder (track).


THIS TAKES A LONG TIME RELATIVE TO THE OTHER STUFF. It is
the physical movement, acceleration, etc. comes into play.
This is SEEK time.

-- once the read/write head is over the correct cylinder, there


is bound to be some time to wait until the correct sector
is under the head. This is ROTATIONAL LATENCY.

-- Even at the correct sector, it still takes some time for


the data to be read/written. This is the READ or WRITE
time.

time to read a sector = seek time + rotate time + read time.

So, the nitty gritty issue is: how does the OS accomplish
I/O requests? There are 2 possibilities.

1. have special I/O instructions


-- input
need to know which device, how much data, where the
data is to go
-- output
need to know which device, how much data, where the
data currently is

How does the processor know that the instruction has completed?
(Is there any need to wait?)
What happens if the device encounters an error?
(Does this halt the computer?)

2. the solution of choice

overload memory locations to use as communication channels.

for example,
address
0x0000 0000 -|
. | real memory
. |
. |
0xffff 0000 -|

0xffff 0008 - data from keyboard (Keyboard_Data)

0xffff 0010 - data to display (Display_Data)

then, by reading (loading) from location 0xffff0008, data


is requested from the keyboard
then, by writing (storing) to location 0xffff0010, data
is sent to the display

the syscall code in the OS must be (in essence)

mov eax, Keyboard_Data # get_ch syscall


return from syscall

and

mov Display_Data, eax # put_ch syscall


return from syscall
This method of I/O is called MEMORY-MAPPED I/O.

Problems with memory-mapped I/O as currently given:


-- get_ch presumably returns once a character has been typed.
What happens if the user does not type a character?
Types it on the wrong keyboard? Goes to get a drink
of water?

What happens to the data if the user types 2 characters


before get_ch has been called?

How does the computer know if a character has been typed?

-- put_ch and put_str: how does the computer know that the device
is ready to print out a second character? What if the
printer jams? (printers and terminals are SLOW!)

What is needed is a way to convey information about the


STATUS of I/O devices. This status information is used
to coordinate and SYNCHRONIZE the useage of devices.

address
0x0000 0000 -|
. | real memory
. |
. |
0xffff 0000 -|

0xffff 0008 - data from keyboard (Keyboard_Data)


0xffff 000c - STATUS from keyboard (Keyboard_Status)

0xffff 0010 - data to display (Display_Data)


0xffff 0014 - STATUS from display (Display_Status)

assume that the MSB is used to tell the status of a device.


MSB = 1 means device ready
MSB = 0 means device is busy
note that we can check for device ready/busy by looking to see
if the Status word is negative (2's comp) or not.

for the keyboard, a 1 means that a character has been typed


a 0 means that no character is available

for the display, a 1 means that a new character may be sent


a 0 means that the device is still disposing
of a previous character

Then, the code in the OS must be more like

keyboard_wait: ; for get_ch


test Keyboard_Status, 80000000h
jz keyboard_wait
mov eax, Keyboard_Data

and

display_wait: ; for put_ch


test Display_Status, 80000000h
jz display_wait
mov Display_Data, eax

This scheme is known as BUSY WAITING, or SPIN WAITING.


The little loop is called a SPIN WAIT LOOP.

Something that is not well explained (at this level) is how


these status bits get set and cleared. The spin wait loop
reads the status word, but does not change it.

The device (its CONTROLLER) sets and clears the bit.


An implied fuction is that the device sets the bit
when it becomes ready to work on another character.

AND, a mov from Keyboard_Data also clears the MSB of Keyboard_Status


AND, a mov to Display_Data also clears the MSB of Display_Status

PROBLEMS with this programmed I/O approach:

-- much time is wasted spin waiting.

if it takes 100 instructions to program this, and each


instruction takes 20ns to execute, then it takes

100 * 20nsec = 2000nsec = 2 usec to execute this code

if a device takes 2msec (=2000usec) to deal with one character,


then the percent of time spent waiting is

time waiting 2000us


------------ = --------------- = .999 = 99.9%
total time 2000us + 2usec

We'd like a solution that spent less time "doing nothing"

-- if (somehow) a second key is pressed before the program does


a get_ch, the first key pressed is lost. There is only one
character's worth of storage.

This problem is actually a "Catch-22." The keyboard_wait code has


to be run often enough that no characters are lost, but
executing this code spin waits until a character is pressed.
The system could do nothing but wait around for characters!

Some problems are solved by the use of queues (buffers).


The check for device ready is separated from the sending
and receiving of characters. Code for this is in the
text, pages 265 and 266.

putnextchar: print a character if there is one in the


queue, and the device is ready
(done by OS periodically)
printstring: put character(s) in queue and return
(called by user program)

getnextchar: get a character and place in a queue, if one


is waiting at the keyboard
(done by OS periodically)
getstring: get character from queue (if available) and return;
if queue is empty, spin wait until there is
a character.
(called by user program)

Some difficulties are caused by this situation:

-- OS must call getnextchar regularly and often so as not


to lose characters.

-- What happens if the queue(s) become full? Are characters


lost?

-- OS must call putnextchar regularly to empty out the queue.

Chapter 14 -- Exception Handling

EXCEPTION HANDLERS
------------------

The trouble with programmed I/O is that it both wastes CPU


resources and it has potential for "incorrect" operation.

What we really want:


(Since most I/O devices are slow), have I/O devices signal
the CPU when they have a change in status.

The I/O devices tell the CPU that they are "ready."

In order to do this we need:


Hardware (wires) from devices to the CPU.
A way for special software to be invoked when the a device
signals on the wire.

The modern solution bundles the software to deal with


these signals (interrupts) and other situations into
an EXCEPTION HANDLER. (Effectively part of the OS.)

EXCEPTIONS
----------
1. interrupts
--initiated outside the instruction stream
--arrive asynchronously (at no specific time)

examples:
I/O device status change
I/O device error condition
thermal override shutdown
internal error detection

when should the interrupt be dealt with?


as soon as possible

2. traps
--occur due to something in instruction stream
--arrive synchronously (while instruction is executing)
good test: if program was re-run, the trap would
occur in precisely the same place in the code.

examples:
bad opcode
arithmetic overflow
I/O functionality, like put_ch
attempt to access privileged or unavailable memory

when should the trap be dealt with?


right now! The user program cannot continue until
whatever caused the trap is dealt with.

exception handling
------------------

the mechanism for dealing with exceptions is simple; its


implementation can get complex. The implementation varies
among computers (manufactures).

situation: a user program is running (executing), and


a device generates an interrupt request.
mechanism to respond:
the hardware temporarily "suspends" the user
program, and instead runs code called
an EXCEPTION HANDLER. After the handler
is finished doing whatever it needs to,
the hardware returns control to the user program.
limitations of exception handler:
since it is being invoked (potentially) in the middle
of a user program, the handler must take extra care
not to change the state of the user program.
-- it can't change register values
-- it can't change the stack
So, how can it do anything at all?
The key to this answer is that any portion of the
state that it wants to change, it must save the
state and also restore it before returning to the
user program.

The handler often uses a stack to temporarily


store register values.

WHEN to handle an interrupt -- 2 possiblilities:


1. right now! Note that this could be in the middle of
an instruction. In order to do this, the hardware
must be able to know where the instruction is in
its execution and be able to "take up where it left off"

This is very difficult to do.


But, it has been done in simpler forms on a few machines.
Example: IBM 360 arbitrary memory to memory copy

2. wait until the currently executing instruction finishes,


then handle. THIS IS THE METHOD OF CHOICE.

The instruction fetch/execute cycle must be expanded to

1. handle pending interrupts


2. instruction fetch
3. PC update
4. decode
5. get operand(s)
6. operation
7. store result

some terms
----------

interrupt request -- the activation of hardware somewhere that


signals the initial request for an interrupt.
pending interrupt -- an interrupt that hasn't been handled yet,
but needs to be
kernel-- the exception handler
In most minds, when people think of a kernel, they think
of critical portions of an operating system. The exception
handler IS a critical portion of an operating system!
handler -- the code of the exception handler.

Pentium exception handling mechanism


---------------------------------
hardware does the following:
1. signals an interrupt
on the external pins of the chip, comes a signal that
means an interrupt request has come in. Or, placed on
the pins by the circuitry from within the chip,
comes a signal that means an trap has come.

With that signal comes a vector.


A vector is an encoded value that categorizes the type
of exception.

examples: vector
0 divide by zero (a trap, really)
2 non-maskable interrupt
4 overflow (a trap, really)
6 invalid opcode (how could we get one of these?)
7 device not available
there can be up to 256 unique vectors

2. Uses the vector to get at the code for the exception handler.

The vector is used as an array index. The array element


desired is a Descriptor for the exception handler code.

a Descriptor: (Intel calls this a "gate".)

segment selector offset bits 15..0

offset bits 31..16 P DBL 0111 I/T 000 reserved

P -- whether the exception handler is in memory or not

I/T -- distinguishes between trap and interrupt


(1 for trap, 0 for interrupt)

DBL --

segment selector -- an index into yet another array. This


array element will contain a base address within memory
to the segment where the exception handler code is.

offset bits 31..0 -- offset from base address to location


of the exception handler code.

3. Save current program's state.

This stuff logically belongs on a stack. Intel (like many


other machines) has several stacks. There is one set
aside just for use when an exception occurs.

Yet another table keeps the values of the stack pointers


within the stacks.

The stack for exceptions will have (after saving state)

| |
|--------------|
| PC (EIP) |
|--------------|
| | CS |
|--------------|
| EFLAGS |
|--------------|
| ESP |
|--------------|
| | SS |
|--------------|
| |
|--------------|

Note that we now have saved away the return address for
when the exception handling is finished.

Then, the code within the exception handler is run.


It does whatever it needed to do.

When the exception has been handled, it is time to restore


the state of the previously running code, and then go back
to executing that code.

What has been done by the hardware (saving state, setting


new values for registers), must be undone by the hardware!
On this architecture, this is accomplished by a single
instruction called iret.
iret pops stuff off the stack (the one for exception handlers),
and puts the stuff back in the right place.
The PC is restored.
The EFLAGS register is restored.
The previous value of ESP is restored.

some advanced topics


--------------------

PRIVILEDGE

The operating system (OS) needs to be able to control access


to ALL computer system resources.

Some resources:
the processor (for executing code!)
main memory
all I/O devices
programs (both applications and the OS code)

As a simplification, there are 2 important ways that


most computer systems (architecture's really) use to acheive
access control of resources:
1. memory access restriction.
Each program is allocated a portion of memory. This portion
will contain program code and data, and space for whatever
is needed by the program.

At EACH memory access, the address is checked (by hardware)


to see if the address falls within the boundaries set for the
program. If the address is OK, the memory access continues.
If the address is not within the program's allowable memory,
then the hardware generates an exception (an addressing error
exception).

2. privileged instructions.
Specific instructions within the instruction set can only be
executed by the OS code. To make this work, there must be
one or more bits in a register somewhere that identify the
privilege level of the currently running program. This bit
is checked each time an instruction is decoded. Each instruction
has its own required privilege level.

If the current privilege level isn't enough to execute a


decoded instruction, then the hardware must generate an
exception (trap).

Intel's implementation of this:

There are 4 levels of privilege


0 The most privileged, allowed to do everything/anything.
The OS will be given this level of privilege.
1
2
3 The lowest privilege, restricted.
Most applications (user programs) will have this level.

Associated with each program is a descriptor. In that


descriptor is the Current Priviledge Level (CPL).

Associated with each segment of memory is a descriptor. In that


descriptor is the Descriptor Priviledge Level (DPL). The DPL
is the privilege level needed to access memory within the boundaries
of the segment.

Associated with every instruction in the instruction set is


a required privilege level. (Intel determines this.)
An example of an instruction that requires privilege level 0
for execution is the instruction that loads the IDTR.
At each instruction decoding:
The CPL is compared with the instruction's required privilege
level. If CPL is not privileged enough (when CPL > required
privilege level), an exception is generated.
At each memory access:
The CPL is compared with the DPL. If CPL is not
privileged enough (when CPL > DPL), an exception is generated.
Note that these are memory accesses for EITHER instruction
fetches OR data.

PRIORITIES

problem: Multiple interrupt requests can arrive simultaneously.


Which one should get handled first?

general solutions:
FCFS -- the first one to arrive gets handled first.

difficulty 1) This might allow a malicious/recalcitrant


device or program to gain control of the processor.

difficulty 2) There must be hardware that maintains


an ordering of pending exceptions.

prioritize all exceptions -- the one with the highest priority


gets handled first. This is a common method for solving
the problem.

Priorities for various exceptions are assigned either by


the manufacturer, or by a system manager through software.
The priorities are normally set when a machine is
booted (the OS is started up).

difficulty 1) Exceptions with the same priority must


still be handled in some order. Example of same priority
exceptions might be all keyboard interrupts. Consider
a machine with many terminals hooked up.

The instruction fetch/execute cycle becomes:


1. any interrupts with a higher priority than whatever
is currently running pending?
2. fetch
3. PC update
4. decode
5. operands
6. operation
7. result

NOTE: This implies that there is some hardware


notion of the priority for whatever is running
(user program, keyboard interrupts, clock interrupt, etc.)

Intel's solution:
Hardware is placed in a chip separate from the processor.
This separate chip is called the Programmable Interrupt
Controller (PIC), and it makes the decisions about what interrupts
are given to the processor, and which come first.

In general, what should get given the highest priority?


clock? power failure? thermal shutdown? arithmetic overflow?
keyboard? I/O device ready?

priorities are a matter of which is most urgent,


and therefore cannot wait, and how long it takes
to process the interrupt.
-- clock is urgent, and takes little processing,
maybe only a variable increment.
-- power failure is very urgent, but takes a lot
or processing, because the machine will be stopped.
-- overflow is urgent to the program which caused it,
because it cannot continue.
-- keyboard is urgent because we don't want to lose
a second key press before the first is handled.

REENTRANT EXCEPTION HANDLERS

Can an exception handler itself be interrupted?


If the answer is NO, then we have what is called a nonreentrant
exception handler.
Why would we want to do this?

There are many details to get right to make this possible.


The instruction fetch/execute cycle remains the same. At
the beginning of EVERY instruction (even those within
the exception handler), a check is made if there are
pending interrupts.

The instruction fetch/execute cycle must be expanded to

1. Handle a pending interrupt that is of a higher priority


than the currently executing code.
2. instruction fetch
3. PC update
4. decode
5. get operand(s)
6. operation
7. store result

The exception handler must be modified so that it can


be interrupted. Its own state must be saved (safely).
Nothing can interrupt while this state is being saved.

The Intel implementation of this allocates a bit within the


EFLAGS register, called the Interrupt Enable Flag (bit #9).
When this bit is 0, interrupts (maskable ones) are disabled.
When this bit is 1, interrupts are enabled.
On the way to executing the code of an exception handler,
IF is cleared.
The iret instruction restores interrupts to their enabled
state.

With the IF bit, we automatically have nonreentrant exception


handlers.

To allow reentrant handlers (ones that can be interrupted),


the exception handler must reenable interrupts.

Chapter 15 -- performance features

Architectural Features used to enhance performance


--------------------------------------------------

What is a "better" computer? What is the "best" computer?


The factors involved are generally cost and performance.

COST FACTORS: cost of hardware design


cost of software design (OS, applications)
cost of manufacture
cost to end purchaser
PERFORMANCE FACTORS:
what programs will be run?
how frequently will they be run?
how big are the programs?
how many users?
how sophisticated are the users?
what I/O devices are necessary?

(this chapter discusses ways of increasing performance)

There are two ways to make computers go faster.

1. Wait a year. Implement in a faster/better/newer technology.


More transistors will fit on a single chip.
More pins can be placed around the IC.
The process used will have electronic devices (transistors)
that switch faster.

2. new/innovative architectures and architectural features, and


clever implementations of existing architectures.

MEMORY HIERARCHIES
------------------
Known in current technologies: the time to access data
from memory is an order of magnitude greater than a
processor operation. (And note that this has been true
for well more than a decade.)

For example: if a 32-bit 2's complement addition takes 1 time unit,


then a load of a 32-bit word takes about 10 time units.

Since every instruction takes at least one memory access (for


the instruction fetch), the performance of computer is dominated
by its memory access time.

(to try to help this difficulty, we have load/store architectures,


where most instructions take operands only from memory. We also
try to have fixed size, SMALL size, instructions.)

what we really want:


very fast memory -- of the same speed as the CPU
very large capacity -- 512 Mbytes
low cost -- $50

these are mutually incompatible. The faster the memory,


the more expensive it becomes. The larger the amount of
memory, the slower it becomes.

What we can do is to compromise. Take advantage of the fact


(fact, by looking at many real programs) that memory accesses
are not random. They tend to exhibit LOCALITY.
LOCALITY -- nearby.
2 kinds:

Locality in time (temporal locality)


if data has been referenced recently, it is likely to
be referenced again (soon!).

example: the instructions with in a loop. The loop is


likely to be executed more than once. Therefore, each
instruction gets referenced repeatedly in a short period
of time.

example: The top of stack is repeatedly referenced within


a program.

Locality in space (spatial locality)


if data has been referenced recently, then data nearby
(in memory) is likely to be referenced soon.

example: array access. The elements of an array are


neighbors in memory, and are likely to be referenced
one after the other.

example: instruction streams. Instructions are located


in memory next to each other. Our model for program
execution says that unless the PC is explicitly changed
(like a control instruction) sequential instructions
are fetched and executed.
We can use these tendencies to advantage by keeping likely
to be referenced (soon) data in a faster memory than main
memory. This faster memory is called a CACHE.

processor-cache <----------------> memory

It is located very close to the processor. It contains COPIES of


PARTS of memory.

A standard way of accessing memory, for a system with a cache:


(The programmer doesn't see or know about any of this)

memory access (for an instruction fetch or to get operands


or to write results) goes to the cache.
If the data is in the cache, then we have a HIT.
The data is returned to to the processor (from the cache),
and the memory access is completed.
If the data is not in the cache, then we have a MISS.
The memory access is then sent on to main memory.

On average, the time to do a memory access is

= cache access time + (% misses * memory access time)

This average (mean) access time will change for each program.
It depends on the program, and its reference pattern, and how
that pattern interracts with the cache parameters.

cache is managed by hardware

Keep recently-accessed blocks of memory in the cache,


this exploits temporal locality

Break memory into aligned blocks (lines), this


exploits spatial locality

transfer data to/from the cache in blocks

put block into a predefined location, its frame

Each block has


state (valid) -- is there anything in this frame?
address tag -- a way to distinguish which block from
memory is currently in this frame.
data -- the block of data, a copy of what is in memory.

>>>> simple CACHE DIAGRAM here <<<<


A Simplified Example:

Addresses are 5 bits.


Blocks are 4 bytes.
Memory is byte addressable.
There are 4 blocks in the cache.

Assume the cache is empty at the start of the example.

(line number) valid tag data (in hex)


00 0 ? 0x?? ?? ?? ??
01 0 ? 0x?? ?? ?? ??
10 0 ? 0x?? ?? ?? ??
11 0 ? 0x?? ?? ?? ??

Memory is small enough that we can make up a complete


example. Assume little endian byte numbering.

address contents
(binary) (hex)
00000 aa bb cc dd
00100 00 11 22 33
01000 ff ee 01 23
01100 45 67 89 0a
10000 bc de f0 1a
10100 2a 3a 4a 5a
11000 6a 7a 8a 9a
11100 1b 2b 3b 4b

(1)
First memory reference is to the byte at address 01101.

The address is broken into 3 fields:


tag line number byte within block
0 11 01

On line 11, the block is marked as invalid, therefore


we have a cache MISS.

The block that address 01101 belongs to (4 bytes starting


at address 01100) is brought into the cache, and the
valid bit is set.

(line number) valid tag data (in hex)


00 0 ? 0x?? ?? ?? ??
01 0 ? 0x?? ?? ?? ??
10 0 ? 0x?? ?? ?? ??
11 1 0 0x45 67 89 0a

And, now the data requested can be supplied to the processor.


It is the value 0x89.

(2)
Second memory reference is to the byte at address 01010.

The address is broken into 3 fields:


tag line number byte within block
0 10 10

On line 10, the block is marked as invalid, therefore


we have a cache MISS.

The block that address 01010 belongs to (4 bytes starting


at address 01000) is brought into the cache, and the
valid bit is set.

(line number) valid tag data (in hex)


00 0 ? 0x?? ?? ?? ??
01 0 ? 0x?? ?? ?? ??
10 1 0 0xff ee 01 23
11 1 0 0x45 67 89 0a

And, now the data requested can be supplied to the processor.


It is the value 0xee.

(3)
Third memory reference is to the byte at address 01111.

The address is broken into 3 fields:


tag line number byte within block
0 11 11

This line within the cache has its valid bit set, so there
is a block (from memory) in the cache. BUT, is it the
block that we want? The tag of the desired byte is checked
against the tag of the block currently in the cache. They
match, and therefore we have a HIT.

The value 0x45 (byte 11 within the block) is supplied to


the processor.

(4)
Fourth memory reference is to the byte at address 11010.

The address is broken into 3 fields:


tag line number byte within block
1 10 10

This line within the cache has its valid bit set, so there
is a block (from memory) in the cache. BUT, is it the
block that we want? The tag of the desired byte is checked
against the tag of the block currently in the cache. They
do NOT match. Therefore, the block currently in the cache
is the wrong one. It will be overwritten with the block
(from memory) that we now do want.

(line number) valid tag data (in hex)


00 0 ? 0x?? ?? ?? ??
01 0 ? 0x?? ?? ?? ??
10 1 1 0x6a 7a 8a 9a
11 1 0 0x45 67 89 0a

The value 0x7a (byte 10 within the block) is supplied to


the processor.

(5)
Fifth memory reference is to the byte at address 11011.

The address is broken into 3 fields:


tag line number byte within block
1 10 11

This line within the cache has its valid bit set, so there
is a block (from memory) in the cache. BUT, is it the
block that we want? The tag of the desired byte is checked
against the tag of the block currently in the cache. They
match, and therefore we have a HIT.

The value 0x6a (byte 11 within the block) is supplied to


the processor.

Often
cache: instruction cache 1 cycle
data cache 1 cycle
main memory 20 cycles

Performance for data references w/ miss ratio 0.02 (2% misses)

mean access time = cache-access + miss-ratio * memory-access


= 1 + 0.02 * 20
= 1.4

Typical cache size is 64K byte given a 64Mbyte memory


20 times faster
1/1000 the capacity
often contains 98% of the references

Remember:

recently accessed blocks are in the cache (temporal locality)

the cache is smaller than main memory, so not all blocks are in the cache.
blocks are larger than 1 word (spatial locality)

This idea of exploiting locality is (can be) done at many


levels. Implement a hierarchical memory system:

smallest, fastest, most expensive memory (registers)


relatively small, fast, expensive memory (CACHE)
large, fast as possible, cheaper memory (main memory)
largest, slowest, cheapest (per bit) memory (disk)

registers are managed/assigned by compiler or asm. lang programmer

cache is managed/assigned by hardware or partially by OS

main memory is managed/assigned by OS

disk managed by OS

Programmer's model: one instruction is fetched and executed at


a time.

Computer architect's model: The effect of a program's execution are


given by the programmer's model. But, implementation may be
different.

To make execution of programs faster, we attempt to exploit


PARALLELISM: doing more than one thing at one time.

program level parallelism: Have one program run parts of itself


on more than one computer. The different parts occasionally
synch up (if needed), but they run at the same time.
instruction level parallelism (ILP): Have more than one instruction
within a single program executing at the same time.

PIPELINING (ILP)
-----------------
concept
-------
A task is broken down into steps.
Assume that there are N steps, each takes the same amount of time.

(Mark Hill's) EXAMPLE: car wash

steps: P -- prep
W -- wash
R -- rinse
D -- dry
X -- wax
assume each step takes 1 time unit

time to wash 1 car (red) = 5 time units


time to wash 3 cars (red, green, blue) = 15 time units

which car time units


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
red P W R D X
green P W R D X
blue P W R D X

a PIPELINE overlaps the steps

which car time units


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
red P W R D X
green P W R D X
blue P W R D X
yellow P W R D X
etc.

IT STILL TAKES 5 TIME UNITS TO WASH 1 CAR,


BUT THE RATE OF CAR WASHES GOES UP!

Pipelining can be done in computer hardware.

2-stage pipeline
----------------
steps:
F -- instruction fetch
E -- instruction execute (everything else)

which instruction time units


1 2 3 4 5 6 7 8 . . .
1 F E
2 F E
3 F E
4 F E

time for 1 instruction = 2 time units


(INSTRUCTION LATENCY)

rate of instruction execution = pipeline depth * (1 / time for )


(INSTRUCTION THROUGHPUT) 1 instruction
= 2 * (1 / 2)
= 1 per time unit

5-stage pipeline
----------------
a popular pipelined implementation, that works really well for
teaching about pipelines and also for load/store architectures

Its application to the Pentium would be problematic.

steps:
IF -- instruction fetch
D -- instruction decode
OA -- operand access
OP -- ALU operation (can be effective address calculation)
R -- store results

which time units


instruction 1 2 3 4 5 6 7 8 . . .
1 IF D OA OP R
2 IF D OA OP R
3 IF D OA OP R

INSTRUCTION LATENCY = 5 time units


INSTRUCTION THROUGHPUT = 5 * (1 / 5) = 1 instruction per time unit

unfortunately, pipelining introduces other difficulties. . .

data dependencies
-----------------

suppose we have the following code:


mov EAX, data1
add EBX, EAX

the data moved (loaded into a register) doesn't get written


to EAX until R,
but the add instruction wants to get the data out of EAX
it its D stage. . .

which time units


instruction 1 2 3 4 5 6 7 8 . . .
mov IF D OA OP R
^
add IF D OA OP R
^

the simplest solution is to STALL the pipeline.


(Also called HOLES, HICCOUGHS or BUBBLES in the pipe.)

which time units


instruction 1 2 3 4 5 6 7 8 . . .
mov IF D OA OP R
^
add IF D D D OA OP R
^ ^ ^ (pipeline stalling)
A DATA DEPENDENCY (also called a HAZARD) causes performance to
decrease.

more on data dependencies


-------------------------

Read After Write (RAW) --


(example given), a read of data is needed before it has been written

Given for completeness, not a difficulty to current pipelines in


practice, since the only writing occurs as the last stage.

Write After Read (WAR) --


Write After Write (WAW) --

NOTE: there is no difficulty implementing a 2-stage pipeline


due to DATA dependencies!

control dependencies
--------------------

what happens to a pipeline in the case of control instructions?

PENTIUM CODE SEQUENCE:

jmp label1
inc eax
label1: mult ecx

which time units


instruction 1 2 3 4 5 6 7 8 . . .
jmp IF D OA OP R
^ (PC changed here)
inc IF D OA OP R
^^ (WRONG instruction fetched here!)

whenever the PC changes (except for the PC update step)


we have a CONTROL DEPENDENCY.

CONTROL DEPENDENCIES break pipelines. They cause


performance to plummet.

So, lots of (partial) solutions have been implemented


to try to help the situation.
Worst case, the pipeline must be stalled such that
instructions are going through sequentially.

Note that just stalling doesn't really help, since


the (potentially) wrong instruction is fetched
before it is determined that the previous instruction
is a branch.

BRANCHES and PIPELINING


-----------------------
(or, how to minimize the effect of control dependencies on pipelines.)

easiest solution (poor performance)


Cancel anything (later) in the pipe when a jump is decoded.
This works as long as nothing changes the program's state
before the cancellation. Then let the branch instruction
finish (flush the pipe), and start up again.

which time units


instruction 1 2 3 4 5 6 7 8 . . .
jmp IF D OA OP R
^ (PC changed here)
inc IF
^^ (cancelled)

branch Prediction (static or dynamic)


add lots of extra hardware to try to help.

a) (static) assume that the branch/jump will not be taken


When the decision is made, the hw "knows" if the correct
instruction has been partially executed.

If the correct instruction is currently in the pipe,


let it (and all those after it) continue. Then,
there will be NO holes in the pipe.
If the incorrect instruction is currently in the pipe,
(meaning that the branch/jump was taken), then all instructions
currently in the pipe subsequent to the branch must
be BACKED OUT.

b) (dynamic) A variation of (a).


Have some extra hw that keeps track of which branches have
been taken in the recent past. Design the hw to presume that
a branch will be taken the same way it was previously.
If the guess is wrong, back out as in (a).

Question for the advanced student: Which is better, (a) or (b)? Why?

NOTE: solution (a) works quite well with currently popular


pipeline solutions, because no state information is changed
until the very last stage of an instruction. As long as
the last stage hasn't started, backing out is a matter
of stopping the last stage from occuring and getting the
PC right.

separate test from branch


make the conditional test and address calculation
separate instructions from the one that changes the PC.

This reduces the number of holes in the pipe.

squashing
A fancy name for branch prediction that always presumes the
branch will be taken, and keeps a copy of the PC that will
be needed in the case of backing out.

Amdahl's Law
------------

(Or why the common case matters most)

speedup = new rate / old rate

= old execution time / new execution time

We program in some enhancement to part of our program.


The fraction of time spent in that part of the code is f.
The speedup of that part of the code (f) is S.

( Let an enhancement speedup f fraction of the time by speedup S)

speedup = [(1-f)+f]*old time / (1-f) * old time + f/S * old time

= 1
---------
1-f + f/S

Examples

f S speedup
--- --- -------
95% 1.10 1.094
5% 10 1.047
5% inf 1.052

lim 1
--------- = 1/ 1-f
S --> inf 1-f + f/S

f speedup
--- -------
1% 1.01
2% 1.02
5% 1.05
10% 1.11
20% 1.25
50% 2.00
This says that we should concentrate on the common case!

Chapter 16 -- Architectures
PERSPECTIVE ON ARCHITECTURE DESIGN
----------------------------------

factors in computer design:


speed
as fast as possible, of course
dependent on technology and cost
cost/price
profit, non-profit, mass market, single use
useablility
shared/single user, size of machine, OS/software issues,
power requirements
depends on intended use!
intended market
mass market, scientific research, home use, multiple users,
instructional, application specific
technology

price/performance curve

^ |
perf. | | x
| x Want to be to the "left" of these,
| x to have higher performance for the
| price. "More bang for the buck."
| x x
_________________

price ->

technology -- a perspective
----------

electromechanical (1930) -- used mechanical relays

vacuum tubes (1945)


space requirement: room
Manchester Mark 1 (late 1940s)
transistors
discrete (late 1950s)
space requirement: a large cabinet to a room
Examples:
CDC 6600
B 5000
Atlas (?)
PDP 11/10

SSI, MSI (mid-late 1960s)


1-10 10-100 transistors
space requirement: a cabinet
Examples:
Cray 1
VAX 11/780

LSI (early 1970s)


100-10,000 transistors
space requirement: a board
Examples:

VLSI (late 1970s - today)


>10,000 transistors
space requirement: a chip, or chip set, board
Examples:
MIPS R2000
Intel 386 (~275,000 transistors)
Pentium (~4 million transistors?)
Sparc
PowerPC (millions of transistors)

RISC vs. CISC


-------------

RISC - Reduced Instruction Set Computer


The term was first used to name a research architecture at
Berkeley: the RISC microprocessor. It has come to (loosely) mean a
single chip processor that has the following qualities:
1. load/store architecture
2. very few addressing modes
3. simple instructions
4. pipelined implementation
5. small instruction set -- easily decoded instructions
6. fixed-size instructions

CISC - Complex Instruction Set Computer


This term was coined to distinguish computers that were not RISC.
It generally is applied to computers that have the following
qualities:
1. complex instructions
2. large instruction set
3. many addressing modes
difficulties with these terms

- not precisely defined

- term introduced/applied to earlier machines

- "RISC" has become a marketing tool

single chip constraint


----------------------

As technologies advanced, it became possible to put a processor


on a single VLSI chip. Designs became driven by how much
(how many transistors) could go on the 1 chip.

Why? 1. The time it takes for an electrical signal to cross a


chip are significantly less than the time for the signal to
get driven off the chip to somewhere else.
2. The number of pins available was limited.

So, the desire is to have as little interaction of the chip


with the outside world as possible. It cannot be eliminated,
but it can be minimized.

The earliest of single processors on a chip had to carefully


pick and choose what went on the chip. Cutting-edge designs
today can fit everything but main memory on the chip.

how the world has changed


-------------------------
earliest computers had their greatest difficulties in getting
the hardware to work --
technology difficulties
space requirements
cooling requirements

given a working computer, scientists would jump through whatever


hoops necessary to use it.

as hardware has gotten (much) faster and cheaper, attention has been
diverted to software.
OS
compilers
optimizers
IPC (inter-process communication)
1 instruction at a time isn't enough. The technology
isn't "keeping up." So, do more than one instruction
at a time:

parallelism
-----------
instruction level (ILP) -- pipelining
superscalar -- more than one instruction at a time

multis
VLIW
supercomputer

WHICH OF THESE IS "BEST" and "FASTEST" DEPENDS ON WHAT PROGRAM


IS BEING RUN -- THE INTENDED USEAGE.

on the 68000 Family


--------------------

- released in the late 1970's


- an early "processor on a chip"
- a lot of its limitations have to do with what could fit on a VLSI
chip in the late 1970's
- big early competitor of Intel

INSTRUCTIONS
- a relatively simple set, but NOT a load/store arch.
- a two-address architecture
- most instructions are specified in 16 bits -- fixed size.
- tight encoding, it is difficult to distinguish opcode from operands,
but the m.s. 4 bits are always part of the opcode.

integer arithmetic
different opcode for varying size data
(add.b add.w add.l)
8-bit 16-bit 32-bit
logical
different opcode for varying size data
control instructions
conditional branches, jumps
(condition code mechanism used -- where most instructions
had the effect of setting the condition codes)
procedure mechanisms
call and return instructions
floating point ??
(I guess not!)
decimal string
arithmetic presuming representation of binary coded decimal

REGISTERS
16 32-bit general purpose registers,
only one is not general purpose (it is a stack pointer)
the PC is not part of the general purpose registers

the registers are divided up into two register files of 8,


one is called the D (data) registers, and the other
is called the A (address) registers. This is a distinction
similar to the CRAY 1.

A7 is the stack pointer.

DATA TYPES
byte
word (16 bits)
longword (32 bits)

addresses are really their own data type.


arithmetic on A registers is 32 bit arithmetic. However,
pin limitations on the VLSI chip required a reduced
size of address. Addresses that travel on/off chip are
24 bits -- and the memory is byte-addressable. So
a 24-bit address specifies one of 16Mbyte memory locations.

each instruction operates on a fixed data type

OPERAND ACCESS

the number of operands for each individual instructions


is fixed

like the VAX, the addressing mode of an operand does not


depend on the instruction. To simplify things, one of the
operands (of a 2 operand instruction) must usually come from
the registers (like the Pentium).

the number/type of addressing modes is much larger than


the MIPS, but fewer than the VAX, similar to Pentium.

PERFORMANCE
?, they got faster as new technologies got faster.

ORIGINAL SIZE
1 64-pin VLSI chip (a huge number of pins at that time)

all about the Cray 1


--------------------
There has always been a drive to design the best, fastest computer in
the world. Whatever computer is the fastest has generally been called
a supercomputer.

The Cray 1 earned this honor, and was the fastest for a relatively long
period of time.

The man who designed the machine, Semour Cray, was a bit of an eccentric,
but he can get away with it because he was so good. The Cray 1 has an
exceptionally "clean" design, and that makes it fast. (This is probably
a bit exaggerated due to my bias -- the Cray 1 is probably my favorite
computer.)

Mostly my opinion:
To make the circuitry as fast as possible, the Cray 1 took 2 paths
1. Physical -- a relatively "time-tested" technology was used, but
much attention was paid to making circuits physically close
(Semour was aware of the limits imposed by the speed of light.)
and the technology was pushed to its limits.
2. Include only what was necessary, but on that, a "don't spare the
horses" philosophy was used.
This means that extra hardware was used (not paying attention to
the cost) wherever it could to make the machine faster. And, at
the same time, any functionality that wasn't necessary (in Semour's
opinion) was left out.
Just remember:
if something seems out of place to you, or some functionality of a
computer that you think is essential and was not included in the Cray 1,
it wasn't necessary! And, leaving something out made the machine
faster.

What the Cray 1 was good for:


It was designed to be used for scientific applications that required
lots and lots of floating point manipulations. It wouldn't make
a good instructional machine (don't want to hook lotsa terminals up
to it!), and it wouldn't be much fun to try to implement a modern
operating system on.

How it is used:
most often, a separate (not as fast/powerful) computer was hooked up
as what was commonly called a host computer. The host is where you do
all your editing and debugging of programs. The host also maintains a
queue of jobs to be run on the Cray. One by one the jobs are run, so
the only thing that the Cray is doing is running the final jobs -- often
with LOTS of data. Although its operating system would allow it,
the "multi-tasking" (more than 1 program running simultaneously)
ability was not often used.

instruction set
fixed length instructions
either 16 or 32 bit (no variability that depends on
the number of operands)
number of operands possible for an instruction
0-3 (it is a 3-address instruction set)
number and kind of instructions
op codes are 7 bits long -- giving 128 instrucitons
This includes complete integer and floating point
instructions.

Notice that missing from the instruction set are:


character (byte) manipulation, duplicates
of anything (!), integer divide

Data representation is vastly simpified from what we've


seen so far! There are ONLY 64-bit 2's complement integers,
24-bit addresses, and floating point numbers!

ALL accesses to memory are done in WORD chunks. A word


on the Cray 1 is 64 bits. All instructions operate on
a single size of data -- Either a 64 bit word, or on an
address (24 bits).

addressing modes (strikingly similar to MIPS)


Register Mode. An instruction (op code) specifies exactly where
the data is.
Base Displacement Mode. Used only for load and store instructions.

REGISTERS:
There are an ENORMOUS number of registers.
There are 5 types of registers.

S registers -- 'S' stands for scalar. These are 64


bit regs. They are used for all sorts of data, but
not addresses. There's 8 of them.

T registers --
These are 64 64-bit backup registers for the S registers.
If you were to do some heavy programming on the Cray 1,
you'd find these registers very useful. This is partially
because you run out of S registers quickly, so you
need temporary storage, but don't want your program
to store to main memory (slow!). There's also an
instruction that allows you to load a block of memory
to the T registers. That's 1 instruction to do a bunch
of loads.

A registers -- 'A' stands for address. These are


24 bit regs. They are used for addresses, and to a
rather limited extent, integer counters.

B registers --
These are backups to the A regs and are used in the
same manner as the T regs.

V registers -- 'V' stands for vector.


There are 8 sets of V regs. Each set has 64 64-bit
registers! That is a lot! They are used mainly for
processing large quantities of "array" data. Their use
makes the Cray 1 very fast. A single instruction that uses
a vector register (1 set) will cause something to happen
to each of the 64 registers within that set.
(SIMD)

hardware stack
no support for stack accesses at all! There is no
special stack pointer register.
cache
none. There's so many registers that there isn't
really a need for one.

size of machine
A bit bigger than 2 refrigerators.
speed of machine
Significantly faster than the VAX and 68000.
For a while, it was the fastest machine around.

price of machine
As an analogy to some very pricey restaurants:
If you need to see the prices on the menu, you can't
afford to eat there.

Probably about $3 million for the basic machine when


they first came out.

A Cray 1 came with a full time hardware engineer,


(a field service person). Why? Down time on a Cray
is very expensive due to the way they are expected to
be used. Waiting for field service to come was
considered too expensive.
how many instructions get executed at one time
its debatable. There can be more than 1 instruction
at some point in its execution at 1 time. It is a
pipelined machine. This can only go so far (only
1 new instruction can be started each clock cycle).
complexity of ALU
There are actually quite a few alu's in the machine.
Cray calls them functional units. Each one is a specialized
piece of hardware that does its own job as fast as can
be done. Each of them could conceivably be working at
the same time.

on the VAX
----------

The VAX was a popular and commercially successful computer


put out in the early 1970's by DEC (Digital Equipment Corp).

It might be characterized by the term CISC.


RISC (Reduced Instruction Set Computer)
CISC (Complex Instruction Set Computer)

A CISC computer is often characterized by


1. many instructions
2. lots of addressing modes
3. (this one is debatable) variable length instructions
4. memory-to-memory architecture

Some details:

LOTS OF INSTRUCTIONS (like Pentium)


integer arithmetic
different opcode for varying size data
logical
different opcode for varying size data
address manipulations
bit manipulations
control instructions
conditional branches, jumps, looping instructions
procedure mechanisms
call and return instructions (there were more than 1!)
floating point (on more than one representation)
character string manipulations
crc (Cyclic Redundancy Check)
decimal string
arithmetic presuming representation of binary coded decimal
string edit

overall: more than 200 instructions

opcodes were of variable length, but always a multiple


of 8 -- most opcodes were specified in the first 8 bits
of an instruction.

REGISTERS
16 32-bit general purpose registers,
except that they really weren't all general purpose
R15 is the PC -- note that the programmer can change the
PC at will!
R14 is a stack pointer
R13 is a frame pointer
R12 is an argument pointer (address of where a procedure's
parameters are stored -- sometimes on the stack,
and sometimes in main memory)

DATA TYPES
byte
word (16 bits)
longword (32 bits)
quadword (64 bits)
octaword (128 bits)

F floating point (32 bits -- 7 bits of exponent)


D floating point (64 bits -- 7 bits of exponent)
G floating point (64 bits -- 10 bits of exponent)
H floating point (128 bits -- 15 bits of exponent)
character string (consecutive bytes in memory, specified always
by a starting address and the length in bytes)
numeric string (the ASCII codes that represent an integer)
packed decimal string (consecutive sequence of bytes in memory
that represent a BCD integer. BCD digits are each in
4-bit quantities (a "nibble")

example: the integer +123 is represented by


0001 0010 0011 1100
(1) (2) (3) (+)
numbering a<7-4> a<3-0> a+1<7-4> a+1<3-0>

each instruction operates on a fixed data type

OPERAND ACCESS

the number of operands for each individual instructions


is fixed

a 3-address instruction set

the location of operands is definitely not fixed,


they can be in memory, or registers, and the variety
of addressing modes that specify the location of an
operand is large!

equivalent of Pentium mov EAX, ECX


add EAX, EDX ; EAX <- ECX + EDX

addl3 R3, R4, R2


^^
||-- 3 operands
|
|--- operate on a 32 bit quantity
( there is also addb3, addw3, addb2, addw2, addl2)

(and this is just for 2's complement addition!)

This instruction does R3 + R4 -> R2

This is a VERY simple use of addressing modes.


The syntax of operand specification allows MANY possible
addressing modes -- every one presented this semester,
plus more!

for example
addl3 (R3), R4, R2
uses Register Direct addressing mode for the first
operand --
operation
the address of the first operand is in R3,
load the operand at the address, add to the
contents of R4, and place the result into R2
The addressing mode for each operand can (an often is)
be different! NO restrictions (unlike Pentium and Motorola 68000)

One type addressing mode sticks out --


auto-increment and auto-decrement
They have the side effect of changing the address used to get
an operand, as well as specifying an address. (Motorola 68000
has this addressing mode.)

addl3 (R3)+, R4, R2


operation
the address of the first operand is in R3, load
the operand at the address, then increment the contents
of R3 (the address), then add data loaded from memory
to the contents of R4 and place the result into R2

the amount added to the contents of R3 depends on the


size of the data being operated on. In this case, it
will be 4 (longwords are 4 bytes)

MACHINE CODE

Together with each operand is an addressing mode specification.


Each operand specification requires (at least) 1 byte.

Format for the simple addl3 R3, R4, R2

8-bit opcode 0101 0011 0101 0100 0101 0010


^ ^ ^ ^ ^ ^
| | | | | |
---- | ---------- | --------- | -- mode (register = 5)
| | |
|------------|-----------|--- which register

Format for the addl3 (R3), R4, R2

same
8-bit opcode 0110 0011 0101 0100 0101 0010
^ ^ ^ ^ ^ ^
| | | | | |
---- | ---------- | --------- | -- mode
| | |
|------------|-----------|--- which register

Each instruction has an 8-bit opcode.


There will be 1 8-bit operand specifier for each operand that
the instruction specifies.

Because of the large number and variety of addressing modes,


an operand specification can be much more than 1 byte.
Example: Immediates are placed directly after their specification
within the code.

PERFORMANCE

the term MIPS (millions of instructions per second) really came


from the VAX --
the VAX 11 780 ran at just about 1 MIPS
note that this term is misleading --
Instructions take variable times to fetch and execute,
so the performance depends on the program

SIZE
one version: the VAX11 750 was about the size of a large-capacity
washing machine
another version: the VAX11 780 was about the size of 2 refridgerators,
standing side by side

the SPARC architecture


----------------------

Scalar Processor ARChitecture


- - ---

developed in late 1980s by Sun Microsystems

some details:

-- single chip processor


-- load store architecture
-- machine code instructions are fixed size: 32 bits
-- control instructions based on condition code bits (like Pentium)
-- design goal: make procedure call and return efficient

So, the big difference between this architecture and others


is that it uses/assigns registers differently.

Think about the way that we write programs. There are


LOTS of procedures, and therefore lots of procedure calls
within a program. For each procedure call, parameters
have to be placed on the stack. Within the procedure,
the parameters are copied back out of the stack into
registers, in order to be used. Also, return values from
the procedure (function, really) must be placed somewhere.
This somewhere was usually on the stack.

What if . . .the current activation record at the top of


the stack was in registers, instead of memory? It would
make the program go faster, since accesses to the activation
record would really be to registers, NOT to memory.

This is what the SPARC processor does. There are a very


large number of registers. They are divided into overlapping
sets called WINDOWS. One window contains a procedure's
activation record.

See manuscript, page 320 for a diagram of this.


A program places parameters to a procedure in the
current window's OUTs. The procedure call switches
windows, such that the new window receives the
parameters in its INS.

MIPS architecture
These are details of the MIPS R2000 architecture.

The purpose of this is to give the flavor of how all architectures


have been designed/specified since the early 1980s. It is
different from Pentium.

load/store architecture
-----------------------

Memory accesses slow a processor down. Sure, we add caches,


but that only brings the average access time down for memory
accesses. There are still times when the processor is doing
nothing, while waiting for memory accesses to complete.

So, we define a new architecture. In this new load/store


architecture, the addressing mode for every operand is fixed.
(So, there are no bytes for addressing mode information.)
And, for arithmetic/logical type instructions, the addressing
mode for all operands will be register mode. We make sure
that there are enough registers, because everything ends up
in registers. To get stuff to/from memory and into/out of registers,
we have explicit instructions that move data.

Load instructions read data from memory and copy it to a register.


Store instructions write data from a register to memory.

The MIPS R2000 is a load/store architecture.

Registers
---------

There are 32 32-bit registers (for non-floating point operands).

(Integers are 32-bit two's complement; addresses are 32-bit unsigned.)

In code, the syntax used is $x, where x is a value


in the range of 0 to 31.

Example: $5 is register 5.

Some are special purpose:


$0 always contains the value 0. This has the effect of
reducing the number of instructions required in the
instruction set.
$1 used by the assembler
$2-3 location for function return values
$4-7 location for parameters
$26-27 used exclusively by OS
$31 return address from procedures placed here

There are also 16 64-bit registers used for floating point operands.

In code, the syntax used is $fx, where x is an even value


in the range of 0 to 30.

Example: $f6 is register 6.

Some of the floating point registers also have specific uses.

three-address instruction set


-----------------------------

The instruction set is considerably smaller than the Intel


instruction sets. This is typical of a RISC processor.
There is generally one way to do a required operation, not
2 or 3 as on the Pentium.

Arithmetic and logical instructions have 3 operands.

Example: add $8, $9, $10

The contents of registers $9 and $10 are added,


and the result is placed in register $8.

To do the equivalent of this on the Pentium, you


would end up with something more like:

mov eax, var1


add eax, var2 ; eax <- var1 + var2

or, if one addend was in ebx, and the other was in ecx,
then the code could be
mov eax, ebx
add eax, ecx

Note that it can take fewer instructions to get the same


calculations done in the 3-address instruction set.

Thought question: What reason could there be for the Pentium


(and earlier Intel processors) to have a 2-address
instruction set, when the 3-address one is more
powerful?
load and store instructions
---------------------------
The only instructions that access memory to get operands are
load and store instructions.

Examples: lw $14, 4($15)

Effective address given by base displacement addressing


mode (second operand). 4 + contents of register $15
are the address. Memory reference is to 1 word (32 bits)
at that address. The word read is placed into register
$14.

sb $10, 0($18)

Effective address given by base displacement addressing


mode (second operand). 0 + contents of register $18
are the address. The byte in the least significant
byte of register $10 is written to that address.

Notes: The first operand is always register mode.


The second operand is always base displacement mode.

control instructions
--------------------

There are 2 types of control instructions.

branch -- (equivalent of Pentium jumps)


conditional or unconditional
Branch decision (to take the branch or not) is based
on a comparison between the values in 2 registers.
(NOT condition codes!)
An offset from the instruction following the branch
is placed into the instruction.

jump -- unconditional
A portion of an absolute address is placed into the
instruction.

Examples: beq $5, $13, label1

Branch if EQual. If the contents of registers $5


and $13 are the same (equal), then go to label1.

jal proc4

Jump And Link. A procedure call instruction very


much like the Pentium call instruction. The return
address is placed into register $31. (Pentium
pushes the return address onto the stack.) Then
address proc4 is placed into the PC.

machine code
------------

All machine code instructions are exactly 32 bits long.


NO EXCEPTIONS!

There are just a few basic formats:

arithmetic and logical instructions

bits 31 26 25 21 20 16 15 11 10 0
6-bit 5-bit 5-bit 5-bit
opcode reg reg reg more opcode
desig. desig. desig.

Since there is no addressing mode information (all operands


are register mode), an operand specification just takes 5
bits to tell which register is used.

Examples: add $8, $10, $12


sub $12, $8, $15
or $10, $11, $12

load/store, arithmetic/logicals that have one operand as an immediate


value, and all branch instructions

31 26 25 21 20 16 15 0
6-bit 5-bit 5-bit 16-bit immediate
opcode reg reg
desig. desig.

Note that the location of the opcode and the locations of


the register designations are in the same location within
this 32-bit instruction. That makes decoding the instruction
easier (meaning faster, with less circuitry).

Examples: lw $20, 12($16)


addi $12, $8, -6
bne $13, $14, somewhere

sample code
------------

Here is a code fragment of MIPS R2000 code.


# a MIPS code fragment to increment each element of
# an array of 100 integers.
.text

addi $8, $0, 0 # value 0 goes into $8 (a counter)


addi $9, $0, 100 # $9 is ending value of counter
la $11, array # $11 is pointer into array of integers
loop_top:
beq $8, $9, done_loop
lw $10, 0($11) # get array element
addi $10, $10, 1 # add one to it
sw $10, 0($11) # put it back
addi $11, $11, 4 # update pointer
addi $8, $8, 1 # update counter
beq $0, $0, loop_top # uncondional branch

Some notes:
la is not really a MIPS R2000 instruction. It is an example of
and instruction that is allowed in assembly language code, but
is translated into MIPS R2000 code by the assembler. This makes
the job of writing this assembly language code easier.

For this la instruction, the assembler produces

(if symbol array is assigned address 0x00aa0bb0)

lui $11, 0x00aa # $11 gets value 0x 00aa 0000


ori $11, $11, 0x0bb0 # $11 gets value 0x 00aa 0bb0

The MIPS architecture does this with LOTS of instructions.

Virtual Memory

Much of this subject straddles the areas of architecture and operating


systems. Must understand both to get it right.

Problems:

-- Physical memory is too small.


(Smaller than the address space of the machine,
for example, 32-bit address = 4K Mbytes, or 4 Gbytes,
but main memory is only 16 Mbytes.)

We want to have just parts of a program in memory, not the


entire (potentially very large) program

-- Want ability to have multiple processes somewhere in their


execution. Want to increase throughput of computer by allowing
multiprogramming, multitasking.
The programs (processes) need to SHARE main memory.
need RELOCATION, OS-controlled SWAPPING

-- Large, inexpensive memory (disk) is slow.

Solution:

Make main memory a cache for the larger disk.

Give OS some "hooks" to ease its job.

Introduce notion of an ADDRESS SPACE.

Address Space
-------------
Associated with every program (process) is a set of addresses
that it may reference.

PROCESS -- a program, together with some processor state.

For a 32-bit address machine, want 32 bits of address space available


to the program. And, every process wants access to the full
address space.

To do this, introduce the notion of a VIRTUAL ADDRESS SPACE.

virtual address easily generated by SW

|
|
\ /

translation

|
|
\ /

physical address used by HW to access memory


(also called a real address)

TO REMEMBER: need the translation to be fast, since it must


be done for every memory access.
That means that it must be done by HW.

Base and Bounds


---------------
Give 2 HW registers, can implement virtual memory.

base -- base address for a process -- a physical (real) address


bounds -- last valid virtual address process may access
virtual addresses generated by process are offsets from the base

-- for each reference, must check if the address is within bounds.


-- each process has its own space
-- each process must be allocated contiguously in memory
-- cheap to implement in HW. Addition and comparison can be
done in parallel.

Problems:
-- impossible to share code between 2 programs (while keeping
other code/data private to a program). This is because there
is only 1 segment for each process.

Segmentation
------------

Permit portions of process to be split into more than 1 area of


memory (SEGMENTS)
one for code
another for heap
another for stack

Use separate base and bounds for each segment.


Could add some sort of protection bit for each segment, to allow
sharing.

Problems:

-- Need table for segments/base/bounds info. Could get big.


-- External fragmentation of memory.
Paging
------
break up main memory into evenly sized PAGES.

PAGE TABLE give address of either


1) base of physical page in main memory
or 2) address of page on disk

each entry also has PRESENT bit (called VALID bit in book,
to draw an analogy to cache's valid bit). Present bit
identifies whether physical page is resident in main memory
or not.

Have one page table for each process. Its base address is kept in
a register in the CPU.

Placement of pages in memory is easy. OS keeps a free list, and


just take one from list.

Problems:

-- It takes 2 memory accesses to get one piece of data.


One to access the page table entry.
One to get the data.
Solution:
TLB (Translation Lookaside Buffer) as a cache for page table.
-- For reasonable size memory, page table is huge.
Solutions:
Keep base/bounds for page table, so only have pages needed
in table
-- Internal fragmentation. (Page size is aways wrong for some
processes.

PAGING AND SEGMENTATION


-----------------------

To reduce table sizes, use 2 levels of mapping.


Each segment contains an integral number of pages. The number of
pages can vary for each segment.

Keep a page table for each segment.

Both external and internal fragmentation are kept at bay.

Problems:

-- If tables are in memory, can take more than one memory access
to get at a piece of data.
Solution: TLB

TLB (Translation Lookaside Buffer)


-----

A process typically accesses only a few of its pages at time, and


accesses them repetitively. (Showing both spacial and temporal
locality.)

TLB conatins a mapping of virtual page number (the TAG) to a


physical page number.

Also included,
Present bit
Referenced bit(s) - To help know chose victim if no free pages.
Protection - to facilitate sharing
Dirty bit - done at page level, not cache level.
Process ID - sometimes added so that the TLB does not have
to be flushed at each context switch.

TLB typically contains 98% or more of the pages accessed.

Typical size is 64-128 entries. Small TLB can be made


fully associative.

There is a difference between miss in the TLB and a page fault.

PAGE FAULTS
-----------
What happens if the page is not resident in main memory.

- In translation, present bit is off.

- Many machines trap (take an exception).

- OS invoked to allocate free physical page from free list,


or choose a "victim," the physical page to be thrown out
of main memory to make room for faulted page. Have to
write current victim page to disk if dirty.

- OS initiates disk access (really slow, 1msec or more),


brings page into memory, updates page table and TLB,
sets present bit

- Instruction that cause page fault is restarted.

Note that restarting instructions can be a tricky business.


IBM 370 ran "long" instructions twice,
First time just to access each memory location (read),
to generate any potential page faults.
Second time to do the actual instruction, guaranteed not
to page fault.
Z8000 solution, have 2 processors, one just for handling page
faults.

You might also like