Computer Organization
Computer Organization
Computer Organization
Computer Organization
(based on old class notes)
(619) 443-6528
dsalomon@csun.edu
http://www.davidsalomon.name/
These notes are based on experience gained from teaching computer organization over many years.
Much of the material found here was originally included in response to questions and requests from students.
The exercises constitute an important part of the notes and should be worked out! The answers are
provided, but should be consulted only as a last resort. Remember what the Dodo said:
3. Input/Output 71
1 The I/O Processor 71
2 Polled I/O 73
3 Interrupt I/O 75
4 DMA 75
5 I/O Channels 78
6 I/O Codes 79
7 ASCII and Other Codes 80
8 Information Theory and Algebraic Coding 83
9 Error-Detecting and Error-Correcting Codes 84
10 Data Compression 93
11 Variable-Size Codes 94
12 Huffman Codes 95
13 Facsimile Compression 97
14 Dictionary-Based Methods 100
15 Approaches to Image Compression 102
16 Secure Codes 107
17 Transposition Ciphers 114
18 Transposition by Turning Template 116
19 Columnar Transposition Cipher 118
20 Steganography 126
21 Computer Communications 131
22 Serial I/O 131
23 Modern Modems 143
24 ISDN and DSL 144
25 T-1, DS-1 and Their Relatives 145
26 Computer Networks 148
27 Internet Organization 152
28 Internet: Physical Layout 153
29 CSUN in the Internet 160
30 ICANN and IANA 163
31 The World Wide Web 163
4. Microprogramming 167
1 Basic Principles 167
2 A Short History of Microprogramming 168
3 The Computer Clock 169
4 An Example Microarchitecture 169
5 The Microinstructions 170
6 Microinstruction Timing 172
7 The Control Path 175
8 The Machine Instructions 177
9 The Microcode 178
10 Final Notes 184
11 A Horizontal Example 187
Contents v
are fast storage units, where a few pieces of information can be stored temporarily until they are needed
by the program. The ALU and the registers are sometimes combined and are called the RALU. Figure 1.1
illustrates the relations between these components.
Computer installation
Control unit ALU Registers
next instruction is fetched, it is stored in the instruction register (IR), another special-purpose register, that
always contains the current instruction. Note that as soon as the next instruction is fetched and is stored in
the IR it becomes the current instruction (Figure 1.2).
Exercise 1.2: Can the PC be one of the general-purpose registers (GPRs)?
The fetch-execute cycle can now be rewritten as:
1. Read the instruction that’s pointed to by the PC from memory and move it into the IR.
2. Increment the PC.
3. Decode the instruction in the IR.
4. If the instruction has to read an operand from memory, calculate the operand’s address (this is called
the effective address, or EA) and read the operand from memory.
5. Execute the current instruction from the IR.
1.2 The Control Unit 3
Processor
Control
ALU unit
Memory
PC
Registers Instructions
IR and data
Step 2 is important since the PC should always point to the next instruction, not to the current one
(the reason for this is discussed below). The PC is incremented by the size of the instruction (measured in
words). In modern computers, instructions have different sizes. The control unit therefore has to determine
the size of the instruction, make sure that all parts of the instruction have been fetched, then increment the
PC. Step 2 is, therefore, not as simple as it sounds, and it is combined with Step 3. Instruction sizes are
discussed in Section 2.1.
In Step 3, the instruction is examined (decoded) to determine what it is and how it is executed. Various
errors can be detected at this step, and if an error is detected, the control unit does not proceed to Step 4
but instead generates an interrupt (Section 1.7).
Exercise 1.3: What errors can be detected when an instruction is decoded?
Step 4 is executed only if the instruction needs to read an operand from memory or if it uses an
addressing mode. Addressing modes are discussed in detail in Section 2.3.
In Step 5, the control unit executes the instruction by sending signals (control signals) to the ALU,
to the memory, and to other parts of the computer. Step 5 is the most complex part of the control unit,
since the instructions are different, and each should be executed differently. The control unit has a different
circuit for executing each instruction and, since a modern computer might have more than 100 different
instructions, the control unit is a large, complex circuit.
On considering the steps above, the reader may wonder why they are arranged in this order. Specif-
ically, why does the control unit perform fetch—increment PC—execute, rather than fetch—execute—
increment PC. In fact, the latter sequence may seem more natural to most.
It turns out that for most instructions, the two sequences above are identical and the control unit can
use any of them. However, there is one group of instructions, those affecting the flow of control, where the
fetch—increment—execute sequence is the only one that works.
Such instructions are the various jumps and branches. They are used to change the normal flow of con-
trol, the way execution proceeds within a program, and they themselves are very easy to execute. Executing
an instruction such as JMP 846 is done simply by resetting the PC to 846 (to the jump address). Since the
PC is supposed to point to the next instruction, resetting the PC guarantees that the next instruction will
be fetched from location 846. Figure 1.3 compares the operations of the control unit in both the fetch—
increment—execute (part a) and the fetch—execute—increment (part b) sequences. It is easy to see that the
latter sequence ends up with the PC being reset to address 847, which is obviously wrong.
Exercise 1.4: What other instructions change the flow of control?
The upper part of Figure 1.3 shows memory locations 189–191 with the JUMP instruction stored in
location 190. The PC contains 190, so the instruction at memory location 190 is the next one, implying that
the current instruction is the one at location 189. In part (a2) the instruction from 190 has been fetched
into the IR. In (a3), the PC is incremented to point to the next instruction, the one in location 191. In (a4),
the instruction in the IR is executed by resetting the PC. It is reset to the jump address, 846.
In parts (b2), (b3), and (b4), the JUMP instruction is fetched, it is executed by resetting the PC to 846,
and, finally, the PC is incremented to 847, which is wrong.
4 1. Introduction
PC
190
191
IR
JMP 846 190
Instruction
189
from 189
(1)
PC PC
190 190
191 191
JMP 846 190 (2) JMP 846 190
IR IR
189 JMP 846 189 JMP 846
PC PC
191 846
191 191
(3)
JMP 846 190 IR JMP 846 190 IR
189 JMP 846 189 JMP 846
PC PC
(a) (b)
Figure 1.3: Two alternative fetch-execute sequences
control unit starts its cycle, so it fetches and executes instructions from the start of the resident operating
system (i.e., the bootstrap loader). This loads the rest of the operating system from disk. The operating
system then goes into a loop where it waits for the first user command or action.
Exercise 1.5: What would happen if the PC does not get set when the computer starts?
Exercise 1.6: Why not store the entire operating system permanently in memory?
The control unit stops when power to the computer is turned off. This seems a “rough” way to stop
the computer, but it makes sense when we consider the way the user, the operating system, and the control
unit interact. The user turns the computer on, which starts the resident operating system, which in turn
loads the rest of the operating system. The operating system then executes user actions and commands,
such as “open a window,” “delete a file,” “execute a file,” “move the cursor,” “pull down a menu,” “insert a
floppy disk into the disk drive,” etc. At the end of the session, the user has no more commands/actions, so
he turns off the computer. At this point the control unit is still running at full speed, fetching and executing
instructions, but we don’t mind shutting it off abruptly since those instructions are executed by the operating
system while looking for user actions. From the point of view of the user those instructions are effectively
useless.
The most important task of the control unit is to execute instructions. However, before this can be
discussed in detail, the reader must know something about memory and its operations. The next section
discusses memory, and Section 1.4 returns to the control unit and instruction execution.
Certain familiar values for M are 8K, 64K, 256K, 1M, (pronounced ‘one mega’), and 32M words. The
quantity K (the letter K stands for “Kilo”) is normally defined as 103 = 1000 but, since computers use
binary numbers, the quantity K that’s used in computers is defined as the power of 2 that’s nearest 1000.
This turns out to be 210 or 1024. Thus, 8K is not 8000 but 8 × 210 = 8192, and 64K equals 26 × 210 = 65536
rather than 64000. Similarly M (Mega) is not defined as a million (106 ) but as 220 = 1024K, which is a little
more than a million. Notice that M specifies the number of memory words, where a word is N bits long.
When N = 8, the word is called a byte. However, many modern computers have different values of N .
Exercise 1.7: Express a petabyte (PB) in terms of terabytes (TB), gigabytes (GB), megabytes (MB),
kilobytes (KB), bytes, and bits.
Why are such values used for M instead of nice round numbers like 1000, 8000, or 64000? The answer is
that 8192 is a nice round number, but in binary, not in decimal. It equals 213 . Some of the values mentioned
earlier are
64K = 64 · K = 26 · 210 = 216 ,
256K = 28 · 210 = 218 ,
32M = 25 · 220 = 225 .
In general, M is always assigned values of the form 2j for some positive integer j. The reason is that such
values make best use of the range of addresses. It turns out that in a memory with M = 2j words, every
address is a j-bit number, and every j-bit number is a valid address.
To see why this is true, consider the familiar decimal numbers. In a memory with 1000 words, decimal
addresses range from 0 to 999. These addresses are therefore 3-(decimal) digit numbers. What’s more, every
3-digit number is an address in such a memory.
As a simple binary example, consider M = 8 = 23 . Binary addresses range from 0 to 7 = 1112 and are
therefore 3-bit numbers. Also, every 3-bit number is an address, since 1112 is the largest 3-bit number.
In general, if M = 2j , addresses vary from 0 to M − 1. But
M − 1 = 2j − 1 = 1 .
. . 11 2 ,
j
1.3 The Memory 7
located at the highest address). In little endian, the bytes are arranged such that the last byte (the end) is
the “littlest” (least significant). This convention is used on Intel processors. Thus, the little endian order is
. . . ,byte1, byte0 (the least-significant byte is located at the lowest address).
Exercise 1.9: Given the four-byte word 00001111|00110101|01011010|00000001, show how it is laid out in
memory in big endian and in little endian.
1.3.1 Other Features of Memories
First-generation computers had memories made of mercury delay lines or of electromechanical relays. They
were bulky, slow, and unreliable. Second-generation computers had magnetic core memories which were
smaller, faster, and more reliable. Modern computers use semiconductor memories which are fabricated on
integrated circuits, use very little electrical power, and are as reliable as the processor itself (and almost as
fast). One important feature of these memories, namely volatility, is discussed in this section.
The two main types of semiconductor memories are ROM and RAM. The acronym ROM stands for
“read only memory.” The computer can read from this type of memory but cannot write into it. Data can be
written in ROM only during its manufacturing process. Some types of ROM can be written to (programmed)
using a special device located outside the computer. RAM is Read/Write memory and is of more general
use. Incidentally, the name “RAM” stands for “random access memory,” but this is an unfortunate name
since both ROM and RAM are random access. A better name for RAM would have been RWM (read write
memory).
8 1. Introduction
The term “random access” is defined as the case where it takes the same amount of time to read any
memory location, regardless of its address. It also takes the same time to write into any location. However,
the read and write times are not necessarily the same.
If ROM cannot be written to by the computer, why is it used at all? It turns out that ROM has at
least two important applications. It is used for dedicated applications and also for the bootstrap loader.
We are used to computers where different programs are compiled, loaded, and executed all the time.
However, many computers spend their “lives” executing the same program over and over. A computer
controlling traffic lights, a computer in a microwave oven, and a computer controlling certain operations in a
car are typical examples of computers that execute dedicated applications. In such computers, the programs
should be stored in ROM, since ROM is nonvolatile, meaning it does not lose its content when power is
turned off. RAM, on the other hand is volatile.
When a computer is first turned on, there should be a program in memory for the computer to execute.
Without such a program it would be impossible to use the computer, since there would be no way to load
any programs into memory. This is why computers have a small program, the bootstrap loader , in permanent
ROM to start things up by loading the rest of the operating system. It should be pointed out that there is
another way to bootstrap a computer. Some old computers had switches that made it possible to manually
load a small bootstrap loader into memory. With volatile semiconductor memories, however, it is more
convenient to have a bootstrap ROM.
To increase memory reliability, some computers use a parity bit in each memory word. Each time new
information is written to the word, the parity bit is updated. Each time the word is read, parity is checked.
Since no memory is completely reliable, bits may get corrupted while stored in memory, and it is important
to detect such errors.
The idea of a parity bit is to complete the number of 1’s in the word to an even number (even parity) or
to an odd number (odd parity). For example, suppose the number 010 is written in a 3-bit word and even
parity is used. The parity bit should be set by the memory unit to 1, to complete the number of 1’s to an
even number. When the word is read, the memory unit checks parity by counting the number of 1’s. If the
number of 1’s is odd, memory notifies the processor of the problem by means of an interrupt (Section 1.7).
A parity bit is thus a simple device for error detection. However, a single parity bit cannot tell which bit
(or bits) is bad.
A single parity bit also cannot detect every possible error. The case where two bits change their values
in a memory word cannot be detected by simple parity. There are ways of adding more parity bits to make
data more reliable, some of which are discussed in Section 3.9.
Exercise 1.10: What if the parity bit itself goes bad?
1.3.2 Memory Operations
This section treats memory as a black box and does not go into how memory operates internally. We only
discuss how memory is accessed by the control unit (later we see that the DMA device can also access
memory). As mentioned earlier, there are only two memory operations, namely memory read and memory
write. Each operation accesses one word, so at any given time it is only possible to read a word from memory,
or to write into a word in memory. When reading from a word, only a copy of the content is read and the
original content of the word is not affected. This means that successive reads will read the same data. This
behavior is a special case of a general rule that says that when data is moved inside the computer, only a
copy is moved and the original is unaffected.
Exercise 1.11: What are exceptions to this rule?
The “memory write” operation, on the other hand, writes new information in the word, erasing its old
content. This is the only way to erase information in memory, there being no “erase” operation. After all,
a memory word may contain only bits, so there is no such thing as a blank word. As a result, even words
that have not been initialized with any information, have something in them, either remnants from an old
program, or random bits.
The first step in either memory operation is to send an address to memory. The address identifies the
memory word that should be accessed (read from or written into). From that point, the “read” and “write”
operations differ. The “memory read” operation consists of the following four steps:
1.3 The Memory 9
example, suppose that the control unit has initiated a memory write. During Step 4, the control unit can
perform a register shift, which is an internal operation and does not involve memory.
Figure 1.6 summarizes the steps of a memory write operation.
With this knowledge of the memory unit and its two operations, we are now ready to look at the details
of how the control unit fetches and executes instructions.
1.4 Instruction Execution
Fetching an instruction is a “memory read” operation. All instructions are fetched from memory in the same
way, the only difference being the size of the instruction. In modern computers, some instructions are long.
They occupy more than one memory word, and therefore have to be fetched in two or even three steps. In
supercomputers, memory words are long (typically 64 bits each) and instructions are packed several to a
word. In such a machine, reading a word from memory fetches several instructions, of which only the first
can be immediately moved to the IR and executed. The rest are stored in an instruction buffer, waiting to
be executed.
Decoding the current instruction, in Step 3, is done by passing the opcode through a decoder (Figure 1.7).
The decoder outputs are connected to the various execution circuits of the control unit and each output
activates “its” execution circuit to execute an instruction. An execution circuit executes an instruction by
generating a sequence of control signals that are sent to various parts of the computer. It should also be
mentioned that in modern computers, opcodes have different lengths, so the control unit has to figure out
the length of the opcode before it can decode it. Variable size opcodes are discussed in Section 2.2.
The opcode decoder is also the point where invalid opcodes are detected. Most computers have several
12 1. Introduction
IR Opcode
Decoder
Sequencer
Invalid opcodes are
used to generate
an interrupt
Control signals
Figure 1.7: Decoding the opcode
unused opcodes (to be used in future versions of the machine), and as a result, some of the decoder outputs
do not go to any execution units. Those outputs are connected together and, when any of them becomes
active, that signal is used to generate an interrupt.
Instruction execution, however, is different from fetching and decoding and is much more complex than
the operations in Steps 1–4 of the control unit cycle. Modern computers can have a large instruction set
with as many as 200–300 instructions. Some instructions are similar and are executed in very similar ways.
1.4 Instruction Execution 13
Others are very different and are executed in completely different ways. In general, the control unit has
to treat each instruction individually, and therefore has to contain a description of the execution of each
instruction. The part of the control unit containing those descriptions is called the sequencer. This is the
collection of all the execution circuits.
The sequencer executes an instruction by sending a sequence of control signals to the different parts
of the computer. Most of those signals move data from one location to another. They perform register
transfers. Other signals trigger parts of the computer to perform certain tasks. For example, a “read” signal
sent by the sequencer to memory, triggers memory to start a “read” operation. A “shift” signal sent to the
ALU triggers it to start a shift operation. The examples below list the control signals sent by the control
unit for the execution of various instructions.
Example 1. JMP 158. The programmer writes the mnemonic JMP and the jump address 158. The
JMP 158
assembler assembles this as the two-field instruction . The control unit executes a jump instruc-
tion by resetting the PC to the jump address (address 158). Figure 1.8 shows the components involved in
executing the instruction.
IR Opcode 158
a b
PC
Decoder
Sequencer
Control signals
Figure 1.8: Executing a JMP instruction
It seems that the only operation necessary in this case is PC←158. This notation seems familiar since
it is similar to that used in higher-level languages. In our case, however, this notation indicates operations
performed by the control signals, and such operations are simple. A control signal can cause the transfer of
register A to register B, but cannot move a constant to a register. This is why we need to have the jump
address in a register. That register is, of course, the IR, where the instruction is located during execution.
Since our example instruction has two parts, we assume that the IR is partitioned into two fields a and b such
that IRa contains the opcode and IRb contains the address. The only control signal necessary to execute the
instruction can now be written as: PC ← IRb .
2
14 1. Introduction
A useful rule
When an instruction is executed by the control unit, it is located in the IR.
Execution follows the fetch and decode steps, so when an instruction is executed,
it has already been fetched into the IR and been decoded. When the control unit
needs information for executing the instruction, it gets it from the IR, because this
is where the instruction is located during execution (some information is obtained
from other sources, such as the PC, the SP, and the general-purpose registers,
GPRs).
As a result, many of the control signals used by the control unit access the IR.
Example 2. LOD R1,158. The programmer writes the mnemonic LOD, followed by the two operands
(register and address) separated by a comma. The assembler generates the 3-field machine instruction
LOD 1 158
. We label these fields a, b, and c (where field IRb is the register number and field IRc
contains the address). This instruction loads the content of memory location 158 into R1 (one of the GPRs,
in the processor). Before executing the instruction, the control unit has to calculate the effective address.
The way this is done in practice depends on the addressing mode used. In this example we assume that
the instruction contains the effective address and that no special addressing mode is used. Executing the
instruction involves reading the operand from memory and moving it to the destination register. The steps
are:
1. MAR ← IRc
2. “R”
3. wait
4. R1 ← MBR
However, Step 4 cannot really be written in this form since these steps should be valid for any LOD instruction.
Step 4 should therefore be general. Instead of using the explicit name of the destination register, it should
get that name from field c of the IR. We can write this step RIRb ←MBR. The four steps are illustrated in
Figure 1.9a.
Example 3. ADD R1,158. This is similar to the previous example, and is illustrated in Figure 1.9b.
Executing this instruction involves the steps:
ALU
(a) (b)
Figure 1.9: Executing LOD and ADD
LSHFT 3 4 IR
A bus
ALU
1. left-ALU← RIRb
2. right-ALU← IRc
3. ALU-function← ‘left shift’
4. wait for ALU
5. RIRb ←ALU-output
16 1. Introduction
Note that the details of this process are transparent to the programmer. The programmer thinks of
the shift as taking place in R3. Notice also the similarity between this example and the previous one. The
difference between the execution of an add and a shift is in the ALU operation, the signals sent by the control
unit are very similar.
Example 5. COMPR R4,288. This is a “compare” instruction, comparing its two operands. Writing
and assembling this instruction are similar to the previous example. The comparison is done in the ALU
and the results (see the discussion of comparisons in Section 2.13.4) are stored by the ALU in the status
flags. We can consider the flags part of the ALU, with the control unit having access to them.
The control signals are:
1. MAR←IRc
2. ‘read’
3. wait
4. left-ALU←MBR
5. right-ALU← RIRb
6. ALU-function← ‘compare’
7. wait for ALU
This should be compared to the control signals issued for executing the ADD instruction above. Note that
Step 7 is normally unnecessary. In principle, the control unit should wait for the ALU only if it has to
use it again immediately. In most cases, however, the next step of the control unit is to fetch the next
instruction, so it does not need to use the ALU for a while. As a result, a sophisticated control unit—of the
kind found in modern, fast computers—looks ahead, after Step 6, to see what the next operation is. If the
next operation does not use the ALU, Step 7 is skipped. On the other hand, the control unit in a cheap,
slow microprocessor—designed for simplicity, not speed—always executes Step 7, regardless of what follows.
Example 6. CALL 248. Calling a procedure is an important programming tool, and every computer
includes a CALL instruction in its instruction set. The instruction is executed like a JMP, but the return
address is saved. The return address is the address of the instruction following the CALL, so it is found in
the PC at the time the CALL is executed. Thus, executing the CALL involves two steps:
1. Save the PC.
2. PC← RIRb .
Where is the PC saved? Most computers use one of three techniques. They save the PC either in a register,
in the stack, or in memory.
Saving the PC in a register is the simplest method. The user should specify an available register by
writing an instruction such as CALL R5,248, and the control unit saves the PC in R5. The downside of this
method is that procedure calls may be nested and there may not be enough available registers to hold all
the return addresses at the same time.
Saving the PC in the stack (stacks are discussed in Section 2.13.5) is a good method since procedure
calls are often nested and the last procedure to be called is the first one to return. This means that the last
return address to be saved in the stack is the first one to be used, which conforms to the way a stack is used.
Modern computers have stacks and use them extensively. The few old mainframes still in use, generally do
not have stacks. Also, some special-purpose computers that have just ROM cannot have a stack.
The third method is to save the PC in memory. This makes sense since there is an ideal memory location
for saving each return address, namely the first location of the procedure. It is ideal since (1) it is different
for different procedures, (2) it is known at the time the CALL is executed, and (3) it can easily be accessed
by the procedure itself when the time comes for it to return to the calling program. An instruction such as
CALL 248 is therefore executed by (1) saving the PC in location 248, and (2) resetting the PC to 249. The
procedure starts at location 248, but its first executable is located at 249. When writing a procedure (in
assembler) on a computer that uses this method, the programmer should reserve the first location for the
return address. If the program is written in a higher-level language, the compiler automatically take care
of this detail. When the procedure has to return, it executes an indirect JNP to location 248. The use of
the indirect mode mean that the control unit reads location 248 and resets the PC to the content of that
location. The indirect mode, as well as other addressing modes, is discussed in Section 2.3.
1.5 CPU Bus Structure 17
The problem with this method is recursive calls. Imagine a procedure A being called by the main
program. The first location of A contains the return address, say 1271. If A calls itself before it returns, the
new return address gets stored in the first location of the procedure, thereby erasing the 1271.
As a result, the PC is normally saved in the stack. The control signals are:
1. MAR←SP.
2. MBR←PC.
3. write.
4. wait. SP←SP+1.
5. PC← IRb .
Notice that the control unit uses the wait period to increment the stack pointer.
Example 7. BPL 158. This is a “Branch on PLus” instruction. It is a conditional branch that branches
to address 158 if the N status flag (sometimes called the S flag) corresponds to a positive sign (i.e., has a
value of zero). More information on the status flags can be found in Sections 2.13.5 and 2.13.5. The only
control signal necessary is “if N=0 then PC← IRb ”.
Initially, it seems that this conditional signal is neither a straightforward signal nor a register transfer,
but a new type of control signal. However, Figure 1.11 shows how the control unit can execute this instruction
by sending a normal, unconditional control signal “Execute BPL” that passes conditionally through the AND
gate depending on the state of the N status flag. Thus, the control unit can execute any conditional branch
by sending a normal control signal; the only special feature needed for this type of instruction is an AND
gate.
PC
1
N
status Execute BPL
flag 0
IRb
This example is important since it is the basis of all the decisions and loops performed by the various
programming languages. Any if, for, while or similar statements are compiled to machine instructions
that include either a BPL or another conditional branch instruction. To execute such an instruction, the
control unit has to be able to test any of the flags and perform an operation only if the flag has a certain
value.
Control Control
Unit Unit
PC IR PC IR
Registers Registers
Aux
ALU ALU
Result
(a) (b)
The advantage of this CPU organization is that only one bus is needed. However, some auxiliary
registers are required and some operations require more than one step.
The triple-bus CPU design is shown in Figure 1.12b (compare this with Figure 4.1). The control unit
and the registers can send data to, and receive data from, any of the three data buses. The ALU receives
its two operands from data buses A and B, and sends its output to data bus C. There is no need for any
auxiliary registers but three buses are needed.
1.6 Microprogramming
The control unit has a separate circuit for the execution of each instruction. Each circuit sends a sequence
of control signals, and the total of all the circuits is called the sequencer. The advantage of this arrangement
is that all the information about instruction execution is hardwired into the sequencer, which is the fastest
way to execute instructions. There is, however, another way of designing the control unit. It is called
microprogramming and it results in a microprogrammed control unit. This section is a short introduction to
microprogramming. Chapter 4 discusses this topic in detail.
The basic idea of microprogramming is to have all the information on instruction execution stored
in a special memory instead of being hardwired. This special memory is called the control store and it
is not part of the main memory of the computer. Clearly, microprogramming is slower than hardwired
execution because information has to be fetched constantly from the control store. The advantages of
microprogramming, however, far outweigh its low speed, and most current computers are (fully or partly)
microprogrammed. Most computer users, even assembler language programmers, do not know that the
computer is microprogrammed or even what microprogramming is. Programs are written and compiled
normally, and the only difference is in the way the control unit executes the instructions. This is why
microprogramming is considered a low-level feature. The term “low level” means closer to the hardware and
microprogramming is as close to the hardware as one can get.
What exactly is stored in the control store? Given a typical sequence for the execution of an instruction:
a←b
“read”
c←d
wait for memory feedback
e←f
we can number each signal and store the numbers in the control store. The sequence above could be stored
as, say 122, 47, 84, 115, 7; the sequence below could perhaps be stored as 84, 48, 55, 115, 7.
1.6 Microprogramming 19
c←d
“write”
x←y
wait for memory feedback
e←f
Note that signals 84 and 155 mean the same in the two sequences.
To design a microprogrammed control unit, the designers have to number every signal that the control
unit can generate, decide on the execution sequence for each instruction, and store each sequence as a string
of numbers in the control store.
Executing an instruction thus involves fetching its execution sequence, number by number, from the
control store and executing each number in the sequence by sending the signal that corresponds to that
number. The number is sent to a decoder that generates a unique output for each number that it receives.
The output is the control signal that corresponds to the number. The sequence of numbers for the execution
of an instruction S is called the microprogram of S.
We therefore conclude that a hardwired control unit knows how to execute all the instructions in
the instruction set of the computer. This knowledge is built into the sequencer and is permanent. In
contrast, a microprogrammed control unit does not know how to execute instructions. It knows only how
to execute microinstructions. It has hardware that fetches microinstructions from the control store and
executes them. Such a control unit has a micro program counter (MPC) and a micro instruction register
(MIR). The execution sequences are now stored in the control store, each as a sequence of numbers. This
makes it easy to modify the instruction set of the computer and it constitutes one of the advantages of
microprogramming. The numbers in the control store are much simpler to execute than machine instructions,
so a microprogrammed control unit is simpler than a hardwired control unit. Figure 1.13 shows the main
components and paths of a microprogrammed control unit.
Incr.
Opcode IR
MPC
MPC PLA
Decoder
Control
signals
(a)
Figure 1.13: A microprogrammed control unit
It turns out that more than just signal numbers can be stored in the control store. In practice, each
control store location contains a microinstruction with several fields. The fields can specify several control
signals that should be generated, and can also indicate a jump to another microinstruction.
To fully understand microprogramming, two more points need to be clarified: (1) How does the control
unit find the start of the sequence of microinstructions for a given instruction and (2) what does it do when
it is finished executing such a sequence.
20 1. Introduction
The answer to the first question is: When an instruction is fetched into the IR, the control unit uses its
opcode to determine the address in the control store where the microinstruction sequence for that instruction
starts. The address may be the opcode itself, it may be a function of the opcode or, what is most common,
the address may be obtained from the opcode by means of a small ROM. In this method, the opcode is
used as an address to this ROM and the content of that address is the start location of the microinstruction
sequence in the control store. In practice, such a small ROM is constructed as a programmable logic array
(PLA) (Section 8.4).
The answer to the second question is: When the control unit has executed a complete sequence of
microinstructions, it should go back to Step 1 of the fetch-execute cycle and fetch the next machine instruction
as fast as possible. Here we see one of the advantages of microprogramming. There is no need for any special
circuit to fetch the next instruction. A proper sequence of microinstructions in the control store can do this
job. As a result, there should be at the end of each sequence a special microinstruction or a special number,
say 99, indicating the register transfer MPC ← 0. This register transfer moves zero into the MPC, causing
the next microinstruction to be fetched from location 0 of the control store. Location 0 should be the start
of the following sequence:
1. MAR ← PC
2. “read”
3. wait for memory feedback
4. IR ← MBR
5. Increment PC
that fetches the next instruction.
More details, as well as a complete example of a simple, microprogrammed computer, can be found in
Chapter 4.
1.7 Interrupts
Interrupts are a very important feature of the computer, so much so that even the smallest, most primitive
microprocessors can handle at least one interrupt. Unlike many other computer features that are nebulous
and hard to define, interrupts are easy to define.
Definition: Interrupts are the way the computer responds to urgent or unusual conditions.
Examples of such conditions are:
1. Invalid opcode. When the control unit decodes the current instruction (in Step 3 of the fetch-execute
cycle), it may find that the opcode is invalid (i.e., unused). On many computers, not all opcodes are used.
Some opcodes do not correspond to any instruction and are considered invalid. When the control unit
identifies such an opcode, it generates an interrupt, since this is both an unusual and an urgent condition.
It has to be handled immediately and the program cannot continue as usual.
2. Zero-divide. When the ALU starts executing a “divide” instruction, it first tests the divisor. If the
divisor is zero, the ALU generates an interrupt since this is an urgent condition; the division cannot take
place.
3. Timer. On many computers, there is a timer that can be set, by software, to any time interval. When
execution time reaches this interval, the timer senses it and generates an interrupt.
4. Software interrupt. When the program wants to communicate with the operating system, it can
artificially generate an interrupt. This interrupt is called a break , a software interrupt, or a supervisor call .
It is an important feature of the computer and is discussed later in this section.
5. Many hardware problems, such as a parity error (in memory or during I/O) or a drop in the voltage
can cause interrupts, if the computer is equipped with the appropriate sensors to sense those conditions and
to respond accordingly.
6. I/O interrupts. Those are generated by an I/O device or by the I/O processor. Such an interrupt
indicates that an I/O process has completed or that an I/O problem has been detected. Those interrupts
are discussed in Section 1.8.
A good way to understand interrupts is to visualize a number of interrupt request lines (IRQs) arriving
at the control unit from various sources in the computer (Figure 1.14). When any interrupt source decides
1.7 Interrupts 21
Sense
IRQ IRQ lines from
lines in various sources
Control unit Step 6
to interrupt the control unit, it sends a signal on its IRQ line. The control unit should test those lines from
time to time to discover all pending interrupts, and respond to them.
Testing the IRQ lines is done in a new step, Step 6 of the fetch-execute cycle. The cycle becomes:
1. Fetch.
2. Increment PC.
3. Decode.
4. Calculate EA and fetch operand.
5. Execute.
6. Test all IRQ lines. If any of them is high, perform the following:
6.1 Save the PC.
6.2 Set the PC to a memory address, depending on the interrupt.
6.3 Disable all interrupts.
6.4 Set processor mode to ‘system’.
After Step 6.4, the control unit continues its cycle and goes back to step 1 to fetch the next instruction.
However, if any IRQ line was tested high in step 6, the next instruction will be fetched from the address sent
to the PC in Step 6.2. In fact, Steps 6.1 and 6.2 are equivalent to a procedure call (compare them with the
example of executing a procedure call instruction). The conclusion is that the control unit, upon sensing an
interrupt, performs a forced procedure call. All interrupts are treated by the control unit in the same way,
the only difference being the address used in Step 6.2. Different addresses are used for different interrupts,
resulting in different procedures being called.
It is now clear that the computer handles an interrupt by artificially calling a procedure. This mechanism
is also called a forced procedure call . The control unit forces a call to a procedure each time an interrupt
occurs. Figure 1.15 shows how control is switched from the interrupted program to the procedure and back.
We now have to have a number of interrupt handling procedures, one for each interrupt. Where do those
procedures come from? Obviously, they have to be written by someone, loaded into memory, and wait there.
Those procedures are never explicitly called by any program. In fact, most computer users are not even
aware of their existence. They lie dormant, perhaps for a long time, and are invoked only when something
unusual happens in the computer and an interrupt signal is issued and is sensed by the control unit in Step 6
of its cycle. We say that the interrupt handling routines are called automatically, which means the calls are
forced by the control unit.
interrupted
program resumed
1 2 1. interrupt
2. return
An interrupt handling routine is a piece of software and can do anything that a program can do. For
example, the handling routine for an invalid opcode should print an appropriate error message, should
also print the saved value of the PC (giving the user an indication of where the error occurred), and should
probably terminate the program, returning control to the operating system in order to start another program.
22 1. Introduction
The handling routine for a zero-divide interrupt should also print an error message, print the saved value
of the PC, then try to communicate with the user. In some cases the user should have a good idea of what
happened and may be able to suggest a value for the divisor, or tell the routine to skip the division altogether.
In such cases the interrupted program may be resumed. To resume a program, the handling routine should
use the saved value of the PC.
In summary, a good computer design calls for various sensors placed in the computer to sense urgent and
unusual conditions. Upon sensing such a condition, the sensor generates an interrupt signal. The control unit
tests the IRQ lines in Step 6, after it has finished executing the current instruction (or after it has decided
not to execute the current instruction because of a problem). If any IRQ line is high, the control unit forces
a procedure call to one of the interrupt handling routines. The routine must be prewritten (normally by
a systems programmer) and is only invoked if a particular interrupt has occurred. Each handling routine
should only handle one interrupt and it can do that in any way. It can print an error message, change any
values in memory or in the registers, terminate the user program or resume it, or do anything else.
Note that the IRQ lines are only tested in Step 6, i.e., between instructions. It is important to service
an interrupt only between instructions and not immediately when it occurs. If the control unit stops in the
midst of an instruction to service an interrupt, it may not be able later to complete the execution of the
instruction, and thus may not be able to resume the program.
If each interrupt routine is devoted to one interrupt, we say that the interrupts are vectored . In many
old microprocessors, there is only one IRQ line and, as a result, only one interrupt handling routine. All the
interrupt sources must be tied to the single IRQ line and the first task of the interrupt handling routine is to
identify the source of the interrupt. This is time consuming and is the main reason why vectored interrupts
are considered important.
Most computers support several interrupts and have to deal with two simple problems. The first is
simultaneous interrupts, a problem easily solved by implementing interrupt priorities. The second is nested
interrupts, solved by adding instructions (i.e., hardware) for enabling and disabling interrupts.
Simultaneous interrupts do not have to be issued at the same time. If several IRQ lines go high at
different times during Steps 1–5, then in Step 6, when the IRQ lines are tested, the control unit considers
them simultaneous. Simultaneous interrupts are a problem because the computer can handle only one
interrupt at a time. The solution is to assign a priority to each interrupt. If Step 6 finds several IRQ lines
high, the control unit responds to the one with the highest priority and lets the others wait.
The problem of nested interrupts has to do with the interrupt handling routine. The routine is a piece
of software, so it can operate only by executing instructions. While the routine is running, many kinds of
interrupts may be issued. The problem is: Should a given interrupt be allowed to nest the handling routine
or should it wait? The most general answer is: It depends. For example, the interrupt handling routine for
interrupt 6 may not want interrupts 1, 4, and 7 to nest it. It may decide, however, that interrupts 2, 3, and
5 are important and deserve fast response. In such a case, it may let those interrupts nest it.
The general solution is therefore to add instructions (i.e., hardware) to disable and enable interrupts.
A computer may have two such instructions, one to enable and another to disable, individual interrupts.
Examples are ‘EI 5’ (Enable Interrupt 5), ‘DI 3’ (Disable Interrupt 3), EA (Enable All interrupts). Step 6.3,
where all interrupts are disabled, is also part of the extra hardware. After this step, the control unit goes to
step 1, where the next instruction is fetched. It is fetched from the handling routine, which means that when
this routine is invoked, all interrupts have been disabled and the routine cannot be nested by any interrupts.
If the handling routine decides that, for example, interrupts 2, 3, and 5 are important and should be allowed
to nest it, it can start by executing ‘EI 2’, ‘EI 3’, and ‘EI 5’.
Exercise 1.12: This discussion suggests that there is no need for instructions that disable interrupts. It
that true?
When the handling routine completes its task, it should enable all the interrupts and return. This is a
simple task but it involves a subtle point. We want to enable all interrupts and return, but if we execute
the two instructions EA and RET, the control unit will sense the freshly enabled interrupts at Step 6 of the
EA instruction, and will invoke a new handling routine before it gets to the RET. Here are the details.
The control unit fetches and executes the EA instruction in Steps 1–5. However, when the EA is executed,
in Step 5, all interrupts become enabled. When the control unit checks the IRQs, in Step 6, it may find
some IRQ lines high. In such a case, the control unit executes substeps 6.1–6.4, then goes to fetch the next
1.7 Interrupts 23
instruction. This next instruction comes from the new interrupt handling routine, and the routine starts its
execution. All the while, the RET instruction hasn’t been executed, which means that the original interrupt
handling routine did not return. It has effectively been nested at the last moment.
To prevent this situation, the computer has to have a special instruction that enables all interrupts and
also returns (by popping the stack and moving this data to the PC).
We now turn to the details of Step 6.2 above. How does the control unit know the start addresses of all
the interrupt service routines? Here are three approaches to this problem:
1. The start addresses of the service routines are permanent and are built into the control unit. This
is a bad choice, since the operating system is updated from time to time, and each update may move many
routines to new locations.
2. Certain memory locations are reserved for these addresses. Imagine a 16-bit computer with 32-bit
addresses. An address in such a computer is therefore two words long. If the computer has, say, 10 interrupts,
the first 20 memory locations are reserved. Each time the operating system is loaded into memory, it loads
the start addresses of the ten interrupt service routines into these 20 locations. Whenever the operating
system is modified, the routines move to new start addresses, and the new operating system stores their new
addresses in the same 20 locations. If the control unit senses, in step 6.1, that IRQi is high, it fetches the
address stored in locations i and i + 1 and moves it into the PC in step 6.2. This method of obtaining the
start addresses is called vectored interrupts.
3. The control unit does not know the start addresses of the service routines and expects to receive each
address from the interrupting device. In step 6.2, the control unit sends an interrupt acknowledge (IACK)
signal to all the interrupting devices. The device that has sent the original IRQ signal responds to the IACK
by dropping its original interrupt request and placing an address on the data bus. (Normally, of course,
addresses are placed on the address bus, not on the data bus. They are also placed there by the control
unit, not by any device, so this is an unusual situation. Also, in the case of a 16-bit computer with 32-bit
addresses, the address should be placed on the 16-bit data bus in two parts.) After sending the IACK signal,
the control unit inputs the data bus and moves its content to the PC. This method too is referred to as
vectored interrupts.
Regardless of what method is used to identify the start address in step 6.2, the interrupt acknowledge
signal is always important. This is because the interrupting device does not know how long it will take the
control unit to respond to the interrupt, and thus does not know when precisely to drop its IRQ signal and
to place the address on the data bus. The control unit may take a long time to respond to an interrupt if
the interrupt is disabled or if it has low priority.
Ideally, each interrupt should have its own line to the control unit, but this may not always be possible
for the following two reasons:
1. If the CPU is a microprocessor (a processor on a chip), there may not be enough available pins on
the chip to accommodate many interrupts. A typical microprocessor has just one pin for interrupt request
lines.
2. A computer may be designed with, say, 16 interrupt request lines but may be used in an application
where there are many interrupting devices. An example is a computer controlling traffic lines in many
intersections along several streets. At each intersection, there are four sensors buried under the street
to detect incoming cars. When a sensor detects a car moving over it, it sends a pulse to the computer.
Since the computer is much faster than any car, we can expect the computer to execute many thousands of
instructions between receiving consecutive pulses. It therefore makes sense to send the pulses to the computer
as interrupts. Such a computer can easily have hundreds of sensors sending it interrupts. It is possible to
design a special-purpose computer for this application, but it makes more sense to use a general-purpose
computer. A general-purpose computer is readily available, cheaper, and already has an operating system
with a compiler and useful utilities.
We therefore consider the extreme case of a computer with just one interrupt request line. All the
interrupt sources send their interrupt signals to this single line, and the interrupt system (i.e., the interrupt
hardware in the control unit and the interrupt service routines) should be able to identify the source of any
interrupt. We discuss two ways to design such an interrupt system, one using a daisy chain and the other
using a priority encoder.
24 1. Introduction
In a daisy chain configuration, interrupt signals from all the interrupt sources are ORed and are sent to
the control unit. The acknowledge signal (IACK) sent back by the control unit arrives at the first device. If
this device has sent the interrupt signal, then it drops its IRQ and places the address of its service routine
on the data bus; otherwise, it forwards the IACK to the next device on the chain (Figure 1.16).
Data bus
IRQ
An advantage of daisy chaining is that it implicitly assigns priorities to the various devices. Imagine a
case where device 3 sends an interrupt signal at a certain point in time, followed a little later by an interrupt
signal from device 2. When the IACK arrives from the control unit, device 2 sees it first and responds to
it by (1) dropping its interrupt request signal and (2) placing the address of its service routine on the data
bus. The control unit moves this address to the PC in step 6.2, and disables all the interrupts in step 6.3.
The service routine for device 2 executes, then enables the interrupts and returns. At that point, the control
unit starts checking interrupts again in step 6.1 (recall that they had been disabled while the service routine
executed) and discovers that there is an interrupt request pending (the original IRQ from device 3). The
control unit sends an IACK that is now forwarded all the way to device 3, which responds to it.
In a priority encoder configuration (Figure 1.17), interrupt signals from the currently-enabled interrupt
sources are sent to a priority encoder (Section 7.7.1) and are also ORed and sent to the control unit. The
IACK signal from the control unit is used to enable the priority encoder which then places the start address
of the interrupt with the highest priority on the data bus. There is still the problem of making the highest-
priority device drop its interrupt request signal. This can be done either by the priority encoder sending
this device a fake IACK signal, or by the service routine. If the device is sophisticated enough, its interrupt
service routine can send it a command to drop its interrupt request signal.
IRQ1
Device 1
Device 2 IRQ2
to
Priority Data bus
encoder
IRQn
Device n
Enable
to
control
unit
Mask register
While the service routine is executing, all interrupts are disabled, so the control unit does not check the
(single) IRQ in step 6. As soon as the routine enables interrupts and returns (these two tasks must be done
by one instruction), the control unit checks the IRQ line in step 6. If it is high, the control unit sends an
IACK, which enables the priority encoder, and the entire process repeats.
1.7 Interrupts 25
Notice how the interrupt request signals from the various devices can be disabled by the mask register.
Each bit in this register corresponds to one interrupting device. Any bit can be set or cleared by special
instructions. If bit i is zero, an interrupt request signal from device i will not reach the priority encoder.
Device i will either drop its request signal eventually or wait until the program sets bit i of the mask
register.
Software interrupts have already been mentioned. This is a standard way for a user program to request
something from, or to transfer control to, the operating system. The user program may need to know the
time or the date, it may need more memory, or it may want to access a shared I/O device, such as a large
disk. At the end of its execution, a user program should notify the operating system that it has finished
(a normal termination). When the program discovers an error, it may decide to terminate abnormally. All
these cases are handled by generating an artificial interrupt and a special code that describes the case. This
interrupt is called a break or a software interrupt. It is generated by the program by executing a special
instruction (such as BRK) with an operand that’s a code. The interrupt is handled in the usual way and a
handling routine (which is part of the operating system) is invoked. The routine handles the interrupt by
examining the code and calling other parts of the operating system to perform the requested services.
Another example of a privileged instruction is any I/O instruction on a multiuser computer. Such a
computer normally has large disk units that should be accessed only by the operating system, since they
are shared by all users. A user program that needs I/O can only place a request to the operating system to
perform the I/O operation, and the operating system is the only program in the computer that can access
the shared I/O devices.
Exercise 1.13: The operating system sets the mode to “user” by means of an instruction. Does this
instruction have to be privileged?
Start
1 2 3 1 2 4 1 Time
1, 2, 3 1, 4, 3
Figure 1.18: Time slices
The only problem is: Once a user program takes control, the operating system cannot stop it, because
the operating system itself is a program, and it needs the use of the processor in order to do anything. The
solution is to use interrupts. The hardware of a multiuser computer must include a timer, which is a special
memory location whose content is decremented by special hardware typically once every microsecond. When
the timer reaches zero, it issues an interrupt.
When the operating system decides to start a time slice, it loads the timer with the value 1000, and
executes a jump to the user program. After 1000 microseconds (one millisecond) the timer reaches zero,
a timer interrupt occurs, and control is switched to the interrupt handling routine, which is part of the
operating system. The system then saves the return address (the address where the user program should
be restarted in its next time slice) and all the registers, and allocates the next time slice to another user
program.
Processor modes and time slices are just two examples of important features that are implemented by
means of interrupts. They are two reasons why interrupts are so important. Interrupts provide a uniform
mechanism for handling many different situations.
1.8 I/O Interrupts 27
that such a loop can iterate 10000 times or more, which explains why register I/O is so slow. It should be
emphasized that—during such a loop—the processor has to execute the BNR instruction and is, therefore,
tied up and cannot do anything else.
Exercise 1.14: Laser printers are much more common today than dot matrix printers. A laser printer does
not print individual characters but rather rows of small dots [typically 600 dots per inch (dpi)]. In such a
case, is there still a need to check printer status?
A similar example is keyboard input. A keyboard is a slow device since few people can type more than
10 characters per second. It is also used for small quantities of input. Before issuing an IN instruction to the
keyboard, the program has to check keyboard status to make sure that there is something in the keyboard
to input. After all, the user may be sitting at the keyboard daydreaming, thinking, or hesitating. Even
assuming the user to type at maximum speed (about 10 characters/second), the computer can easily execute
100000 instructions between the typing of two consecutive characters. Again the result is a tight loop where
the processor has to spend a lot of time for each character being input:
LP: BNR 5,LP
IN 5,R6
28 1. Introduction
IRQ pin
Control Control
unit unit Micro
processor
(a) (b) (c)
Figure 1.19: Connecting interrupt request lines
package. The number of pins is limited and the typical microprocessor is limited to just one or two interrupt
request pins (Figure 1.19c).
The process of interrupting a microprocessor is summarized in Figure 1.20a. The process starts when
an interrupt request (IRQ) signal is sent from an interrupting source to the IRQ pin in the microprocessor
(1). The microprocessor completes the current instruction and then, if interrupts are enabled inside the
microprocessor, sends an acknowledge signal through the IACK pin. The acknowledge signal arrives at all
the interrupting devices (2) and it causes the one that has interrupted to drop the IRQ signal (3).
IRQi
IRQ
IRQ IACK
IACK IACKi
2
3
Data
1
(a) (b)
Figure 1.20: Timing diagrams for interrupting a microprocessor
A PIC on a chip
In order to implement vectored interrupts in a microprocessor-based computer, an additional chip, a support
chip, is used. It is called an interrupt controller or, if it can be programmed, a Programmable Interrupt
Controller (a PIC). The name PIC sometimes stands for a Priority Interrupt Controller.
Only a programmable PIC is discussed here. Such a device has four main functions:
1. It has enough pins to accept vectored interrupts from n sources.
2. Once it is interrupted, it interrupts the microprocessor and causes the right handling routine to be
invoked.
3. If two or more sources interrupt the PIC at the same time, it decides which source has the highest
priority, starts the microprocessor on that interrupt, and saves the other ones, to be serviced as soon as
the microprocessor is done with the current interrupt. The interrupt priorities are not built-in and can be
changed under program control.
4. If the program decides to ignore (disable) certain interrupts, it can program the PIC to do so.
The main advantage of the PIC is point 2 above. The PIC identifies the specific interrupt source
and decides which handling routine should be invoked. To actually invoke the routine, the PIC sends its
start address to the microprocessor. Since the handling routines may be written by the user, they can
start anywhere in memory, so the PIC cannot know their start addresses in advance. The program should,
therefore, send the start addresses of all the handling routines to the PIC. This is done by means of commands
sent to the PIC early in the program.
30 1. Introduction
The PIC commands are sent by the microprocessor as regular output. The PIC therefore should have
a device select number, much like any other I/O device. Since the commands represent a low-volume I/O,
they are sent to the PIC using register I/O.
Figure 1.21 shows a typical PIC connected in a microcomputer system. The address lines carry the
device select number to the PIC in the same way they carry it to all other I/O devices. The commands
are sent to the PIC on the data bus and the start addresses are sent, by the PIC, on the same bus, in
the opposite direction. The PIC interrupts the microprocessor through the IRQ pin and then waits for an
acknowledge from the microprocessor on the IACK pin. At that point, the PIC sends an acknowledge pulse
to the interrupting device. Figure 1.20b is a timing diagram summarizing the individual steps involved in a
typical interrupt.
Address bus
Memory Micro-
processor PIC Interrupt
sources
Data bus
Figure 1.21: Interfacing a PIC to a microprocessor
A good example of a common, simple PIC is the Intel 8259. In addition to the features described here,
the 8259 has a master-slave feature, allowing up to nine identical PICs to be connected to one microprocessor
in a configuration that can service up to 64 interrupts. For more information on the 8259, see [Intel 87] and
[Osborne and Kane 81].
Left Right
operand
Status
flags
ALU Auxiliary
input
Output
Figure 1.22: ALU inputs and outputs
The first example is addition, which is one of the most important and common operations. To add two
binary numbers, we first have to know how to add two bits, which involves the following steps;
1. The truth table should be written (figure 1.23b). It shows the output for every possible combination
of the inputs.
2. The truth table is analyzed and the outputs are expressed as logical functions of the inputs.
3. The circuit is designed out of logical elements such as gates and flip-flops. Such a circuit is called a
half adder.
The half-adder is a very simple circuit but can only add a pair of bits. It cannot be used for adding
numbers. Adding binary numbers involves two operations, adding pairs of bits and propagating the carry.
1.10 The ALU 31
A B
AB AB
A B S C Half
0 0 0 0 S=AB+AB
adder
0 1 1 0 C=AB
1 0 1 0 C
1 1 0 1 S S C
(a) (b) (c) (d)
Figure 1.23: A half adder
To propagate the carry we need a circuit that can add two bits and a carry. Unlike the half-adder, such a
circuit has three inputs and two outputs, it is called a full adder, and its truth table and logical design are
shown in figure 1.24a–d. Figure 1.24e,f shows how a full-adder can be built by combining two half-adders.
B B
A T A T
A B T S C
0 0 0 0 0 S=ABT+ABT+ABT+ABT
0 1 0 1 0 C=ABT+ABT+ABT+ABT
1 0 0 1 0
1 1 0 0 1 S C
0 0 1 1 0
0 1 1 0 1
1 0 1 0 1 (b) (c) (d)
1 1 1 1 1
T A B A B T
(a) S
(e) (f)
An Bn A2 B2 A1 B1 A0 B0
Cn−1
(g)
Figure 1.24: A full adder
The full adder can serve as the basis for an adding device that can add n-bit numbers. Figure 1.24g
shows the so-called parallel adder. This device is a simple combination of one half-adder and n−1 full-adders,
where the carry that is output by adder i is propagated to become the third input of full-adder i + 1. In a
truly parallel adder, the execution time is fixed, but the execution time of our parallel adder is proportional
to n, so this device is not truly parallel. To understand how it works, let’s concentrate on the two rightmost
32 1. Introduction
circuits, the half-adder 0 and the full-adder 1. They receive their inputs simultaneously and their outputs
are therefore ready at the same time. However, the outputs of the the full-adder 1 are initially wrong since
this device hasn’t received the correct input C0 . Once the right C0 is generated by half-adder 0 and is
propagated to full-adder 1, the outputs of that full-adder become correct, and its C1 output propagates to
full-adder 2, that has, up until now, produced bad outputs.
Figure 2.1 illustrates typical instruction formats. They range from the simplest to the very complex
and they demonstrate the following properties of machine instructions:
34 2. Machine Instructions
Instructions can have different sizes. The size depends on the number and nature of the individual
fields.
The opcode can have different sizes. This property is discussed in Section 2.2. The opcode can also
be broken up into a number of separate fields.
A field containing an address is much larger than a register field. This is because a register number is
small, typically 3–4 bits, while an address is large (typically 25–35 bits).
Fields containing immediate operands (numbers used as data) can have different sizes. Experience
shows that most data items used by programs are small. Thus, a well-designed instruction set should allow
for both small and large data items, resulting in short instructions whenever possible.
The last point, about short instructions, is important and leads us to the next topic of discussion,
namely the properties of a good instruction set. When a new computer is designed, one of the first features
that has to be determined and fully specified is the instruction set. First, the instruction set has to be
designed as a whole, then the designers decide what instructions should be included in the set. The general
form of the instruction set depends on features such as the intended applications of the computer, the word
size, address size, register set, memory organization, and bus organization. Instruction sets vary greatly, but
a good design for an instruction set is based on a small number of principles that are independent of the
particular computer in question. These principles demand that an instruction set should satisfy the following
requirements:
1. Short instructions.
2. An instruction size that is both compatible with the word size, and variable.
1. Short instructions: Why should machine instructions be short? Not because short instructions
occupy less memory. Memory isn’t expensive and modern computers support large memories. Also, not
because short instructions execute faster. The execution time of an instruction has nothing to do with its
size. A “register divide” instruction, to divide the contents of two floating-point registers, is short, since it
only has to specify the registers involved. It takes, however, a long time to execute.
Exercise 2.1: Find an example of a long instruction with fast execution.
The reason why short instructions are preferable is that it takes less time to fetch them from memory.
An instruction that is longer than one memory word has to be stored in at least two words. It therefore
takes two memory cycles to fetch it from memory. Cutting the size of an instruction such that it fits in one
word, cuts down the fetch time. Even though the fetch time is short (it is measured in nanoseconds), many
instructions are located inside loops and have to be executed (and therefore fetched) many times. Also, since
memory is slower than the processor, speeding up the instruction fetch can speed up the entire computer.
2. Instruction size: The instruction size should also be compatible with the computer’s word size.
The best design results in instruction sizes of N , 2N and 3N where N is the word size. Instruction sizes such
as 1.2N or 0.7N do not make any sense, since each memory read brings in exactly one word. In a computer
with long words, several instructions can be packed in one word and, as a result, instruction sizes of N/2,
N/3, and N/4 also make sense.
The two requirements above are satisfied by the use of variable-size opcodes and addressing modes.
These two important concepts are discussed next.
2.2 The Opcode Size
If the instruction is to be short, individual fields in the instruction should be short. In this section we
look at the opcode field. Obviously, the opcode cannot be too short or there would not be enough codes
for all the instructions. An opcode size of 6–8 bits, allowing for 64–256 instructions, is common. Most
modern computers, however, use variable size opcodes, for two good reasons. One reason has to do with the
instruction size in relation to the word size, and the other has to do with future extensions of the instruction
set.
The first reason is easy to understand. We want our instructions to be short, but some instructions
have to contain more information than others and are naturally longer. If the opcode size varies, longer
instructions can be assigned short opcodes. Other instructions, with short operands, can be assigned longer
opcodes. This way the instruction size can be fine-tuned to fit in precisely N or 2N bits.
2.2 The Opcode Size 35
The second advantage of variable-size opcodes has to do with extensions to an original instruction set.
When a computer becomes successful and sells well, the manufacturer may decide to come up with a new,
extended, and upward compatible version of the original computer. The 68000, 80x86, and Pentium families
are familiar examples.
Upward compatibility means that any program running on the original computer should also run on the
new one. The instruction set of the new computer must therefore be an extension of the original instruction
set. With variable-size opcodes, such an extension is easy.
When an instruction is fetched, the hardware does not know what its opcode is. It has to decode the
instruction first. In a computer with variable-size opcodes, when an instruction is fetched, the control unit
does not even know how long the opcode is. It has to start by identifying the opcode size, following which
it can decode the instruction. Thus, with variable-size opcodes, the control unit has to work harder.
Three methods to implement variable-size opcodes are described here.
Prefix codes. We start by simply choosing several opcodes with the sizes that we need. For example,
if we need a 3-bit opcode in order to adjust the size of instruction A to one word, a 6-bit opcode to adjust
the size of instruction B to one word, and other opcodes with sizes 2, 4, and 7 bits, we can select the set
of opcodes 000, 101010, 11, 0001, and 1110001. A little thinking, however, shows that our opcodes cannot
be decoded uniquely. Imagine an instruction with opcode 0001 being fetched in Step 1 of the fetch-execute
cycle and decoded in Step 3. The control unit examines the first few bits of the instruction and finds them
to be 0001. . . . It does not know whether the opcode is the four bits 0001 or the three bits 000 (with the 1
being the start of the next field). The problem exists because we selected the bit pattern 000 as both a code
and the start of another code. The solution is to avoid such a situation. We have to design our opcodes
following the simple rule:
Once a bit pattern has been assigned as an opcode, no other opcode should start with that pattern
The five opcodes above can now be redesigned as, for example, 000, 101010, 11, 100, and 0100100. We
say that these opcodes satisfy the prefix property, or that they are prefix codes. These codes are discussed in
more detail in Section 3.11. It is easy to see, by checking each of the five codes, that they can be uniquely
decoded. When the control unit finds, for example, the pattern 000 it knows that this must be the first code,
since no other codes start with 000. When it finds 10. . . , it knows that this must be either code 2 or code
4. It checks the next bit to find out which code it is.
Fixed prefix. The first 2–4 bits of the instruction tell what the opcode size is (and even what the
instruction format is). Figure 2.2 shows four instruction formats, denoted by 0–3 and identified by the first
two opcode bits. When the control unit fetches an instruction that starts with 00, it knows that the next
three bits are the opcode, so there can be eight instructions with a 5-bit opcode (including the identification
bits). Similarly, when the control unit fetches an instruction that starts with 01, it knows that the opcode
consists of the next two bits plus three more bits elsewhere in the instruction, so there can be 32 instructions
with a 7-bit opcode (including the two identification bits).
Exercise 2.2: How many instructions can there be with the other two opcode sizes?
0 00xxx· · · ···
1 01xx· · ·xxx· · ·
2 10xxxx· · · ···
3 11xxxxxx· · · · · ·
8 × 5 + 32 × 7 + 16 × 6 + 64 × 8 872
= ≈ 7.26 bits.
8 + 32 + 16 + 64 120
With fixed-size opcodes, the opcode size for a set of 120 instructions would be seven bits. The advantage
of this method is that several opcode sizes are available and the opcode size can easily be determined. The
disadvantages are: (1) the instruction set cannot be extended; there is no room for new opcodes and (2)
because of the prefix, the average opcode is longer than in the case of fixed-size opcodes.
36 2. Machine Instructions
Reserved patterns. We can start with short, 2-bit opcodes. There are four of them, but if we assign
all four, there would be no way to distinguish between say, the 2-bit opcode ‘01’ and a longer opcode of the
form ‘01xxx’ (review the prefix rule above). Therefore, we reserve one of the four 2-bit opcodes to be the
prefix of all the longer opcodes. If we reserve ‘00’, then all the long opcodes have to start with ‘00’.
Suppose, as an example, that we need opcodes of sizes 2, 5, 8, and 13 bits. We reserve the 2-bit pattern
00 as the prefix of longer opcodes. This leaves the three 2-bit opcodes 01, 10, and 11. The 5-bit opcodes
must, therefore, have the form 00xxx, and one of them, say 00000, has to be reserved, so we end up with
seven such opcodes. Similarly, the 8-bit opcodes must have the form 00000xxx with one pattern reserved,
leaving seven 8-bit opcodes. Finally, the 13-bit opcodes must all start with eight zeros, so only five of the
13 bits vary. This allows for 32 opcodes, of which one should be reserved for future extensions.
2 × 3 + 5 × 7 + 8 × 7 + 13 × 31 500
= ≈ 10.42 bits.
3 + 7 + 7 + 31 48
This is much longer than the six bits needed for a similar set of 48 fixed-size opcodes.
The advantages of this method are (1) several opcode sizes are available and (2) the instruction set can
be extended, since more (longer) opcodes can be created and assigned to the new instructions. The only
disadvantage is the large average size of the opcodes.
2.3 Addressing Modes
This is an important feature of computers. We start with the known fact that many instructions have to
include addresses; the instructions should be short, but addresses tend to be long. Addressing modes are a
solution to this problem. Using addressing modes, a short instruction may specify a long address.
The idea is that the instruction no longer includes the full address (also called the effective address, or
EA), but instead contains a number, normally called a displacement, that’s closely related to the address.
Another field, an addressing mode, is also added to the instruction. The addressing mode is a code that tells
the control unit how to obtain the effective address from the displacement. It is more accurate to say that the
addressing mode is a rule of calculation, or a function, that uses the displacement as its main argument, and
other hardware components—such as the PC, the registers, and memory locations—as secondary arguments,
and produces the EA as a result.
The notation EA = fM (Disp, PC, Regs, Mem), where the subscript M indicates the mode is convenient.
For different modes there are different functions, but the idea is the same.
Before any individual modes are discussed, it may be useful to look at some of the numbers involved
when computers, memories, and instructions are designed.
Up until the mid 1970s, memories were expensive, and computers supported small memories. A typical
memory size in a second generation computer was 16K–32K words (24–48 bits/word), and in an early third
generation computer, 32K–64K words (48–60 bits/word). Today, with much lower hardware prices, modern
computers can access much larger memories. Most of the early microprocessors could address 64K bytes,
and most of today’s microprocessors can address between 32M words and a few tera words (usually 8-bit
words, bytes). As a result, those computers must handle long addresses. Since 1M (1 mega) is defined as
1024 K = 1024 × 210 = 220 , an address in a 1M memory is 20 bits long. In a 32M memory, an address is
2.4 The Direct Mode 37
25 bits long. The Motorola 68000 microprocessor [Kane 81] and the Intel 80386 [Intel 85] generate 32-bit
addresses, and can thus physically address 4G (giga) bytes. (their virtual address space, though, is 64 tera
bytes, = 246 .) Computers on the drawing boards right now could require much longer addresses.
Let’s therefore assume a range of 25–35 bits for a typical address size, which results in the following
representative instruction formats
Opcode Reg Operand Opcode Reg1 Reg2 Operand
6–8 3–6 25–35 6–8 3–6 3–6 25–35
The operand field in those formats makes up about 63–73% of the total instruction size, and is therefore the
main contributor to long instructions. A good instruction design should result in much shorter operands,
and this is achieved by the use of addressing modes.
When an addressing mode is used, the instruction contains a displacement and a mode instead of an
EA. The mode field is 3–4 bits long, and the result is the following typical format:
Opcode Reg Mode Displacement
6–8 3–6 3–4 8–12
The operand (the mode and displacement fields) is now 11–16 bits long. It is still the largest field in
the instruction, making up 50–55% of the instruction size, but is considerably shorter than before. Many
instructions do not need an address operand and therefore do not have an addressing mode. They do not
have a mode field and are therefore short.
There is, of course, a tradeoff. If an instruction uses a mode, the EA has to be calculated, in Step 4 of
the fetch-execute cycle, before the instruction can be executed (in Step 5), which takes time. However, this
calculation is done by the control unit, so it is fast.
Before looking at specific modes, another important property of addressing modes should be mentioned.
They serve to make the machine instructions more powerful. We will see examples (such as those in Sec-
tion 2.10.9) where a single instruction, using a powerful mode, does the work of two instructions.
The five main addressing modes, found on all modern computers and on many old ones, are direct,
relative, immediate, index, and indirect. They are discussed here in detail.
2.4 The Direct Mode
This is the case where the displacement field is large enough to contain the EA. There is no need for any
calculations, and the EA is simply the displacement. The definition of this simple mode is EA←Displacement,
but this is not a very useful mode because it does not result in a short instruction (it is like not having an
addressing mode at all). Nevertheless, if the assembler cannot assemble an instruction using any other mode,
it has to use the direct mode, .
2.5 The Relative Mode
In this mode, the displacement field is set by the assembler to the distance between the instruction and its
operand. This is a useful mode that is often selected by the assembler as a first choice when the user does
not explicitly specify any other mode. It also results in an instruction that does not require relocation by
the loader.
Using the concept of a function mentioned earlier, the definition of this mode is EA = Disp + PC. This
means that, at run time, before the instruction can be executed, the control unit has to add the displacement
and the PC to obtain the EA. The control unit can easily do that, and the only problem is to figure out the
displacement in the first place. This is done by the compiler (or assembler) in a simple way. The compiler
maintains a variable called the location counter (LC) that points to the instruction currently being compiled.
Also, at run time, the PC always contains the address of the next instruction. Therefore the expression
above can be written as Disp = EA − PC, which implies
Disp = EA − (LC + size of current instr.) = EA − LC − size of current instr.
Example: The simple JUMP instruction “JMP B”. We assume that the JMP instruction is assembled and
loaded into memory location 57, and that the value of symbol B is address 19. The expression above implies
the displacement should be
Disp = 19 − 57 − 1 = −39.
38 2. Machine Instructions
Notice that the displacement is negative. A little thinking shows that the displacement is negative whenever
the operand precedes the instruction. Thus, in this mode, the displacement should be a signed number.
Since the sign uses one of the displacement bits, the maximum value of a signed displacement is only half
that of an unsigned one. An 8-bit displacement, for instance, is in the range [0, 255] if it is unsigned, but in
the shifted interval [−128, +127] if it is signed. The ranges have the same size, but the maximum values are
different.
This example also illustrates the absolute nature of this mode. The displacement is essentially the
distance between the instruction and its operand, and this distance does not depend on the start address of
the program. We say that this mode generates position independent code and an instruction using it does
not have to be relocated.
LEN EQU 25 LEN is the array size
ARY RES LEN ARY is the array itself
M DAT ARY Location M contains the value of symbol ARY
Exercise 2.3: What instruction can be used instead of “LOD R5,M” above, to load the value of ARY?
The index mode can also be used in a different way, as in “LOD R1,ARY(R5)” where ARY is the start
address of the array and R5 is initialized to 0. In this case, the index register really contains the index of the
current array element used. This form can be used if the start address of array ARY fits in the displacement
field.
2.8 The Indirect Mode 39
LC Obj. Code
. Opc m disp
24 JMP @TO xxx 3 124
.
.
124 TO DC 11387
.
The at-sign “@” instructs the assembler to use the indirect mode (other symbols, such as a dollar sign, may
be used by different assemblers). The value of TO (address 124) is the indirect address and, in the simplest
version of the indirect mode, it has to be small enough to fit in the displacement field. This is one reason
why several versions of this mode exist (see below). The EA in our example is 11387, the content of memory
location 124.
What’s the use of this mode? As this is a discussion of computer organization, and not of assembly
language programming, a complete answer cannot be provided here. However, we can mention a common
case where this mode is used, namely a return from a procedure. When a procedure is called, the return
address has to be saved. Most computers save the return address in the stack, but some old computers save
it in the first word of the procedure (in which case the first executable instruction of the procedure should
be stored in the second word, see page 16). If the latter method is used, a return from the procedure is done
by a jump to the memory location whose address is contained in the first word of the procedure. This is
therefore a classical, simple application of the indirect mode. If the procedure name is P, then an instruction
such as “JMP @P” (where “@” specifies the indirect mode) can easily accomplish the job
Incidentally, if the return address is saved in the stack, a special instruction, such as RET, is necessary
to return from the procedure. Such an instruction should jump to the memory location whose address is
contained at the top of the stack, and also remove that address from the stack. Thus, a RET instruction uses
a combination of the indirect and stack modes.
Common extensions of the indirect mode combine it with either the relative or the index modes. The
JMP instruction above could be assembled into “xxx 3 99”, since the value of TO relative to the JMP instruction
is 124 − 24 − 1 = 99. We assume that the size of the JMP instruction is one word and that mode 3 is the
combination indirect-relative. A combination indirect-index can also be used and, in fact, the (now obsolete)
6502 microprocessor used two such combinations, a pre-indexed indirect (that can only be used with index
register X), and a post-indexed one (that can only be used with index register Y). Their definitions are
Pre-indexed indirect: EA = Mem[disp + X],
Post-indexed indirect: EA = Mem[disp] + Y.
In the former version, the indexing (disp + X) is done first, followed by the indirect operation (the memory
read). In the latter version, the indirect is done first and the indexing (· · · +Y) follows. [Leventhal 79]
illustrates the use of those modes. The two instructions “LDA ($80,X)” and “LDA ($80),Y” are typical
examples. The dollar sign “$” stands for hexadecimal and the parentheses imply the indirect mode. It is
interesting to note that in order to keep the instructions short, there is no mode field in the 6502 instructions,
and the mode is implied in the opcode. Thus, an instruction may have several opcodes, one for each valid
mode. The two instructions above have opcodes of A1 and B1 (hexadecimal), respectively.
40 2. Machine Instructions
1 2178 0 12345
loc. 1045 loc. 15
Figure 2.3: Cascaded indirect
Exercise 2.4: How can the programmer generate both an address and a flag in a memory word?
2.10 Other Addressing Modes
Computers support many other modes, some simpler and others more powerful than the basic modes de-
scribed so far. This section describes a few other modes, not trying to provide a complete list but only to
illustrate what can be found on different computers.
2.10.1 Zero Page Mode
This is a version of the direct mode. If the displacement field is N bits long, the (unsigned) displacement can
have 2N values. Memory may be divided into pages, each 2N words long, and page zero—the first memory
page—is special in the sense that any instruction accessing it uses a small address that fits in the displacement
field. Hence the name zero page mode. Good examples are the 6502 microprocessor [Leventhal 79] and the
DEC PDP-8 [Digital Equipment 68].
2.10.2 Current Page Direct Mode
The DEC PDP-8 was a 12-bit minicomputer. It had 12-bit addresses and thus a 4K-word memory. Memory
is divided into pages, each 128 words long. The first memory page, addresses 0–127, was called base page.
Many instructions have the format:
opcode m disp
4 1 7 = 12 bits
If m=0, the direct mode is used and the EA is the displacement; this is zero page addressing. However, if
m=1, the EA becomes the logical OR of the five most significant bits of the PC and the seven bits of the
displacement, pppppddddddd . The displacement in this case is the address in the current page, and the mode
is called current page direct mode.
2.10.3 Implicit or Implied Mode
This is the simple case where the instruction has no operands and is not using any mode. Instructions such
as HLT or EI (enable all interrupts) are good examples. Those instructions do not use any modes, but the
manufacturer’s literature refers to them often as using the implicit or implied mode.
2.11 Instruction Types 41
of the stack is implied in the opcode. Examples are POP and PUSH. The former pops a data item from the
stack, then updates the stack pointer to point to the new top-of-stack. The latter updates the stack pointer
to point to the next available stack location, then pushes the new data item into the stack.
Exercise 2.5: : A stack is a LIFO data structure, which implies using the top element. What could be a
reason for accessing a stack element other than the top?
2.10.6 Stack Relative Mode
This is a combination of the relative mode and the stack mode. In this mode, EA = disp + SP where SP is
the stack pointer, a register that always points to the top of the stack. This mode allows access to any stack
element, not just the one at the top.
2.10.7 Register Mode
This is the case where the instruction specifies a register that contains the operand. This mode does not use
an EA and there is no memory access.
2.10.8 Register Indirect Mode
In this mode, the register contains the EA; it points to the operand in memory.
2.10.9 Auto Increment/Decrement Mode
This is a powerful version of the index mode. The control unit uses the index register to calculate the EA,
then increments or decrements the register. This is an example of a powerful addressing mode, because an
instruction using this mode can do the work of two instructions. The PDP-11 instruction “CLR (R5)+” is
executed by (1) use R5 as an index (it contains the EA, so it points to the operand in memory), (2) execute
the instruction (clear the operand, a memory word), and (3) finally, increment the index register so that it
points to the next word in memory. This is clearly a powerful mode, since without it, another instruction
would be needed, to increment the register. Similarly, the instruction “INC -(R5)” starts by decrementing
the index register, then using it as an index, pointing to the memory word that is to be INCremented.
In the DEC PDP-11, memory is physically divided into bytes, and a word consists of two consecutive
bytes. The instructions above operate on a word and, therefore, the index is updated by 2, not by 1, so it
points to the next/previous word. In an instruction such as CLRB (clear a byte), the register is updated by
1.
In the Nova minicomputer [Data General 69] [Data General 70], memory locations 16–31 are special.
Locations 16–23 are auto increment and locations 24–31 are auto decrement. When any of those locations
is accessed by an indirect instruction, the computer first increments/decrements that location, then uses it
to calculate the effective address.
Figure 2.4 is a graphic representation of 11 of the modes discussed here.
2.11 Instruction Types
Modern computers support many instructions, sometimes about 300 different instruction, with features such
as variable-size opcodes and complex addressing modes. At the same time, the instructions in a typical
instruction set can be classified into a small number of classes. The instruction types discussed here are
general and are not those of any particular computer. Some—such as data movement instructions—can be
found on every computer while others—such as three-address instructions—are rare.
42 2. Machine Instructions
Instruction
Instruction disp
disp Operand PC + Operand
memory memory
Direct Relative
Instruction
indirect
disp Instruction
address
Operand
Operand
memory
memory
Indirect Immediate
Instruction
disp
registers
Indexed
Instruction Instruction
indirect
disp indirect disp address
address +
+ Operand Operand
registers
memory memory
registers
Indirect Pre-Indexed Indirect Post-Indexed
Instruction
Instruction
reg
reg
operand
address operand
registers
registers
memory
Register Register Indirect
base
top start of
Instruction
+ program
disp
disp
+ operand operand
SP
Instruction
stack memory
TRA
Registers Constant
Memory
Constant
MOV
2.13 Operations
These instructions specify operations on data items. They are all executed in a similar way. The control
unit fetches the two operands (or the single operand), sends them to the ALU, sends a function code to the
ALU, and waits for the ALU to perform the operation. The control unit then picks up the result at the ALU
44 2. Machine Instructions
A 1 1 0 0
B 1 0 1 0
negation ¬A 0 1
conjunction (AND) A∧B 1 0 0 0
disjunction (OR) A∨B 1 1 1 0
exclusive OR (XOR) A⊕B 0 1 1 0
equivalence A≡B 1 0 0 1
Table 2.6: Truth tables of some logic operations
A look at the truth tables in table 2.6 shows two important facts about the logic operations:
1. The negation operation has just one operand. The truth table is fully defined by two bits, and there
can therefore be only four logic operations with one input.
Exercise 2.6: Show a simple example of another logic operation with one input.
2. Each of the other tables has two inputs and is fully defined by specifying four bits. This implies that
there can only be 24 = 16 such tables. This is not a limitation, since only four or five of the 16 operations
are useful in practice and are implemented as ALU circuits. Most ALUs have circuits to perform the AND,
OR, and NOT operations. The XOR operation can also be performed by the ALUs of many computers, but
if any of the other operations is needed, it has to be implemented in software.
It should also be noted that the NOT operation can be achieved by XORing a number with all 1’s.
Thus, 1010 ⊕ 1111 = 0101.
The importance of arithmetic operations in a computer is obvious, but what is the use of the logical op-
erations? Perhaps their most important use is in bit manipulations. They make it easy to isolate or to change
an individual bit or a field of bits in a word. The operation xxxyxxxx AND 00100000 results in 000y0000
which, when compared to zero, reveals the value of bit y. Similarly, the operation xxxxx000 OR 00000yyy
results in xxxxxyyy, making it easy to combine individual pieces of data into larger segments.
The XOR operation is especially interesting. Its main properties are:
1. It is commutative a ⊕ b = b ⊕ a.
2. It is associative (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c).
3. It is reversible. If a = b ⊕ c, then b = a ⊕ c and c = a ⊕ b.
2.13 Operations 45
One important application of the XOR is the stream ciphers of Section 3.19.4. Another, less familiar
application is hiding an n-bit file F in k cover files of the same size. Given k random files C0 , C1 ,. . . ,Ck−1 ,
we can hide a file F in them in the following way:
A k-bit password P is chosen. It is scanned, and for each bit Pj = 1 in P , cover file Cj is selected. All
the selected Cj files are XORed, the result is XORed with file F , and that result, in turn, is XORed with
any of the Cj ’s, say Ca . The process is summarized by
Ca = Cj ⊕ F ⊕ Ca .
Pj =1
Only one of the cover files, Ca , is modified in the process and there is no need to memorize which one it is.
Once F is hidden this way, it can be retrieved by
F = Cj .
Pj =1
Exercise 2.7: We choose n = 4, k = 5, the password P = 01110, and the five cover files C0 = 1010,
C1 = 0011, C2 = 1110, C3 = 0111, and C4 = 0100. Show how file F = 1001 is hidden using P .
This clever application of the XOR hides a file F in such a way that anyone with the password can easily
retrieve it, but someone who does not have the password cannot even verify that the file exists, even if they
search every bit of the entire file system. See Section 3.20 for a discussion of other data hiding techniques.
2.13.3 Shifts
These are simple, important operations, that are used on both numeric and nonnumeric data, and come in
several varieties.
0 s 0
(a) (b) (c)
0 s
(d) (e)
0
The simplest type of shift is the logical shift (Figure 2.7a). In a left logical shift, the leftmost bit is
shifted out and is lost, and a zero is entered on the right. In a right logical shift, this operation is reversed.
The arithmetic shift (Figure 2.7b) is designed for multiplying and dividing numbers by powers of 2. This
is easy and fast with a shift, and the only problem is to preserve the sign. The ALU circuit for arithmetic
shift knows that the leftmost bit contains the sign, and is therefore special and should be preserved. An
arithmetic left shift shifts bits out of the second bit position from the left, entering a zero on the right. The
leftmost bit (the sign) is not affected. An arithmetic right shift shifts out the rightmost bit, entering copies
of the sign bit into the second position from the left. This always results in the original number being either
multiplied or divided by powers of 2, with the original sign preserved.
Also common is the circular shift (or rotation, Figure 2.7c). The number is rotated left by shifting out
the leftmost bit and entering it from the right. Similarly for a circular right shift. A typical use of such a
shift is the case where every bit of a certain number has to be examined. The number is repeatedly shifted,
46 2. Machine Instructions
in a loop, and each bit is examined when it gets to the leftmost position. In that position, the bit temporarily
becomes the sign of the number, making it easy to examine.
A double shift is depicted in Figure 2.7d. A pair of registers is shifted as one unit. This is useful in a
computer with a short word size, where double precision operations are common. The double shift itself can
be of any of the varieties mentioned here.
Another variation is a shift through the carry. The carry status flag (C) can participate in a shift, making
it easy to isolate and test individual bits. In a right shift through the carry, the rightmost bit is moved to
the C flag and can later be examined by testing that flag. If such a shift is also circular (Figure 2.7e), it is
equivalent to shifting N + 1 bits.
2.13.4 Comparisons
In a typical comparison operation, two numbers are compared and a result is generated. The result, however,
is not a number, which is why comparisons are not considered operations on data, but are instead classified as
a separate class of instructions. If the two numbers compared are a and b, then the result of the comparison
is one of the three relations:
a < b, a = b, a > b.
The result is not a number and should therefore be stored in a special place, not in a register or in memory.
The best place for the result is the Z and S status flags (Section 1.4).
By definition, the result of a comparison is the status of the difference a − b. The computer does not
have to actually subtract the numbers, but it sets the flags according to the status of the difference. Thus,
if a = b, the difference is zero and the Z flag is set to 1. If a < b, the difference is negative and the S flag is
set (the Z flag is reset, since the difference is not zero). If a > b, both flags are cleared.
One way to compare numbers is to subtract them and check the difference. The subtract circuit in the
ALU can be used, which automatically updates the Z and S flags. This method is simple but may result in
a carry or overflow (see discussion of carry and overflow in Section 2.26).
Another technique to compare numbers is to compare pairs of bits from left (the most significant pair)
to right (least significant one). As long as the pairs are equal, the comparison continues. When the first
unequal pair is encountered, the result is immediately known and the process terminates. For example, given
the numbers a = 00010110 and b = 00100110, b is clearly the greater. A left to right comparison will stop
at the third pair, where b has a 1 while a has a 0. Note that the next (fourth) pair has the opposite values
but that cannot change the relative sizes of the two numbers.
The test instruction TST, found on some computers, has one operand, and it is compared to zero. Thus,
the Z and S flags are set according to the status of the number being tested.
A conditional branch instruction works by examining one or more of the status flags. It causes a branch
(by resetting the PC) if the flag has a certain value (0 or 1). For example, the BCC instruction (Branch Carry
Clear) branches if the C flag is 0. The BZE instruction (Branch on ZEro) branches if the Z flag is 1.
A computer may have two types of conditional branches, jumps and branches. The only difference
between them is the addressing mode. A jump uses the direct mode, and can jump anywhere in memory. A
branch uses the relative mode, and can only jump over short distances (Sections 2.4 and 2.5).
A computer may also have a SKIP instruction. This is a conditional branch that increments the PC
instead of resetting it. For example, the SKPN (skip on Negative) instruction increments the PC by 2 if S = 1.
This effectively skips the next instruction. In the sequence
SKPN
INCR
ADD
the INCR instruction is skipped if S=1 (negative).
The various procedure call instructions (page 16) are also included in this type. The difference between
a branch and a procedure call is that the latter saves the return address before branching. Note that a
procedure call instruction can also be conditional.
2.13.6 I/O Instructions
These instructions are divided into two groups, instructions that move data, and control instructions. The
first group includes just two instructions, namely IN and OUT. Their general format is “IN R,DEV” where R
is any register and DEV is a device select number (Section 1.8). Such an instruction moves one byte of data
between the I/O device and the register.
The second group consists of a number of instructions that send commands to the I/O devices and also
receive device status. On older computers, these instructions are very specific. A control instruction may
only send commands to a disk unit, or may only receive status from a keyboard. This is mostly true for On
modern computers, an I/O control instruction can be used to control several devices. An instruction such as
“BRDY DEV,ADDR”, for example, reads status from device DEV and branches to location ADDR when a “ready”
status is read.
The single-bus computer, mentioned in Chapter 3, do not support any I/O instructions. On such a
machine the I/O is memory mapped and data is moved between the I/O devices and the processor by
means of memory reference instructions such as LOD and STO. When an I/O device is interfaced with such a
computer, it is assigned several select numbers, one (or two) for the data and others for reading status and
sending commands. A printer, for instance, may be assigned two select numbers, one to send data to be
printed, and the other for reading printer status (ready, busy, out of paper, jammed) and also for sending
printer commands (backspace, carriage return, line feed, form feed, ring bell, switch color).
An advantage of single-bus computer organization is that many different devices can be added to the
computer without the need for special instructions.
2.13.7 Miscellaneous Instructions
Various instructions may be included in this category, ranging from the very simple (instructions with no
operands) to very complex (instructions that are as powerful as statements found in a higher-level language).
Following are a few examples.
1. HLT. An obvious instruction. Sometimes privileged (Section 1.7.1).
2. NOP. Does nothing. A very useful instruction, used to achieve precise delays between consecutive
operations.
3. PSW updating. The PSW (program status word, sometimes referred to as PS) contains the various
status flags and also hardware components such as the state of interrupts, processor mode, and memory
protection information. Some of the PSW bits are updated only by the hardware, some can be updated only
by privileged instructions, others can easily be updated by the user, using nonprivileged instructions. As a
result, the computer has at least one instruction to access the PSW or parts of it.
4. Interrupt control. On a computer with more than one interrupt request line, instructions are needed
to enable and disable individual interrupts.
5. ED. The EDit instruction is a good example of a high-level machine instruction found in older
mainframes. It has two operands, a pattern and an array in memory. It scans the array and updates its
contents according to the pattern. Thus, for example, all the “(” characters in an array can be replaced by
48 2. Machine Instructions
“[”, all the periods can be replaced by commas, and all the spaces changed to an “x”, all in one instruction.
This is an example of a complex instruction, common in CISC (complex instruction set computers).
2.14 N -Operand Instructions
Many of the instructions discussed here come in several varieties that differ by the number of their operands.
The ADD instruction, for example, is supposed to add two numbers, so it seems that it should have two
operands. On many computers it is indeed a two–operand instruction, written as “ADD R,ADDR” (add a
memory location to a register) or “ADDR R1,R2” (add two registers). On some computers, however, the same
instruction has only one operand and is therefore written “ADD ADDR”. Such an instruction adds its operand
(a register or a memory location) to the accumulator (the main functional register).
It is also possible to have a 3-operand ADD instruction. “ADD A,B,C” can perform either A←B+C or
A+B→C. Such an instruction is long (especially when A, B, C are memory addresses) and therefore rarely
used.
Extending the same idea in the other direction, is it possible to design an ADD instruction with zero
operands? The interesting answer is yes. On such a computer, the two numbers being added are the two
top operands of the stack. They are also removed and replaced by the sum.
It is possible to design many other instructions as 0, 1, 2, or 3-operand instructions, which introduces
the concept of an N -operand computer. In an N -operand computer many instructions (including most of the
operations) have N operands. Of course, not every instruction can have N operands. The LOD instruction,
for instance, can have either one operand (LOD ADDR) or two operands (LOD R,ADDR). A LOD with zero or
with three operands cannot be designed in a reasonable way. An instruction that moves a block of words
in memory is a natural candidate for two operands. Thus “MOV SIZE,SOURCE,DESTIN” (that moves SIZE
words from SOURCE to DESTIN) cannot be reasonably designed as a 1-operand or a 3-operand. Note that the
SIZE operand is not really a third operand, since it is a short number and not an address. Note also that it
is possible to design this instruction as “MOV SIZE,R1,R2” where the two registers point to the two memory
areas.
2.15 Actual Instruction Sets
The instruction sets summarized here are those of the Intel 8080 microprocessor, The VAX, and the Berkeley
RISC I computer. The first set is representative of the early 8-bit microprocessors; it is simple, neatly
organized, and not large. The second set is typical of large, old mainframes. It is very large, contains many
different types and sizes of instructions, and uses many addressing modes. It is an example of a complex
instruction set. The third set is an example of the opposite approach, that of a reduced instruction set. It
is very small and very simple. It uses single-size instructions and very few addressing modes. The concept
of RISC (reduced instruction set computers) is discussed in many texts on computer architecture.
The Intel 8080 was one of the first 8-bit microprocessors and was extensively used in the late 1970s,
until better microprocessors were developed. Its instruction set is simple and reflects its architecture. It has
a relatively small number of instructions, three addressing modes, three different instruction sizes, and seven
types of instructions.
The 8080 has eight 8-bit registers and two 16-bit registers. Their organization is summarized in Fig-
ure 2.8. The PSW (program status word) contains the status flags. The A register (accumulator) is the main
functional register. Registers B–L are the general-purpose registers. The PC (program counter) contains
the address of the next instruction, and the SP (stack pointer) always points to the top item in the stack.
Certain instructions use the BC, DE, and HL pairs of registers.
PSW A
B C
D E
H L
SP
PC
The addressing modes are direct, implied and register indirect. In the direct mode, the 16-bit EA is
stored in two bytes in the instruction and the instruction itself is 3-bytes long. In the implied mode, the EA
is stored in the HL register pair, and the instruction is either one or two bytes long. In the register indirect
mode, the 2-byte EA is stored in memory, in the location pointed to by one of the register pairs.
Most instructions are one byte long. The move-immediate, arithmetic-immediate, logical-immediate,
and I/O instructions are two bytes long, with the second byte containing either the immediate quantity or
an I/O device number. Branches and some memory-reference instructions are three bytes long (they use the
direct mode mentioned earlier).
The first byte of an instruction contains the opcode. This allows for 256 instructions but only 244
opcodes are used. However, many of those instructions are very similar and can be considered identical. For
example the “MOV A,B” and “MOV C,B” instructions have different opcodes (the opcodes contain the register
codes) but are really the same instruction or, at least, the same type of instruction. As a result, the total
number of different instructions is around 50–60, the precise number depends on what one considers to be
different instructions.
Table 2.9 summarizes the seven instruction formats and eight register codes. It is followed by examples
of instructions. The notation p in this table refers to an opcode bit, rrr, ddd, and sss are registers, Y is
either an immediate number or an I/O device number and EA, an effective address.
Table 2.10 is a very compact summary of the entire instruction set. Each asterisk “*” in the table
represents an additional byte. The row numbers on the left (00–37) are in octal and represent the leftmost
five bits of the opcode. The column numbers on the top row (0–7) represent the rightmost three bits. Thus,
for example, the CNZ instruction has an opcode of 3048 = 110001002 .
The table is divided into four parts, each containing certain groups of instructions. The first part
contains the increment, decrement, index, and move immediate instructions. The second part contains
just the move instructions. The third part, the arithmetic operations, and the fourth part, the jump, call,
arithmetic immediate, and a few special instructions. The table demonstrates the neat organization of this
instruction set.
50 2. Machine Instructions
0 1 2 3 4 5 6 7
∗∗ ∗
00 NOP LXI B STAX B INX B INR B DCR B MVI B RLC
01 DAD B LDAX B DCX B INR C DCR C MVI C∗ RRC
02 LXI D∗∗ STAX D INX D INR D DCR D MVI D∗ RAL
03 DAD D LDAX D DCX D INR E DCR E MVI E∗ RAR
04 LXI H∗∗ SHLD INX H INR H DCR H MVI H∗ DAA
05 DAD H LHLD H DCX H INR L DCR L MVI L∗ CMA
06 LXI SP∗∗ STA INX SP INR M DCR M MVI M∗ STC
07 DAD SP LDA DCX SP INR A DCR A MVI A∗ CMC
10 MOV B,B MOV B,C MOV B,D MOV B,E MOV B,H MOV B,L MOV B,M MOV B,A
11 MOV C,B MOV C,C MOV C,D MOV C,E MOV C,H MOV C,L MOV C,M MOV C,A
12 MOV D,B MOV D,C MOV D,D MOV D,E MOV D,H MOV D,L MOV D,M MOV D,A
13 MOV E,B MOV E,C MOV E,D MOV E,E MOV E,H MOV E,L MOV E,M MOV E,A
14 MOV H,B MOV H,C MOV H,D MOV H,E MOV H,H MOV H,L MOV H,M MOV H,A
15 MOV L,B MOV L,C MOV L,D MOV L,E MOV L,H MOV L,L MOV L,M MOV L,A
16 MOV M,B MOV M,C MOV M,D MOV M,E MOV M,H MOV M,L HLT MOV M,A
17 MOV A,B MOV A,C MOV A,D MOV A,E MOV A,H MOV A,L MOV A,M MOV A,A
20 ADD B ADD C ADD D ADD E ADD H ADD L ADD M ADD A
21 ADC B ADC C ADC D ADC E ADC H ADC L ADC M ADC A
22 SUB B SUB C SUB D SUB E SUB H SUB L SUB M SUB A
23 SBB B SBB C SBB D SBB E SBB H SBB L SBB M SBB A
24 ANA B ANA C ANA D ANA E ANA H ANA L ANA M ANA A
25 XRA B XRA C XRA D XRA E XRA H XRA L XRA M XRA A
26 ORA B ORA C ORA D ORA E ORA H ORA L ORA M ORA A
27 CMP B CMP C CMP D CMP E CMP H CMP L CMP M CMP A
30 RNZ POP B JNZ∗∗ JMP∗∗ CNZ∗∗ PUSH B ADI ∗ RST 0
31 RZ RET B JZ ∗∗ CZ∗∗ CALL∗∗ ACI∗ RST 1
32 RNC POP D JNC∗∗ OUT∗ CNC∗∗ PUSH D SUI∗ RST 2
33 RC JC ∗∗ IN ∗ CC∗∗ SBI∗ RST 3
34 RPO POP H JPO∗∗ XHTL CPO∗∗ PUSH H SNI∗ RST 4
35 RPE PCHL JPE∗∗ XCHG CPE∗∗ XRI∗ RST 5
36 RP POP PSW JP∗∗ DI CP∗∗ PUSH PSW ORI∗ RST 6
37 RM SPHL JM ∗∗ EI CM∗∗ CPI∗ RST 7
The addressing mode is a 4-bit field in a byte which also includes a 4-bit register number. This is the
operand specifier byte, which is a part of most instructions. Since register 15 is also the PC, it is different
from the other registers. This is why certain mode bits, when used with register 15, mean something different
than usual.
Mode bits 1010(=A), for example, mean byte displacement when used with any of the registers 0–14;
when used with register 15, the same bits mean byte relative deferred mode. Table 2.12 summarizes all the
24 different addressing modes.
52 2. Machine Instructions
An example of a complex VAX instruction is the MOVTC Move Translated Characters. It accepts a string
of characters, scans it, and replaces each character in the string with another one, taken from a special
256-byte table of replacement codes. As an alternative, it can leave the input string intact, and generate an
output string. This instruction is considered complex because it can only be replaced by a loop, not just by
a few instructions.
The MOVTC instruction thus requires the following operands: Source string length, source string address,
fill character*, address of table, destination string length, destination string address.
More information on the VAX and its instruction set can be found in [Digital Equipment 81] as well as
in many books on the VAX assembler language.
IMM
opcode destin source1 0 source2 opcode destin source1 1 13-bit immediate
7 5 5 1 5 13
ISC 4 13 14
opcode destin 19-bit immediate source opcode cond 19-bit relative address
4 19 19
* If the destination string is longer than the source string, the fill character is used to fill the rest of the
destination string
2.18 Non-Numeric Data Types 53
The 1-bit SCC (set condition code) field specifies whether an instruction should update the status flags.
The source1 and destination fields select two register operands (out of 32 registers). The IMM field defines
the meaning of the source2 field. If IMM is zero, the five low-order bits of source2 specify a register number.
If IMM is one, then source2 specifies either an immediate number, or a 13-bit relative address, depending
on the opcode. Some instructions use the COND field (four bits) instead of source1. Finally, the branch
relative instructions JMPR and CALLR use the PC-relative mode, where the source1, IMM, and source2 fields
are combined to form a 19-bit relative address. Some examples are:
In those examples, Rs and Rd stand for the source1 and destination registers, S2 is the source2 field (a source
address), Rx is the source1 register, used as an index register, and Y is the 19-bit relative address mentioned
earlier. M[S2] is the contents of location S2 in memory, M[Rx+S2] specifies indexing (an EA is obtained by
adding Rx+S2 followed by a memory access M[EA]).
This is a simple instruction set that tries to show, in the spirit of RISC, that a complex instruction set
is not a necessary condition for a fast, efficient computer.
2.18 Non-Numeric Data Types
The rest of this chapter is devoted to the different types of data supported by current computers. The
following important types are described, character, boolean, signed integer, floating-point, fixed-point, and
BCD. Also discussed are the concepts of carry and overflow.
The nonnumeric data types dealt with in this section are character and boolean. First, however, let’s
try to understand the meaning of the term “data type.” A data type is something like “integer”, “real”, or
“boolean.” Those words are used all the time in higher level programming and most users have a vague,
intuitive understanding of them. It is possible, however, to assign them a precise meaning. Recall, for
example, that a boolean quantity can take one of two values, namely “true” or “false.” We say that the set
of all boolean values is of size 2, which leads to the definition
A data type is a set of values.
2.18.1 Characters
Characters constitute a simple data type, since there are relatively few of them, and since relatively few
operations can be performed on them. Characters are important, because they are used in word processing,
data bases, the compilation of programs, and other important applications. In fact, there are many computer
users who rarely, if ever, use numbers in their programs. The fact that computers are useful for nonnumeric
applications was recognized early in their history, when programmers realized that the computer could be
used to compile its own programs.
Character sets typically contain 100 to 200 characters, and the most important character codes are
described in Chapter 3. This section concentrates on the few operations that can be performed on characters,
namely input, comparisons, and output. When dealing with strings of characters, two more operations
namely, concatenation and substring, are useful.
Most of the time, characters are input, stored in memory, and eventually output as text. Comparing
characters is a simple, important operation that involves the character codes. It is easy, of course, to compare
characters for equality and nonequality. It is also possible to compare them for the other four relationals <,
>, ≤, and ≥. The comparison a > b is done by comparing the numeric codes of the two characters. The
character with the smaller code would thus be the “smaller” character for the purpose of comparisons. The
code, in fact, defines an order on the characters, an order called the collating sequence of the computer.
The term concatenating means to place two objects side by side. The substring operation extracts a
substring from a given string. These are two important operations when dealing with strings of text.
54 2. Machine Instructions
2.18.2 Boolean
Boolean is a word derived from the name of George Boole, an Irish mathematician and philosopher who
founded the field of modern symbolic logic. He was interested in operations on numbers that result in
nonnumeric quantities—such as a > b, which results in either true or false—and came up with a formalism
of such operations, which we now call boolean algebra. His work has been widely used in the design of
computers, since all the arithmetic operations in the computer are performed in terms of logical operations.
His main work is An Investigation of the Laws of Thought (1854).
The two boolean values are represented in the computer using either the sign bit
0x . . . x for true, and 1x . . . x for false (or vice versa).
(where x stands for ‘don’t care’), or using all the bits in a word
00 . . . 0 for true, and 11 . . . 1 for false (or vice versa).
2.19 Numeric Data Types
Most computers include two data types in this category, namely integer and real. However, this, section
shows that it is possible to define other numeric data types that can be useful in special situations.
2.20 Signed Integers
The binary number system was discovered by the great mathematician and philosopher Gottfried Wilhelm
von Leibniz on March 15, 1679. This representation of the integers is familiar and is the most “natural”
number representation on computers in the sense that it simply uses the binary values of the integers being
represented. The only feature that needs to be discussed, concerning integers, is the way signed integers
are represented internally. Unsigned numbers are certainly not enough for practical calculations, and any
number representation should allow for both positive and negative numbers. Three methods have been
used, throughout the history of computing, to represent signed integers. They are sign-magnitude, one’s
complement, and two’s complement. All three methods reserve the leftmost bit for the sign of the number,
and all three use the same sign convention, namely 1 represents a negative sign and 0 represents a positive
sign.
2.20.1 Sign-Magnitude
In this method the sign of a number is changed by simply complementing the sign bit. The magnitude bits
are not changed. To illustrate this method (and also the following ones) we use 4-bit words, where one bit is
reserved for the sign, leaving three magnitude bits. Thus 0|111 is the representation of +7 and 1|111, that
of −7. The number 0|010 is +2 and 1|010 is −2. In general, 1xxx is the negative value of 0xxx. The largest
number in our example is +7 and the (algebraically) smallest one is −7. In general, given N -bit words,
where the leftmost bit is reserved for the sign, leaving N − 1 magnitude bits, the largest possible number is
2N −1 − 1 and the smallest one is −(2N −1 − 1).
This representation has the advantage that the negative numbers are easy to read but, since we rarely
have to read binary numbers, this is not really an advantage. The disadvantage of this representation is that
the rules for the arithmetic operations are not easy to implement in hardware. Before the ALU can add or
subtract such numbers, it has to compare them, in order to decide what the sign of the result should be.
When the result is obtained, the ALU has to append the right sign to it explicitly.
The sign-magnitude method was used on some old, first-generation computers, but is no longer being
used.
2.20.2 One’s Complement
This method is based on the simple concept of complement. It is more suitable than the sign-magnitude
method for hardware implementation. The idea is to represent a negative number by complementing the bits
of the original, positive number. This way, we hope to eliminate the need for a separate subtraction circuit
in the ALU and to subtract numbers by adding the complement. Perhaps the best way to understand this
method is to consider the complement of a decimal number.
Example: Instead of subtracting the decimal numbers 12845 − 3806, we try to add 12845 to the decimal
complement of 3806. The first step in this process is to append sign digits to both numbers, which results
in 0|12845 and 0|03806. The second step is to complement the second number. The most natural way to
2.20 Signed Integers 55
2 Number Bases
Decimal numbers use base 10. The number 203710 , e.g., is worth
2 × 103 + 0 × 102 + 3 × 101 + 7 × 100 . We can say that 2037 is the sum of the digits 2, 0, 3 and 7,
each weighted by a power of 10. Fractions are represented in the same way, using negative powers
of 10. Thus 0.82 = 8 × 10−1 + 2 × 10−2 and 300.7 = 3 × 102 + 7 × 10−1 .
Binary numbers use base 2. Such a number is represented as a sum of its digits, each weighted
by a power of 2. Thus 101.112 = 1 × 22 + 0 × 21 + 1 × 20 + 1 × 2−1 + 1 × 2−2 .
Since there is nothing special about 10 or 2∗ , it should be easy to convince yourself that any
positive integer n > 1 can serve as the basis for representing numbers. Such a representation requires
n ‘digits’ (if n > 10 we use the ten digits and the letters ‘A’, ‘B’, ‘C’. . . ) and represents the number
d3 d2 d1 d0 .d−1 as the sum of the ‘digits’ di , each multiplied by a power of n, thus
d3 n3 + d2 n2 + d1 n1 + d0 n0 + d−1 n−1 . The base for a number system does not have to consist of
powers of an integer, but can be any superadditive sequence that starts with 1.
Definition: A superadditive sequence a0 , a1 , a2 ,. . . is one where any element ai is greater than
the sum of all its predecessors. An example is 1, 2, 4, 8, 16, 32, 64,. . . where each element equals
one plus the sum of all its predecessors. This sequence consists of the familiar powers of 2, so we
know that any integer can be expressed by it using just the digits 0 and 1 (the two bits). Another
example is 1, 3, 6, 12, 24, 50,. . . where each element equals 2 plus the sum of all its predecessors. It
is easy to see that any integer can be expressed by it using just the digits 0, 1 and 2 (the 3 trits).
Given a positive integer k, the sequence 1, 1 + k, 2 + 2k, 4 + 4k,. . . ,2i (1 + k) is superadditive,
since each element equals the sum of all its predecessors plus k. Any non-negative integer can be
uniquely represented in such a system as a number x . . . xxy, where x are bits and y is in the range
[0, k].
In contrast, a general superadditive sequence, such as 1, 8, 50, 3102 can be used to represent
integers, but not uniquely. The number 50, e.g., equals 8 × 6 + 1 + 1, so it can be represented as
0062 = 0 × 3102 + 0 × 50 + 6 × 8 + 2 × 1, but also as 0100 = 0 × 3102 + 1 × 50 + 0 × 8 + 0 × 1.
It can be shown that 1 + r + r2 + · · · + rk < rk+1 for any real number r > 1, which implies that
the powers of any real number r > 1 can serve as the base of a number system using the digits 0, 1,
2,. . . , d for some d. √
The number φ = 12 (1 + 5) ≈ 1.618 is the well known golden ratio. It can serve as the base of
a number system using the two binary digits. Thus, e.g., 100.1φ = φ2 + φ−1 ≈ 3.2310 .
Some real bases have special properties. For example, any positive integer R can be expressed as
R = b1 F1 + b2 F2 + b3 F3 + b4 F5 + · · · (that’s b4 F5 , not b4 F4 ) where bi are either 0 or 1, and the Fi are
the Fibonacci numbers 1, 2, 3, 5, 8, 13, . . . This representation has the interesting property that the
string b1 b2 . . . does not contain any adjacent 1’s (this property is used by certain data compression
methods). As an example, the integer 33 equals the sum 1 + 3 + 8 + 21, so it is expressed in the
Fibonacci base as the 7-bit number 1010101.
A non-negative integer can be represented as a finite sum of binomials
a b c d
n= + + + + ···; Where 0 ≤ a < b < c < d . . .
1 2 3 4
i i!
are integers and n is the binomial n!(i−n)! . This is the binomial number system.
∗
Actually, there is. Two is the smallest integer that can be a base for a number system. Ten is the number of our
fingers.
56 2. Machine Instructions
complement a decimal number is to complement each digit with respect to 9 (the largest decimal digit).
Thus the complement of 0|03806 would be 9|96193 (this suggests that 9 represents a negative sign). The
third step is to add the two numbers, which yields the sum 1009038. This is truncated to six digits (one
sign digit plus five magnitude digits), to produce 0|09038. The correct result, however, is +9039. We end up
with a result that differs from the correct one by 1. It is, of course, easy to artificially add a 1 to our result
to get the correct sum, but before doing so, the reader should try some more decimal examples to convince
himself that this result happens often, but not always, when numbers with different signs are added.
The binary case is similar. The one’s complement of the number 0|101 is 1|010 but, when adding
numbers with different signs, a 1 has sometimes to be added in order to obtain the correct result. A
complete discussion of one’s complement arithmetic is outside the scope of this text and can be found in
texts on computer arithmetic. However, the general rule is easy to state.
Decimal Binary
Sign-Magnitude 1’s Complement 2’s Complement
2N −1 − 1 01 . . . 111
.. ..
. .
7 00 . . . 111 Same as Same as
.. ..
. . sign-magnitude sign-magnitude
2 00 . . . 010
1 00 . . . 001
0 00 . . . 000
−0 10 . . . 000 11 . . . 111 NA
−1 10 . . . 001 11 . . . 110 11 . . . 111
−2 10 . . . 010 11 . . . 101 11 . . . 110
−3 10 . . . 011 11 . . . 100 11 . . . 101
.. .. .. ..
. . . .
−7 10 . . . 111 11 . . . 000 11 . . . 001
.. .. .. ..
. . . .
−2N −1 − 1 11 . . . 111 10 . . . 000 10 . . . 001
−2N −1 NA NA 10 . . . 000
Table 2.14: Three methods for representing signed integers
2.20.3 Two’s Complement
Two’s complement is the method used to represent signed integers on all modern computers. It is based
on the idea that, since we sometimes have to artificially add a 1 to our one’s complement results, why not
try to add this 1 when the signed number is generated in the first place. The principle of this method is to
complement a number in two steps: (1) every bit is complemented (we get the one’s complement) and (2) a
1 is added. Thus, the two’s complement of +5 = 0|101 is obtained by
This is a general rule. It can be used to obtain the two’s complement of a positive number as well as that
of a negative one. The reader can easily verify this by applying the rule to the −5 above.
The downside of the two’s complement method is easy to see; the negative numbers are hard to read.
However, as pointed out before, this is not an serious disadvantage. The main advantage of this method is
not immediately apparent but is important. The two’s complement method simplifies the arithmetic rules
for adding and subtracting signed numbers. In fact, there are only two simple rules:
2 2
1. To subtract a − b, first obtain the two’s complement b̄ , of b, then add a + b̄ .
2. To add a + b, add corresponding pairs of bits, from right to left, including the pair of sign bits, while
propagating the carry in the usual way.
A simple example is 6 + (−3) = 0|110 + 1|101. A direct addition from right to left gives 10|011 which should
be read: carry = 1, sign = 0, magnitude = 011 = 3. The carry should be ignored, since we can only deal
with fixed-size, 4-bit numbers. Note how adding the third pair (1 + 1) generates an intermediate carry that
is propagated to the fourth pair. The fourth pair is added as (0 + 1) + carry = 1 + carry = (the right sign).
The interested reader is referred any other text on computer arithmetic for a more thorough discussion of
the properties of the two’s complement method. Perhaps the only point worth mentioning here is the absence
of negative zero. The two’s complement of 0|000 is 1|111 + 1 = 10|000 = (after ignoring the carry) 0|000.
This has an interesting side effect. With 4 bits we can have 16 combinations. If one combination is used
for zero, 15 combinations are left. Since 15 is an odd number, we cannot have the same amounts of positive
and negative numbers. It turns out (Table 2.14) that with four bits, we can represent the positive numbers
1, 2, . . . , 7 and the negative numbers −1, −2, . . . , −8. In the two’s complement method, the range of negative
numbers is always greater, by one number, than the range of positive numbers.
Exercise 2.9: What are the two ranges for n-bit numbers.
Since integers are not sufficient for all calculations, computer designers have developed other representations
where nonintegers can be represented and operated on. The most common of which is the floating point
representation (f.p. for short), normally called real. In addition to representing nonintegers, the floating
point method can also represent very large and very small (fractions very close to zero) numbers. The
method is based on the common scientific notation of numbers (for example 56 × 109 ) and represents the
real number x in terms of two signed integers a (the mantissa) and b (the exponent) such that x = a × 2b
(Figure 2.15).
b a s exp mantissa
(a) (b)
Figure 2.15: Floating-point numbers, (a) as two integers, (b) with a sign bit
The word mantissa is a Latin term of Etruscan origin, meaning a makeweight, a small
weight added to a scale to calibrate it [Smith 23].
Table 2.16 shows some simple examples of real numbers that consist of two integers. Note that none of
the integers is very large or very small, yet some of the floating point numbers obtained are extremely large
or are very small fractions. Also some floating point numbers are integers, although in general they are not.
58 2. Machine Instructions
a b a × 2b value
1. 1 1 1×2 1
2
2. 1 10 1 × 210 1024
3. 1 20 1 × 220 ≈ 106
4. 1 21 1 × 221 ≈ 2 × 106
5. 15 -3 15 × 2−3 15/8 = 1.875
6. 1 -10 1 × 2−10 ≈ 0.001
7. 10 -20 10 × 2−20 ≈ 10 × 10−6 = 10−5
8. 100 -1 100 × 2−1 50
9. 1234 -100 1234 × 2−100 ≈ 1234 × 10−30
10. 1235 -100 1235 × 2−100 ≈ 1235 × 10−30
Table 2.16: Examples of floating-point numbers
Two important properties of floating point numbers are clearly illustrated in the table. One is that
the magnitude of the number is sensitive to the size of the exponent b; the other is that the sign of b is an
indication of whether the number is large or small. Values of b in the range of 20 result in floating point
numbers in the range of a million, and a small change in b (as, for example, from 20 to 21) doubles the value
of the floating point number. Also, floating point numbers with positive b tend to be large, while those with
negative b tend to be small (usually less than 1). This is not always true, as example 8 shows, but that
example is atypical, as is illustrated later.
The function of the mantissa a is not immediately clear from the table, but is not hard to see. The
mantissa contributes the significant digits to the floating point number. The numbers in examples 9 and 10
are both very small fractions; they are not much different (the difference between them is 10−26 ), but they
are not the same.
More insight into the nature of the mantissa is gained when we consider how floating point numbers are
multiplied. Given the two floating point numbers x = a×2b and y = c×2d , their product is x×y = (a×c)×2b+d .
Thus, to multiply two floating point numbers, their exponents have to be added, and their mantissas should
be multiplied. This is easy since the mantissas and exponents are integers, but it involves two problems:
1. The sum b + d may overflow. This happens when both exponents are very large. The floating point
product is, in such a case, too big to fit in one word (or one register), and the multiplication results in
overflow. In such a case, the ALU should set the V flag, and the program should test the flag after the
multiplication, before it uses the result.
2. The product a×c is too big. This may happen often because the product of two n-bit integers can
be up to 2n bits long. When this happens, the least-significant bits of the product a×c should be cut off,
resulting in an approximate floating point product x×y. Such truncation, however, is not easy to perform
when the mantissas are integers, as the next paragraph illustrates.
Suppose that the computer has 32-bit words, and that a floating-point number consists of an 8-bit
signed exponent and a 24-bit mantissa (we ignore the sign of the mantissa). Multiplying two 24-bit integer
mantissas produces a result that’s up to 48 bits long. The result is stored in a 48-bit temporary register,
and its most-significant 24 bits are extracted, to become the mantissa of the product x×y. Figure 2.17a,b,c
illustrates how those 24 bits can be located anywhere in the 48-bit register, thereby complicating the task
of extracting them.
Computer designers solve this problem by considering the mantissa of a floating-point number a fraction,
rather than an integer. Thus, a mantissa of 10110 . . . 0 equals 0.10112 = 2−1 + 2−3 + 2−4 = 11/16. The
mantissa is stored in the floating point number as 10110 . . . 0, and the (binary) point is assumed to be to
the left of the mantissa. This explains the term mantissa and also implies that the examples in Table 2.16
are wrong. In practice, the mantissa a is always less than 1, it is a fraction. The examples of Table 2.18
have such mantissas and are therefore more realistic. (They are still wrong because they are not normalized.
Normalization of floating-point numbers is introduced below.) The values of the exponents and the mantissas
are shown in Table 2.18 both in binary and in decimal. It is clear that the smallest m-bit mantissa is
00 . . . 01 = 0.00 . . . 012 = 2−m (a small fraction), and the largest one is 11 . . . 1 = 0.11 . . . 12 ≈ 0.99 . . . 910
(very close to 1).
These four numbers have the same value (what is it?), since each was obtained from its predecessor by two
steps. First the exponent was incremented by 1—which resulted in multiplying the number by 2—then the
mantissa was shifted to the right—which has divided by number by 2—effectively cancelling the change in
the exponent!
Which of those different representations is the best? The answer is, the first one, since this is where the
mantissa is shifted to the left as much as possible, allowing for the maximum number of significant digits.
This representation is called the normalized representation of the floating point number, and every floating
point number, except zero, has a unique normalized representation. Zero, by the way, can be represented
as a floating-point number, with a mantissa of 0 and any exponent. However, the best representation of
floating-point zero is a number of all zeros, since this makes floating-point zero identical to integer zero,
simplifying a comparison between them.
We therefore conclude that the smallest normalized mantissa is 10 . . . 0 = 0.12 = 0.5, regardless of the
size of the mantissa.
The ALU circuits are therefore designed to normalize every floating point result when it is generated,
and also to test the mantissa of every new floating-point result. If the mantissa is zero, the entire number is
cleared to make it a true floating-point zero.
Since the mantissa is a fraction, multiplying two mantissas results in a fraction. If the mantissas are
large (i.e., close to 1), their product will be close to 1. If they are small (i.e., 0.5 or a little larger), their
product will be 0.25 = 0.012 or a little larger. The point is that it is now easy to identify the region containing
60 2. Machine Instructions
the 24 most-significant bits of the product. As Figure 2.17d,e,f shows, this is normally the left half of the
48-bit register, except when the product is less than 0.5, in which case there is one zero on the left, and the
product should be left-shifted one position in the 48-bit register before the leftmost 24 bits are extracted.
Exercise 2.10: Can there be more than one zero on the left of the product of two normalized mantissas?
It is important to realize that even though mantissas are fractions, they can be multiplied as integers.
The ALU does not need a special circuit to multiply them.
The rules for multiplying the two floating point numbers x = a×2b and y = c×2d are thus (we assume
8-bit exponents and 24-bit mantissas):
1. Add b + d. This becomes the exponent E of the product.
2. Multiply a×c as integers and store the result on the left side of a temporary, 48-bit register. If the
leftmost bit of the product is zero, shift the product one position to the left, and compensate by decrementing
the exponent E by 1. If E is too big to fit in eight bits, set the V flag and terminate the multiplication.
3. Extract the leftmost 24 bits of the 48-bit register and make them the mantissa M of the product.
This mantissa is already normalized.
4. Pack the new exponent E and mantissa M to form the floating-point product of x and y.
As can be seen from Table 2.18, the binary representation of the exponent is special in that it uses a sign
convention that’s the opposite of the normal. A sign bit of 0 indicates a negative exponent, so an exponent
of 10012 has a value of +1 and the bit pattern 0110 means an exponent of −2. Another way of looking at
the exponent is to visualize it as biased. We can imagine the exponent as if a certain number, a bias, is
always added to it and—in order to get the true value of the exponent—the bias should be subtracted. In
our example, with a 4-bit exponent, the bias is 24−1 = 8, and we say that the exponent is represented in
excess-8. All the 4-bit exponents are summarized in Table 2.19. A biased exponent is sometimes called an
exrad, meaning, it is an exponent expressed in a special radix.
normalizing the sum and packing it with the common exponent. In the general case, where the exponents
are different, the first step in adding the numbers is to equate the exponents. This can be done either by
decreasing the larger of the two (and compensating by shifting its mantissa to the left), or by increasing the
smaller of the two and shifting its mantissa to the right. The ALU uses the latter method.
A simple example is the sum x + y = 0.11002 × 23 + 0.10102 × 2−3 . The first step is to increase the
smaller exponent from −3 to 3 (6 units) and compensate by shifting the corresponding mantissa 6 positions
to the right, changing it from 0.10102 to .00000010102 . The two numbers can now be easily added by adding
the mantissas (as if they were integers) and packing the sum with the common exponent. The result in our
example is .1100001010 × 23 , and it serves to illustrate an important point in floating point arithmetic. The
result has a 9-bit mantissa, compared to the 2-bit and 3-bit mantissas of the original numbers x and y. If
we are limited to, say, 4-bit mantissas, we have to truncate the result by cutting off the least significant bits.
This, however, leaves us with 0.1100 × 23 , which is the original value of x. Adding x + y has resulted in x
because of the limited capacity of our computer. We say that this operation has resulted in a complete loss
of significance, and many ALUs generate an interrupt in such a case, to let the user know that an arithmetic
operation has lost its meaning. This is just one example of the differences between computer arithmetic and
mathematics.
Adding two floating-point numbers is done in the following steps:
1. Compare the exponents. If they are equal, go to Step 3.
2. Select the number with the smaller exponent and shift its mantissa to the right by the difference of
the exponents.
3. Add the two mantissas.
4. Pack the sum with the larger of the two exponents.
5. Normalize the result if necessary.
Normalization is simple. Each mantissa is in the range [0.5, 1). If the exponents are equal, no mantissa
is shifted, and the sum of two mantissas is in the range [1, 2). If one mantissa is shifted to the right, the
sum may be less than 1, but is always greater than or equal 0.5. Normalization is therefore needed only if
the sum is greater than 1. Since it is always less than 2, the sum of the two mantissas can be normalized by
shifting it to the right one position.
Exercise 2.11: Add the two floating-point numbers 0|0111|10 . . . 0 and 0|0111|11 . . . 0. Notice that the size
of the mantissa is unspecified, but the exponent consists of one sign bit and three bits of magnitude.
Even though floating-point addition is a complex process involving several steps, it can be speeded up
by using the idea of pipelining. Several pairs of numbers may simultaneously be in different stages of being
added. Pipelining numbers in an ALU circuit is called vector processing (Section 5.12).
[Juffa 00] is a detailed bibliography of about 1600 references for floating-point operations.
How many floating-point numbers can be represented in a given computer? In our simple example, the
mantissa and exponent occupy four bits each, and there is also the sign bit of the entire number. Each of our
floating point numbers thus occupies nine bits which, consequently, allows for up to 29 = 512 numbers to be
represented. In general, if the word size is N , then 2N floating-point numbers can be represented. Since the
sign is a single bit, about half of those numbers are positive and half are negative. Also, since the exponent’s
sign requires one bit, about half the 2N numbers are less than 1 (in absolute value) and half, greater than
1. The result is (Figure 2.20) that half the floating-point numbers are concentrated in the range (−1, 1) and
the other half are in the ranges (−∞, −1) and (1, ∞).
one-fourth one-fourth one-fourth one-fourth
More insight into the behavior of floating-point numbers is gained by considering the distance between
consecutive floating-point numbers. High-school mathematics teaches that every integer I has a successor,
62 2. Machine Instructions
denoted by I + 1, but that a real number R has no immediate successor. This is proved by arguing that if R
has a successor S, then we could always find another number, namely (R + S)/2, that’s located between R
and S. The real numbers are therefore said to form a continuum. In the computer, in contrast, everything
is finite and discrete. Floating-point is normally used as the representation of the real numbers, but there
is only a finite quantity of floating-point numbers available (2N , where N is the word size), so they do not
form a continuum. As a result, each floating-point number has a successor.
To find the successor of a floating-point number we have to increment its mantissa by the smallest
amount possible (we already know that incrementing the exponent, even by 1, doubles the size of the
number). The smallest increment is obtained by considering the mantissa an integer and incrementing that
integer by one. As an example consider a hypothetical computer with 32 bits per word. If we assume that
the exponent field is eight bits wide (one bit for the sign and seven for the magnitude), then the mantissa
gets the remaining 24 bits. Again, one bit is reserved for the sign and 23 bits are left for the magnitude.
Since the mantissa is a fraction, its 23 bits have values ranging from 2−1 (for the leftmost bit) to 2−23 (for
the rightmost bit). Incrementing the mantissa by 1 therefore increases its value by 2−23 and increases the
value of the floating-point number by 2−23 2e = 2−23+e where e is the exponent of the number.
Since the smallest 8-bit signed exponent is −128, we conclude that the smallest possible increment of a
floating-point number on our example computer is 2−23−128 = 2−151 ≈ 3.5 · 10−46 . Adding this increment
to the smallest floating-point number results in the successor of that number. The quantity 2−151 is called
the resolution of the floating-point numbers on the particular computer.
On the other hand, the largest exponent is 127, so the distance between the largest floating-point number
and its predecessor is 2−23+127 = 2104 ≈ 2 · 1031 . The conclusion is that as we move, on this particular
computer, from the smallest floating-point number to the largest one, the distance between consecutive
numbers varies from the very small quantity 3.5 · 10−46 to the very large quantity 2 · 1031 ; a wide range.
The range of floating-point numbers is not hard to estimate. It depends mainly on the size of the
exponent and is surprisingly sensitive to that size. Assuming a floating-point representation with an e-bit
exponent and an m-bit mantissa, the smallest positive floating-point value is obtained when the exponent
has the largest negative value and the mantissa, the smallest (normalized) value. The format of this number
is therefore:
. . . 0 = 1/2 × 2−2 = 2−2 −1 .
e−1 e−1
0 00 . . . 0 10
e m
Since the mantissa is normalized, the smallest mantissa has a value of 1/2.
The largest floating-point number is obtained when both the exponent and the mantissa have their
largest values:
−1
= (1 − 2−m ) × 22 −1 −1
e−1 e−1 e−1
0 11 . . . 1 11
. . . 1 = 0.11 . . . 12 × 22
≈ 22 .
e m
Assuming that 2−m is a very small fraction, it can be ignored and (1 − 2−m ) can be replaced by 1. This
is another way of saying the the mantissa only affects the precision and not the size, of the floating-point
number. Both extreme sizes depend only on e. On the negative side, the largest and smallest negative
floating-point numbers are about the same size as their positive counterparts. Table 2.21 summarizes the
properties of the floating-point numbers found on some historically-important computers.
Since floating-point numbers are so popular, attempts have been made to standardize their representa-
tion. The most important of these is the IEEE floating-point standard [Stevenson 81], The acronym IEEE
stands for “Institute of Electrical and Electronics Engineers.” Among its many activities, this organiza-
tion also develops standards. The full title of the IEEE floating-point standard is “ANSI/IEEE Standard
754-1985, Standard for Binary Floating Point Arithmetic.” It defines three representations:
1. A Short Real. This has one sign bit, eight exponent bits (the exponent is biased by 127), and 23
significand (mantissa) bits, for a total of 32 bits. The smallest value is
2−2 −1
= 2−128−1 = 2−129 ≈ 10−38 .
e−1
R 8R 8R R 8R 8R
The table shows that the largest fixed-point number that can be represented in this way is 4095.875 (it
becomes the integer 32,767) and the smallest fixed-point numbers are any real numbers close to and less than
−4096.125 (they are represented by the smallest 16-bit 2’s-complement integer, −32, 768). The real numbers
in the interval (−4096.125, 4095.875] are thus represented by 216 = 65,536 different fixed-point numbers.
Dividing 65,536 by the length 8,192 of the interval yields the resolution 8 = 23 = 2p .
Different applications require different fixed-point resolutions, which is why the fixed-point representa-
tion is implemented by software. Another point to consider is that floating-point operations are different
from integer operations and require hardware implementation for fast execution. In contrast, fixed-point
operations are operations on integers and can be performed by the same ALU circuits that perform integer
arithmetic.
2.23 Decimal (BCD) Numbers
We are used to decimal numbers; computers find it easy to deal with binary numbers. Conversions are
therefore necessary, and can easily be done by the computer. However, sometimes it is preferable to avoid
number conversions. This is true in an application where many numbers have to be input, stored in memory,
and output, with very little processing done between the input and the output. Perhaps a good example is
inventory control.
Imagine a large warehouse where thousands or even tens of thousands of items are stored. In a typical
inventory control application, a record has to be input from a master file for each item, stored in memory,
updated, perhaps printed, and finally written on a new master file. The updating usually involves a few
simple operations such as incrementing or decrementing the number of units on hand. In such a case it may
be better not to convert all the input to binary, which also saves conversion of the output from binary.
Such a situation is typical in data processing applications, which is why computers designed specifically
for such applications support BCD numbers in hardware. The idea in a BCD number is to store the decimal
digits of a number in memory, rather than converting the entire number to binary (integer or floating point).
Since memory can only contain bits, each decimal digit has to be converted to bits, but this is a very
simple process. As an example, consider the decimal number −8190. Converting this number to binary
integer is time consuming (try it!) and the result is 13 bits long (14, including the sign bit). On the other
hand, converting each decimal digit is quick and easy. It yields 1000 0001 1001 0000. Each decimal digit is
represented as a group of four bits, since the largest digit (=9) requires four bits.
2.24 Other BCD Codes 65
The principle of this number representation is to code each decimal digit in binary, hence the name
binary coded decimal (BCD). It turns out that the sign of such a number can also be represented as a group
of four bits, and any of the six unused groups 1010, 1011, . . . , 1111 (representing the numbers 10–15) can be
used. Assuming that 1101 represents a negative sign, our number can be represented as 1101 1000 0001 1001
0000. This is very different from the integer representation of −8190 = 1|00000000000102 and requires 20
bits instead of 14. It also requires special ALU circuits for the arithmetic operations. The only advantage of
BCD numbers is the easy conversion. In many data processing applications, this advantage translates into
considerable increase in speed, justifying hardware support of this representation.
A BCD code is any binary code assigned to the ten decimal digits. Many such codes are possible, and some
have special properties that make them useful for certain applications. Since there are 10 decimal digits, any
BCD code must be at least four bits long, but only 10 of the 16 4-bit combinations are used.
Exercise 2.13: How many BCD codes are possible?
The Excess-3 Code. This code (also referred to as XS3) is derived from the familiar BCD code by
adding 3 to each of the 10 codes (Table 2.23). Thus, 0 is represented in XS3 by 0011 (3), 1 is represented
by 0100 (4), and so on up to 9, which is represented by 1100 (12). This code simplifies the addition of BCD
digits. Adding two such digits results in a sum that has an excess of 6. Suppose that we add a + b = c. If
a + b ≥ 10, then c ≥ 16 and therefore has the form 1xxxx. The correct result is obtained if we consider the
1 on the left the tens digit of the sum, delete it, and add 3 to the remaining four bits of c. An example is
4 + 8 = 12. In excess-3 this produces 0111 + 1100 = 10010 = 18. Removing the tens digit leaves 0010, and
adding 3 produces 0101. The final result is 12, where the most-significant digit is 1 and the least-significant
66 2. Machine Instructions
digit is 0101 (2 in XS3). On the other hand, if a + b < 10, then c < 16 and the correct result is obtained if 3
is subtracted from c. As an example, consider 2 + 6 = 0101 + 1001 = 1110 = 14. When 3 is subtracted, the
result is 14 − 3 = 11, which is 8 in XS3.
Many BCD codes are weighted. Each bit bi in such a code is assigned a weight wi , and the decimal digit
represented by bits b3 b2 b1 b0 is the weighted sum b3 w3 + b2 w2 + b1 w1 + b0 w0 . The familiar BCD code is the
weighted 8421 code. Table 2.23 lists the 6311̄ and 2421 weighted codes.
self complementing reflected
Table 2.23: Six BCD codes
Exercise 2.14: Show the 5421, 5311, and 7421 BCD codes.
The weights must be selected such that they can produce the ten decimal digits. The set 8765, for
example, cannot serve as weights since the digits 1–4 cannot be expressed as a weighted sum of 8, 7, 6, and
5. Weighted BCD codes with more than four bits are sometimes useful when increased reliability is needed.
For example, a parity bit can be added to the 8421 code, to produce the p8421 code. A 2-out-of-5 code
has five bits, two of which are 1’s. This can be used for error detection. The number of ways to select two
objects out of five is
5!
= 10,
2!(5 − 2)!
and these ten codes are assigned to the ten digits such that this code looks like the weighted code 74210,
with the exception of the code for 0.
00011 00101 00110 01001 01010 01100 10001 10010 10100 11000
1 2 3 4 5 6 7 8 9 0.
The biquinary code (2-out-of-7) is a 7-bit code where exactly two bits are 1’s. This is almost twice the
number of bits required to code the ten decimal digits, and the justification for such a long code is that it
offers a simple error checking and simple arithmetic rules. The seven code bits are divided into two groups
of two bits and five bits, with a single 1 bit in each group, as shown in Table 2.24.
65 43210
0 01 00001
1 01 00010
2 01 00100
3 01 01000
4 01 10000
5 10 00001
6 10 00010
7 10 00100
8 10 01000
9 10 10000
Table 2.24: The 2-out-of-7 BCD code
2.24 Other BCD Codes 67
The first four codes of Table 2.23 are self-complementing. The 1’s complement of the code of digit d is
also the 9’s complement of d. Thus, the XS3 code of 3 is 0110 and the 1’s complement 1001 is the XS3 code
of 6. Notice that the last of these codes, code III, is unweighted.
The last two codes of Table 2.23 are reflected BCD codes. In such a code, the 9’s complement of a digit
d is obtained by complementing just one bit. In code II, this bit is the leftmost one and in code I it is the
second bit from the left.
Unit distance codes: The property of unit distance codes is that the codes of two consecutive symbols
xi and xi+1 differ by exactly one bit. The most common unit distance code is the Gray code (sometimes
called reflected Gray code or RGC), developed by Frank Gray in the 1950s. This code is easy to generate
with the following recursive construction:
Start with the two 1-bit codes (0, 1). Construct two sets of 2-bit codes by duplicating (0, 1) and
appending, either on the left or on the right, first a zero, then a one, to the original set. The result is (00, 01)
and (10, 11). We now reverse (reflect) the second set, and concatenate the two. The result is the 2-bit
RGC (00, 01, 11, 10); a binary code of the integers 0 through 3 where consecutive codes differ by exactly one
bit. Applying the rule again produces the two sets (000, 001, 011, 010) and (110, 111, 101, 100), which are
concatenated to form the 3-bit RGC. Note that the first and last codes of any RGC also differ by one bit.
Here are the first three steps for computing the 4-bit RGC:
Add a zero (0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100),
Add a one (1000, 1001, 1011, 1010, 1110, 1111, 1101, 1100),
reflect (1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000).
Exercise 2.15: Write software to calculate the RGC and use it to compute and list the 32 5-bit RGC codes
for the integers 0–31.
The conversion from the Gray code of the integer i back to its binary code is bi = gi ⊕ bi+1 , for
i = n, n − 1, . . . , 1, 0, where gn . . . g2 g1 g0 is the RGC of i and 0bn . . . b2 b1 b0 is the binary code of i (bit bn+1
is zero).
It is also possible to generate the reflected Gray code of an integer n with the following nonrecursive rule:
Exclusive-OR n with a copy of itself that’s logically shifted one position to the right. In the C programming
language this is denoted by n^(n>>1).
(a) (b)
Figure 2.25: Rational numbers, a. fixed-slash, b. floating-slash
In a fixed-slash rational, the value of M is fixed. All such numbers allocate M of the N bits to b, and
N − M bits to a. In a floating-slash rational, the relative sizes of a and b may vary and the N bits of a word
are divided into three fields. A count field C (of fixed size, usually 3–4 bits) specifying the size of a, followed
by the C bits of the a field, followed by the N − C − 3 bits of the b field.
The advantages of rational numbers are as follows:
Many numbers that can be only approximately represented as floating point numbers, can be repre-
sented as rational numbers exactly.
Certain mathematical operations produce very large intermediate results (although the final results
may be reasonably sized). Using floating point arithmetic, every time a very large result is generated,
precision (i.e., significance) may be lost. With rational arithmetic, it is easier to represent large results in
double precision (using a pair of words for each number) without losing accuracy. The final results may be
represented in single precision.
It is easy to obtain the inverse of a rational number, and the inverse is always exact. The inverse of
a b b a
is, of course . The inverse is easy to generate and store as a rational number. This is not
always true for floating point numbers.
The main questions that have to be answered before implementing such a representation on a computer
are:
How complex are the arithmetic operations on such numbers? If it is very difficult to add or to
multiply such numbers, the rational representation may not be practical.
How accurate can those numbers be? Given an N -bit word, what are the smallest and largest numbers
that can fit in the word.
Compatibility. Performing the same calculations on computers with different word size may not yield
the same results. How does the accuracy of the result depend on the word size?
A good reference for fixed-slash and floating-slash representations is [Matula and Kornerup 78].
2.26 Carry and Overflow
These two concepts are associated with arithmetic operations. Certain operations may result in a carry,
in an overflow, in none, or in both. There is much confusion and misunderstanding among computer users
concerning carry and overflow, and this section attempts to clear up this confusion by defining and illustrating
these concepts.
Definitions: Given two N -bit numbers used in an addition or subtraction, we would like, of course, to
end up with a result that has the same size, and fits in the same registers or memory words where the original
operands came from. We consequently say that the operation results in a carry if the sum (or difference)
has N + 1 bits. Similarly, the operation results in an overflow if the result is too large to fit in a register or
a computer word.
These definitions look suspiciously similar, and the next example only reinforces this suspicion. Imagine
a computer with 3-bit words (so the largest number is 7) and we first try to add 5+7. In binary, the operation
2.26 Carry and Overflow 69
is: 101 + 111 = 1100. The addition is done in the usual way, moving from right to left and propagating the
carry. The result, naturally, is 12.
Did we get a carry? Yes, since we started with 3-bit numbers and ended up with a 4-bit result. Did we
get an overflow? Yes, since the maximum number that can be represented on our computer is 7, and the
result is 12, a number too large to fit in a 3-bit register.
It seems that although our definitions of carry and overflow look different, they are actually two aspects
of the same phenomenon, and always occur together, which should make them identical.
The explanation is that yes, in the above example, carry and overflow always go together, but in a real
computer, they are different. This is because real computers use signed numbers, while our example uses
unsigned numbers. When signed numbers are used, with the leftmost bit reserved for the sign, carry and
overflow are indeed different.
To illustrate this, we extend our computer’s capacity to four bits per word, a sign bit plus three
magnitude bits. Working with 2’s complement numbers, the range of numbers that can be represented
is from −8 to +7. We use the two numbers 5 = 0 1012 (with −5 = 1 0112 ) and 6 = 0 1102 (with −6 = 1 0102 )
for the examples in Table 2.26.
The table summarizes four different operations on these numbers. It is clear that carry and overflow do
not always occur together; they are different aspects of the arithmetic operations. The carries in the table
are generated when the results are 5 bits long. The overflows occur when the results do not fit in a register,
i.e., they are less than −8 or greater than 7.
decimal 5 −5 6 −6
operands 6 −6 −5 5
binary 0101 1011 0110 1010
operands 0110 1010 1011 0101
result: 1011 10101 10001 1111
carry: no yes yes no
overflow: yes yes no no
Table 2.26: Carry and overflow
Note that a carry does not necessarily indicate a bad result. The two examples with carry have correct
results and the carries can simply be ignored. The reason the ALU still has to detect a carry has to do with
double-precision arithmetic and is explained below. Overflow, on the other hand, is always bad and should
be detected by the program by means of a conditional branch (most computers can optionally generate an
interrupt when overflow is detected).
An interesting question is, how does the ALU detect overflow? Carry is easy for the ALU to detect,
since it involves an additional bit. Overflow, however, causes a wrong result and is easy for us to detect, since
we know in advance what the result should be. The ALU, however, does not know the result in advance,
and a simple test is needed to let the ALU detect overflow.
This test involves the sign bits. A look at Table 2.26 verifies that every time overflow occurs, a bit has
overflowed into the sign position, corrupting the sign. Overflowed results always have the wrong sign. Since
the sign is easy to determine in advance, the ALU uses it as an indication of overflow. The test uses the
simple fact that, when adding numbers of the same sign, the result should have that sign. Thus, when two
positive numbers are added, the result should be positive. The ALU therefore performs the following test.
If the two numbers being added have the same sign and the result has the opposite sign, the overflow flag
should be set.
Exercise 2.16: What is the overflow test when adding numbers with different signs?
The last point discussed here is the use of the carry. Normally, the presence of a carry does not indicate
a bad result, and carries are ignored. However, the carry is important in double-precision arithmetic. On
a computer with a short word size, the largest number that can be stored in a single word may sometimes
be too small and, in order to handle larger numbers, double precision is necessary. In double-precision
arithmetic, a number is stored in two consecutive words (Figure 2.27), where only one word has a sign bit.
2
70 2. Machine Instructions
Adding such numbers involves two steps (1) the least significant parts are added (producing, perhaps, a
carry and/or overflow), (2) the most significant parts are added, together with any carry from the first step.
Example: Add 25 and 46 on a computer with four-bit words. Each number is represented in two parts
25 = 0001 1001 and 46 = 0010 1110. Adding the least significant parts 1001 + 1110 produces 0111 plus both
carry and overflow. The overflow simply indicates that the leftmost bit of the sum (=0) is different from
the leftmost bits of the original operands (=1). Since we know that those bits are magnitude, and not sign
bits, the overflow indication can be ignored. The carry, however, should not be ignored; it should be carried
over and added to the most significant parts. Adding these parts 0001 + 0010 produces 0011, and adding the
carry yields the final result 0100 which, combined with 0111, produces the correct sum, 0100 0111 = 72.
computers can be networked.
The input/output devices themselves are not described; this chapter concentrates on how data is moved
in and out of the computer.
Exercise 3.1: Is it possible to have a meaningful program without input and output?
Address
Data
Figure 3.1: Organization of a mainframe around the address and data buses
72 3. Input/Output
The example has to do with the way the control unit executes two different but similar instructions, a LOD
and an IN. The execution of the instruction LOD R1,17 has already been discussed in Section 1.4; it involves
the steps
1. Address bus←IRc
2. ‘read’
3. wait
4. R1←Data bus
A typical input instruction is IN R1,17. It instructs the computer to input one piece of data (normally one
byte) from input device 17 and store it in register1. This instruction is executed by the following, similar
steps:
1. Address bus←IRc
2. ‘input’
3. wait
4. R1←Data bus
In either case the number 17 is sent, from IRc , to the address lines. In the former case it should go to
memory, to select location 17. In the latter case it should go to the I/O devices, to select device 17. How is
the number 17 sent to the right place? The answer has to do with the bus organization of the computer.
There are two main ways to organize the buses of a computer. They are called a single-bus organization
and a double-bus organization. There are also combinations and variations of these two methods.
A double bus organization is shown in Figure 3.2a. The principle is to have two buses, each with its
own address, data, and control lines. One bus connects the CPU to the memory and the other connects it
to the I/O devices.
(a)
Computer bus
Processor
(b)
Figure 3.2: Two bus organizations of a computer
The control unit of such a computer knows about the two buses, and always uses the right one. When
it executes a LOD or any other memory reference instruction, it sends and receives address and data signals
on the memory bus. Similarly, the I/O bus is used to route signals when an I/O instruction (such as IN or
OUT) is executed. In such a computer, address 17 can be sent to either bus. When sent on the memory bus it
identifies memory location 17. When sent on the I/O bus, the same number selects device 17. We say that a
double-bus computer has two address spaces. In a typical double-bus computer, the memory bus may have
25 address lines and the I/O bus may have 12 address (or rather, device select) lines. The memory address
space in this case has 225 = 32M addresses, while the I/O address space has 212 = 4K select numbers.
3.1 The I/O Processor
Exercise 3.2: Does that mean that the computer can have 4K I/O devices?
Just having two separate buses does not mean that they are used simultaneously. The control unit
executes only one operation at a time, and therefore only one bus is used at any given time. However, when
73
the computer has a DMA device or an I/O channel (Section 3.4), the two buses may be used at the same
time.
Figure 3.2b shows a single bus organization of a computer. There is only one bus, with one set of address
lines. These lines carry signals to the memory and to the I/O devices. The problem with this organization
is that if an instruction sends the number 17 on the address lines, then memory and I/O devices have to
have a way to figure out whether this 17 is a memory address or is a device select number. There are three
solutions to this problem.
1. The single-bus computer has only one address space, i.e., each address appears only once, and is
either the address of a memory location or the select number of an I/O device. If the single bus contains
16 address lines—for an address space of 64K—then the address space may be divided, for example, into a
62K partition and a 2K partition. The lowest partition contains 62K memory addresses, while the upper
partition has 2K I/O device numbers. The 16 address lines allow for only 62K (instead of 64K) memory
words, since part of the address space must be reserved for I/O device numbers. Often, it is easy to modify
the way the address space is partitioned. A user with modest memory requirements and many I/O devices
may want a division of 60K + 4K or a different partition. As a result, a single-bus organization offers more
flexibility than a double bus organization.
One feature of the single-bus organization is that the control unit executes the LOD and IN instructions
in an identical way. If we compare the sequences for the two instructions LOD R1,17 and IN R1,63017 we
find that in either case the address is sent from the IRc to the same address lines and the only difference
between the two instructions is the value of the address.
As a result, a single-bus computer has no need for separate I/O instructions, and the input/output is
done by means of memory reference instructions. This is called memory-mapped I/O.
2. The single bus gets one more line added to it, to distinguish between memory addresses and I/O
device select numbers. When a memory-reference instruction, such as LOAD or STORE is executed, the control
unit sends a zero on this line. When an I/O instruction, such as IN or OUT is executed, the control unit sends
a 1 on the same line. The memory unit examines this line every time it senses something new on the address
lines. If it sees a zero, it knows that the address lines carry a valid address. The I/O devices do likewise.
3. Not every computer is a pure single-bus or double-bus machine. It is possible to design computers
whose bus organization is a combination of the two. The Intel 8080, one of the first successful 8-bit micro-
processors, uses a single bus with 16 address lines. It is possible to use the 8080 as a single-bus machine with
memory-mapped I/O and, in such a case, some of the 64K addresses in the address space must be devoted
to I/O device select numbers.
It is, however, also possible to use the 8080 as a double-bus computer. It has two instructions, IN and
OUT, that when executed, send a device select number on 8 of the 16 address lines. The particular organization
that’s being used at any time, single bus or double bus, is determined by the state of an additional control
line, the M/I line. When the control unit executes an IN or an OUT instruction, it sends the device number
on the lower 8 bits of the address bus and sets the M/I line to 0, to indicate an I/O operation. When a
memory reference instruction is executed, the control unit sends a 16-bit address on the address lines and
sets the M/I line to 1, to indicate a memory operation. When the 8080 is used as the CPU of a computer,
the computer should be designed in such a way that memory would be disabled when the M/I line is low.
The description above is a simplification of the actual operation of the 8080. For more information on
how the 8080 handles memory and I/O operations, see [Intel 75].
Following this introduction, the I/O processor can be described. Five types of I/O processors are
discussed in Sections 3.2 through 3.5 and are summarized in Table 3.3.
74 3. Input/Output
number, then the instruction LOD R1,65000 reads data from an input device instead of loading data from
memory. When register I/O is performed on such a computer, it is called memory-mapped I/O.
WAIT: IN R0,14
BMI WAIT
OUT R2,15
After the first character has been printed, the main program can forget about the printer output for a
while and can continue its main task. After about 10,000 instructions, the printer will finish with the first
character and will interrupt the computer. The interrupt handling routine is invoked and should execute the
following:
Test buffer B to see whether more characters remain to be printed.
If yes, execute:
LOD R2,from current position in buffer B
OUT R2,15
Update current position in buffer.
Else (buffer is empty) set a flag in memory.
Return.
This arrangement frees the program from having to test device status and wait for a “ready” status. Once
the program verifies that the printer is ready and will take the first character, it can switch to other tasks.
The entire output is performed by the interrupt handling routine. When there is no more output to be sent
to the printer, the handling routine sets a special flag in memory. The main program should test this flag
when it wants to use the printer again. If the flag is set, the printer is ready for the next stream of output.
Otherwise, the flag should be tested a while later.
Interrupt I/O can also be used for input. However, a program often needs the complete input in order to
proceed and, in such cases, using interrupt I/O does not save any time. Consider, for example, a command-
driven program. The next command is entered, character by character, from the keyboard, the last character
being a carriage return (cr ). While the command is being entered, the individual characters can be displayed.
However, the program cannot execute the command until it is fully entered (i.e., until the cr is sent from
the keyboard). In such a case, the program cannot do anything useful while the command is being entered,
and it therefore does not matter if the input is executed by register I/O (which occupies the main program)
or by interrupt I/O (which only occupies the interrupt handling routine).
Interrupt I/O saves computer time, especially if the I/O device is slow. This method is used for slow
devices and for low- to medium volumes of information.
76 3. Input/Output
3.4 DMA
The acronym DMA stands for direct memory access. In this method, the I/O processor, called a DMA
device, handles the entire input or output process without help from the CPU. All that the CPU has to
do is initialize the DMA device by means of a few instructions, following which the DMA device starts,
and it performs the entire input or output operation independently of the CPU. After initializing the DMA
device, the CPU continues its fetch-execute cycle and executes the program without having to worry about
the details of the I/O. This makes DMA suitable for large quantities of I/O, and also makes it fast. In
fact, using DMA is like having two processes running simultaneously in the computer. One is the processor,
executing programs, and the other is the DMA device, controlling the I/O process.
The DMA device causes data to be moved between the I/O device and memory directly. The word
“directly” is important; it means that the data moves between the two components in the shortest possible
way, without going through the processor or through the DMA device.
To start a DMA process, the program (in a higher-level language) executes a statement (or a command)
such as read(vol, filename,buf,10000). This command is compiled into a machine instruction (normally
called BRK or SVC) that creates an artificial interrupt (or a software interrupt, Section 1.7) that tells the
operating system that the program is asking for service (in this case, a DMA transfer), and provides the
operating system with the following pieces of information:
1. The direction of the data. It can be either input or output.
2. The volume name (the volume is normally a disk or a CD-ROM) and the filename.
3. The start address of the memory buffer reserved for the file.
4. The size of the buffer.
The operating system uses the volume’s name to select one of possibly several DMA devices. It uses the
volume’s directory to find the volume address of the file (and its size, if known), and prepares the following
items:
1. The direction of the DMA transfer (one bit).
2. The volume address (on a disk, this is the track and sector numbers) of the file.
3. The start address of the memory buffer.
4. The size of the buffer.
5. The file size (if known).
6. Other information, such as a start/stop bit, and a bit which indicates whether the DMA should issue
an interrupt when it is done.
The DMA device has several internal registers where this information is stored. One register stores
the buffer start address, another stores the buffer size, a third one saves the disk address, and a fourth one
contains all the other items. The operating system starts a DMA operation by sending the five items above
to the registers. This information is called the DMA command. The last part of the command contains a
bit that actually starts the DMA device. Figure 3.4 shows a typical arrangement of the registers.
Start address
Disk address
Buffer size
File size
Direction
Start/stop Others
bit Interrupt Status bit
at end
Figure 3.4. Typical DMA registers
How is the command sent to the DMA device? This is done by the operating system but without
the need for any special instructions. Each of the DMA registers is assigned a device select (or device id)
3.4 DMA 77
number, and the operating system uses register I/O or memory-mapped I/O to send the individual parts of
the command to the DMA registers. Thus, assuming that the individual parts of a DMA command have
already been generated by the program and stored by it in the four registers R1–R4, and also assuming that
the DMA registers have id numbers 6–9, the operating system should execute the following instructions:
OUT 6,R1
OUT 7,R2
OUT 8,R3
OUT 9,R4
The last part of the DMA command also contains the bit that starts the DMA device. When the
device starts, it enters a loop where, in each iteration, it causes the transfer of one byte between the I/O
device (the volume) and memory. Note that the start/stop bit acts as an on/off switch. In order to start
the DMA device, the operating system should set the bit. When the DMA device finishes, it clears the bit
automatically. Also the operating system can clear the start/stop bit at any time if it decides to stop the
DMA prematurely.
Address bus
6 5 4
DMA device
Processor 1 I/O
Memory
2
3 Data bus
R/W
Figure 3.5: A DMA device
A complete process of DMA input is shown here as an example. Figure 3.6 shows the details. Data bytes
should be read from the I/O device and written in memory, in consecutive locations. Each time through the
loop, the following steps are executed:
1. The DMA sends the disk address of the current byte to the disk, followed by a “read” signal. It then
waits for the disk to read the byte. The disk may take a relatively long time (measured in milliseconds) until
the byte moves under the disk’s read/write head, so it can be read. When the disk drive has read the byte,
it sends a status of “ready” to DMA, ending the wait. In practice, the disk drive reads an entire block (a
few hundred to a few thousand bytes) and keeps it in a buffer. In the following iteration, the next byte is
simply read from the buffer.
2. DMA checks the status. If the status is “ready”, the DMA stops the processor temporarily by sending
a “hold” signal on line 5 (Figure 3.5). It then proceeds to Step 3. However, if the status is “eof” (the end
of the file has just been sensed), then the entire file has been transferred, the DMA clears its start/stop bit
and interrupts the processor by sending a signal on line 6.
3. DMA starts the memory write operation by sending the current buffer address to the address bus.
Note that the DMA device acts as a memory requestor.
4. The DMA sends a “write” signal to memory and another signal to the I/O device, to place the byte
on the data bus. The DMA enters another wait period (much shorter than the previous one) waiting for
memory to complete the write.
5. Memory now has all the information that it needs. It performs its internal operations and, when
done, sends a feedback signal that is detected by DMA, ending the wait.
6. DMA drops the “hold” signal. It increments the current buffer address and the current disk address.
If the memory buffer has not overflowed, DMA goes to Step 1 and repeats the loop. If the buffer is full, an
overflow situation has been reached. DMA clears its start/stop bit and interrupts the processor by sending
a signal on line 6.
78 3. Input/Output
Note that one of the bits in one of the registers (interrupt on end) can be set to instruct the DMA not
to interrupt. In such a case the processor is not told when the DMA has finished. To find out the status of
the DMA process, the processor should poll the start/stop bit of the DMA often. Also, the processor should
check the status bit of the DMA from time to time to discover any unusual status.
1
Device ready
2
7fl
HOLD
3
HLDA
The “done” field is usually 0 and is set to 1 by the channel after it has finished executing the command.
At that point, the channel goes to memory, sets the “done” field of the command just executed, and fetches
the next one. This way, the operating system can test the channel commands in memory at any time, find
out which ones have been executed, and restart the user programs that have originally requested those I/O
transfers.
The “interrupt on end” field tells the channel whether it is supposed to interrupt the processor at the
end of the current channel command.
The ‘goto’ field is used to indicate a jump in the channel program. If there is no more room in the
current buffer for more channel commands, the last command should have just the “goto” field set, and an
address in the “start address” field. Such a command does not indicate any I/O transfer, just a jump in the
channel program to an area where more commands can be found.
The remaining fields have the same meaning as in a DMA command.
Types of channels.
There are two main types of I/O channels, a selector and a multiplexor. A selector channel is connected
to one I/O device and works as explained earlier. A multiplexor channel is connected to several devices
and works by looping over them. It may transfer a byte between device A and memory, then another byte
between device B and memory, etc.
A selector channel is suitable for a high-speed I/O device such as a disk drive. The high speed of
the device may require the entire attention of the channel. A multiplexor channel, on the other hand, is
connected to several slow devices, where it should be fast enough to match the total speed of all the devices.
3.6 I/O Codes
One of the main features of computers, a feature that makes computers extremely useful, is their ability to
deal with nonnumeric quantities. Compiling a program is an example of such a task. The compiler reads a
source program which is made up of strings of text, analyzes each string and translates it. Another example
is word processing. A word processor deals mostly with text and performs relatively few calculations.
Since the computer can only handle binary quantities, any nonnumeric items have first to be coded into
binary numbers. This is why I/O codes are so important. Any symbol that we want the computer to input,
process, and output, should have a binary code.
There are many different codes. Some are suitable for general use, others have been developed for
special applications. Some are very common, while others are rarely used. Since standards are useful in any
field, there have been several attempts to standardize I/O codes. Today, most computers use the ASCII
code (Section 3.7), more and more new computers use the new Unicode (Section 3.7), and some old IBM
computers still use the EBCDIC code. Older, obsolete second and third generation computers used other
codes.
An interesting question is: How long should the code of a character be (how many bits per character)?
The answer depends, of course, on the number of characters to be coded (the size of the character set). In
the past, computer printers were limited to just digits, upper case letters, and a few punctuation marks.
Since there are 10 digits and 26 letters (blank space is considered a punctuation mark), a character set of
size 64 was considered sufficient. Such a set can include 28 punctuation marks, in addition to the letters and
digits, and the code size is therefore six bits per character.
In the last three decades, however, printers have become much more sophisticated. In addition to the
traditional character set, common laser and inkjet printers can print lower case letters, any symbols, and
artwork. Also, advances in communications have created a need for a special group of characters, called
control characters. As a result, modern computers can handle a larger set of characters. Increasing the
code size to seven bits/character doubles the size of the character set, from 64 to 128. This is a good size,
providing codes for the 10 decimal digits, the upper- and lowercase letters, and up to 66 more characters
80 3. Input/Output
octal and in hexadecimal, but not in decimal. Octal numbers are set in italics and are preceded by a quote,
while hexadecimal numbers are set in typewriter type and are preceded by a double-quote. Thus, the code
of ‘A’ is ‘1018 or "4916 but it takes some work to convert it to decimal.
Exercise 3.3: Why is it unimportant to know the decimal value of a character code?
The following should be noted about the ASCII codes:
1. The first 32 codes are control characters. These are commands used in input/output and commu-
nications, and have no corresponding graphics, i.e., they cannot be printed out. Note that codes "2016 and
"7F16 are also control characters.
2. The particular codes are arbitrary. The code of A is "4116 , but there was no special reason for
assigning that particular value, and almost any other value would have served as well. About the only rule
for assigning codes is that the code of B should follow, numerically, the code of A. Thus. B has the code
"4216 , C has "4316 , etc. The same is true for the lowercase letters and for the 10 digits.
There is also a simple relationship between the codes of the uppercase and lowercase letters. The code
of a is obtained from the code of A by setting the most significant (7th) bit to 1.
3. The parity bit in Table 3.8 is always 0. The ASCII code does not specify the value of the parity bit,
and any value can be used. Different computers may therefore use the ASCII code with even parity, odd
parity, or no parity.
4. The code of the control character DEL is all ones (except the parity which is, as usual, unspecified).
This is a tradition from the old days of computing (and also from telegraphy), when paper tape was an
important medium for input/output. When punching information on a paper tape, whenever the user
noticed an error, they would delete the bad character by pressing the DEL key on the keyboard. This worked
by backspacing the tape and punching a frame of all 1’s on top of the bad character. When reading the
tape, the reader would simply skip any frame of all 1’s.
A related code is the EBCDIC (Extended BCD Information Code), shown in Table 3.10. This code
was used on IBM computers and may still be used (as an optional alternative to ASCII) by some old IBM
personal computers. EBCDIC is an 8-bit code, with room for up to 256 characters. However, it assigns codes
3.7 ASCII and Other Codes 81
´0 ´1 ´2 ´3 ´4 ´5 ´6 ´7
´00x NUL SOH STX ETX EOT ENQ ACK BEL
˝0x
´01x BS HT LF VT FF CR SO SI
´02x DLE DC1 DC2 DC3 DC4 NAK SYN ETB
˝1x
´03x CAN EM SUB ESC FS GS RS US
´04x SP ! " # $ % & ’
˝2x
´05x ( ) * + , - . /
´06x 0 1 2 3 4 5 6 7
˝3x
´07x 8 9 : ; < = > ?
´10x @ A B C D E F G
˝4x
´11x H I J K L M N O
´12x P Q R S T U V W
˝5x
´13x X Y Z [ \ ] ˆ _
´14x ‘ a b c d e f g
˝6x
´15x h i j k l m n o
´16x p q r s t u v w
˝7x
´17x x y z { | } ˜ DEL
˝8 ˝9 ˝A ˝B ˝C ˝D ˝E ˝F
to only 107 characters, and there are quite a few unassigned codes. The term BCD (binary coded decimal)
refers to the binary codes of the 10 decimal digits.
EBCDIC was developed by IBM, in the late 1950s, for its 360 computers. However, to increase compat-
ibility, the 360 later received hardware that enabled it to also use the ASCII code. Because of the influence
of IBM, some of the computers designed in the 1960s and 70s also use the EBCDIC code. Today, however,
the ASCII code is a de facto standard (with unicode catching on).
A quick look at the EBCDIC control characters shows that they were developed to support punched card
equipment and simple line printers. Missing are all the ASCII control characters used for telecommunications
and for driving high-speed disk drives.
2
Another important design flaw is the gaps in the middle of the letters (between i and j, r and s) and
the small number of punctuation marks.
Unicode
A new international standard code, the Unicode, has been proposed, and is being developed
by an international Unicode organization (www.unicode.org). Unicode uses 16-bit codes for its
characters, so it provides for 216 = 64K = 65,536 codes. (Doubling the size of a code much more
than doubles the number of possible codes.) Unicode includes all the ASCII codes plus codes for
characters in foreign languages (including complete sets of Korean, Japanese and Chinese characters)
and many mathematical and other symbols. Currently about 39,000 out of the 65,536 possible codes
have been assigned, so there is room for adding more symbols in the future.
The Microsoft Windows NT operating system has adopted Unicode, as have also AT&T Plan 9
and Lucent Inferno. See Appendix D for more information.
NUL (Null): No character, Used for filling in space in an I/O device when there are no characters.
SOH (Start of heading): Indicates the start of a heading on an I/O device. The heading may include information pertaining
to the entire record that follows it.
STX (Start of text): Indicates the start of the text block in serial I/O.
ETX (End of text): Indicates the end of a block in serial I/O. Matches a STX.
EOT (End of transmission): Indicates the end of the entire transmission in serial I/O.
ENQ (Enquiry): An enquiry signal typically sent from a computer to an I/O device before the start of an I/O transfer, to verify
that the device is there and is ready to accept or to send data.
ACK (Acknowledge): An affirmative response to an ENQ.
BEL (Bell): Causes the I/O device to ring a bell or to sound a buzzer or an alarm in order to call the operator’s attention.
BS (Backspace): A command to the I/O device to backspace one character. Not every I/O device can respond to BS. A
keyboard is a simple example of an input device that cannot go back to the previous character. Once a new key is pressed, the
keyboard loses the previous one.
HT (Horizontal tab): Sent to an output device to indicate a horizontal movement to the next tab stop.
LF (Line feed): An important control code. Indicates to the output device to move vertically, to the beginning of the next line.
VT (Vertical tab): Commands an output device to move vertically to the next vertical tab stop.
FF (Form feed): Commands the output device to move the output medium vertically to the start of the next page. some output
devices, such as a tape or a plotter, do not have any pages and for them the FF character is meaningless.
CR (Carriage return): Commands an output device to move horizontally, to the start of the line.
SO (Shift out): Indicates that the character codes that follow (until an SI is sensed), are not in the standard character set.
SI (Shift in): Terminates a non-standard string of text.
DLE (Data link escape): Changes the meaning of the character immediately following it.
DC1–DC4 (Device controls): Special characters for sending commands to I/O devices. Their meaning is not predefined.
NAK (Negative acknowledge): A negative response to an enquiry.
SYN (Synchronous idle): Sent by a synchronous serial transmitter when there is no data to send.
ETB (End transmission block): Indicates the end of a block of data in serial transmission. Is used to divide the data into
blocks.
CAN (Cancel): Tells the receiving device to cancel (disregard) the previously received block because of a transmission error.
EM (End of medium): Sent by an I/O device when it has sensed the end of its medium. The medium can be a tape, paper,
card, or anything else used to record and store information.
SUB (Substitute): This character is substituted by the receiving device, under certain conditions, for a character that has been
received incorrectly (had a bad parity bit).
ESC (Escape): Alters the meaning of the immediately following character. This is used to extend the character set. Thus ESC
followed by an ‘X’ may mean something special to a certain program.
FS (File separator): The 4 separators on the left have no pre-
GS (Group separator): defined meaning in ASCII, except that FS
RS (Record separator): is the most general separator (separates
US (Unit separator): large groups) and US, the least general.
SP (Space): This is the familiar blank or space between words. It is non-printing and is therefore considered a control character
rather than a punctuation mark.
DEL (Delete): This is sent immediately after a bad character has been sent. DEL Indicates deleting the preceding character
(see note 4).
and was last used on the Cyber mainframes (that became obsolete in the 1980s) which had a word size of
60 bits.
3.7.2 The Baudot Code
Another old code worth mentioning is the Baudot code (Table 3.12). this is a 5-bit code developed by Emile
Baudot around 1880 for telegraph communication. It became popular and, by 1950, was designated the
International Telegraph Code No. 1. It was used by many first- and second generation computers. The
code uses 5 bits per character, but encodes more than 32 characters. Each 5-bit code can be the code of
two characters, a letter and a figure. The “letter shift” (LS) and “figure shift” (FS) codes are used to shift
between letters and figures.
Using this technique, the Baudot code can represent 32 × 2 − 2 = 62 characters (each code can have two
meanings except the LS and FS codes). The actual number of characters is, however, less than that since
five of the codes have one meaning each, and some codes are not assigned.
The code does not employ any parity bits and is therefore unreliable. A bad bit can transform a
character into another. In particular, a corrupted bit in a shift character causes a wrong interpretation of
3.8 Information Theory and Algebraic Coding 83
control coding).
Efficiency—efficient encoding of the information (source coding or data compression).
Security—protection against eavesdropping, intrusion, or tampering (cryptography).
The main principles and algorithms of these aspects are dealt with in the sections that follow.
Figure 3.13 shows the stages that a piece of computer data may go through when it is created, trans-
mitted, received, and used at the receiving end.
Transmitter
Source Channel
information
encoder encoder Modulator
Source
(compression) (error-control)
Noise Channel
Source Channel
Use data decoder decoder Demodulator
(decompression) (correct errors)
Receiver
Figure 3.13: A Communication System
The information source provides the original data to be transmitted. If this is in analog form, it has
to be digitized before it proceeds to the next step. The source encoder translates the data to an efficient
form by compressing it. The channel encoder adds an error-control (i.e., detection or correction) code to the
data, to make it more robust before it is sent on the noisy channel. The modulator (Section translates the
3.9 Error-Detecting and Error-Correcting Codes 85
digital data to a form that can be sent on the channel (usually an electromagnetic wave). The channel itself
may be a wire, a microwave beam, a satellite link, or any other type of hardware that can transmit signals.
After demodulation, channel decoding (error check and correction) and source decoding (decompression),
the received data finally arrives at the user on the receiving end. The source encoder, channel encoder, and
modulator are sometimes called the transmitter.
We denote by p the probability that a 1 bit will change to a 0 during transmission. The probability
that a 1 will remain uncorrupted is, of course, 1 − p. If the probability that a 0 bit will be degraded during
transmission and recognized as 1 is the same p, the channel is called binary symmetric. In such a channel,
errors occur randomly. In cases where errors occur in large groups (bursts), the channel is a burst-error
channel.
3.9 Error-Detecting and Error-Correcting Codes
Every time information is transmitted, on any channel, it may get corrupted by noise. In fact, even when
information is stored in a storage device, it may become bad, because no piece of hardware is absolutely
reliable. This also applies to non-computer information. Speech sent on the air gets corrupted by noise, wind,
variations in temperature, etc. Speech, in fact, is a good starting point for understanding the principles of
channel coding (error-detecting and error-correcting codes). Imagine a noisy cocktail party where everybody
talks simultaneously, on top of blaring music. We know that even in such a situation, it is possible to carry
on a conversation, except that more attention than usual is needed.
What makes our language so robust, so immune to errors? There are two properties, redundancy and
context.
Our language is redundant because only a very small fraction of all possible words are valid. A huge
number of words can be constructed with the 26 letters of the Latin alphabet. Just the number of 7-letter
words, e.g., is 267 ≈ 8.031 billion. Yet only about 50,000 words are commonly used, and even the Oxford
English Dictionary lists “only” about 500,000 words. When we hear a garbled word, our brain searches
through many similar words for the “closest” valid word. Computers are very good at such searches, which
is why redundancy is the basis of error-detecting and error-correcting codes.
Our brain works by associations. This is why we humans excel at using the context of a message to
repair errors in the message. In receiving a sentence with a garbled word or a word that doesn’t belong, such
as “pass the thustard please”, we first use our memory to find words that are associated with “thustard,”
then we use our accumulated life experience to select, among many possible candidates, the word that best
fits in the present context. If we are driving on the highway, we pass the bastard in front of us; if we are at
dinner, we pass the mustard (or custard). Another example is the (corrupted) written sentence “a∗l n∗tu∗al
l∗∗gua∗es a∗e red∗∗∗ant”, which we can easily complete. Computers don’t have much life experience and are
notoriously bad at such tasks, which is why context is not used in computer codes. In extreme cases, where
much of the sentence is bad, even we may not be able to correct it, and we have to ask for a retransmission
“say it again, Sam.”
The idea of using redundancy to add reliability to information is due to Claude Shannon, the founder of
information theory. It is not an obvious idea, since we are conditioned against it. Most of the time, we try
to eliminate redundancy in computer information, in order to save space. In fact, all the data-compression
methods do just that.
We discuss two approaches to reliable codes. The first one is to duplicate the code, an approach that
leads to the idea of voting codes; the second approach uses check bits and is based on the concept of Hamming
distance.
3.9.1 Voting Codes
Perhaps the first idea that comes to mind, when thinking about redundancy, is to duplicate every bit of
the message. Thus, if the data 1101 has to be transmitted, the bits 11|11|00|11 are sent instead. A little
thinking shows that this results in error detection, but not in error correction. If the receiver receives a
pair of different bits, it cannot tell which bit is correct. This is an example of a receiver failure. A little
thinking may convince the reader that sending each bit in triplicate can lead to error-correction (although
not absolute). We can transmit 111|111|000|111 and tell the receiver to compare the three bits of each triplet.
If all three are identical, the receiver assumes that they are correct. Moreover, if only two are identical and
86 3. Input/Output
the third one is different, the receiver assumes that the two identical bits are correct. This is the principle
of voting codes. The receiver (decoder) makes the correct decision when either (1) two of the three bits are
identical and the third one is different and the two identical bits are correct or (2) all three bits are identical
and correct. Similarly, the decoder makes the wrong decision when (1) two of the three bits are identical
and the third one is different and the two identical bits are bad or (2) all three bits are identical and are
bad. Before deciding to use such a code, it is important to compute the probabilities of these cases and try
to estimate their values in practical applications.
If we duplicate each bit an odd number of times, the receiver may sometimes make the wrong decision,
but it can always make a decision. If each bit is duplicated an even number of times, the receiver will fail
(i.e., will not be able to make any decision) in cases where half the copies are 0s and the other half are 1s.
In practice, errors may occur in bursts, so it is preferable to seperate the copies of each bit. Instead
of transmitting 111|111|000|111, it is better to transmit 1101 . . . 1101 . . . 1101 . . .. The receiver has first to
identify the three bits of each triplet, then compare them.
A voting code where each bit is duplicated n times is called an (n, 1) voting code.
It is intuitively clear that the reliability of the voting code depends on n and on the quality of the
transmission channel. The latter can be estimated by measuring the probability p that any individual bit
will be corrupted. This can be done by transmitting large quantities of known data through the channel and
counting the number of bad bits received. Such an experiment should be carried out with many millions
of bits and over long time periods, to account for differences in channel reliability between day and night,
summer and winter, intense heat, high humidity, lightnings, and so on. Actual values of p for typical channels
used today are in the range 10−7 to 10−9 , meaning that on average one bit in ten million to one bit in a
billion bits transmitted gets corrupted.
Once p is known, we compute the probability that j bits out of the n bits transmitted will go bad.
Given any j bits of the n bits, the probability of those j bits going bad is pj (1 − p)n−j , because we have
to take into account the probability that the remaining n − j bits did not go bad. However, it is possible
to select j bits out of n bits in several ways, and this also has to be taken into account in computing the
probability. The number of ways to select j objects out of any n objects without selecting the same object
more than once is denoted by n Cj and is
n n n!
Cj = = .
j j!(n − j)!
We therefore conclude that the probability of any group of j bits out of the n bits getting bad is
Pj = n Cj pj (1 − p)n−j . Based on this, we can analyze the behavior of voting codes by computing three
basic probabilities. (1) The probability pc that the receiver will make the correct decision (i.e., will find and
correct an error), (2) the probability pe that the receiver will make the wrong decision (i.e., will “detect”
and correct a nonexistent error), and (3) the probability pf that the receiver will fail (i.e., will not be able
to make any decision). We start with the simple case n = 3 where each bit is transmitted three times.
When n = 3, the receiver will make a correct decision when
either all three bits remain
good or when
one bit got bad. Thus, pc is the sum P0 + P1 which equals 3 C0 p0 (1 − p)3−0 + 3 C1 p1 (1 − p)3−1 =
(1 − p)3 + 3p(1 − p)2 . Similarly, the receiver will make the wrong decision when either two of the three bits
get
3 bad or when all
three
have been corrupted.
The probability pe is therefore the sum P2 + P3 which equals
C2 p2 (1 − p)3−2 + 3 C3 p3 (1 − p)3−3 = 3p2 (1 − p) + 3p3 . Since n is odd, the receiver will always be able
to make a decision, implying that pf = 0. Any code where pf = 0 is a complete decoding code. Notice that
the sum pc + pe + pf is 1.
As a simple example, we compute pc and pe for the (3, 1) voting code for p = 0.001 and for p = 0.001. The
former yields (1−0.01)3 +3·0.01(1−0.01)2 = 0.999702 and pc = 3·0.012 (1−0.01)+3·0.013 = 0.000298. The
latter yields (1−0.001)3 +3·0.001(1−0.001)2 = 0.999997 and pc = 3·0.0012 (1−0.001)+3·0.0013 = 0.000003.
This shows that the simple (3, 1) voting code features excellent behavior even for large bit failure rates.
3.9 Error-Detecting and Error-Correcting Codes 87
The (3, 1) voting code can correct up to one bad bit. Similarly, the (5, 1) voting code can correct up
to two bad bits. In general, the (n, 1) voting code can correct
(n − 1)/2 bits (in future, the “
” and “”
will be omitted). Such a code is simple to generate and to decode, and it provides high reliability, but at
a price; an (n, 1) code is very long and can be used only in applications where the length of the code (and
consequently, the transmission time) is unimportant. It is easy to compute the probabilities of correct and
incorrect decisions made by the general (n, 1) voting code. The code will make a correct decision if the
number of bad bits is 0, 1, 2,. . . ,(n − 1)/2. The probability pc is therefore
(n−1)/2
pc = P0 + P1 + P2 + · · · + P(n−1)/2 = n
Cj pj (1 − p)n−j .
j=0
Similarly, the (n, 1) voting code makes the wrong decision when the number of bad bits is greater than
(n + 1)/2. Thus, the value of pe is
n
pe = P(n+1)/2 + P(n+3)/2 + P(n+5)/2 + · · · + Pn = n
Cj pj (1 − p)n−j .
(n+1)/2
Notice that pf = 0 for an odd n because the only case where the receiver cannot make a decision is when
half the n bits are bad, and this requires an even n. Table lists the values of pe for five odd values of n.
Values of the redundancy (or the code rate) R are also included. This measure is the ratio k/n, i.e., the
number of data bits in the (n, 1) code (i.e., 1) divided by the total number n of bits.
n pe R
−4
3 3.0 × 10 0.33
5 9.9 × 10−6 0.20
7 3.4 × 10−7 0.14
9 1.2 × 10−8 0.11
11 4.4×10−10 0.09
Table 3.14: Probabilities of Wrong Decisions for Voting Codes.
Exercise 3.4: Compute pe for the (7, 1) voting code assuming a bit failure rate of p = 0.01.
Exercise 3.5: Calculate the probability pf of failure for the (6, 1) voting code assuming that p = 0.01.
3.9.2 Check Bits
In practice, error detection and correction is done by means of check bits that are added to the original
information bits of each word of the message. In general, k check bits are appended to the original m
information bits, to produce a codeword of n = m + k bits. Such a code is referred to as an (n, m) code. The
codeword is then transmitted to the receiver. Only certain combinations of the information bits and check
bits are valid, in analogy with a natural language. The receiver knows what the valid codewords are. If a
nonvalid codeword is received, the receiver considers it an error. Section 3.9.7 shows that by adding more
check bits, the receiver can also correct certain errors, not just detect them. The principle of error correction
is that, on receiving a bad codeword, the receiver selects the valid codeword that is the “closest” to it.
Example: A set of 128 symbols needs to be coded. This implies m = 7. If we select k = 4, we end up
with 128 valid codewords, each 11 bits long. This is an (11, 7) code. The valid codewords are selected from
a total of 211 = 2048 possible codewords, so there remain 2048 − 128 = 1920 nonvalid codewords. The big
difference between the number of valid (128) and nonvalid (1920) codewords means that, if a codeword gets
corrupted, chances are that it will change to a nonvalid one.
It may, of course, happen that a valid codeword gets changed, during transmission, to another valid
codeword. Thus, our codes are not completely reliable, but can be made more and more reliable by adding
more check bits and by selecting the valid codewords carefully. One of the basic theorems of information
theory says that codes can be made as reliable as desired by adding check bits, as long as n (the size of a
codeword) does not exceed the channel’s capacity.
88 3. Input/Output
It is important to understand the meaning of the word “error” in data transmission. When an n-bit
codeword is sent and received, the receiver always receives n bits, but some of them may be bad. A bad bit
does not disappear, nor does it change into something other than a bit. A bad bit simply changes its value,
either from 0 to 1, or from 1 to 0. This makes it relatively easy to correct the bit. The code should tell the
receiver which bits are bad, and the receiver can then easily correct the bits by inverting them.
In practice, bits may be sent on a wire as voltages. A binary 0 may, e.g., be represented by any voltage
in the range 3–25 volts. A binary 1 may similarly be represented by the voltage range of −25v to −3v.
Such voltages tend to drop over long lines, and have to be amplified periodically. In the telephone network
there is an amplifier (a repeater) every 20 miles or so. It looks at every bit received, decides if it is a 0 or
a 1 by measuring the voltage, and sends it to the next repeater as a clean, fresh pulse. If the voltage has
deteriorated enough in passage, the repeater may make a wrong decision when sensing it, which introduces
an error into the transmission. At present, typical transmission lines have error rates of about one in a
billion but, under extreme conditions—such as in a lightning storm, or when the electric power suddenly
fluctuates—the error rate may suddenly increase, creating a burst of errors.
3.9.3 Parity Bits
A parity bit can be added to a group of m information bits to complete the total number of 1 bits to an odd
number. Thus the (odd) parity of the group 10110 is 0, since the original group plus the parity bit has an
odd number (3) of 1’s. It is also possible to use even parity, and the only difference between odd and even
parity is that, in the case of even parity, a group of all zeros is valid, whereas, with odd parity, any group of
bits with a parity bit added, cannot be all zeros.
Parity bits can be used to design simple, but not very efficient, error-correcting codes. To correct 1-bit
errors, the message can be organized as a rectangle of dimensions (r − 1) × (s − 1). A parity bit is added to
each row of s − 1 bits, and to each column of r − 1 bits. The total size of the message (Table 3.14) is now
s × r.
0 1 0 0 1
1 0 1 0 0 0 1 0 0 1
0 1 1 1 1 1 0 1 0
0 0 0 0 0 0 1 0
1 1 0 1 1 0 0
0 1 0 0 1 1
Table 3.15: Table 3.16:
If only one bit becomes bad, a check of all s − 1 + r − 1 parity bits will discover it, since only one of the
s − 1 parities and only one of the r − 1 ones will be bad.
The overhead of a code is defined as the number of parity bits divided by the number of information
bits. The overhead of the rectangular code is, therefore,
(s − 1 + r − 1) s+r
≈ .
(s − 1)(r − 1) s × r − (s + r)
A similar, slightly more efficient, code is a triangular configuration, where the information bits are
arranged in a triangle, with the parity bits placed on the diagonal (Table 3.15). Each parity bit is the parity
of all the bits in its row and column. If the top row contains r information bits, the entire triangle has
r(r + 1)/2 information bits and r parity bits. The overhead is thus
r 2
= .
r(r + 1)/2 r+1
It is also possible to arrange the information bits in a number of two-dimensional planes, to obtain a
three-dimensional cube, three of whose six outer surfaces consist of parity bits.
It is not obvious how to generalize these methods to more than 1-bit error correction.
3.9 Error-Detecting and Error-Correcting Codes 89
1111
110 111
10 11 010
011 4D 1110
0 1
2D 3D
1D 100
101
00 01 000 001
These definitions have a simple geometric interpretation (for the mathematical readers). Imagine a
hypercube in n-dimensional space. Each of its 2n corners can be numbered by an n-bit number (Figure 3.17),
such that each of the n bits corresponds to one of the n dimensions. In such a cube, points that are directly
connected (near neighbors) have a Hamming distance of 1, points with a common neighbor have a Hamming
distance of 2, etc. If a code with a Hamming distance of 2 is desired, only points that are not directly
connected should be selected as valid codewords.
The reason code2 can detect all single-bit errors is that it has a Hamming distance of 2. The distance
between valid codewords is 2, so a one-bit error always changes a valid codeword into a nonvalid one. When
90 3. Input/Output
two bits go bad, a valid codeword is moved to another codeword at distance 2. If we want that other
codeword to be nonvalid, the code must have at least distance 3.
In general, a code with a Hamming distance of d+1 can detect all d-bit errors. In comparison, code3 has
a Hamming distance of 2 and can therefore detect all 1-bit errors even though it is short (n = 3). Similarly,
code4 has a Hamming distance of 4, which is more than enough to detect all 2-bit errors. It is obvious now
that we can increase the reliability of our transmissions, but this feature does not come free. As always,
there is a tradeoff, or a price to pay, in the form of the overhead. Our codes are much longer than m bits
per symbol because of the added check bits. A measure of the price is n/m = m+k m = 1 + k/m, where the
quantity k/m is called the overhead of the code. In the case of code1 the overhead is 2, and in the case of
code3 it is 3/2.
Example: A code with a single check bit, that is a parity bit (even or odd). Any single-bit error can
easily be detected since it creates a nonvalid codeword. Such a code therefore has a Hamming distance of 2.
Notice that code3 uses a single, odd, parity bit.
Example: A 2-bit error-detecting code for the same 4 symbols. It must have a Hamming distance of
at least 3, and one way of generating it is to duplicate code3 (this creates code4 with a distance of 4).
b1+2+4 and is, therefore, used in determining check bits b1 , b2 , and b4 . The check bits are simply parity bits.
The value of b2 , for example, is the parity (odd or even) of b3 , b6 , b7 , b10 , . . . .
Example: A 1-bit error-correcting code for the set of symbols A, B, C, D. It must have a Hamming
distance of 3. Two information bits are needed to code the four symbols, so they must be b3 and b5 . The
parity bits are therefore b1 , b2 , and b4 . Since 3 = 1 + 2 and 5 = 1 + 4, the 3 parity bits are defined as: b1 is
the parity of bits b3 , b5 ; b2 is the parity of b3 ; and b4 is the parity of b5 . This is how code5 was constructed.
Example: A 1-bit error-correcting code for a set of 256 symbols. It must have a Hamming distance of
3. Eight information bits are required to code the 256 symbols, so they must be b3 , b5 , b6 , b7 , b9 , b10 , b11 ,
and b12 . The parity bits are, therefore, b1 , b2 , b4 , and b8 . The total size of the code is 12 bits. The following
relations define the four parity bits:
3 = 1 + 2, 5 = 1 + 4, 6 = 2 + 4, 7 = 1 + 2 + 4, 9 = 1 + 8, 10 = 2 + 8, 11 = 1 + 2 + 8, and 12 = 4 + 8.
This implies that b1 is the parity of b3 , b5 , b7 , b9 , and b11 . The definitions of the other parity bits are
left as an exercise.
Exercise 3.6: Construct a 1-bit error-correcting Hamming code for 16-bit codes (m = 16).
What is the size of a general Hamming code? The case of a 1-bit error-correcting code is easy to analyze.
Given a set of 2m symbols, 2m valid codewords are needed, each n bits long. The 2m valid codewords should,
therefore, be selected from a total of 2n numbers. Each codeword consists of m information bits and k check
bits, where the value of m is given, and we want to know the minimum value of k.
Since we want any single-bit error in a codeword to be corrected, such an error should not take us too
far from the original codeword. A single-bit error takes us to a codeword at distance 1 from the original one.
As a result, all codewords at distance 1 from the original codeword should be nonvalid. Each of the original
2m codewords should thus have the n codewords at distance 1 from it, declared nonvalid. This means that
the total number of codewords (valid plus nonvalid) is 2m + n2m = (1 + n)2m . This number has to be
selected from the 2n available numbers, so we end up with the relation (1 + n)2m ≤ 2n . Since 2n = 2m+k ,
we get 1 + n ≤ 2k or k ≥ log2 (1 + n). The following table illustrates the meaning of this relation for certain
values of m.
n : 4 7 12 21 38 71
k : 2 3 4 5 6 7
m = n − k: 2 4 8 16 32 64
k/m : 1 .75 .5 .31 .19 .11
The geometric interpretation provides another way of obtaining the same result. We imagine 2m spheres
of radius one tightly packed in our n-dimensional cube. Each sphere is centered around one of the corners,
and encompasses all its immediate neighbors. The volume of a sphere is defined as the number of corners
it includes, which is 1 + n. The spheres are tightly packed but they don’t overlap, so their total volume is
(1 + n)2m , and this should not exceed the total volume of the cube, which is 2n .
The case of a 2-bit error-correcting code is similarly analysed. Eachnvalid codeword should define a set
including itself, the n
codewords at distance 1 from it, and the set of 2 codewords at distance 2 from it,
a total of n0 + n1 + n2 = 1 + n + n(n − 1)/2. Those sets should be non-overlapping, which implies the
relation
1 + n + n(n − 1)/2 2m ≤ 2n ⇒ 1 + n + n(n − 1)/2 ≤ 2k ⇒ k ≥ log2 1 + n + n(n − 1)/2 .
It may happen, of course, that three or even five bits are bad, but the simple SEC-DED code cannot detect
such errors.
If the single parity is good, then there are either no errors, or two bits are wrong. The receiver proceeds
to step 2, where it uses the other parity bits to distinguish between these two cases. Again, there could be
four or six bad bits, but this code cannot handle them.
The SEC-DED code has a Hamming distance of 4. In general, a code for c-bit error correction and d-bit
error detection, should have a distance of c + d + 1.
The following discussion provides better insight into the bahavior of the Hamming code. For short
codewords, the number of parity bits is large compared to the size of the codeword. When a 3-bit code, such
as xyz, is converted to SEC-DED, it becomes the 6-bit codeword xyb4 zb2 b1 , which doubles its size. Doubling
the size of a code seems a high price to pay in order to gain only single-error correction and double-error
detection. On the other hand, when a 500-bit code is converted to SEC-DED, only the 9 parity bits b256 ,
b128 , b64 , b32 , b16 , b8 , b4 , b2 , and b1 , are added, resulting in a 509-bit codeword. The size of code is increased
by 1.8% only. Such a small increase in size seems a good price to pay for even a small amount of increased
reliability. With even larger codes, the benefits of SEC-DED can be obtained with even a smaller increase
in the code size. There is clearly something wrong here, since we cannot expect to get the same benefits by
paying less and less. There must be some tradeoff. Something must be lost when a long code is converted
to SEC-DED.
What is lost is reliability. The SEC-DED code can detect 2-bit errors and correct 1-bit errors in the
short, 6-bit codeword and also in the long, 509-bit codeword. However, there may be more errors in 509 bits
than in 6 bits. The difference in the corrective power of SEC-DED is easier to understand if we divide the
500-bit code into 100 5-bit segments and convert each individually into a SEC-DED code by adding four
parity bits. The result is a string of 100(5 + 4) = 900 bits. Now assume that 50 bits become bad. In the
case of a single, 509-bit long string, the SEC-DED code may be able to detect an error but will not be able
to correct any bits. In the case of 100 9-bit strings, the 50 bad bits constitute 5.6% of the string size. This
implies that many 9-bit strings will not suffer any damage, many will have just 1 or 2 bad bits (cases which
the SEC-DED code can handle), and only a few will have more than 2 bad bits.
The conclusion is that the SEC-DED code provides limited reliability and should be applied only to
resonably short strings.
3.9.8 Periodic Codes
No single error-correcting code provides absolute reliability in every situation. This is why each application
should ideally have a special reliable code designed for it, based on what is known about possible data
corruption. One common example of errors is a burst. Imagine a reliable cable through which data is
constantly transmitted with high reliability. Suddenly a storm approaches and lightning strikes nearby. This
may cause a burst of errors that will corrupt hundreds or thousands of bits in a small region of the data.
Another example is a CD. Data is recorded on a CD in a spiral path that starts on the inside and spirals
outside. When the CD is cleaned carelessly, it may get scratched. If the cleaning is done in a circular motion,
parallel to the spiral, even a small scratch may cover (and therefore corrupt) many consecutive bits (which
is why a CD should be cleaned with radial motions, in a straight line from the inside toward the rim).
The SEC-DED, or any similar code, cannot deal with a burst of errors, so different codes must be
designed for applications where such bursts may occur. One way to design such a code is to embed parity
bits among the data bits, such that each parity bit is the parity of several, nonadjacent data bits. Such a
code is called periodic.
As a simple example of a periodic code, imagine a string of bits divided into 5-bit segments. Each bit
is identified by the segment number and by its number (1 through 5) within the segment. Each segment is
followed by one parity bit. Thus, the data with the parity bits becomes the string
b1,1 , b1,2 , b1,3 , b1,4 , b1,5 , p1 , , b2,1 , b2,2 , b2,3 , b2,4 , b2,5 , p2 , . . . , bi,1 , bi,2 , bi,3 , bi,4 , bi,5 , pi , . . . .
We now define each parity bit pi as the (odd) parity of the five bits bi,1 , bi−1,2 , bi−2,3 , bi−3,4 , and bi−4,5 .
Thus, each parity bit protects five bits in five different segments, so any error affects five parity bits, one in
its segment and four in the four segments that follow. If all the data bits of segment j become corrupted, it
3.9 Error-Detecting and Error-Correcting Codes 93
will be reflected in the five parity bits pj , pj+1 , pj+2 , pj+3 , and pj+4 . In general, errors will be detected if
every two bursts are at least five segments apart.
This simple code cannot correct any errors, but the principle of periodic codes can be extended to allow
for as much error correction as necessary.
3.9.9 A Different Approach
This section describes error correction in a CD (see Appendix C).
It is obvious that reading a CD-ROM must be error free, but error correction is also important in an
audio CD, because one bad bit can cause a big difference in the note played. Consider the two 16-bit numbers
0000000000000000 and 1000000000000000. The first represents silence and the second, a loud sound. Yet
they differ by one bit only! The size of a typical dust particle is 40μm, enough to cover more than 20 laps
of the track, and cause several bursts of errors (Figure C.1b). Without extensive error correction, the music
would sound as one long scratch.
Any error correction method used in a CD must be very sophisticated, since the errors may come in
bursts, or may be individual. The use of parity bits makes it possible to correct individual errors, but not
a burst of consecutive errors. This is why interleaving is used, in addition to parity bits. The principle of
interleaving is to rearrange the samples before recording them on the CD, and to reconstruct them after they
have been read. This way, a burst of errors during the read is translated to individual errors (Figure C.1a),
that can then be corrected by their parity bits.
The actual code used in CDs is called the Cross Interleaved Reed-Solomon Code (CIRC). It was devel-
oped by Irving S. Reed and Gustave Solomon at Bell labs in 1960, and is a powerful code. One version of
this code can correct up to 4000 consecutive bit errors, which means that even a scratch as long as three
millimeters can be tolerated on a CD. The principle of CIRC is to use a geometric pattern that is so familiar
that it can be reconstructed even if large parts of it are missing. It’s like being able to recognize the shape
of a rectangular chunk of cheese after a mouse has nibbled away large parts of it.
Suppose that the data consists of the two numbers 3.6 and 5.9. We consider them the y coordinates
of two-dimensional points and we assign them x coordinates of 1 and 2, respectively. We thus end up with
the points (1, 3.6) and (2, 5.9). We consider those points the endpoints of a line and we calculate four more
points on this line, with x coordinates of 3, 4, 5, and 6. They are (3, 8.2), (4, 10.5), (5, 12.8), and (6, 15.1).
Since the x coordinates are so regular, we only need to store the y coordinates of these points. We thus store
(or write on the CD) the six numbers 3.6, 5.9, 8.2, 10.5, 12.8, and 15.1.
Now suppose that two errors occur among those six numbers. When the new sequence of six numbers
is checked for the straight line property, the remaining four numbers can be identified as being collinear and
can still be used to reconstruct the line. Once this is done, the two bad numbers can be corrected, since their
x coordinates are known. Even three bad numbers out of those six can be corrected since the remaining
three numbers would still be enough to identify the original straight line.
It is even more reliable to start with three numbers a, b, and c, to convert them to the points (1, a),
(2, b), and (3, c), and to calculate the (unique) parabola that passes through these points. Four more points,
with x coordinates of 4, 5, 6, and 7, can then be calculated on this parabola. Once the seven points are
known, they provide a strong pattern. Even if three of the seven get corrupted, the remaining four can be
used to reconstruct the parabola and correct the three bad ones. However, if four of the seven get corrupted,
then no four numbers will be on a parabola (and any group of three will define a different parabola). Such a
code can correct three errors in a group of seven numbers, but it requires high redundancy (seven numbers
instead of four).
3.9.10 Generating Polynomials
There are many approaches to the problem of developing codes for more than 1-bit error correction. They
are, however, more complicated than Hamming’s method, and require a background in group theory and
Galois (finite) fields. One such approach, using the concept of a generating polynomial, is briefly sketched
here.
We consider the case m = 4. Sixteen codewords are needed, that can be used to code any set of 16
symbols. We already know that three parity bits are needed for 1-bit correction, thereby bringing the total
size of the code to n = 7. Here is an example of such a code:
94 3. Input/Output
The sum (modulo 2) of any two codewords equals another codeword. This implies that the sum of
any number of codewords is a codeword. Thus, the 16 codewords above form a group under this operation.
(Addition and subtraction modulo-2 is done by 0 + 0 = 1 + 1 = 0, 0 + 1 = 1 + 0 = 1, 0 − 1 = 1. The definition
of a group should be reviewed in any text on algebra.)
Any circular shift of a codeword is another codeword. Thus, this code is cyclic.
It has a Hamming distance of 3, as required for 1-bit correction.
Interesting properties! The sixteen codewords were selected from the 128 possible ones by means of
a generator polynomial. The idea is to look at each codeword as a polynomial, where the bits are the
coefficients. Here are some 7-bit codewords associated with polynomials of degree 6.
1 0 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 1 1 1
x6 +x3 +x2 +x +1 x5 +x4 +x x5 +x2 +x +1
The 16 codewords in the table above were selected by finding the degree-6 polynomials that are evenly
divisible (modulo 2) by the generating polynomial x3 + x + 1. For example, the third codeword ‘0100111’
in the table corresponds to the polynomial x5 + x2 + x + 1, which is divisible by x3 + x + 1, because
x5 + x2 + x + 1 = (x3 + x + 1)(x2 + 1).
To understand how such polynomials can be calculated, let’s consider similar operations on numbers.
Suppose that we want to know the largest multiple of 7 that’s less than or equal 30. We divide 30 by 7,
obtaining a remainder of 2, and then subtract the 2 from the 30, getting 28. Polynomials are divided in a
similar way. Let’s start with the four information bits 0010, and calculate the remaining three parity bits.
We write 0010ppp which gives us the polynomial x4 . We divide x4 by the generating polynomial, obtaining
a remainder of x2 + x. Subtracting that remainder from x4 gives us something that will be evenly divisible
by the generating polynomial. The result of the subtraction is x4 + x2 + x, so the complete codeword is
0010110.
Any generating polynomial can get us the first two properties. To get the third property (the necessary
Hamming distance), the right generating polynomial should be used, and it can be selected by examining
its roots (see [Lin 70]).
A common example of a generating polynomial is CRC(x) = x16 + x12 + x5 + 1. When dividing a large
polynomial by CRC(x), the result is a polynomial of degree 15, which corresponds to a 16-bit codeword.
There are standard polynomials used to calculate the CRC (cyclic redundancy codes) at the end of blocks
of transmitted data (see page 140).
Other generating polynomials are CRC12 (x) = x12 +x3 +x+1 and CRC16 (x) = x16 +x15 +x2 +1. They
generate the common CRC-12 and CRC-16 codes, which have lengths of 12 and 16 bits, respectively.
3.10 Data Compression
Data compression (officially called source coding) is a relatively new field, but it is an active one, with many
researchers engaged in developing new approaches, testing new techniques, and implementing software.
Quite a few compression methods currently exist. Some are suitable for the compression of text, while
others have been developed for compressing images, video, or audio data. The main aim of this section is
to introduce the reader to the main techniques and principles used to compress data. It turns out that the
many compression methods in existence are based on just a small number of principles, the most important
of which are variable-size codes, dictionaries, quantization, and transforms. These approaches are illustrated
in the following sections, together with a few examples of specific compression methods.
Before we can look at the details of specific compression methods, we should answer the basic question,
“How can data be compressed? How can we take a piece of data that’s represented by n bits and represent
it by fewer than n bits?” The answer is: Because data that’s of interest to us contains redundancies. Such
3.11 Variable-Size Codes 95
data is not random and contains various patterns. By identifying the patterns and removing them, we can
hope to reduce the number of bits required to represent the data. Any data compression algorithm must
therefore examine the data to find the patterns (redundancies) in it and eliminate them. Data that does not
have any redundancies is random and cannot be compressed.
The following simple argument illustrates the essence of the statement “Data compression is achieved
by reducing or removing redundancy in the data.” The argument shows that most data files cannot be
compressed, no matter what compression method is used. This seems strange at first because we compress
our data files all the time. The point is that most files cannot be compressed because they are random or
close to random and therefore have no redundancy. The (relatively) few files that can be compressed are the
ones that we want to compress; they are the files we use all the time. They have redundancy, are nonrandom
and therefore useful and interesting.
Given two different files A and B that are compressed to files C and D, respectively, it is clear that C
and D must be different. If they were identical, there would be no way to decompress them and get back
file A or file B.
Suppose that a file of size n bits is given and we want to compress it efficiently. Any compression
method that can compress this file to, say, 10 bits would be welcome. Even compressing it to 11 bits or 12
bits would be great. We therefore (somewhat arbitrarily) assume that compressing such a file to half its size
or better is considered good compression. There are 2n n-bit files and they would have to be compressed
into 2n different files of sizes less than or equal n/2. However, the total number of these files is
so only N of the 2n original files have a chance of being compressed efficiently. The problem is that N is
much smaller than 2n . Here are two examples of the ratio between these two numbers.
For n = 100 (files with just 100 bits), the total number of files is 2100 and the number of files that can be
compressed efficiently is 251 . The ratio of these numbers is the ridiculously small fraction 2−49 ≈ 1.78 · 10−15 .
For n = 1000 (files with just 1000 bits, about 125 bytes), the total number of files is 21000 and the
number of files that can be compressed efficiently is 2501 . The ratio of these numbers is the incredibly small
fraction 2−499 ≈ 9.82 · 10−91 .
Exercise 3.7: Assuming that compressing a file down to 90% of its size or less is still considered good
compression, compute the fraction of n-bit files that can be compressed well for n = 100 and n = 1000.
Most files of interest are at least some thousands of bytes long. For such files, the percentage of files
that can be efficiently compressed is so small that it cannot be computed with floating-point numbers even
on a supercomputer (the result is zero).
It is therefore clear that no compression method can hope to compress all files or even a significant
percentage of them. In order to compress a data file, the compression algorithm has to examine the data,
find redundancies in it, and try to remove them. Since the redundancies in data depend on the type of data
(text, images, sound, etc.), any compression method has to be developed for a specific type of data and
works best on this type. There is no such thing as a universal, efficient data compression algorithm.
3.11 Variable-Size Codes
So far we have discussed reliable codes. The principle behind those codes is increased redundancy, and this
always results in codes that are longer than strictly necessary. In this section we are interested in short
codes. We start with the following simple question: Given a set of symbols to be coded, what is the shortest
possible set of codes for these symbols? In the case of four symbols, it seems that the shortest code should
have a size of m = 2 bits per symbol. However, code7 of Table 3.16 is different. It illustrates the principle
of variable-size codes.
Codes of different sizes have been assigned, in code7 , to the four symbols, so now we are concerned with
the average size of the code. The average size of this code is clearly between one and three bits per symbol.
This kind of code makes sense if its average size turns out to be less than 2. To calculate the average size,
we need to examine a typical message that consists of these symbols, such as
BBBCABBCDCCBBABBCBDC.
These 20 symbols are easy to code, and the result is the 35-bit string:
96 3. Input/Output
The average size is 2.65 bit/char., and again we may ask, is this the shortest possible size for the given
frequencies?
The Huffman method is illustrated in Figure 3.18 using the five characters A, B, C, D, and E. We
assume that they occur with frequencies of 10%, 15%, 30%, 20% and 25%, respectively.
The two lowest-frequency characters are A and B (10% and 15%). Their codes should therefore be the
longest, and we start by tentatively assigning them the codes 0 . . . and 1 . . . where the “. . .” stands for more
bits to be assigned later. In our character set, we temporarily replace the two characters A and B with the
single character x, and assign x a frequency that’s the sum (25%) of the frequencies of A and B.
Our character set now consists of the four symbols D (20%), E (25%), x (25%), and C (30%), and we
repeat the previous step. We select any two characters with the lowest frequencies, say, D and x, assign
them the codes 0 . . . and 1 . . ., and replace both temporarily with a new character y whose frequency is the
3.12 Huffman Codes 97
x y
0 1 0 1
A B D x
D E x C y E C
20 25 25 30 45 25 30
Step 1 Step 2
z T
0 1 0 1
E C z y
y z T
45 55 100
Step 3 Step 4
T
0 1
z y
0 1 0 1
E C D x
0 1
sum 20 + 25 = 45%. Note that we could have selected D and E instead of D and x. The Huffman code, like
any other variable-length code, is not unique.
In the third step we select E and C and replace them with z whose frequency is 25 + 30 = 55%. In the
fourth and final step, we select the only remaining symbols, namely z and y, assign them the codes 0 and
1, respectively, and replace them with the symbol T , whose frequency is 55 + 45 = 100%. Symbol T stands
for the entire character set.
In the general case of a set of N symbols, this loop is repeated N − 1 times. Following the loop, we
construct a binary tree with T as the root, z and y as the two sons of T , E as the left son of z, and so on.
Comparing our tentative code assignments to the final tree shows that each left branch corresponds to an
assignment of a 0, and each right branch, to a 1. The final codes are therefore:
and the average code length is 0.10 × 3 + 0.15 × 3 + 0.30 × 2 + 0.20 × 2 + 0.25 × 2 = 2.25 bits/char.
The Huffman algorithm is simple to implement, and it can be shown that it produces a set of codes
with the minimum average size.
The most obvious disadvantage of variable-length codes is their vulnerability to errors. To achieve
minimum size we have omitted parity bits and, even worse, we use the prefix property to decode those codes.
As a result, an error in a single bit can cause the receiver to lose synchronization and be unable to decode
the rest of the message. In the worst case, the receiver may even read, decode, and interpret the rest of the
transmission wrong, without realizing that a problem has occurred.
98 3. Input/Output
Example: Using the code above, the string CEDBCE is coded into: “01 00 10 111 01 00” (without the
spaces). Assuming the following error: “01 00 10 111 00 00” the receiver will not notice any problem, but
the fifth character decoded will be E instead of C.
Exercise 3.8: What will happen in the case ‘01 00 11 111 01 00’ ?
A simple way of adding reliability to variable length codes is to break a transmission into groups of 7
bits and add a parity bit to each group. This way the receiver will at least be able to detect a problem and
ask for a retransmission.
3.12.1 Adaptive Huffman Coding
Up until now we have assumed that the compressor (encoder) knows the frequencies of occurrence of all the
symbols in the message being compressed. In practice, this rarely happens, and practical Huffman coders
can operate in one of three ways as follows:
1. A fixed set of Huffman codes. Both encoder and decoder use a fixed, built-in set of Huffman codes.
This set is prepared once by selecting a set of representative texts and counting the frequencies of all the
symbols in this set. We say that this set is used to “train” the algorithm. For English text, this set of
“training” documents can be the complete works of Shakespeare.
2. A two-pass compression job. The encoder reads the input data twice. In the first pass it counts
symbol frequencies and in the second pass it does the actual compression. In between the passes, the encoder
computes a set of Huffman codes based on the frequencies counted in pass 1. This approach is conceptually
simple and produces good compression, but it is too slow to be practical, because the input data has to be
read from a disk, which is much slower than the processor. An encoder using this method has to write the
set of Huffman codes at the start of the compressed file, since otherwise the decoder would not be able to
decode the data.
3. An adaptive algorithm. The encoder reads the input data once and in this single pass it counts symbol
frequencies and compresses the data. While data is being input and compressed, new Huffman codes are
computed and assigned all the time, based on the symbol frequencies counted so far. An adaptive algorithm
must be designed such that the decoder would be able to mimic the operations of the encoder at any point.
We present here a short description of an adaptive Huffman algorithm. The main idea is for the
compressor and the decompressor to start with an empty Huffman tree, and to modify it as more and
more characters are being read and processed (in the case of the compressor the word “processed” means
compressed. In the case of the decompressor it means decompressed). The compressor and decompressor
should modify the tree in the same way, so at any point in the process they should use the same codes,
although those codes may change from step to step.
Initially, the compressor starts with an empty Huffman tree. No characters have any codes assigned.
The first character read is simply written on the compressed file in its uncompressed form. The character is
then added to the tree, and a code assigned to it. The next time this character is encountered, its current
code is written on the file, and its frequency incremented by one. Since the tree has been modified, it is
checked to see if it still a Huffman tree (best codes). If not, it is rearranged, thereby changing the codes.
The decompressor follows the same steps. When it reads the uncompressed form of a character, it adds
it to the tree and assigns it a code. When it reads a compressed (variable-size) code, it uses the current tree
to determine what character it is, and it updates the tree in the same way as the compressor.
The only subtle point is that the decompressor needs to know whether it is reading an uncompressed
character (normally an 8-bit ASCII code) or a variable-size code. To remove any ambiguity, each uncom-
pressed character is preceded by a special variable-size code. When the decompressor reads this code, it
knows that the next 8 bits are the ASCII code of a character which appears in the file for the first time.
The trouble is that the special code should not be any of the variable-size codes used for the characters.
Since these codes are being changed all the time, the special code should also be changed. A natural way
to do this is to add an empty leaf to the tree, a leaf with a zero frequency of occurrence, which is always
assigned to the 0-branch of the tree. Since the leaf is in the tree it gets a variable-size code assigned. This
code is the special code preceding every uncompressed character. As the tree is being modified, the position
of the empty leaf (and also its code) change, but the code is always used to identify uncompressed characters
in the compressed file.
3.13 Facsimile Compression 99
This method is used to compress/decompress data in the V.32 protocol for 14400 baud modems (Sec-
tion 3.22.7).
3.13 Facsimile Compression
So far, we have assumed that the data to be compressed consists of a set of N symbols, which is true for text.
This section shows how Huffman codes can be used to compress simple images, images sent between facsimile
(fax) machines. Those machines are made by many manufacturers, so a standard compression method was
needed when they became popular. Several such methods were developed and proposed by the ITU-T.
The ITU-T is one of four permanent parts of the International Telecommunications Union (ITU),
based in Geneva, Switzerland (http://www.itu.ch/). It issues recommendations for standards applying
to modems, packet switched interfaces, V.24 connectors, etc. Although it has no power of enforcement, the
standards it recommends are generally accepted and adopted by industry. Until March 1993, the ITU-T was
known as the Consultative Committee for International Telephone and Telegraph (CCITT).
The first data compression standards developed by the ITU-T were T2 (also known as Group 1) and
T3 (group 2). They are now obsolete and have been replaced by T4 (group 3) and T6 (group 4). Group 3 is
currently used by all fax machines designed to operate with the Public Switched Telephone Network (PSTN).
These are the machines we have at home, and they operate at maximum speeds of 9600 baud. Group 4 is
used by fax machines designed to operate on a digital network, such as ISDN. They have typical speeds of
64K baud. Both methods can produce compression ratios of 10:1 or better, reducing the transmission time
of a typical page to about a minute with the former, and a few seconds with the latter.
A fax machine scans a document line by line, converting each line to small black and white dots called
pels (from Picture ELement). The horizontal resolution is always 8.05 pels per millimeter (about 205 pels
per inch). An 8.5-inch wide scan line is thus converted to 1728 pels. The T4 standard, though, recommends
scanning only about 8.2 inches, thus producing 1664 pels per scan line (these numbers, as well as the ones
in the next paragraph, are all to within ±1% accuracy).
The vertical resolution is either 3.85 scan lines per millimeter (standard mode) or 7.7 lines/mm (fine
mode). Many fax machines have also a very-fine mode, where they scan 15.4 lines/mm. Table 3.19 assumes
a 10-inch high page (254 mm) and shows the total number of pels per page, and typical transmission times
for the three modes without compression. The times are long, which shows how important data compression
is in fax transmissions.
Ten inches equal 254mm. The number of pels is in the scan pels per pels per time time
millions and the transmission times, at 9600 baud without lines line page (sec) (min)
compression, are between 3 and 11 minutes, depending on 978 1664 1.670M 170 2.82
the mode. However, if the page is shorter than 10 inches, 1956 1664 3.255M 339 5.65
or if most of it is white, the compression ratio can be 10:1 3912 1664 6.510M 678 11.3
or better, resulting in transmission times of between 17
Table 3.20: Fax transmission times
and 68 seconds.
The group 3 compression method is based on the simple concept of run lengths. When scanning a row
of pels in an image, we can expect to find runs of pels of the same color. In general, if we point to a pel
at random and find that it is white, chances are that its immediate neighbors will also be white, and the
same is true for black pels. The group 3 method starts by counting the lengths of runs of identical pels in
each scan line of the image. An image with long runs yields a small number of run lengths, so writing them
on a file produces compression of the original image. The problem is that a run length can be any number
between 1 and the length of a scan line (1664 pels), so the algorithm should specify a compact way to write
numbers of various sizes on a file.
The group 3 method solves this problem by assigning Huffman codes to the various run lengths. The
codes are written on the compressed file, and they should be selected such that the shortest codes should be
assigned to the most common run lengths.
To develop the group 3 code, the ITU-T selected a set of eight representative “training” documents and
analysed the run lengths of white and black pels on these documents. The Huffman algorithm was then used
100 3. Input/Output
to assign a variable-size code to each run length. The most common run lengths were found to be 2, 3, and
4 black pixels, so they were assigned the shortest codes (Table 3.20). Next came run lengths of 2 to 7 white
pixels. They were assigned slightly longer codes. Most run lengths were rare, and were assigned long, 12-bit
codes.
(a) (b)
Table 3.21: Some Group 3 and 4 fax codes, (a) termination codes, (b) make-up codes
Exercise 3.9: A run length of 1664 white pels was assigned the short code 011000. Why is this length so
common?
Since run lengths can be long, the Huffman algorithm was modified. Codes were assigned to run lengths
of 1 to 63 pels (they are the termination codes of Table 3.20a) and to run lengths that are multiples of 64
pels (the make-up codes of Table 3.20b). Group 3 is thus a modified Huffman code (also called MH). The
code of a run length is either a single termination code (if the run length is short) or one or more make-up
codes, followed by one termination code (if the run length is long). Here are some examples:
1. A run length of 12 white pels is coded as 001000.
2. A run length of 76 white pels (=64+12) is coded as 11011|001000 (without the vertical bar).
3. A run length of 140 white pels (=128+12) is coded as 10010|001000.
4. A run length of 64 black pels (=64+0) is coded as 0000001111|0000110111.
5. A run length of 2561 black pels (2560+1) is coded as 000000011111|010.
Exercise 3.10: An 8.5-inch-wide scan line results in 1728 pels, so how can there be a run of 2561 consecutive
pels?
Each scan line is coded separately, and its code is terminated with the special EOL code 00000000001.
Each line also gets one white pel appended to it on the left when it is scanned. This is done to remove any
ambiguity when the line is decoded on the receiving side. After reading the EOL for the previous line, the
receiver assumes that the new line starts with a run of white pels, and it ignores the first of them. Examples:
1. The 14-pel line is coded as the run lengths 1w 3b 2w 2b 7w EOL,
which becomes the binary string “000111|10|0111|11|1111|0000000001”. The decoder ignores the single white
pel at the start.
2. The line is coded as the run lengths 3w 5b 5w 2b EOL, which
becomes the binary string “1000|0011|1100|11|0000000001”. The decoder starts the line with two white pels.
3.14 Dictionary-Based Methods
Exercise 3.11: The group 3 code for a run length of five black pels (0011) is also the prefix of the codes
for run lengths of 61, 62, and 63 white pels. Explain this.
The group 3 code has no error correction, but many errors can be detected. Because of the nature of the
101
Huffman code, even one bad bit in the transmission can cause the receiver to get out of synchronization, and
to produce a string of wrong pels. This is why each scan line is encoded separately. If the receiver detects
an error, it skips bits, looking for an EOL. This way, at most one scan line can be received incorrectly. If the
receiver does not see an EOL after a certain number of lines, it assumes a high error rate, and it aborts the
process, notifying the transmitter. Since the codes are between 2 and 12 bits long, the receiver can detect
an error if it cannot decode a valid code after reading 12 bits.
Each page of the coded document is preceded by one EOL and is followed by six EOL codes. Because
each line is coded separately, this method is a one-dimensional coding scheme. The compression ratio
depends on the image. Images with large contiguous black or white areas (text or black and white diagrams)
will highly compress. Images with many short runs can sometimes produce negative compression. This is
especially true in the case of images with shades of gray (mostly photographs). Such shades are produced
by halftoning, which covers areas with alternating black and white pels (runs of length 1).
Exercise 3.12: What is the compression ratio for runs of length one (many alternating pels)?
The T4 standard also allows for fill bits to be inserted between the data bits and the EOL. This is done
in cases where a pause is necessary, or where the total number of bits transmitted for a scan line must be a
multiple of 8. The fill bits are zeros.
Example: The binary string “000111|10|0111|11|1111|0000000001” becomes
“000111|10|0111|11|1111|0000|0000000001” after four zeros are added as fill bits, bringing the total length of
the string to 32 bits (= 8 × 4). The decoder sees the four zeros of the fill, followed by the nine zeros of the
EOL, followed by the single 1, so it knows that it has encountered a fill followed by an EOL.
The group 3 method is described in http://www.cis.ohio-state.edu/htbin/rfc/rfc804.html. At
the time of writing, the T.4 and T.6 recommendations can also be found as files 7_3_01.ps.gz and
7_3_02.ps.gz at src.doc.ic.ac.uk/computing/ccitt/ccitt-standards/1988/.
The group 3 method is called one-dimensional because it encodes each scan line of the image separately.
For images with gray areas, this type of coding does not produce good compression, which is why the group
4 method was developed. The group 4 method is a two-dimensional compression method. It compresses a
scan line by recording the differences between it and its predecessor. Obviously, the first scan line has to
compressed as with group 3. The method is efficient since consecutive scan lines are normally very similar
(remember that there are about 200 of them per inch). However, a transmission error in one scan line will
cause that line and all its successors to be bad. Therefore, two-dimensional compression reverts to one-
dimensional compression from time to time, compresses one scan line independently of its predecessor, then
switches back to two-dimensional coding.
The encoder scans the search buffer backwards (from right to left) for a match to the first symbol ‘e’ in
the look-ahead buffer. It finds it at the ‘e’ of the word ‘easily’. This ‘e’ is at a distance (offset) of 8 from
the end of the search buffer. The encoder then matches as many symbols following the two ‘e’s as possible.
Three symbols ‘eas’ match in this case, so the length of the match is 3. The encoder than continues the
backward scan, trying to find longer matches. In our case, there is one more match, at the word ‘eastman’,
with offset 16, and it has the same length. The encoder selects the longest match or, if they are all the same
length, the last one found, and prepares the token (16, 3, ‘e’).
Selecting the last match, rather than the first one, simplifies the program, since it only has to keep track
of the last match found. It is interesting to note that selecting the first match, while making the program
somewhat more complex, also has an advantage. It selects the smallest offset. It would seem that this is
not an advantage, since a token should have room for the largest possible offset. However, it is possible to
follow LZ77 with Huffman coding of the tokens, where small offsets are assigned shorter codes. This method
is called LZH. Many small offsets implies a smaller output file in LZH.
In general, a token has three parts, offset, length and next symbol in the look-ahead buffer (which, in
our case, is the second ‘e’ of the word ‘teases’). This token is written on the output file, and the window
is shifted to the right (or, alternatively, the input data is moved to the left) four positions, 3 positions for
the matched string and one position for the next symbol.
...sir sideastmaneasilytease|sseasickseals......
If the backward search yields no match, a token with zero offset and length, and with the unmatched
symbol is written. This is also the reason why a token has to have a third component. Such tokens are
common at the beginning of the compression job, when the search buffer is empty. The first five steps in
encoding our example are
|sirsideastman ⇒ (0,0,‘s’)
s|irsideastmane ⇒ (0,0,‘i’)
si|rsideastmanea ⇒ (0,0,‘r’)
sir|sideastmaneas ⇒ (0,0,‘’)
sir|sideastmaneasi ⇒ (4,2,‘d’)
Exercise 3.13: What are the next two steps?
Clearly, a token of the form (0,0,. . . ), which encodes a single symbol, does not provide good compression.
It is easy to estimate its length. The size of the offset is log2 S where S is the length of the search buffer.
In practice, the search buffer may be a few thousand bytes long, so the offset size is typically 10–12 bits. The
size of the ‘length’ field is, similarly log2 (L − 1) where L is the length of the look-ahead buffer (see below
for the −1). In practice the look-ahead buffer is only a few tens of bytes long, so the size of the ‘length’ field
is just a few bits. The size of the ‘symbol’ field is typically 8 bits but, in general it islog2 A where A is the
alphabet size. The total size of the 1-symbol token (0,0,. . . ) may typically be 11 + 5 + 8 = 24 bits, much
longer than the raw 8-bit size of the (single) symbol it encodes.
Here is an example showing why the ‘length’ field may be longer than the size of the look-ahead buffer.
...Mr.alfeastmaneasilygrowsalf|alfainhisgarden...
The first letter ‘a’ in the look-ahead buffer matches the five ‘a’s in the search buffer. It seems that the
two extreme ‘a’s match with a length of 3 and the encoder should select the last (leftmost) of them and create
the token (28 ,3, ‘a’). In fact it creates the token (3, 4, ‘’). The 4-letter string ‘alfa’ in the look-ahead
buffer is matched to the last three letters ‘alf’ in the search buffer and the first letter ‘a’ in the look-ahead
buffer. The reason for this is that the decoder can handle such a token naturally, without any modifications.
It starts at position 3 of its search buffer and copies the next four letters, one by one, extending its buffer
to the right. The first three letters are copies of the old buffer contents, and the fourth one is a copy of the
first of those three. The next example is even more convincing (and only somewhat contrived)
... alfeastmaneasilyyellsA|AAAAAAAAAAAAAAAH....
The encoder creates the token (1, 9, ‘A’), matching the first nine copies of ‘A’ in the look-ahead buffer
and including the tenth ‘A’. This is why, in principle, the length of a match could be up to the size of the
look-ahead buffer minus one.
The decoder is much simpler than the encoder. It has to maintain a buffer, equal in size to the encoder’s
window. The decoder inputs a token, finds the match in its buffer, writes the match and the third token field
3.15 Approaches to Image Compression 103
on its output file, and shifts the matched string and the third field into the buffer. This implies that LZ77,
or any variants, are useful in cases where a file is compressed once (or a few times) and is decompressed
often. A rarely-used archive of compressed files is a good example.
At first it seems that this method does not make any assumptions about the input data. Specifically,
it does not pay attention to any symbol frequencies. A little thinking, however, shows that, because of the
nature of the sliding window, the LZ77 method always compares the look-ahead buffer to the recently input
text in the search buffer and never to text that was input long ago. Thus, the method implicitly assumes
that patterns in the input data occur close together. Data that satisfies this assumption will compress well.
The basic LZ77 method has been improved in several ways by researchers and programmers during
the 1980s and 1990s. One way to improve it is to use variable-size ‘offset’ and ‘length’ fields in the tokens.
Another way is to increase the sizes of both buffers. Increasing the size of the search buffer makes it possible
to find better matches, but the tradeoff is an increased search time. A large search buffer thus requires a
more sophisticated data structure that allows for fast search. A third improvement has to do with sliding the
window. The simplest approach is to move all the text in the window to the left after each match. A faster
method is to replace the linear window with a circular queue, where sliding the window is done by resetting
two pointers. Yet another improvement is adding an extra bit (a flag) to each token, thereby eliminating
the third field.
3.15 Approaches to Image Compression
A digital image is a rectangular set of dots, or picture elements, arranged in m rows and n columns. The
pair m×n is called the resolution of the image, and the dots are called pixels (except in the case of a fax
image, where they are referred to as pels). For the purpose of image compression it is useful to distinguish
the following types of images:
1. A bi-level (or monochromatic) image. This is an image where the pixels can have one of two values,
normally referred to as black and white. Each pixel can thus be represented by one bit. This is the simplest
type of image.
2. A grayscale image. A pixel in such an image can have one of n values, indicating one of 2n shades of
gray (or shades of some other color). The value of n is normally compatible with a byte size, i.e., it is 4, 8,
12, 16, 24, or some other convenient multiple of 4 or of 8.
3. A continuous-tone image. This type of image can have many similar colors (or grayscales). When
adjacent pixels differ by just one unit, it is hard or even impossible for the eye to distinguish between their
colors. As a result, such an image may contain areas with colors that seem to vary continuously as the eye
moves along the area. A pixel in such an image is represented by either a single large number (in the case
of many grayscales) or by three components (in the case of many colors). This type of image can easily be
obtained by taking a photograph with a digital camera, or by scanning a photograph or a painting.
4. A cartoon-like image. This is a color image which consists of uniform areas. Each area has a uniform
color but adjacent areas may have very different colors. This feature may be exploited to obtain better
compression.
This section discusses general approaches to image compression, not specific methods. These approaches
are different, but they remove redundancy from an image by using the following principle: If we select a
pixel in the image at random, there is a good chance that its neighbors will have the same color or very
similar colors. Thus, image compression is based on the fact that neighboring pixels are highly correlated.
3.15.1 Differencing
It is posible to get fairly good compression by simply subtracting each pixel in a row of an image from its
predecessor and writing the differences (properly encoded by variable-size, prefix codes) on the compressed
file. Differencing produces compression because the differences are small numbers and are decorrelated. The
following sequence of values are the intensities of 24 adjacent pixels in a row of a continuous-tone, grayscale
image
12, 17, 14, 19, 21, 26, 23, 29, 41, 38, 31, 44, 46, 57, 53, 50, 60, 58, 55, 54, 52, 51, 56, 60.
Only two of the 24 pixels are identical. Their average value is 40.3. Subtracting pairs of adjacent pixels
results in the sequence
12, 5, −3, 5, 2, 4, −3, 6, 11, −3, −7, 13, 4, 11, −4, −3, 10, −2, −3, 1, −2, −1, 5, 4.
104 3. Input/Output
12, −7, −8, 8, −3, 2, −7, 9, 5, −14, −4, 20, −11, 7, −15, 1, 13, −12, −1, 4, −3, 1, 6, 1.
5 5
10 10
15 15
20 20
25 25
30 30
5 10 15 20 25 30 5 10 15 20 25 30
(a) (b)
Figure 3.23: Maps of a Random Matrix (a) and Its Inverse (b).
b=inv(a); imagesc(b)
Exercise 3.14: Use mathematical software to illustrate the covariance matrices of (1) a matrix with corre-
lated values and (2) a matrix with decorrelated values.
3.15 Approaches to Image Compression 105
where the rotation matrix R is orthonormal (i.e., the dot product of a row with itself is 1, the dot product
of different rows is 0, and the same is true for columns). The inverse transformation is
1 1 1
(x, y) = (x∗ , y ∗ )R−1 = (x∗ , y ∗ )RT = (x∗ , y ∗ ) √ . (3.3)
2 −1 1
(The inverse of an orthonormal matrix is its transpose.)
It is obvious that most points end up having y-coordinates that are zero or close to zero, while the
x-coordinates don’t change much. Figure 3.24a,b shows that the distributions of the x and y coordinates
(i.e., the odd-numbered and even-numbered pixels of an image) before the rotation don’t differ substantially.
Figure 3.24c,d shows that the distribution of the x coordinates stays almost the same but the y coordinates
are concentrated around zero.
Once the coordinates of points are known before and after the rotation, it is easy to measure the
reduction in correlation. A simple measure is the sum i xi yi , also called the cross-correlation of points
(xi , yi ).
Exercise 3.15: Given the five points (5, 5), (6, 7), (12.1, 13.2), (23, 25), and (32, 29) rotate them 45◦ clock-
wise and calculate their cross-correlations before and after the rotation.
We can now compress the image by simply writing the transformed pixels on the compressed stream.
If lossy compression is acceptable, then all the pixels can be quantized, resulting in even smaller numbers.
We can also write all the odd-numbered pixels (those that make up the x coordinates of the pairs) on the
compressed stream, followed by all the even-numbered pixels. These two sequences are called the coefficient
vectors of the transform. The latter sequence consists of small numbers and may, after quantization, have
runs of zeros, resulting in even better compression.
It can be shown that the total variance of the pixels does not change by the rotation, since a rotation
matrix is orthonormal. However, since the variance of the new y coordinates is small, most of the variance
is now concentrated in the x coordinates. The variance is sometimes called the energy of the distribution of
pixels, so we can say that the rotation has concentrated (or compacted) the energy in the x coordinate and
has created compression this way.
106 3. Input/Output
255
128
0
0 128 255
(a)
127
50
−50
−128
(b)
90 90
80 80
70 70
60 60
50 50
40 40
30 30
20 (a) 20 (b)
10 10
0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
90 1000
80 900
800
70
700
60
600
50
500
40
400
30
300
20
200 (d)
10
(c) 100
0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300
Concentrating the energy in the x coordinate has another advantage. We know that this coordinate is
important, so we quantize the y coordinate (which is unimportant) coarsely. This increases compression while
losing only unimportant image data. The x coordinate should be quantized only lightly (fine quantization).
The following simple example illustrates the power of this basic transform. We start with the point (4,√ 5),
whose two coordinates are similar. Using Equation (3.2) the point is transformed to (4, 5)R = (9, 1)/ 2 ≈
(6.36396, 0.7071). The energies of the point and its transform are 42 + 52 = 41 = (92 + 12 )/2. If we delete
the smaller coordinate (4) of the point, we end up with an error of 42 /41 = 0.39. If, on the other hand, we
delete the smaller of the two transform coefficients (0.7071), the resulting error is just 0.70712 /41 = 0.012.
Another way to obtain the same error is to consider the reconstructed point. Passing √12 (9, 1) through the
inverse transform [Equation (3.3)] results in the original point (4, 5). Doing the same with √12 (9, 0) results in
the approximate reconstructed point (4.5, 4.5). The energy difference between the original and reconstructed
points is the same small quantity
2
(4 + 52 ) − (4.52 + 4.52 ) 41 − 40.5
= = 0.0012.
42 + 52 41
This simple transform can easily be extended to any number of dimensions, with the only difference
that we cannot visualize spaces of more than three dimensions. However, the mathematics can easily be
extended. Instead of selecting pairs of adjacent pixels we can select triplets. Each triplet becomes a point
in three-dimensional space, and these points form a cloud concentrated around the line that forms equal
(although not 45◦ ) angles with the three coordinate axes. When this line is rotated such that it coincides with
the x axis, the y and z coordinates of the transformed points become small numbers. The transformation is
done by multiplying each point by a 3×3 rotation matrix, and such a matrix is, of course, orthonormal. The
transformed points are then separated into three coefficient vectors, of which the last two consist of small
numbers. For maximum compression each coefficient vector should be quantized separately.
Here is a very practical approach to lossy or lossless image compression that uses rotations in 8-
dimensional space.
108 3. Input/Output
1. Divide the image into blocks of 8 × 8 pixels each. One such block, with 64 pixels denoted by L (large)
is shown in Figure 3.25a.
2. Rotate each of the eight rows of a block. The result of rotating a row in eight dimensions is eight
numbers, of which the last 7 are small (denoted by S in Figure 3.25b).
3. Rotate each of the eight columns of a block. The result of rotating all the columns is a block
of 8 × 8 transform coefficients, of which the coefficient at the top-left corner is large (it is called the DC
coefficient) and the remaining 63 coefficients (called the AC coefficients) are small (Figure 3.25c). Thus, this
double rotation concentrates the energy in the first of 64 dimensions. This type of transform is called the
two-dimensional discrete cosine transform or DCT.
4. If lossy compression is desired, the 63 AC coefficients can be coarsely quantized. This brings most of
them to zero, which helps in subsequent steps, while losing only unimportant image data.
5. Experience shows that the set of 64 transform coefficients get smaller in value as we move from
the top-left to the bottom-right corner of the 8 × 8 block. We therefore scan the block in zigzag pattern
(Figure 3.25d), resulting in a linear sequence of 64 coefficients that get smaller and smaller.
6. This sequence normally has many zeros. The run lengths of those zeros are determined and the entire
sequence is encoded by replacing each nonzero coefficient and each run length of zeros with a Huffman code.
This method is the basis of the well-known JPEG image compression standard.
L L L L L L L L L S S S S S S S L S S S S S S S
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
L L L L L L L L L S S S S S S S S ss ss ss ss ss ss ss
(a) (b) (c) (d)
Figure 3.26: A two-dimensional DCT and a zigzag scan pattern
3.16 Secure Codes
Secure codes have always been important to kings, tyrants, and generals; and equally important to their
opponents. Messages sent by a government to various parts of the country have to be encoded in case
they fall into the wrong hands, and the same is true of orders issued by generals. However, the “wrong
hands” consider themselves the right side, not the wrong one, and always try to break secure codes. As a
result, the development of secure codes has been a constant race between cryptographers (code makers) and
cryptanalysts (code breakers). New codes have been developed throughout history and were broken, only
for newer, more sophisticated codes to be developed. This race has accelerated in the 20th century, because
of (1) the two World Wars, (2) advances in mathematics, and (3) the development of computers. A general
online reference for secure codes is [Savard 01].
Secure codes are becoming prevalent in modern life because of the fast development of telecommunica-
tions. Our telephone conversations are sometimes transmitted through communications satellites and our
email messages may pass through many computers, thereby making our private conversations vulnerable to
interception. We sometimes wish we had a secure code to use for our private communications. Businesses
and commercial enterprises rely heavily on sending and receiving messages and also feel the need for secure
communications. On the other hand, widespread use of secure codes worries law enforcement agencies, since
criminals (organized or otherwise) and terrorists may also use secure codes if and when they become widely
available.
The field of secure codes is divided into cryptography and steganography, and cryptography is further
divided into codes and ciphers (Figure 3.26). Steganography (Section 3.20) deals with information hiding,
while cryptography—derived from the Greek word kryptos, meaning “hidden”—tries to encode information,
so it is impossible (or at least very hard) to decipher. The term “code” refers to codes for words, phrases,
or for entire messages, while cipher is a code for each individual letter. For example, army units may agree
3.16 Secure Codes 109
on the codeword green to mean attack at dawn and on red to imply retreat immediately. The words
green and red are codes. A code may be impossible to break, but the use of codes is limited, because codes
have to be agreed upon for every possible eventuality. A cipher, on the other hand, is a rule that tells how
to encode each letter in a message. Thus, for example, if we agree to replace each letter with the one two
places ahead of it in the alphabet, then the message attack at dawn will be encoded as crrcem cr fcyp
or, even more securely, as crrcemcrfcyp. A cipher is general, but can be broken. In practice, however, we
use the terms code and codebreaker instead of the more accurate cipher and cipherbreaker.
Steganography
(hiding)
Secure Code
codes (replace words)
Substitution
Cipher
Cryptography (replace letters)
(scrambling)
Transposition
Figure 3.27: The Terminology of Secure Codes.
A combination of code and cipher, called nomenclator, is also possible. Parts of a message may be
encrypted by codes, and the rest of the message, the parts for which codes do not exist, is encrypted by a
cipher.
It is a popular saying that the First World War was the chemists’ war, because of the large-scale use
of poison gas for the first time. The Second World War was the physicists’ war because of the use of
the atom bomb. Similarly, the Third World War (that we hope can be avoided) may turn out to be the
mathematicians’ war, because winning that war, if at all possible, may depend on the use of and the breaking
of, secure codes.
Some of the development as well as the breaking of codes is done by researchers at universities and
research institutes all over the world (see [Flannery 00] for an unusual example of this). It is generally
agreed, however, that most of the work in this field is done in secret by government agencies. Two well-
known examples of such agencies are the national security agency (NSA) in the United States and the
government communications headquarters (GCHQ) in the United Kingdom.
Encrypting a message involves two ingredients, an algorithm and a key. There are many known encryp-
tion algorithms, but the details of each depend on the choice of a key. Perhaps the simplest example of an
encryption algorithm is letter shifting. The algorithm is: A message is encrypted by replacing each letter
with the letter located n positions ahead of it (cyclically) in the alphabet. The key is the value of n. Here
is an example for n = 3 (note how Y is replaced by A).
ABCDEFGHIJKLMNOPQRSTUVWXYZ
DEFGHIJKLMNOPQRSTUVWXYZABC
The top line is the plain alphabet and the bottom line is the cipher alphabet.
Such simple shifting is called the Caesar cipher, because it is first described in Julius Caesar’s book
Gallic Wars. It is an example of a substitution algorithm, in which each letter is substituted by a different
letter (or sometimes by itself). Most encryption algorithms are based on some type of substitution. A
simple example of a substitution algorithm that can also be made very secure is the book cipher. A book is
chosen and a page number is selected at random. The words on the page are 1 numbered 2 and 3 a 4 table 5 is
6
prepared, 7 with 8 the 9 first 10 letter 11 of 12 each 13 word 14 and 15 the 16 word’s 17 number. 18 This 19 code 20 table
21 22
is later 23 used 24 to 25 encrypt 26 messages 27 by 28 replacing 29 each 30 letter 31 of 32 the 33 message 34 with 35 a
36
number 37 from 38 the 39 table. For example, the message NOT NOW may be encoded as 36|31|20|17|11|13 (but
may also be encoded differently). If the messages are short, and if a different page is used for each message,
then this simple code is very secure (but the various page numbers have to be agreed upon in advance, which
makes this method impractical in most situations).
110 3. Input/Output
Before we discuss substitution algorithms in detail, it is important to mention another large family
of encryption algorithms, namely transposition algorithms. In such an algorithm, the original message is
replaced by a permutation of itself. If the message is long, then each sentence is replaced by a permutation
of itself. The number of permutations of n objects is n!, a number that grows much faster than n. However,
the permutation being used has to be chosen such that the receiver will be able to decipher the message.
A simple example of such a method is to break a message up into two strings, one with the odd-numbered
letters of the message and the other with the even-numbered letters, then concatenate the two strings. For
example:
WAITFORMEATMIDNIGHT
W I O E A I N G T
A T F R M T M D I H
WIOEAINGTATFRMTMDIH
This method can be extended by selecting a key n, breaking the message up into n strings (where the first
string contains letters 1, n + 1, 2n + 1,. . . ), and concatenating them.
3.16.1 Kerckhoffs’ principle
Back to substitution algorithms. An important principle in cryptography, due to the Dutch linguist Auguste
Kerckhoffs von Nieuwenhoff, states that the security of an encrypted message must depend on keeping the
key secret. It must not depend on keeping the encrypting algorithm secret. This principle is widely accepted
and it implies that there must be many possible keys to an algorithm. The Caesar algorithm, for example,
is very insecure because it is limited to the 26 keys 1 to 26. Shifting the 27-symbol sequence (26 letters and
a space) 27 positions returns it to its original position, and shifting it 27 + n positions is identical to shifting
it just n positions.
Kerckhoffs’ principle
One should assume that the method used to encipher data is
known to the opponent, and that security must lie in the choice
of key. This does not necessarily imply that the method should
be public, just that it is considered public during its creation.
—Auguste Kerckhoffs
language, the various letters appear in texts with different probabilities (we say that the distribution of
letters is nonuniform). Some letters are common while others are rare. The most common letters in English,
for example, are E, T, and A (if a blank space is also considered a letter, then it is the most common), and
the least common are Z and Q. If a monoalphabetic substitution code replaces E with, say , D, then D should
be the most common letter in the ciphertext.
The letter distribution in a language is computed by selecting documents that are considered typical
in the language and counting the number of times each letter appears in those documents. For reliable
results, the total number of letters in all the documents should be in the hundreds of thousands. Table 3.27
lists the letter distribution computed from the (approximately) 708,000 letters constituting a previous book
by this author. Any single document may have letter distribution very different from the standard. In
a mathematical text, the words quarter, quadratic, and quadrature may be common, increasing the
frequencies of Q and U, while a text on the effects of ozone on zebras’ laziness in Zaire may have unusually
many occurrences of Z. Also, any short text may have letter distribution sufficiently different from the
2
standard to defy simple analysis. The short story The Gold Bug by Edgar Allan Poe is an early example of
breaking a monoalphabetic substitution code.
Another statistical property of letters in a language is the relation between pairs (and even triplets)
of letters (digrams and trigrams). In English, a T is very often followed by H (as in this, that, then, and
them) and a Q is almost always followed by a U. Thus, digram frequencies can also be used in deciphering a
stubborn monoalphabetic substitution code.
Not everything that counts can be counted, and not everything
that can be counted counts.
—Albert Einstein
Alberti’s cipher algorithm was extended by several people, culminating in the work of Blaise de Vi-
genère, who in 1586 published the polyalphabetic substitution cipher named after him. The Vigenère system
uses 26 cipher alphabets (including just letters, not a space), each shifted one position relative to its prede-
cessor. Figure 3.28 shows the entire Vigenère square, generally regarded as the most perfect of the simpler
polyalphabetical substitution ciphers.
Plain ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 BCDEFGHIJKLMNOPQRSTUVWXYZA
2 CDEFGHIJKLMNOPQRSTUVWXYZAB
3 DEFGHIJKLMNOPQRSTUVWXYZABC
4 EFGHIJKLMNOPQRSTUVWXYZABCD
5 FGHIJKLMNOPQRSTUVWXYZABCDE
6 GHIJKLMNOPQRSTUVWXYZABCDEF
7 HIJKLMNOPQRSTUVWXYZABCDEFG
8 IJKLMNOPQRSTUVWXYZABCDEFGH
9 JKLMNOPQRSTUVWXYZABCDEFGHI
10 KLMNOPQRSTUVWXYZABCDEFGHIJ
11 LMNOPQRSTUVWXYZABCDEFGHIJK
12 MNOPQRSTUVWXYZABCDEFGHIJKL
13 NOPQRSTUVWXYZABCDEFGHIJKLM
14 OPQRSTUVWXYZABCDEFGHIJKLMN
15 PQRSTUVWXYZABCDEFGHIJKLMNO
16 QRSTUVWXYZABCDEFGHIJKLMNOP
17 RSTUVWXYZABCDEFGHIJKLMNOPQ
18 STUVWXYZABCDEFGHIJKLMNOPQR
19 TUVWXYZABCDEFGHIJKLMNOPQRS
20 UVWXYZABCDEFGHIJKLMNOPQRST
21 VWXYZABCDEFGHIJKLMNOPQRSTU
22 WXYZABCDEFGHIJKLMNOPQRSTUV
23 XYZABCDEFGHIJKLMNOPQRSTUVW
24 YZABCDEFGHIJKLMNOPQRSTUVWX
25 ZABCDEFGHIJKLMNOPQRSTUVWXY
26 ABCDEFGHIJKLMNOPQRSTUVWXYZ
Figure 3.29: The Vigenère Cipher System.
A message is encrypted by replacing each of its letters using a different row, selecting rows by means of
a key. The key (a string of letters with no spaces) is written above the Vigenère square and is repeated as
many times as necessary to produce a row of 26 letters. For example, the key LAND AND TREE produces the
following
Key LANDANDTREELANDANDTREELAND
Plain alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher alphabet 1 BCDEFGHIJKLMNOPQRSTUVWXYZA
Cipher alphabet 2 CDEFGHIJKLMNOPQRSTUVWXYZAB
24 more lines ... ... ...
Each letter of the original message is replaced using the row indicated by the next letter of the key. Thus,
the first letter of the message is replaced using the row that starts with L (row 11). If the first letter is, say,
E, it is replaced by P. The second letter is replaced by a letter from the row indicated by the A of the key
(row 26), and so on.
It is obvious that the strength of this simple algorithm lies in the key. A sufficiently long key, with
many different letters uses most of the rows in the Vigenère square and results in a secure encrypted string.
The number of possible keys is enormous, since a key can be between 1 and 26 letters long. The knowledge
114 3. Input/Output
that a message has been encrypted with this method helps little in breaking it. Simple frequency analysis
does not work, since a letter in the encrypted message, such as S, is replaced by different letters each time
it occurs in the original text.
The Vigenère cipher, as all other polyalphabetic substitution methods, is much more secure than any
monoalphabetic substitution method, and was not broken until the 19th century, first in the 1850s by
Charles Babbage who never published this achievement (Babbage is well-known for his attempts to construct
mechanical computers, see Appendix A), then, in the 1860s, independently by Friedrich W. Kasiski, a retired
officer in the Prussian army.
The key to breaking a polyalphabetic substitution cipher is the key (no pun intended) used in the
encryption. The keys commonly used are meaningful words and phrases and they tend to be short. It has
been proved that the use of a random key can lead to absolute security, provided that the key is used just
once. Thus, polyalphabetic substitution ciphers can be the ultimate secret codes provided that (1) random
keys are used and (2) a key is never used more than once. Such a one-time pad cipher is, of course, not
suitable for general use. It is impractical and unsafe to distribute pads of random keys to many army units
or to many corporate offices. However, for certain special applications, a one-time pad cipher is practical
and absolutely secure. An example is private communication between world leaders.
In spite of its obvious advantage, the Vigenère method was not commonly used after its publication
because of its complexity. Encrypting a message with this method (as with any other polyalphabetic substi-
tution method) by hand is slow, tedious, and error prone. It should be done by machine, and such machines
have been developed since the 1500s. A historically notable example is a device with 36 rotating cylinders,
designed and built in the 1790s by Thomas Jefferson, third President of the United States.
A better-known code machine is the German Enigma, developed in the 1910s by the German inventor
Arthur Scherbius. The Enigma was an electric machine with a 26-key keyboard to enter the plaintext and
26 lamps to indicate the ciphertext. The operator would press a key and a lamp would light up, indicating a
letter. The Morse code for that letter would then be sent by the operator on the military telegraph or radio.
Between the keyboard and the lamps, the Enigma had 3 scrambling rotors (with a 4th one added later) that
changed the substitution rule each time a key was pressed. The Enigma was used extensively by the German
army during World War II, but its code was broken by British code breakers working at Bletchley Park. A
detailed description of the Enigma and the story of breaking its code can be found in [Singh 99].
In the absence of encrypting machines, past cryptographers that needed a cipher more secure than simple
monoalphabetic substitution started looking for methods simpler than (but also weaker than) polyalphabetic
substitutions.
a b c d e f g h i j k l m n o p q r s t u v w x y z
90 70 62 18 76 89 61 83 71 44 91 94 52 81 67 43 31 46 13 72 60 87 07 28 79 51
03 84 82 14 32 53 64 37 27 36 96 77 00 22 98 93 17 65 45 38
23 75 54 01 20 12 69 08 97 57 10 80 73
09 95 63 74 56 58 26 29 92 21 66
55 33 25 05 48 11 02 39 59
40 68 15 41 34 85 78 99 86
49 47 16 04
88 06 35
19 50
24
30
42
Table 3.30: A Homophonic Substitution Cipher.
substitution codes. Numbers can be assigned to the individual letters and also to many common syllables.
When a common syllable is encountered in the plaintext, it is replaced by its code number. When a syllable
is found that has no code assigned, its letters are encrypted individually by replacing each letter with its
code. Certain codes may be assigned as traps, to throw any would-be cryptanalyst off the track. Such a
special code may indicate, for example, that the following (or the preceding) number is spurious and should
be ignored.
3.17 Transposition Ciphers
In a substitution cipher, each letter in the plaintext is replaced by another letter. In a transposition cipher,
the entire plaintext is replaced by a permutation of itself. If the plaintext is long, then each sentence is
replaced by a permutation of itself. The number of permutations of n objects is n!, a number that grows
much faster than n. However, the permutation being used has to be chosen such that the receiver will be
able to decipher the message. [Barker 92] describes a single-column transposition cipher.
It is important to understand that in a transposition cipher the plainletters are not replaced. An E in
the plaintext does not become another letter in the ciphertext, it is just moved from its original position to
another place. Frequency analysis of the ciphertext will reveal the normal letter frequencies, thus serving as
a hint to the cryptanalyst that a transposition cipher was used. A digram frequency analysis should then
be performed, to draw detailed conclusions about the cipher. Common digrams, such as th and an would
be torn apart by a substitution cipher, reinforcing the suspicion that this type of cipher was used.
We start with a few simple transposition ciphers. These are easy to break but they indicate what can
be done with this type of cipher.
1. Break the plaintext into fixed-size blocks and apply the same permutation to each block. As an
example, the plaintext ABCDE. . . can be broken into blocks of four letters each and one of the 4! = 24
permutations applied to each block. The ciphertext can be generated by either collecting the four letters
of each block, or by collecting the first letter of each block, then the second letter, and so on. Figure 3.30
shows a graphic example of some 4-letter blocks. The resulting ciphertext can be either ACBD EGFH IKJL
MONP. . . or AEIM. . . CGKO. . . BFJN. . . and DHLP. . . .
A D E H I L M P
B C F G J K N O
Figure 3.31: 2×2 Blocks For a Transposition Cipher.
2. A square or a rectangle with n cells is filled with the integers from 1 to n in some order. The first n
plainletters are placed in the cells according to these integers and the ciphertext is generated by scanning the
rectangle in some order (possibly rowwise or columnwise) and collecting letters. An alternative is to place
116 3. Input/Output
the plainletters in the rectangle in rowwise or columnwise order, and generate the ciphertext by scanning
the rectangle according to the integers, collecting letters. If the plaintext is longer than n letters, the process
is repeated. Some null letters, such as X or Q, may have to be appended to the last block. Figure 3.31a–c
shows three 8×8 squares. The first with integers arranged in a knight’s tour, the second a magic square,
and the third a self-complementary magic square (where if the integer i is replaced by 65 − i, the result is
the same square, rotated).
47 10 23 64 49 2 59 6 52 61 4 3 20 29 36 45 1 62 5 59 58 12 61 2
22 63 48 9 60 5 50 3 14 3 62 51 46 35 30 19 57 14 50 48 19 45 18 9
11 46 61 24 1 52 7 58 53 60 5 12 21 28 37 44 10 27 34 25 41 36 33 54
62 21 12 45 8 57 4 51 11 6 59 54 43 38 27 22 49 30 26 23 37 22 21 52
19 36 25 40 13 44 53 30 55 58 7 10 23 26 39 42 13 44 43 28 42 39 35 16
26 39 20 33 56 29 14 43 9 8 57 56 41 40 25 24 11 32 29 24 40 31 38 55
35 18 37 28 41 16 31 54 50 63 2 15 18 31 34 47 56 47 20 46 17 15 51 8
38 27 34 17 32 55 42 15 16 1 64 49 48 33 32 17 63 4 53 7 6 60 3 64
(a) (b) (c)
Figure 3.32: Three 8×8 Squares For Transposition Ciphers.
In addition to row and column scanning, such a square can be scanned in zigzag, by diagonals, or in a
spiral (Figure 3.32a–c, respectively).
3. The plaintext is arranged in a zigzagging rail fence whose height is the key of the cipher. Assuming
the plaintext ABCDEFGHIJK and a key of 3, the rail fence is
A E I
B D F H J
C G K
The fence is scanned row by row to form the ciphertext AEI.BDGHJ.CGK, where the dots are special symbols
needed for decryption. This is an example of an irregular transposition cipher.
4. The plaintext is divided into groups of n×n letters and each group is placed into an n×n square as
shown in Figure 3.33, which illustrates the plaintext PLAINTEXT IN A SQUARE. The ciphertext is obtained
by scanning the square in rows PLNIR IATNE TXEAX AUQSX XXXXX.
5. Start with a rectangle of m rows and n columns. In each column, select half the rows at random.
This creates a template of m×n cells, half of which have been selected. All authorized users of the cipher
must have this template. The plaintext is written in rows and is collected by columns to form the ciphertext.
Figure 3.34 shows an example where the plaintext is the letters A through R and the ciphertext is CFLGK
MADPH NQEIR BJO.
3.17 Transposition Ciphers 117
P L N I R
I A T N E
T X E A
A U Q S
A B
C D E
F G H I J
K
L M N O
P Q R
6. Select a key k = k1 k2 . . . kn . Write the plaintext in an n-row rectangle row by row. On row i, start
the text at position ki . The ciphertext is generated by collecting the letters in columns from left to right.
Figure 3.35 is an example. It shows how the plaintext ABCD...X is enciphered into NAOBG PCHQU DIRVE
JLSWF KMTX by means of the key 23614. This type of transposition cipher is irregular and thus somewhat
more secure, but involves a long key.
A B C D E F
G H I J K
L M
N O P Q R S T
U V W X
Figure 3.36: An Irregular Rectangle With Plaintext.
7. A simple example of a substitution cipher is to reverse every string of three letters, while keeping the
positions of all blank spaces. For example, the message ATTACK AT DAWN becomes TTAKCA DA TNWA. Another
elementary transposition is to break a message up into two strings, one with the odd-numbered letters of
the message and the other with the even-numbered letters, then concatenate the two strings. For example:
WAITFORMEATMIDNIGHT
W I O E A I N G T
A T F R M T M D I H
WIOEAINGTATFRMTMDIH
This method can be extended by selecting a key n, breaking the message up into n strings (where the first
string contains letters 1, n + 1, 2n + 1,. . . ), and concatenating them.
8. An anagram is a rearrangement of the letters of a word or a phrase. Thus, for example, red code
is an anagram of decoder and strict union is an anagram of instruction. It is possible to encrypt a
message by scrambling its letters to generate an anagram (meaningful or not). One simple way to create
nonsense anagrams is to write the letters of a message in alphabetical order. Thus I came, I saw, I
conquered becomes AACCDEEEIIIMNOQRSUW. This kind of cipher is practically impossible for a receiver (even
an authorized one) to decipher, but has its applications. In past centuries, scientists sometimes wanted to
keep a discovery secret and at the same time keep their priority claim. Writing a report of the discovery in
anagrams was a way to achieve both aims. No one could decipher the code, yet when someone else claimed
to have made the same discovery, the original discoverer could always produce the original report in plaintext
and prove that this text generates the published anagram.
118 3. Input/Output
The fundamental problem of transposition ciphers is how to specify a permutation by means of a short,
easy to remember key. A detailed desciption of a permutation can be as long as the message itself, or even
longer. It should contain instructions such as “swap positions 54 and 32, move position 27 to position 98,”
and so on. The following sections discuss several ways to implement transposition ciphers.
As he sat on the arm of a leather-covered arm chair, putting his
face through all its permutations of loathing, the whole house-
hold seemed to spring into activity around him.
—Kingsley Amis, Lucky Jim, 1953
A B C D E F
G H I J K L
M N O P Q R
S T U V W X
Y Z 0 1 2 3
4 5 6 7 8 9
(a) (b) (c)
Naturally, the opposite process is also possible. The first nine letters of the plaintext are written through
the template holes, the template is rotated, and the process repeated three more times, for a total of 36
letters arranged in a 6×6 square. The ciphertext is prepared by reading the 36 letters row by row.
The template is normally a square where the number of cells is divisible by 4. However, the template
can also be an equilateral triangle where the number of cells is divisible by 3 (Figure 3.36b) or a square with
an odd number of cells such as 5×5), in which case the central row and column can simply be ignored.
Decrypting is the reverse of encrypting, but breaking this code, even though not especially hard, is tricky
and requires intuition. The length of the ciphertext provides a clue to the size of the template (in the case of
a 6×6 template, the length of the ciphertext is a multiple of 36). Once the codebreaker knows (or guesses or
suspects) the size of the template, further clues can be found from the individual letters of common digrams.
If it is suspected that the template size is 6×6, the ciphertext is broken up into 36-letter segments, and each
segent written as a 6×6 square. Any occurrences of the letters T and H in a square could be due to the digram
TH or the trigram THE that were broken apart by the encryption. If the codebreaker discovers certain hole
positions that would unite the T and H in one square, then other squares (i.e., other ciphertext segments)
may be tried with these holes, to check whether their T and H also unite. This is a tedious process, but it
can be automated to some degree. A computer program can try many hole combinations and display several
squares of ciphertext partly decrypted with those holes. The user may reject some hole combinations and
ask for further tests (perhaps with another common digram) with other combinations.
3.18 Transposition by Turning Template 119
It is very easy to determine hole positions in such a square template. Figure 3.37a shows a 6×6 template
divided into four 3×3 templates rotated with respect to each other. The 9 cells of each small template are
numbered in the same way. The rule for selecting cells is: If cell 3 is selected for a hole in one of the 3×3
templates, then cell 3 should not be a hole in any of the other three templates. We start by selecting cell
1 in one of the four templates (there are four choices for this). In the next step, we select cell 2 in one of
the four templates (again four choices). This is repeated nine times, so the total number of possible hole
configurations is 49 = 262,144. This, of course is much smaller than the total number 36! of permutations of
36 letters.
Exercise 3.16: How many hole configurations are possible in an 8×8 template?
We illustrate this method with the plaintext happy families are all alike every unhappy family
is unhappy in its own way. The first 36 letters are arranged in a 6×6 square
happyf
amilie
sareal
lalike
everyu
nhappy
Figure 3.37b shows how the same template is placed over this square four times while rotating it each
time. The resulting cipher is hpymeakea afiislieu peaalyhpy alrlevrnp
1 2 3 3 1
7 8 9
4 5 6
1 2 3
4 5 6 6 5
2
8
7 8 9 8
3
7 8 9 7 9
1 2 3
4 5 6
7 8 9
4 5 6 5 6 4
2
1 2 3
1
7
(a) (c)
h a p p y f h a p p y f h a p p y f h a p p y f h a p p y f
a m i l i e a m i l i e a m i l i e a m i l i e a m i l i e
s a r e a l s a r e a l s a r e a l s a r e a l s a r e a l
l a l i k e l a l i k e l a l i k e l a l i k e l a l i k e
e v e r y u e v e r y u e v e r y u e v e r y u e v e r y u
n h a p p y n h a p p y n h a p p y n h a p p y n h a p p y
(b)
It is also possible to use a key to determine the hole positions. The advantage is that the users don’t
have to keep the actual template (which may be hard to destroy in a sudden emergency). Consider, for
example, the key FAKE CIPHER. The first key letter, F is the 6th letter of the alphabet, which implies that
the square should be of size 6×6. Such a square has 36 cells, so 36/4 = 9 hole positions are needed, where
each position is determined by one of the key’s letters. The key should therefore be nine letters long. If it is
shorter, it is repeated as needed, and if it is longer, only the first nine letters are considered. The letters are
numbered according to their alphabetic order. A is numbered 1. The next key letter in alphabetical order
is C, so it is numbered 2. The next one is E, so its two occurrences are numbered 3 and 4. Thus, the key
becomes the nine digits 518327964. These are divided into the four groups 51, 83, 27, and 964 for the four
quadrants. The first group corresponds to holes 5 and 1 in the first quadrant. The second group corresponds
to holes 8 and 3 in the second quadrant, and so on (Figure 3.37c).
This method can be made more secure by double encryption. Once the ciphertext is generated, it is
encrypted, by means of a mirror image of the template, or even by a completely different template. However,
120 3. Input/Output
even with this enhancement, the turning template method is unsafe, and has the added problem of keeping
the template from falling into the wrong hands.
3.19 Columnar Transposition Cipher
A template requires constant protection, so it is preferable to specify a permutation by means of a key. Here
is a simple way of doing this. To specify a permutation of, say, 15 letters, we start with a key whose length
is at least 15 letters (not counting spaces), eliminate the spaces, and leave the first 10 letters. Thus, if the
key is the phrase TRANSPOSITION CIPHERS, we first generate the 15-letter string TRANSPOSITIONCI. The 15
letters are then numbered as follows: The A is numbered 1. There are no B’s, so the C is numbered 2. The
next letter (alphabetically) is I. There are three I’s, so they are numbered 3, 4, and 5, and so on, to obtain
T R A N S P O S I T I O N C I
14 11 1 6 12 10 8 13 3 15 4 9 7 2 5
The sequence 14, 11, 1, 6, 12, 10, 8, 13, 3, 15, 4, 9, 7, 2, and 5 specifies the permutation. After the
permutation, the third plainletter should be in position 1, the 14th plainletter should be in position 2, and
so on. The plaintext is now arranged as a rectangle with 15 columns, the columns switched according to the
permutation, and the ciphertext collected column by column. Figure 3.38 shows an example.
T R A N S P O S I T I O N C I
14 11 1 6 12 10 8 13 3 15 4 9 7 2 5
h a p p y f a m i l i e s a r
e a l l a l i k e e v e r y u
n h a p p y f a m i l y i s u
n h a p p y i n i t s o w n w
a y e v e r y t h i n g w a s
i n c o n f u s i o n i n t h
e o b l o n s k y s h o u s e
t h e w i f e h a d d i s c o
v e r e d t h a t t h e
p a i i r p s a e f a y m h l
l y e v u l r i e l a a k e e
a s m l u p i f y y h p a n i
a n i s w p w i o y h p n n t
e a h n s v w y g r y e t a i
c t i n h o n u i f n n s i o
b s y h e l u s o n o o k e s
e c a d o w s e i f h i h t d
r t h e h e t e d a v t
This transposition cipher does not require a template, and the key is easy to memorize and replace, but
the method isn’t very secure. If the codebreaker can guess the number of columns (or try many guesses),
then the permutation, or parts of it, can be guessed from the positions of the members of common digrams.
Thus, if the codebreaker arranges the ciphertext in a rectangle of the right size, then columns can be moved
around until all or most occurrences of H follow T’s. If the result starts making sense, more digrams can be
tried.
3.19 Columnar Transposition Cipher 121
As is true with many encryption methods, this cipher can be strengthened by performing double en-
cryption. After the ciphertext is obtained by reading the 15 columns and collecting letters, it can be written
in rows and reencrypted with the same key. An alternative is to reencrypt the ciphertext before it is read
from the 15 columns. This is done using another key on the rows. There are 9 rows in our example and they
can be swapped according to a secondary key.
Replacing a key or even two keys often is the key to security. This can easily be done once the authorized
users agree (or are being told) how to do this. The users may agree on a book and on a formula. The formula
converts a date to a page number in the book, and the key consists of the last letters of the first 15 lines
on that page. Every day, all users apply the formula to the current date and construct a new key from the
same page. A possible formula may start with a 6-digit date of the form d1 d2 m1 m2 y1 y2 and compute
(50d1 + 51d2 + 52m1 + 53m2 + 54y1 + 55y2 ) mod k + 1
where k is the number of pages in the book. This formula (which is a weighted sum, see the similar formula
for ISBN below) depends on all six digits of the date in such a way that changing any digit almost always
changes the sum. The result of computing any integer modulo k is a number in the range [0, k − 1], so adding
1 to it produces a valid page number.
The international standard book number (ISBN), is a unique number assigned to most book published.
This number has four parts, a country code, a publisher code, a book number assigned by the publisher,
and a check digit, for a total of 10 digits. For example, ISBN 0-387-95045-1 has country code 0, publisher
code 387, book number 95045, and check digit 1. The check digit is computed by multiplying the leftmost
digit by 10, the next digit by 9, and so on, up to ninth digit from the left, which is multiplied by 2. The
products are then added, and the check digit is determined as the smallest integer that when added to the
sum will make it a multiple of 11. The check digit is therefore in the range [0, 10]. If it happens to be 10, it
is replaced by the Roman numeral X in order to make it a single symbol.
Exercise 3.17: Assuming a 190-page book, compute the page number for today.
3.19.1 The Myszkowsky Cipher
This method, due to E. Myszkowsky, is a variation on the general columnar transposition cipher. The key
is duplicated as many times as needed to bring it to the size of the plaintext. Each key letter is replaced by
a number according to its alphabetic position, and the plainletters are collected by these numbers to form
the ciphertext. The example shown here uses the key QUALITY, which is duplicated four times to cover the
entire 28-letter plaintext Come home immediately. All is lost.
Q U A L I T Y Q U A L I T Y Q U A L I T Y Q U A L I T Y
13 21 1 9 5 17 25 14 22 2 10 6 18 26 15 23 3 11 7 19 27 16 24 4 12 8 20 28
C O M E H O M E I M M E D I A T E L Y A L L I S L O S T
The ciphertext is MMESH EYOEM LLCEA LODAS OITIM ILT. Since the extended key contains duplicate letters,
the original key may also contain duplicates, such as in the word INSTINCT.
3.19.2 The AMSCO Cipher
The AMSCO cipher, due to A. M. Scott, is a columnar transposition cipher where the plaintext is placed in
the columns such that every other column receives a pair of plainletters. The example presented here uses
the same key and plaintext as in Section 3.19.1.
Q U A L I T Y
4 6 1 3 2 5 7
C OM E HO M EI M
ME D IA T EL Y AL
L IS L OS T
1001101|1000001|1010011|1010100|1000101|1010010.
It is very easy to come up with simple permutations that will make a transposition cipher for such a string.
Some possible permutations are: (1) Swap every pair of consecutive bits. (2) Reverse the seven bits of each
code, then swap consecutive codes. (3) Perform a perfect shuffle as shown here
100110110000011010011|101010010001011010010
1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1
1 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 0 1 0
110001101100101100000001001111001100001110
3.19.4 Stream Ciphers
The principle of a stream cipher is to create a string of bits, called the keystream and to exclusive-or (XOR)
the bits of the plaintext with those of the keystream. The resulting bit string is the ciphertext. Decrypting
is done by XORing the ciphertext with the same keystream. Thus, a stream cipher is based on the following
property of the XOR operation (Table 2.6): If B = A ⊕ K, then A = B ⊕ K. As an example, the plaintext
MASTER is enciphered with the keystream KEY (= 1001011|1000101|1011001).
Key 100101110001011011001|100101110001011011001
Text 100110110000011010011101010010001011010010
XOR 000111000001000001010001111100000000001011
The resulting ciphertext can then easily be deciphered by performing an XOR between it and the binary
key.
The main advantage of stream ciphers is their high speed for both encryption and decryption. A
stream cipher is easily implemented in software, but special hardware can also be constructed to generate
the keystream and perform the encryption and decryption operations. There are two main types of stream
ciphers, synchronous and self synchronizing. The former uses a keystrem that’s independent of the plaintext
and cipher text; this is the common type. The latter uses a keystream that depends on the plaintext and
may even depend on the cipher text that has been generated so far.
The simplest (and also most secure) type of stream cipher is the one-time pad, sometimes called the
Vernam cipher. Given the plaintext (a string of n bits), the keystream is selected as a random string of
n bits that’s used just once. The resulting ciphertext is also a random bit string. If a codebreaker knows
(or suspects) that this method was used to encrypt a message, they can try to break it by generating every
random string of n bits and using it as a keystream. The original plaintext would eventually be produced
this way, but the cryptanalyst would have to read every plaintext generated; an impossible task given the
immense number of random strings of n bits, even for modest values of n. There is also the possibility
that many meaningful plaintexts would be generated, making it even harder to decide which one is the real
plaintext.
Even though it offers absolute security, the one-time pad is generally impractical, because the one-time
pads have to be generated and distributed safely to everyone in the organization. This method can be used
in only a limited number of situation, such as exchanging top-secret messages between a government and its
ambassador abroad.
In practice, stream ciphers are used with keystreams that are pseudo-random bit strings. Such a bit
string is generated by repeatedly using a recursive relation, so it is deterministic and therefore not random.
Still, if a sequence of n pseudo-random bits does not repeat itself, it can be used as the keystream for a
stream cipher with relative safety.
3.19 Columnar Transposition Cipher 123
One way to generate pseudo-random bits is a shift register sequence, most often used in hardware stream
ciphers. Another is the mixed congruential pseudorandom number generator, usually used in software.
Figure 3.39a shows a simple shift register with 10 stages, each a flip-flop (such as an SR latch, Section 7.2.1).
Bits are shifted from left to right and the new bit entered on the left is the XOR of bits 6 and 10. Such a
shift register is said to correspond to the polynomial x10 + x6 + 1. The latches at stages 6 and 10 are said to
be tapped. Before encryption starts, the register is initialized to a certain, agreed-upon 10-bit number (its
initial state, the key). The register is then shited repeatedly, and the bits output from the rightmost stage
are used to encode the plaintext.
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13
(a) (b)
The bit string output by the register depends, of course, on the initial state and on the corresponding
polynomial. If the initial state of the register is reached after s steps, there will be a cycle, and the same s
states will repeat over and over. Repetition, as is well known, is a major sin in cryptography and should be
avoided. A shift register should therefore be designed to go through the maximum number of states before
it cycles. In a 10-stage shift register, there can be 210 = 1024 different 10-bit states. Of those, the special
state of all zeros should be avoided because it degenerates the output to a stream of zeros. Thus, it is clear
that the best that can be achieved in a 10-stage shift register is a sequence of 210 − 1 = 1023 different states.
It can be shown that this can be achieved if the polynomial associated with the register is primitive.
[A polynomial P (x) of degree n is primitive if it satisfies (1) P (0) = 0, (2) P (x) is of order 2n −1, and (3)
P (x) is irreducible. The order of a polynomial is the smallest e for which P (x) divides xe + 1. For example,
the polynomial x2 + x + 1 has degree 2. It has order e = 3 = 22 − 1 because (x2 + x + 1)(x + 1) = x3 + 1. A
polynomial is irreducible if it cannot be written as the product of lower-degree polynomials. For example,
x2 + x + 1 is irreducible, but x2 − 1 is not because it equals the product (x + 1)(x − 1). We therefore conclude
that x2 + x + 1 is primitive.]
Also, a shift register with the maximum number of states must have an even number of taps and the
rightmost stage (the oldest bit) must be a tap. The latter property is illustrated in Figure 3.39b. It is easy
to see that the output bit string generated by this 13-stage register is identical to that generated by the
similar 10-stage register, except that it is delayed.
Selecting the right polynomial is important. We never use our initials, our address, or our phone number
as a password. Similarly, certain polynomials have been adopted for use in various international standards,
such as CRC. These polynomials should be avoided in a shift register. Some well-known examples (see also
page 93) are x16 +x12 +x5 +1 which is used by the CCITT, x12 +x3 +x+1 and x16 +x15 +x2 +1 which generate
the common CRC-12 and CRC-16 cyclic redundancy codes, and x10 +x3 +1 and x10 +x9 +x8 +x6 +x3 +x2 +1
which are used by the GPS satellites.
3.19.5 The Data Encryption Standard
During the 1950s and 1960s, relatively few people used computers and they had to write their own programs.
Thus, the adoption of a standard computer-based cipher had to wait until the 1970s. This standard, known
today as the data encryption standard or DES, was adopted by the US government in November 1976. It is
based, with few modifications, on a cipher system known as Lucifer that was developed in the early 1970s by
Horst Feistel for IBM. Lucifer is, in principle, a simple cipher employing both transposition and substitution.
The plaintext is divided into blocks of 64 bits each and each block is enciphered separately. The 64 bits of
a block are first perfectly shuffled and split into two 32-bit strings denoted by L0 and R0 . String R0 is first
saved in T , then its bits R0 are scrambled by a complex substitution cipher that depends on a key. Two
124 3. Input/Output
new strings L1 and R1 are constructed by R1 = L0 + R0 and L1 = T . This process is repeated 16 times.
Deciphering is the reverse of enciphering.
This method is a perfect illustration of Kerckhoffs’ principle. The details of the algorithm are known, so
its security depends on the key. The longer the key, the more complex the cipher operations, and the harder
it is to break this code. When the Lucifer cipher was considered for adoption as a standard, the NSA argued
for limiting the size of the key to 56 bits, thereby limiting the total number of keys to 256 ≈ 1.13 · 1015 . The
NSA felt that a 56-bit key would guarantee reasonable security (because no organization had at that time
computers fast enough to break such a code in a reasonable amount of time), and at the same time will
enable them to break any enciphered message because they had the biggest, fastest computers.
Many commercial entities have implemented the DES on their computers and started using it as soon
as it became a standards. The DES has proved its worth as a secure code, but its users immediately ran
into the same problem that has haunted cryptographers for many years; that of key distribution. How can
a commercial organization, such as a stock broker, distribute keys to its many clients. The safest way to do
this is to meet each client in person. Another safe solution is to hand deliver the key to each client with
a trusted courier. The same problem faces a military organization, with the added complication that keys
can be captured by the enemy, requiring an immediate replacement. The result was that the reliability of
the DES, as of many ciphers before it, was being undermined by the problem of key distribution. Many
experts on cryptography agreed that this was a problem without a solution. Then, completely unexpectedly,
a revolutionary solution was discovered.
3.19.6 Diffie-Hellman-Merkle Keys
The following narrative illustrates the nature of the solution. Suppose that Alice wants to send Bob a secure
message. She places the message in a strong box, locks it with a padlock, and mails it to Bob. Bob receives
the box safely, but then realizes that he does not have the key to the padlock. This is a simplified version of
the key distribution problem, and it has a simple, unexpected solution. Bob simply adds another padlock to
the box and mails it back to Alice. Alice removes her padlock and mails the box to Bob, who removes his
lock, opens the box, and reads the message.
The cryptographic equivalent is similar. Alice encrypts a message to Bob with her private key (a key
unknown to Bob). Bob receives the encrypted message, encrypts it again, with his key, and sends the
doubly-encrypted message back to Alice. Alice now decrypts the message with her key, but the message is
still encrypted with Bob’s key. Alice sends the message again to Bob, who decrypts it with his key and can
read it.
The trouble with this simple approach is that most ciphers must obey the LIFO (last in first out) rule.
The last cipher used to encrypt a doubly-encrypted message must be the first one used to decipher it. This
is easy to see in the case of a monoalphabetic cipher. Suppose that Alice’s key replaces D with P and L
with X and Bob’s key replaces P with L. After encrypting a message twice, first with Alice’s key and then
with Bob’s key, any D in the message becomes an L. However, when Alice’s key is used to decipher the L, it
replaces it with X. When Bob’s key is used to decipher the X, it replaces it with something different from the
original D. The same happens with a polyalphabetic cipher.
However, there is a way out, a discovery made in 1976 by Whitfield Diffie, Martin Hellman and Ralph
Merkle. Their revolutionary Diffie-Hellman-Merkle key exchange method involves the concept of a one-way
function, a function that either does not have an inverse or whose inverse is not unique. Most functions
have simple inverses. The inverse of the exponential function y = ex , for example, is the natural logarithm
3.19 Columnar Transposition Cipher 125
x = loge y. However, modular arithmetic provides an example of a one-way function. The value of the
modulo function f (x) = x mod p is the remainder of the integer division x ÷ p and is an integer in the range
[0, p − 1]. Table 3.40 illustrates the one-way nature of modular arithmetic by listing values of 3x mod 7 for
10 values of x. It is easy to see, for example, that the number 3 is the value of 3x mod 7 for x = 1 and x = 7.
The point is that the same number is the value of this function for infinitely more values of x, effectively
making it impossible to reverse this simple function.
x 1 2 3 4 5 6 7 8 9 10
3x 3 9 27 81 243 729 2187 6561 19683 59049
3x mod 7 3 2 6 4 5 1 3 2 6 4
Based on this interesting property of modular arithmetic, the three researchers came up with an original
and unusual scheme that’s summarized in Figure 3.41. The process requires the modular function Lx mod P ,
whose two parameters L and P should satisfy L < P . The two parties have to select values for L and P ,
but these values don’t have to be secret.
Alice Bob
Step 1 Selects a secret Selects a secret
number a say, 4 number b say, 7
Step 2 Computes Computes
α = 5a mod 13 = 625 mod 13 = 1 β = 5b mod 13 = 78125 mod 13 = 8
and sends α to Bob and sends β to Alice
Step 3 Computes the key by Computes the key by
β a mod 13 = 4096 mod 13 = 1 αb mod 13 = 1 mod 13 = 1
Notice that knowledge of α, β and the function is not enough to compute
the key. Either a or b is needed, but these are kept secret.
Figure 3.42: Three Steps to Compute the Same Key.
Careful study of Figure 3.41 shows that even if the messages exchanged between Alice and Bob are
intercepted and even if the values L = 5 and P = 13 that they use are known, the key still cannot be derived
since the values of either a or b are also needed but they are kept secret by the two parties.
This breakthrough has proved that cryptographic keys can be securely exchanged through unsecure
channels, and users no longer have to meet personally to agree on keys or trust couriers to deliver them.
However, the Diffie-Hellman-Merkle key exchange method described in Figure 3.41 is inefficient. In the ideal
case, where both users are online at the same time, they can go through the process of Figure 3.41 (select
the secret numbers a and b, compute and exchange the values of α and β, and calculate the key) in just
a few minutes. If they cannot be online at the same time (for example, if they live in very different time
zones), then the process of determining the key may take a day or more.
The following analogy may explain why a one-way function is needed to solve the key distribution
problem. Imagine that Bob and Alice want to agree on a certain paint color and keep it secret. Each starts
with a container that has one liter of paint of a certain color, say, R. Each adds one liter of paint of a
secret color. Bob may add a liter of paint of color G and Alice may add a liter of color B. Neither knows
what color was added by the other one. They then exchange the containers (which may be intercepted and
examined). When each gets the other’s container, each adds another liter of his or hers secret paint. Thus,
each container ends up having one liter each of paints of colors R, G, and B. Each container contains paint
of the same color. Intercepting and examining the containers on their ways is fruitless, because one cannot
unmix paints. Mixing paints is a one-way operation.
126 3. Input/Output
It was clear to Diffie that a cipher based on an asymmetric key would be the ideal solution to the
troublesome problem of key distribution and would completely revolutionize cryptography. Unfortunately,
he was unable to actually come up with such a cipher. The first (and so far, the only) simple, practical,
and secure public key cipher, known today as RSA cryptography, was finally developed in 1977 by Ronald
Rivest, Adi Shamir, and Leonard Adleman. RSA was a triumphal achievement, an achievement based on
the properties of prime numbers.
A prime number, as most of us know, is a number with no divisors. More accurately, it is a positive
integer N whose only divisors are 1 and itself. (Nonprime integers are called composites.) For generations,
prime numbers and their properties (the field of number theory) were of interest to mathematicians only,
and had no practical applications whatsoever. RSA cryptography found an interesting, original, and very
practical application for these numbers. This application relies on the most important property of prime
numbers, the property that justifies the name prime. Any positive integer can be represented as the product
of prime numbers (its prime factors) in one way only. In other words, any integer has a unique prime
factorization. For example, the number 65,535 equals the product of the four primes 3, 5, 17, and 257.
The main idea behind RSA is to choose two large primes p and q that together constitute the private
key. The public key N is their product N = p × q (which, of course, is composite) The important (and
surprising) point is that multiplying large integers is a one-way function! It is relatively easy to multiply
integers, even very large ones, but it is practically impossible, or at least extremely time consuming, to find
the prime factors of a large integer, with hundreds of digits. Today, after millennia of research (primes have
been known to the ancients), no efficient method for factoring numbers has been discovered. All existing
factoring algorithms are slow and may take years to factor an integer consisting of a few hundred decimal
digits. The factoring challenge (with prizes) offered by RSA laboratories [RSA 01] testifies to the accuracy
of this statement.
To summarize, we know that the public key N has a unique prime factorization and that its prime
factors are the private key. However, if N is large enough, we will not be able to factor it, even with the
fastest computers, which makes RSA a secure cipher. At the same time, no one has proved that a fast
factorization method does not exist. It is not inconceivable that someone will develop such an algorithm,
which would render RSA (impregnable for more than two decades) useless and would stimulate researchers
to discover a different public key cipher.
(Recent declassifying of secret British documents suggests that a cipher very similar to RSA was de-
veloped by James Ellis and his colleagues starting in in 1969. They worked for the British government
communications headquarters, GCHQ, and so had to keep their work secret. See [Singh 99].)
And now, to the details of RSA. They are deceptively simple, but the use of large numbers requires
special arithmetic routines to be implemented and carefully debugged. We assume that Alice has selected
3.19 Columnar Transposition Cipher 127
two large primes p and q as her private key. She has to compute and publish two more numbers as her
public key. They are N = p · q and e. The latter can be any integer, but it should be relatively prime to
(p − 1) · (q − 1). Notice that N must be unique (if Joe has selected the same N as his public key, then he
knows the values of p and q), but e does not have to be. To encrypt a message M (an integer) intended
for Alice, Bob gets her public key (N and e), computes C = M e mod N , and sends C to Alice through an
open communications channel. To decrypt the message, Alice starts by computing the decryption key d from
e×d = 1 mod (p − 1) · (q − 1), then uses d to compute M = C d mod N .
The security of the encrypted message depends on the one-way nature of the modulo function. Since
the encrypted message C is M e mod N , and since both N and e are public, the message can be decrypted
by inverting the modulo function. However, as mentioned earlier, this function is impossible to invert for
large values of N . It is important to understand that polyalphabetic ciphers can be as secure as as RSA, are
easier to implement and faster to execute, but they are symmetric and therefore suffer from the problem of
key distribution.
The use of large numbers requires special routines for the arithmetic operations. Specifically the oper-
ation M e may be problematic, since M may be a large number. One way to simplify this operation is to
break the message M up into small segments. Another option is to break up e into a sum of terms and use
each term separately. For example, if e = 7 = 1 + 2 + 4, then
3.20 Steganography
The word steganography is derived from the Greek στ γανoς γραφν, meaning “covered writing.” The aim
of steganography is to conceal data. This applies to data being stored or to a message being transmitted.
We start with a few elementary and tentative methods most of which are used manually, although some can
benefit from a computer. These are mostly of historical interest.
1. Write a message on a wooden tablet, cover it with a coat of wax, and write innocuous text on the
wax. This method is related by Greek historians.
2. Select a messenger, shave his head, tattoo the message on his head, wait for the hair to grow, and
send him to his destination, where his head is shaved again.
3. Use invisible ink, made of milk, vinegar, fruit juice, or even urine, and hide the message between the
lines of a seemingly-innocent letter. When the paper is heated up, the text of the hidden message slowly
appears.
4. The letters constituting the data are concealed in the second letter of every word of a specially-
constructed cover text, or in the third letter of the first word of each sentence. An example is the data
“coverblown” that can be hidden in the specially-contrived cover text “Accepted you over Neil Brown.
About ill Bob Ewing, encountered difficulties.” A built-in computer dictionary can help in selecting the
words, but such specially-constructed text often looks contrived and may raise suspicion.
5. An ancient method to hi.de data uses a la.rge piece of text where sm. all dots are placed under the letters
th.at are to be hi.dd.en. For example, this paragraph has d.ots place.d under certain. letters that together spell
the message “i am hidden.” A variation of this method slightly perturbs certain letters from their original
positions to indicate the hidden data.
6. A series of lists of alternative phrases from which a paragraph can be built can be used, where
each choice of a phrase from a list conceals one letter of a message. This method was published in 1518 in
Polygraphiae by Johannes Trithemius, and was still used during World War II, as noted in [Marks 99].
7. Check digits. This idea is used mostly to verify the validity of important data, such as bank accounts
and credit card numbers, but can also be considered a method to hide validation information in a number.
A simple example is the check digit used in the well-known international standard book number (ISBN),
assigned to every book published. This number has four parts, a country code, a publisher code, a book
number assigned by the publisher, and a check digit, for a total of 10 digits. For example, ISBN 0-387-
95045-1 has country code 0, publisher code 387, book number 95045, and check digit 1. The check digit is
computed by multiplying the leftmost digit by 10, the next digit by 9, and so on, up to ninth digit from
the left, which is multiplied by 2. The products are then added, and the check digit is determined as the
smallest integer that when added to the sum will make it a multiple of 11. The check digit is therefore in
the range [0, 10]. If it happens to be 10, it is replaced by the Roman numeral X in order ot make it a single
symbol.
8. It is possible to assign 3-digit codes to the letters of the alphabet as shown in Table 3.42. Once this
is done, each letter can be converted into three words of cover text according to its 3-digit code. A code
digit of 1 is converted into a word with 1, 4, or 7 syllables, a code digit of 2 is converted into a word with 2
or 5 syllables, and a code digit of 3 is converted into a word with 3 or 6 syllables. This way, there is a large
selection of words for each word in the cover text.
Modern steganography methods are more sophisticated and are based on the use of computers and on
the binary nature of computer data. We start with a few simple methods, and follow with two important
and commonly-used approaches to steganography, namely least-significant bit modification (Section 3.20.1),
and BPCS steganography (Section 3.20.2).
A modern personal computer may have tens of thousands of files on one hard disk. Many files are
part of the operating system, and are unfamiliar to the owner or even to an expert user. Each time a large
3.20 Steganography 129
application is installed on the computer, it may install several system files, such as libraries, extensions, and
preference files. A data file may therefore be hidden by making it look like a library or a system extension.
Its name may be changed to something like MsProLibMod.DLL and its icon modified. When placed in a folder
with hundreds of similar-looking files, it may be hard to identify as something special.
Camouflage is the name of a steganography method that hides a data file D in a cover file A by
scrambling D, then appending it to A. The original file A can be of any type. The camouflaged A looks
and behaves like a normal file, and can be stored or emailed without attracting attention. Camouflage is not
very safe, since the large size of A may raise suspicion.
When files are written on a disk (floppy, zip, or other types), the operating system modifies the disk
directory, which also includes information about the free space on the disk. Special software can write a file
on the disk, then reset the directory to its previous state. The file is now hidden in space that’s declared
free, and only special software can read it. This method is risky, because any data written on the disk may
destroy the hidden file.
3.20.1 LSB Image Steganography
Data can be hidden in a color image (referred to as a cover image), since slight changes to the original colors
of some pixels are many times invisible to the eye (after all, changing the original pixels is the basis of all
lossy image compression methods, Section 3.15).
Given an image A with 3 bytes per pixel, each of the three color components, typically red, green, and
blue, is specified by one byte, so it can have one of 256 shades. The data to be hidden—which can be text,
another image, video, or audio—is represented in the computer as a file, or a stream of bits. Each of those
bits is hidden by storing it as the least-significant bit of the next byte of image A, replacing the original bit.
The least-significant bit of the next byte of A is, of course a 0 or a 1. The next bit to be hidden is also a 0 or
a 1. This means that, on average, only half the least-significant bits are modified by the bits being hidden.
Each pixel is represented by three bytes, so changing half the bytes means changing 1 or 2 least-significant
bits per pixel, on average.
Small changes in the color of a few isolated pixels may be noticeable if the cover image A has large
uniform areas, which is why A has to be carefully chosen. It should contain a large number of small details
with many different colors. Such an image is termed busy.
If image A is to be compressed later, the compression should, of course, be lossless. If the owner forgets
this, or if someone else compresses image A with a lossy compression algorithm, some of the hidden data
would be lost during the compression.
The hiding capacity of this method is low. If one data bit is hidden in a byte, then the amount of data
hidden is 1/8 = 12.5% of the cover image size. As an example, only 128K bytes can be hidden in a 1Mbyte
cover image.
An extension of this method hides the bits of the data not in consecutive bytes of the image but only
in certain bytes that are selected by a key. If the key is, say, 1011, then data bits are hidden in the least-
significant bits of bytes 1, 3, and 4 but not of byte 2, then in bytes 5, 6, and 8, but not in byte 7, and so on.
A variation of this method hides one bit in the least-significant position of a byte of A and hides the next
bit in the second least-significant position of the next byte of A.
A more sophisticated technique is to scan the image byte by byte in a complex way, not row by row,
and hide one bit of the data in each byte visited. For example, the image bytes may be visited in the order
defined by a space-filling curve, such as the Peano curve or the Hilbert curve. Such curves visit each point
of a square area exactly once.
A marked improvement is achieved by considering the sensitivity of the eye to various colors. Of the
three colors red, green, and blue, the eye is most sensitive to green and least sensitive to blue. Thus, hiding
bits in the blue part of a pixel results in image modification that’s less noticeable to the eye. The price for
this improvement is reduced hiding capacity from 12.5% to 4.17%, since one bit is hidden in each group of
3 image bytes.
Better security is obtained if the data is encrypted before it is hidden. An even better approach is to
first compress the data, then encrypt it, and finally hide it. This increases the hiding capacity considerably.
An audio file can also serve as a cover. Such a file consists of audio samples, each typically a 16-bit
number, so one bit of data can be hidden in the least-significant bit of each sample. The difference is that an
130 3. Input/Output
image is two-dimensional, and so can be scanned in complex ways, whereas an audio file is one-dimensional.
3.20.2 BPCS Steganography
BPCS (bit-plane complexity segmentation) steganography is the brainchild of Eiji Kawaguchi. Developed
in 1977, this method hides data in a cover image (color or grayscale) and its main feature is large hiding
capacity [BPCS 01]. The size of the hidden data is typically 50–70% the size of the cover image, and hiding
the data does not increase the size of the cover image.
Figure 3.43 shows a grayscale image of parrots, together with five of its bitplanes (bitplane 8, the
most-significant one, bitplane 1, the least-significant one, and bitplanes 3, 5, and 7). It is clear that the
most-significant bitplane is somewhat similar to the full image, the least-significant bitplane is random (or
at least seems random) and, in general, as we move from the most-significant bitplanes to least-significant
ones, they become more random. However, each bitplane has some parts that look random. (BPCS converts
the pixels of the cover image from binary to Gray codes, but this feature will not be discussed here, see
Section 2.24.)
The principle of BPCS is to separate the image into individual bitplanes, check every 8×8 bit square
region in each bitplane for randomness, and replace each random region with 64 bits of hidden data. Regions
that are not random (shape-informative regions in BPCS terminology) are not modified. A special complexity
measure α is used to determine whether a region is random. A color image where each pixel is represented
by 24 bits has 24 bitplanes (8 bitplanes per color). A grayscale image with 8 bits/pixel has eight bitplanes.
The complexity measure α of an 8×8 block of bits is defined as the number of adjacent bits that are different.
This measure (see note below) is normalized by dividing it by the maximum number of adjacent bits that
can be different, so α is in the range [0, 1].
An important feature of BPCS is that there is no need to identify those regions of the cover image that
have been replaced with hidden data, because the hidden data itself is transformed, before it is hidden, to a
random representation. The BPCS decoder identifies random regions by performing the same tests as the
encoder, and it extracts the 64 data bits of each random region. The hidden data is made to look random
by first compressing it. Recall (from Section 3.10) that compressed data has little or no redundancy, and
therefore looks random. If a block of 64 bits (after compression) does not pass the BPCS randomness test,
it goes through a conjugation operation that increases its randomness. The conjugation operation computes
the exclusive-OR of the 64-bit block (which is regrouped as an 8×8 square) with an 8×8 block that has a
checkerboard pattern and whose upper-left corner is white. This transforms a simple pattern to a complex
one and changes the complexity measure of the block from α to 1 − α. The encoder has to keep a list of
all the blocks that have gone through the conjugation operation, and this list is also hidden in the cover
image. (The list is not long. In a 1K×1K×24-bit image, each of the 24 bitplanes has 214 = 16K regions, and
so contributes 16K bits (2K bytes) to the conjugation list, for a total of 24×2K = 49,152 bytes, or 1.56%
the size of the image. The algorithm does not specify where to hide the conjugation list and any practical
implementation may have a parameter specifying one of several places to hide it.)
Note (for those who want the entire story). Figure 3.44 shows how the BPCS image complexity measure
α is defined. Part (a) of the figure shows a 4×4 image with the 16 pixels (1 bit each) numbered. The complexity
of the image depends on how many adjacent bits differ. The first step is to compare each bit in the top row
(row 1) to its neighbor below and count how many pairs differ. Then the four pairs of bits of rows 2 and 3
are compared, followed by the pairs in rows 3 and 4. The maximum number of different pairs is 4×3. The
next step is to compare adjacent bits horizontally, starting with the four pairs in columns 1 and 2. This can
also yield a maximum of 4×3 different bit pairs. The image complexity measure α is defined as the actual
number of bit pairs that differ, divided by the maximum number, which for a 4×4 block is 2·4(4 − 1) = 24
and for an 8×8 block is 2·8(8 − 1) = 112.
Figure 3.44b shows a checkerboard pattern, where every pair of adjacent bits differ. The value of α for
such a block is 1. Figure 3.44c shows a slightly different block, where the value of α is
(3 + 4 + 4) + (3 + 4 + 4)
α= = 0.917.
2·4·3
BPCS considers a block random if it satisfies α ≥ 0.5 − 4σ, where σ = 0.047 is a constant. The value of
σ was determined by computing α values for many 8×8 blocks and plotting them. The distribution of the
α values was found to be Gaussian with mean 0.5 and standard deviation 0.047.
3.20 Steganography 131
(1) (3)
(5) (7)
(8)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
3.20.3 Watermarking
The term watermarking refers to any technique used to embed ownership information in important digital
data. The watermark can be used to identify digital data that has been illegally copied, stolen, or altered
in any way. a typical example is a map. It takes time, money, and effort to create a road map of a city.
Once this map is digitized, it becomes easy for someone to copy it, make slight modifications, and sell it as
an original product. A watermark hidden in the map can help the original developer of the map identify
any attempts to steal the work. The watermark itself should not increase the size of the original data by
much, and must also be robust, so it does not get destroyed by operations such as filtering, compressing,
and cropping.
Watermarking can be done by steganography. The watermark may be a string with, for example, the
name of the owner, repeated as many times as needed. This string is hidden in the original data (image or
audio) and can be extracted by the original owner by means of a secret key.
There is, however, a fundamental difference between steganography and watermarking. In the former,
the hidden data is important, while the cover image is not. The embedding capacity of the algorithm is
important, but the hidden data may be fragile (it may be destroyed by transforming or compressing the
cover image) In watermarking the cover image is valuable, while the hidden data is not (it can be any
identifying data). The embedding capacity is unimportant, but the hidden data has to be robust. As a
result, watermarking should use special steganographic techniques.
The related concept of fingerprinting should also be mentioned. Fingerprinting refers to embedding a
secret serial number in each copy of some important digital data. A commercially-sold computer program,
for example, is easy to copy illegally. By embedding a secret serial number in each copy of the program sold,
the manufacturer can identify each buyer with a serial number, and so identify pirated copies.
Transmitter Receiver
(computer or disk) (computer or disk)
1 1
Figure 3.46: A typical long-distance configuration
The configuration of Figure 3.46 has one disadvantage, namely the distance between the transmitter
and the receiver cannot be too large. The reason for this is that computers and I/O devices are designed
for short distance communications. They generate low voltage signals that cannot penetrate through long
wires. A typical signal generated by a computer can travel, on a wire, a distance of a few hundred yards.
Beyond that, the signal gets too weak to be received reliably. This is why the configuration of Figure 3.46
is used only when the distances involved are short, such as in a local area network (LAN, Section 3.26.1).
For long distances, public lines, such as telephone lines, have to be used. In such a case, the digital
low-voltage signals generated by the computer have to be converted to signals that can travel a long distance
over the phone line. At the receiving end, the opposite conversion must take place.
134 3. Input/Output
3.22.1 Modulation
This naturally leads to the question of how the telephone system works. How does the telephone send voice
signals over long distances on phone lines? The telephone cannot produce high-voltage signals (because
they are dangerous), and a weak signal fed into a transmission line is completely absorbed by the internal
resistance of the line after traveling a short distance.
The solution used by the phone system is to generate a signal that varies regularly with time, a wave
(Figure 3.48a), and to feed that wave into the transmission line. This solution uses the fact that the resistance
of a line to electrical signals is not constant but depends on the frequency of the signal. This phenomenon is
known as impedance and it is an important feature of modern telecommunications. The telephone generates
a wave whose amplitude and frequency change according to the voice pattern. The frequency, however, must
always be in the limited range where the impedance of the line is lowest. Even so, signals on phone lines
must be amplified every 20 miles or so. The wires commonly used for voice-grade telephone lines feature low
impedance for frequencies in the range 2–3 KHz, which is why many modem standards specify a frequency
of 2400 Hz or 300 Hz.
The wave sent on the telephone line has to be varied (modulated) to reflect the bits being sent. Two
different wave shapes are needed, one for a binary 0 and the other for a binary 1. This can be done in
a number of ways, the most important of which are outlined below. Figure 3.47 shows a transmitter-
receiver configuration with special hardware to implement modulation. Note that on the transmitter side
we need a modulator, and on the receiver side we need a demodulator. However, during transmission, the
role of transmitter and receiver may be reversed several times, so each side needs both a modulator and
a demodulator. Such devices are therefore made so they can perform either function and they are called
modems (for MOdulator, DEModulator).
Since the distance between the serial interface and the modem is short, they are connected with several
lines, but only one is used for the data; the rest are used for control signals.
Serial Serial
Transmitter Receiver
8 interface interface 8
(several lines
1 for commands)
1
Modem Modem
1
Figure 3.48: Serial I/O with serial interfaces and modems
(a)
(b)
(c)
(d)
0 1 0 1 1 0
Figure 3.49: Three basic modulation techniques
136 3. Input/Output
These three basic modulation techniques make it possible to transmit one bit per time unit. In such cases,
the term “baud” indicates both the number of time units per second (the signaling rate) and the number of
bits transmitted per second. Recent advances in telecommunications, however, allow the use of sophisticated
modulation techniques where more than one bit can be transmitted in a time unit. Such techniques involve
phase modulations at various angles, not just 180◦ , and may also employ different amplitudes. Figure 3.49a
shows a sine wave and the results of shifting it through various angles. This diagram makes it easy to
understand a modulation technique where in each time unit the phase of the wave can be shifted by one
of the four angles 0◦ , 90◦ , 180◦ , and 270◦ . This is equivalent to transmitting, in each time unit, one of
four symbols. Each of the four symbols can be assigned a 2-bit code, with the result that this modulation
technique transmits two bits in each time unit. It is called phase amplitude modulation or PAM. The term
“baud” now refers to the signaling rate, and the number of bits per second (bps or the bitrate) is now twice
the baud.
A simple extension of PAM uses two distinct amplitudes. The wave transmitted in each time unit can be
in one of four phases and one of two amplitudes. This is equivalent to sending one of eight symbols, or three
bits, in each time unit. The resulting bitrate is three times the baud. Such a method is called quadrature
amplitude modulation, or QAM. The constellation diagram of Figure 3.49b is a graphics representation of
the 4-phase 2-amplitude QAM modulation technique. The four inner points represent the four phases with
small amplitude, and the four outer points correspond to the large amplitude. Table 3.50 lists the three bits
assigned to each of the eight symbols, and Figure 3.49c is an example of the modulated wave required to
transmit the bit sequence 010|111|000|100 (notice the discontinuities in the wave). The symbols sent in the
four time units are 1: 90◦ , 2: 270◦ , 1: 0◦ , and 1: 180◦ .
2 270 111
Table 3.51: Eight state QAM.
Exercise 3.18: QAM extends the three basic modulation techniques by using multiple amplitudes and
phases but not multiple frequencies. What is the reason for that?
The QAM technique can be extended to more than four phases and two amplitudes. Figure 3.51 shows
(a) a 16-QAM scheme with four phases and four amplitudes, (b) an 8-QAM with eight phases and a single
amplitude (this is used in 4800 bps modems), (c) a 16-QAM with 12 phases and three amplitudes (used
in 9600 bps modems), and (d) a 32-QAM consisting of 20 phases and four amplitudes (this modulation is
used by V.32 modems). However, as the constellation diagram gets denser, it becomes easier for even minor
disturbances along the way to corrupt the modulated wave and result in incorrect demodulation. This is
why the smallest phase shift currently used in QAM is 22.5◦ and why QAM modulations don’t exceed nine
bits per baud (512-QAM). Two solutions to the reliability problem are (1) improved modem hardware and
line quality (resulting in higher signal to noise ratio) and (2) the use of error-correcting codes. A further
speed increase is achieved by compressing the data before it is transmitted.
Typical examples of amplitudes are 2, 4, and 6 volts. A typical time unit is 0.833 ms. This translates to
a signaling rate of 1200 time units per second. At two bits per time unit, this results in a bitrate of 2400 bps,
while at four bits per unit, the bitrate is 4800 bps. An example of a higher speed is a signaling rate of 3200
baud combined with nine bits per baud (achieved by 512-QAM). This results in a bitrate of 28800 bps.
Currently, the fastest telephone modems work at 56.6Kbps. Such a modem uses a D/A circuit to convert
8-bit digital data from the transmitting computer into a wave with a signaling rate of 8 KHz. A 256QAM
modulation is used to pack eight bits into each time unit, resulting in a bitrate of 8000×8 = 64000 bps. The
3.22 Serial I/O 137
00 1800 00 900
0 π 0 π/2
00 2700 450
(a)
0 3π/2 π/4
−2700 00 −900 0
−3π/2 0 −π/2
900
1-900 1-00 1-1800
0
180
00
2700 2-2700
(b) (c)
Figure 3.50: Quadrature Amplitude Modulation
138 3. Input/Output
900
1350 450
0111 0110 0010 0001
11
0101 00
1800 0100 0000
01
0 0
10
0 0 1100 1000
11
0 11
225 1110 1010
2700 3150 1101 1011
receiving modem, however, has to use an A/D circuit to demodulate the wave, and such circuits are far from
perfect. A typical D/A can resolve the incoming modulated wave to only seven bits per time unit, resulting
in an effective bitrate of 8000×7 = 56000 bps.
computer
Personal
ISP
D/A phone comp. connection
home A/D
wiring 64Kbps digital
modem phone comp.
modem
Figure 3.53: Analog and digital data paths for 56K communications
Figure 3.52 shows a typical data path between a home computer and an ISP’s computer, using the
(relatively new) digital wires. Uploading (from home to ISP) involves sending the data from the home
computer to the D/A circuit in the home modem. From there, through wires in the home, to the A/D in
the telephone company modem (normally located on the outside wall of the house, or on a nearby utility
pole). This A/D circuit is the weakest point in the entire path. From that modem, the data is sent on a
digital wire to the ISP’s computer, which normally has direct digital connection.
Downloading (from ISP to home) is done in the opposite direction, so the only A/D circuit involved is
the one in the home modem. In principle, we can purchase a high-quality modem with a high-precision A/D
circuit. In practice, most modems have an average performance A/D, which becomes the weak link in the
download path.
3.22.2 Receiving Bytes in Serial I/O
The next step in understanding serial I/O has to do with the packing of bits by the receiver. The serial
interface on the transmitting side has no problem in unpacking 8-bit bytes and sending out a long stream
of bits on the transmission line. The serial interface on the receiving side, however, has to break up the
incoming stream of bits into groups of 8 bits and pack each group so that it can be sent to the receiving
computer.
The problem stems from the fact that there is just one communication line, through which both the data
and synchronization signals have to be sent. Thus, the receiver has no way of knowing when the transmission
is going to start. When the receiver is first turned on, the line is either quiet or the transmission has already
started and part of it may have been lost already.
Only rarely does it happen that the receiver is turned on at the very moment the transmission arrives
from the transmitter. This is why it is important to make sure that the receiver can identify the beginning
of the transmission or, at least, the beginning of some byte in the transmission.
This problem has two solutions, synchronous serial I/O and asynchronous serial I/O.
3.22.3 Synchronous Serial I/O
The main principle of this method is that the transmitter should transmit all the time, at a constant rate.
This is not as simple as it sounds. The transmitter may be a keyboard, operated by a human. Even the
3.22 Serial I/O 139
fastest human typist cannot possibly exceed 10 characters/s and maintain it for a while, when using a
keyboard. This rate, however, is an extremely slow one for the hardware. From the point of view of the
serial interface, such a keyboard rarely has anything to send. This is why the synchronous method uses
another rule: When a synchronous transmitter has nothing to send, it sends the ASCII character SYN (code
0001 0110 = 1616 ).
As an example, suppose the user types SAVE 2,TYPE(cr). The serial interface on the keyboard would
generate a sequence of ASCII codes similar to
← 16 16 16 53 41 16 56 45 16 20 16 32 2C 16 16 54 59 50 45 16 16 0D 16 16 ←
(in hex, see the ASCII code table on page 81). This way the transmission is sent at the same rate, even
when there is nothing to transmit.
The receiver knows the rules of synchronous transmission and its only problem is, therefore, to identify
the first bit of the first data byte (the 53 in our example). Once this is done, the receiver can simply receive
the bits at the same constant rate, count modulo 8, and pack consecutive groups of eight bits into bytes. If
a group turns out to be a SYN character, it is ignored by the receiver.
At worst, the receiver should be able to identify the first bit of a data byte (perhaps not the first data
byte). In such a case, only part of the transmission would be lost. In the above example, if the receiver
identifies the first bit of byte 56, then it has lost the first two data bytes 53 and 41, but the rest of the
transmission can still be received correctly.
To enable the receiver to identify the first data byte, the transmitter starts each transmission with at
least two SYN characters. Whenever it is turned on, the receiver automatically switches into a special mode
called the hunt mode. In this mode the receiver hunts for a SYN character in the transmission. It starts by
receiving the first eight bits from the line and storing them in a shift register. It then compares the contents
of the shift register with the code of SYN (000101102 ). If the two match, the receiver assumes that the current
eight bits are, in fact, a SYN character, and the next bit to be received is thus the first bit of a byte (we show
later that this assumption is not always correct). If the comparison fails, the receiver inputs the next, shifts
the register to the left, enters the new bit on the right end of the register, and compares again. The process
is repeated until either a match is found or the transmission is over.
To illustrate this process, we use the example above. It starts with the following bits:
00010110 00010110 00010110 01010011 01000001 . . .
(without the spaces). We assume that the receiver is turned on a little late, at the point marked with the
vertical bar
00010110 00010|110 00010110 01010011 01000001 . . .
The shift register will initially contain 11000010. The first three comparisons will fail and the receiver will
read and shift the next three bits. The shift register will contain, in successive stages: 10000101, 00001011,
and 00010110. After the third shift, the shift register will match the code of SYN and the receiver will switch
to its normal mode. In this mode, the receiver (actually, the serial interface on the receiving side) receives
bits in groups of 8, packs each group, and sends it to the computer or the I/O device on the receiving side.
If a group equals 00010110, it is ignored.
Two things can go wrong in this simple process.
1. If the receiver is turned on too late, it may miss part of the transmission. In the case
00010110 00010110 0|0010110 01010011 01000001 . . .
the receiver will stay in the hunt mode until it finds the 16
16 16 16 53 41 16 56 45 16 20 . . .
and will therefore miss the first two data bytes 53 and 41.
2. Sometimes the shift register may contain, during a hunt, the pattern 00010110 even though no SYN
is involved. For example, if the receiver is turned on just before the word TYPE is received, it will see the
stream 16 54 59 50 45 16 or, in bits 01011000|01011001 . . .. The shift register will start with 01011000
and, after six shifts, will contain 00|010110. Two bits are left over from the T and the other six came from
the Y. The receiver, however, will conclude that the shift register contains a SYN and will assume that the
next bit (the seventh bit of the Y) is the first bit of the next byte. In such a case, all synchronization is lost
and the rest of the transmission is corrupted.
To help the receiver in such cases, each byte should include a parity bit. When the receiver loses
synchronization, many parity bits will be bad, and the receiver should ask for a retransmission of the entire
140 3. Input/Output
block.
3.22.4 Asynchronous Serial I/O
This method is based on a different principle. When the transmitter has nothing to send, it generates an
unmodulated wave. The receiver receives it and interprets it as a uniform sequence (an idle sequence) of all
zeros (or, possibly, of all ones). When the transmitter has a character to send, it first breaks the sequence
of zeros by sending a single 1 (a start bit) followed by the eight data bits of the character.
The task of the receiver is to wait for the start bit (the break in the transmission) and to receive the
data bits that follow. They are packed and sent, after a parity check, to the computer on the receiving side.
To allow the receiver enough time to do the packing and checking, the transmitter sends at least two zeros
(stop bits) following the data bits. After the two stop bits, the transmitter may either send a sequence of
zeros or the next data byte (again sandwiched between a start bit and two stop bits). A typical example is
00 . . . 001dddddddp001dddddddp00 . . . 001dddddddp000 . . .
Where ddddddd stand for the seven data bits and p is the parity bit. The idle sequences can be as long as
necessary, but each character is now transmitted as 11 bits. The seven data bits, the parity bit and the three
start/stop bits.
Since each character now requires more bits, asynchronous transmission is, in principle, slower than
synchronous transmission. On the other hand, the asynchronous method is more reliable since it can never
get out of synchronization. If a character is received with bad parity, the rest of the transmission is not
affected.
3.22.5 Serial I/O Protocols
The last step in understanding serial I/O is the description of a complete transfer, including data and
control characters. Such a transfer follows certain conventions and must obey certain rules in order for the
transmitter and receiver may understand each other. This convention, or set of rules, is called the serial I/O
protocol.
There are many different protocols, and the one illustrated here is typical. When first turned on, the
two serial interfaces at both ends of the transmission line switch themselves to the receiving mode. At a
certain point, one of them receives data in parallel, from the computer, to be transmitted serially. It changes
its mode to “transmit” and starts the protocol.
It may happen, in rare cases, that both serial interfaces start transmitting at the same time. In such
a case, none will receive the correct response from the other side, and sooner or later, one side is going to
give up and switch back to receiving mode. (Section 3.26.1 discusses methods for the prevention of such
conflicts.)
The first step in the protocol is to determine whether the receiver is really connected at the other end
and is ready to receive. The receiver may be disconnected, turned off, busy, or out of order. The transmitter
therefore starts by sending an enquire (ENQ) character (ASCII code 058 ). If the receiver is ready, it responds
with an acknowledge (ACK) character (ASCII code 068 ). If the receiver cannot receive, it responds with a
negative acknowledge (NAK, code 158 ).
If the transmitter receives a NAK, it has to wait, and perhaps try again later. If the response is positive
(an ACK), the transmitter sends the first block of data, bracketed between the two control characters STX
(start text, ASCII code 028 ) and ETX (end text, code 038 ). Each character in the data block has its own
parity bit, the entire block (in fact, the entire protocol) is transmitted either in the synchronous or the
asynchronous mode, and the receiver has to be preset, when first turned on, to the speed of the transmitter.
Following the ETX there is usually another character that provides a check, similar to a parity check, for
the entire block. This character is called the cyclic redundancy character (CRC) and is discussed below (see
also page 93).
Following the transmission of the first block, the receiver responds with either an ACK or, in case of parity
errors, with a NAK. Upon receiving an ACK, the transmitter sends the next block, using the same format (STX,
data bytes, ETX, CRC character). In response to a NAK, the transmitter retransmits the entire block. If the
same block has to be retransmitted several times, the transmitter may give up and notify the computer or
I/O device on its side by sending an interrupt signal.
After sending the last block, the transmitter waits for a response and, on getting an ACK, sends an
3.22 Serial I/O 141
ENQ
ACK or NAK
ACK or NAK
Transmitter
Second block. crc ETX data bytes STX
Receiver
ACK or NAK
ACK or NAK
ETB
Figure 3.54: A typical serial protocol
EOT (end of transmission) character. Following the EOT, the transmitter switches its mode to receiving.
Figure 3.53 shows a typical, simple protocol.
The CRC character is the output of the cyclic redundancy checking algorithm, a method similar to, but
much more powerful than, parity. All the characters in the message block are treated as one long stream
of bits representing a (large) binary number. That number is divided modulo 2 by another, pre determined
binary number b, and the remainder is the CRC character. If b = 256, then the remainder is always in the
range 0 . . . 255 and is therefore an 8-bit number. More information on the cyclic redundancy check method
can be found in [Press and Flannery 88] and [Ramabadran and Gaitonde 88].
As an example of a simple protocol, we describe the Simple Mail Transfer Protocol (SMTP, [RFC821 82]),
commonly used for email messages. We assume that the message is sent from a computer located on a network
where there is a central computer (called the sender) that’s responsible for all communications. The message
is sent by the sender to another network (making several stops on the way, if necessary), where a similar
computer (the receiver) receives it and sends it to the recipient computer. The main point is that the same
message may be sent to several recipients served by the same receiver. In such a case, only one copy of the
message data should be sent. The steps of the protocol are as follows:
1. The sender sends a MAIL command, indicating that it wants to send mail. This command contains a
reverse path, specifying the addresses the command went through. The reverse path ends with the sender’s
IP address
2. If the receiver can accept mail, it responds with an OK reply.
3. The sender sends a RCPT commands, identifying one of the recipients of the mail.
4. If the receiver can accept mail for that recipient, it responds with an OK. Otherwise, it sends a reply
rejecting that recipient, but not the entire email transaction.
5. The sender and receiver may negotiate several recipients this way.
6. When the recipients have been identified and accepted by the receiver, the sender sends the mail
data, terminated with a special sequence.
7. If the receiver understands the data, it responds with an OK reply.
The following are the SMTP commands:
HELO <SP> <domain> <CRLF>
MAIL <SP> FROM:<reverse-path> <CRLF>
RCPT <SP> TO:<forward-path> <CRLF>
DATA <CRLF>
RSET <CRLF>
SEND <SP> FROM:<reverse-path> <CRLF>
142 3. Input/Output
S: MAIL FROM:<Smith@USC-ISIF.ARPA>
R: 250 OK
S: RCPT TO:<Jones@BBN-UNIX.ARPA>
R: 250 OK
S: RCPT TO:<Green@BBN-UNIX.ARPA>
R: 550 No such user here
S: RCPT TO:<Brown@BBN-UNIX.ARPA>
R: 250 OK
S: DATA
R: 354 Start mail input; end with <CRLF>.<CRLF>
S: Blah blah blah...
S: ...etc. etc. etc.
S: .
R: 250 OK
S: QUIT
R: 221 BBN-UNIX.ARPA Service closing transmission channel
3.22.6 Communications Lines
There are three types of communication lines, simplex, half-duplex, and full-duplex. A simplex line is uni-
directional. Information can only be sent from the transmitter to the receiver and nothing, not even an
acknowledge, can go in the other direction. Such a line can be used only in cases where the receiver is not
expected to respond. Radio and television transmissions are common examples of simplex lines.
A half-duplex line is bidirectional but can only carry one transmission at a time. Thus, before the
receiver can send a response, the transmitter has to send a control character (part of the protocol) to signal
a change in the direction of the line. Such a direction change can be time consuming. After sending the
response, the receiver should send another control character to again change the direction of the line, which
slows down the whole transmission.
A full-duplex is bidirectional and can handle transmissions in both directions simultaneously. Both sides
may transmit data on such a line at the same time without the need to change directions. Such lines are, of
course, faster.
The following sections discuss terminology, protocols, and conventions used by modern modems:
3.22 Serial I/O 143
the problem may be resulting from impairments along the lines running to the local telephone company or
within your home or office. Your telephone company or a private consultant may be able to help.
Modem Data Compression. No matter how fast the modem, file transfers will always go faster with
compression, unless the file is already compressed. There are two basic standards for compression: V.42bis
and MNP5. MNP compression is older, but today’s modems support both methods.
V.90: The original standard for 56Kbaud modems. In practice, speeds reached by this protocol are only
about 36Kbaud.
V.92: A new protocol for 56K modems currently (summer 2000) being considered for ratification by the
ITU-T. The four main features of this protocol are as follows:
1. “Modem-on-hold.” This feature allows the modem and the server to enter a sleep mode, restoring
the data without the need to redial the server. This is useful in cases where a single phone line is used for
both data transfers and speech. When the modem uses the line and the user notices an incoming call on the
same line, the user places the modem on hold and accepts the call.
2. A “Quick Connect” feature reduces the time spent on negotiating a connection between a modem
and a server. To implement this feature, the modem has to “learn” the capabilities of the home phone line
and its maximum transmission rates.
3. Data transmission rates (as opposed to receiving) are improved. Generally, modems receive data
faster than they transmit. Current data transmission rates on 56Kbaud modems are about 28.8Kbaud. The
V.92 standard will be able to improve them to about 40Kbaud, depending on line conditions.
4. A new, efficient data compression standard, V.44, has been developed by Hughes Network Systems
in parallel with V.92. It has been specifically designed for the compression of HTML documents, and it is
expected to increase the effective data receiving rates of V.92 modems to over 300Kbaud, about double that
of the standard V.42bis protocol. This will very likely speed up web surfing by 20-60%, and perhaps up to
200% in some extreme cases
Modem Error Control. Modern modems support the error correction standards V.42 and MNP 2-4.
Any of these standards assures that your data will be received as accurately and as quickly as possible.
Table 3.54 is a summary of the most important modem standards.
3.23 Modern Modems
The story of the modern modem starts with Dennis C. Hayes, a telecommunications engineer formerly
with Financial Data Sciences from Georgia. Hayes started making modems in the 1970s, and founded
Hayes Microcomputer Products, Inc., in 1978. The company makes modems and other products under the
brand names of Hayes and Practical Peripherals. Hayes’ main idea was to make programmable modems.
A programmable device is one that can receive and execute commands. All modern modems are Hayes
compatible and execute commands based on Hayes’ original design. A typical modem can execute upwards
of 100 commands.
In 1981, Hayes started making the Smartmodem 300, the first intelligent modem that executed Hayes
Standard AT Command Set. In 1982, the Smartmodem 1200 was introduced. In 1987, Hayes introduced
the Express 96, a 9600 baud modem. In 1988, the Hayes Smartmodem 9600 became the first Hayes V.32
modem introduced for use with data communications networks.
A modern modem is a complex machine that performs many tasks. However, since it does not contain
any moving parts, it is easy to manufacture once a prototype is ready. This is why, by the middle 1980s, there
were many modem makers competing with Hayes, developing new ideas, standards and products. Currently,
the main modem makers are AT&T, Global Village, Motorola, Multi-Tech Systems, Racal Electronics, Supra,
and U.S. Robotics.
A modern modem performs the following tasks:
Negotiating with the modem on the other side features such as baud rate, synchronous/async, and
the precise method for data compression and error-correction.
Renegotiate the transmission speed “on the run,” depending on how many errors have been received.
Answering the phone is one important feature of modern modems. You can leave your computer running,
start a communications program, send the command “ATS0=2” to the modem, and it will answer the phone
after two rings. This is important for an individual wanting to set up a web server, and also for a computer
center servicing remote users. The character string AT is the modem’s attention command. The string S0 is
the name of an internal register used by the modem to decide when (after how many rings) to answer the
phone.
The newest features in modems today are speeds of 56,600 baud (achieved mainly through the use of
efficient compression), cable-tv modems, and simultaneous voice/data modems (DSL). DSL (Section 3.24) is
such a promising technology that many predict that the current generation of modems will be the last one
and modems will be completely phased out in the future.
See also this book’s web site for a very detailed modem glossary and dictionary. Another useful source
of information on modems is http://www.teleport.com/~curt/modems.html.
Modern modems can also send and receive fax. See http://www.faximum.com/faqs/fax for more
information on fax modems.
146 3. Input/Output
Asymmetric DSL (can share line with analog telephone) Asymmetric DSL (can share line with phone)
Maximum Maximum Upstream and
upstream downstream Cable Maximum downstream Cable Maximum
Type speed speed pairs distance Type speed pairs distance
ADSL 1M 8M 1 18000 HDSL 768K 2 12000
RADSL 1M 7M 1 25000 1.544M 2 12000
G.lite 512K 1.5M 1 25000 2.048M 3 12000
VDSL 1.6M 13M 1 5000 HDSL-2 44M (T1) 1 18000
3.2M 26M 1 3000 2.408M (E1) 18000
6.4M 52M 1 1000 SDSL 1.5M 1 9000
784K 1 15000
208K 1 20000
160K 1 22700
Table 3.56: Features of symmetric and asymmetric DSL
Splitter
Exchange
Splitter
Computer Computer
For some historical reason, T-2 has been bypassed altogether. The rest of the world sends a bundle of 30
DS-0 signals on an E-1 line and calls this a DS-1/E-1 channel.
In North America, the following terms are used to indicate standards of serial transmissions:
North American Standards: A T-1 is only one way of bundling DS-0s together. In general, control data
has to be transmitted in addition to the voice signals. This data is needed for call connection, caller ID, and
record-keeping/billing information. On a T-1 line, this data is transmitted by robbing 8 Kbaud from the
64 Kbaud of each of the 24 DS-0 channels, effectively reducing them to 56 Kbaud channels. This is called a
“robbed bit DS-1.” The degradation in voice quality is not noticeable.
Most of us still have analog telephone lines used in our homes. Such lines are referred to as analog
Central Office (CO) lines. In the 1970’s, in response to the explosive growth of personal computers and
computer communications, AT&T decided to convert the entire telephone network to digital. This was how
ISDN (Integrated Services Digital Network), was developed. Initially, ISDN was not popular, but it started
to grow together with the internet. There currently are two types of ISDN connections as follows:
Type 1. This ISDN connection is common in residences and businesses. It is called a Basic Rate Interface
(BRI) and consists of two 64 Kb voice lines and one 16 Kb data line on one pair of regular phone cable. It
is sometimes called 2B + D (two bearer plus data). This is slowly catching on among internet users because
the two bearer lines can be bound together to form one 128 Kb line, but a special modem is needed. It is
also possible to send and receive slow data such as e-mail over the 16 Kb data line without the phone lines
being tied up or even making an actual phone call. ISDN BRI uses the same phone wire as regular phones
but transmits digitally, so a different card is needed at the Central Office or at the company’s PBX (Private
Branch Exchange). Also, a special phone is needed, or a special box to plug regular telephones into.
Type 2. This type connects the customer’s PBX to the telephone company’s central office. It is called a
Primary Rate Interface (PRI) and consists of a DS-1 connection. An ISDN PRI connection comes on a T-1
type circuit, but there are many more services offered to the customer. Also, an ISDN PRI is not a robbed
bit service. It provides 23 DS-0 channels and uses the 24th channel as the data channel for all 23 DS-0s.
SONET (Synchronous Optical NETwork) is a fiber optic network standard that telephone company
central offices use to communicate. It originated in the US but has been adopted by the CCITT as an
international standard, so equipment made by different manufacturers would work together. SONET defines
how different pieces of equipment interface and communicate, but not what they say to each other. As a
result, different types of networks can be used on a sonet ring. With sonet, the term OC-1 connection is used
instead of DS-3/T-3 connection (the term “OC” stands for “optical carrier”). Currently, sonet connections
3.26 Computer Networks 149
are very expensive and are consequently used only by large customers. The different OC speed standards
are as follows:
OC-1 52 Mbps 28 DS-1’s or 1 DS-3.
OC-3 155 Mbps 84 DS-1’s or 3 DS-3.
OC-9 466 Mbps 252 DS-1’s or 9 DS-3.
OC-12 622 Mbps 336 DS-1’s or 12 DS-3.
OC-18 933 Mbps 504 DS-1’s or 18 DS-3.
OC-24 1.2 Gbps 672 DS-1’s or 24 DS-3.
OC-36 1.9 Gbps 1008 DS-1’s or 36 DS-3.
OC-48 2.5 Gbps 1344 DS-1’s or 48 DS-3.
OC-96 5 Gbps
OC-192 10 Gbps
ATM (Asynchronous Transfer Mode) is a packet switching network that runs on SONET. It runs at
the OC-3 and OC-12 speed. ATM has been slated by CCITT as the future backbone of Broadband ISDN
or B-ISDN, but it could also be used as a private network. ATM is not even a finished or officially adopted
standard but manufacturers are currently making products promising that they will comply with the finished
standard when it is approved. The products are still very expensive, but higher end customers may be looking
at it. ATM would be their private network and they would connect to other offices through a leased OC-3
or OC-12 connection. A cheaper alternative may be for a company to buy connections to a public ATM
network and just use the ATM network as the company’s WAN backbone.
3.26 Computer Networks
Early computers could only send their output to a printer, a punched card, or a punched paper tape machine.
Magnetic tapes and drums were added in the 1950s, and magnetic disks have been used since the second
generation of computers. In the 1960s, when computers became more popular, computer designers and users
started thinking of ways to send information between computers, to make computers communicate. Two
things are necessary for computer communication, a communication line and programs that can send and
receive information in a standard way.
Today, the Internet (Section 3.27) is an important part of our lives, but local area networks are also
important, so this section starts with a discussion of LANs.
3.26.1 Local Area Networks
When several computers are located within a short distance (a few hundred yards) of each other, it is possible
to connect them with a cable, so that they form a local area network (LAN). Such a network connects
computers in the same room, in the same building, or in several nearby buildings. A typical example is a
multipurpose building on a university campus, with offices, classrooms, and labs, where a cable runs inside
the walls from room to room and from floor to floor, and all the computers are connected in a LAN.
To reduce interference, a special type of cable, such as a coaxial cable or a twisted pair, has to be used.
Each computer on the network sends output on the wire, or receives input from it, as if the other end of the
wire was connected to an I/O device. The only problem is conflicts, the case where several computers try
to send messages on the same wire at the same time. This problem can be solved in a number of ways, two
of which are described in Sections 3.26.2 and 3.26.3.
The discussion of serial I/O in Section 3.22 implies that every computer on a LAN has to have a serial
port installed. The problem of conflicts has to be solved somehow, and this also requires special hardware.
As a result, each computer in a LAN must have a special piece of hardware called a network card or a
communications interface. This hardware acts as a serial port, packing and unpacking bytes as necessary,
and also resolves conflicts in communications over the network.
150 3. Input/Output
3.26.2 Ethernet
This type of LAN was developed in 1976 by Bob Metcalfe of Xerox [Metcalfe and Boggs 76], and is currently
very popular. A single cable, called the ether, connects all the computers through an ethernet interface (also
called an ethernet interface) in each computer. The ethernet interface is a special piece of hardware that
performs the data transfers. It executes the ethernet protocol and knows how to avoid conflicts.
A computer sends a message to its ethernet interface as if the interface were an I/O device. The interface
listens to the ether to find out if it is in use. If the ether is in use, the interface waits until the ether is clear,
then sends the message immediately. It is possible, of course, that two or more interfaces will listen to the
ether until it becomes clear and then send their messages simultaneously. To solve this problem, an interface
has to monitor the ether while it is sending a message. If the signal on the ether is different from the one
the interface is sending, the interface stops, waits a random time interval, then tries again.
Once a message is sent on the ether, it is received by every other ethernet interface on the network.
The message must therefore start with an address, which each receiving interface compares with its own. If
the two addresses are not equal, the interface ignores the message, otherwise, it received it and sends it, as
input, to its computer.
An ethernet network can easily reach speeds of about 10 Mbaud.
Ethernet addresses are 48 bits long, organized in six bytes. They must be unique, so they are assigned
by the IEEE. Each ethernet manufacturer is assigned three bytes of address, and the manufacturer, in turn,
assigns these three bytes, followed by a unique, three-byte number, to each ethernet card they make.
(a) (b)
Figure 3.58: (a) Ethernet, (b) ethernet with hub
3.26 Computer Networks 151
History of Ethernet
The following are excerpts from the draft version of the upcoming book, Überhacker!, by Carolyn Meinel.
The history of Ethernet is important because this is a networking technology that’s currently used on
the vast majority of all the local area networks on this planet.
May 22, 1973, at the Xerox Palo Alto Research Center (PARC), the world’s first Ethernet LAN trans-
mitted its first packet (chunk of data). The proud inventors were Bob Metcalfe and David Boggs. For
years they labored in the laboratory to improve their invention. By 1976 their experimental network was
connecting 100 devices.
The turning point came in 1979. That year Gordon Bell of Digital Equipment Corp. (DEC) phoned
Metcalfe to suggest that they work together to make a commercial product out of Ethernet. Metcalf’s
employer, Xerox, loved the idea. DEC would build Ethernet hardware, and Intel would provide chips for
DEC’s Ethernet network interface cards (NICs). The idea was that this trio of industrial titans would keep
the technology to itself, so that anyone who would want to use Ethernet would have to buy the equipment
from their combine.
There was one problem with this idea—if Ethernet were to become the dominant networking technology
someday, this combine would violate US antitrust laws designed to curb monopolies. Back then, no one used
Ethernet outside the laboratory. So for these people to be thinking about the danger of becoming a monopoly
was either arrogant—or prescient.
Metcalfe, Bell and associates chose to avoid an Ethernet monopoly. They began working with the
Institute of Electronics and Electrical Engineers (IEEE) to create an open industry standard for Ethernet.
That meant that anyone would be free to create and sell Ethernet hardware or design network operating
systems that would use it. Persuading Xerox, DEC and Intel to make Ethernet free for anyone to build,
ensured that Ethernet would become inexpensive and widely available. For this they deserve credit for
creating one of the keystones of today’s Internet. In June of 1979, Metcalfe left Xerox to found 3Com Corp.
By March 1981, 3Com shipped its first Ethernet hardware to the public. Ethernet had finally emerged from
the laboratory.
In 1982, 3Com shipped its first Ethernet adapter for a personal computer—the “Apple Box.” Some 18
months later 3Com introduced its first Ethernet internal card, the Etherlink ISA adapter for the PC. This
card used “Thin Ethernet” cabling, a technique that is still popular today.
In 1983, the IEEE published the Ethernet standard, 802.3. Xerox turned over all its Ethernet patents
to the nonprofit IEEE, which in turn will license any company to build Ethernet hardware for a fee of $1000.
This was yet another act of corporate generosity that helped make Ethernet the most widely used local area
networking technology. In 1989, the Ethernet standard won international approval with the decision of the
International Standards Organization (ISO) to adopt it as standard number 88023.
Why all this history? The important thing with Ethernet is that it became a world-wide recognized
standard in 1989. That means if you set up an Ethernet LAN in your home, you can be certain that much
of what you learn from it will work on Ethernet LANs anywhere else on the planet. Also, if you ever invent
something truly wonderful, please remember this story and make your invention freely available to the world,
just as Metcalfe and Boggs did.
The message should be sent in a special code, to guarantee that no message should contain a bit pattern
identical to the token.
and is connected to the NSFNET through gateways in Boulder (at the National Center for Atmospheric
Research, NCAR), and Salt Lake City (at the University of Utah). The NSFNET is a member network of
the Internet.
In addition to IP, three more protocols are currently used in the Internet. They are collectively known
as TCP/IP (Transmission Control Protocol. Internet Protocol) and are fully described in RFC-1140.
Simple Mail Transfer Protocol (SMTP), is the Internet standard protocol for transferring E-mail
messages.
File Transfer Protocol (FTP), is used to transfer files between Internet nodes.
Telnet is the standard protocol for remote terminal connection. It allows a user to interact with a
remote computer as if he were connected to it directly.
The point where a network is connected to another one used to be called a gateway, but the term
network server is now common. This is usually a computer dedicated to message routing, but a server can
also be a custom piece of hardware. It may be connected to one or more nodes in each of the networks, and
it must have some knowledge of the organization of both, so it can route each message in the right direction.
A message is sent from a node to the nearest server. It is then forwarded from server to server, until it
reaches a server connected to the destination node. The message is sent as a packet, and the entire process
is known as packet switching. The term ‘packet,’ however, is loosely used and may describe different things.
On the Internet, a packet is a block of data, also called a datagram, whose format is defined by the IP.
Some servers on the Internet have a total picture of the network. They know how to reach every domain
on the Internet. They are called root servers, and are updated each time a domain is added to or deleted
from the Internet. Every other server has to get up-to-date information from a root server periodically.
An interesting special case is when one of the networks is small, such as a university campus. A network
server in such a network knows all the nodes on the network, so it can easily route each incoming message to
its destination. However, all outgoing messages may be routed by the server to the same node in the larger
network, where there is more information about the outside world, helping to decide where to forward the
message.
When all the nodes of a network are physically close (within a radius of up to a few thousand meters),
they can be connected directly, without the need for modems and phone lines, to form a local area network.
Examples are a department located in the same building, or a small facility housed in a few adjacent
structures.
The Internet as a whole is managed by two centers. The Information Sciences Institute (ISI) of the
University of Southern California, located in Los Angeles, and ICANN (Section 3.30), a US government
corporation. ISI is in charge of developing standards and procedures. ICANN is the Internet Assigned
Numbers Authority. It maintains the Network Information Center (InterNIC), where users can connect to
get information about available and assigned addresses and about existing and proposed standards.
3.27 Internet Organization
How does a network of computers work? How is information sent from a source computer and gets to a
destination computer? A good way to understand this is to think in terms of layers. A network is made up
of layers, the lowest of which is the hardware. All other layers consist of software. Lower layers consist of
programs that use the hardware directly. Programs in higher layers invoke the ones in the lower layers, and
are easy to use, since they don’t require detailed knowledge of how the network operates.
The lowest layer, the hardware, is, of course, necessary. Information can be sent, in bits, on wires and
can be created and processed by hardware circuits. Hardware, however, is not completely reliable, and is
tedious to use directly (think of the difference between machine language and higher-level languages). This
is where the low software layers come in. Software is necessary to decide how to route the information, and
to check and make sure that all the bits of a message have been received correctly.
Computer networks resemble the phone network in a superficial way because of the following:
They use existing phone lines to send information, and they rent dedicated lines from telephone and
telecommunications companies.
154 3. Input/Output
Using a network from a computer is similar to the way we use a phone. A connection has to be opened
first, information is then moved both ways; finally, the connection is closed.
It therefore comes as a surprise to learn that computer networks operate more like the postal service
than the phone network. When A calls B on the phone, the phone network creates a direct connection
between them, and dedicates certain lines to that connection. No one else can use these lines as long as A
and B stay connected. These two users monopolize part of the phone network for a certain duration. On
the other hand, when A sends a letter to B, the postal service does not dedicate any routes or resources to
that letter. It sorts the letter, together with many others, then sends it, by truck or plane, to another post
office, where it is sorted again and routed, until the letter arrives at its destination.
We say that the phone system is a circuit switched network, whereas the postal service uses packet
switching. Computer networks also use packet switching. A user does not monopolize any part of the
network. Rather, any message sent by the user is examined by network software. Depending on its destination
address, it is sent to another part of the network where it is examined again, sent on another trip, and so
on. Eventually, it arrives at its destination, or is sent back, if the destination address does not exist.
National
Provider To other
Providers
Token
Ring
Router
Local
Provider
Router
From other
Ethernet Providers
This structure is illustrated by Figure 3.58. It shows two local networks, a token ring and an ethernet,
connected to a local provider. This may be a computer at a local university or a local computer center.
The connections consist of dedicated lines run between computers called routers. A router is a computer
performing all the network operations for a local network. A router is normally dedicated to network
applications and does not do anything else.
The router at the local network provider is connected to a number of local routers, and to at least
one bigger network provider, perhaps a national one. The national network providers are connected with
high-speed dedicated backbone lines, to complete the network.
3.28 Internet: Physical Layout
The key to understanding the physical structure of the Internet is to think of it as a hierarchical structure.
At the bottom of this hierarchy are the internet users (PCs, workstations, mainframes). Each is connected
to a local Internet Service Provider (ISP), which is the next higher level in the hierarchy. The local ISPs
are in turn connected to regional ISPs (or regional backbone providers), which are in turn connected to
national and international backbone providers. These providers are the highest level in the hierarchy and
are connected together at interconnect points. New branches and connections can be added to any level.
An Internet backbone (sometimes called a network) is a chain of high-speed data lines. In a national
or international backbone, these lines span great distances. The national backbones typically have high-
bandwidth transmission links, with bandwidths of 1Gigabit/s and higher. Each such backbone also has
numerous hubs that interconnect its links and at which regional backbone operators can tap into the national
3.28 Internet: Physical Layout 155
backbone. In a regional backbone, the lines span a limited region, such as a city, a county, a state, or several
states. In order to provide communications throughout the internet, each backbone must be connected to
some other backbones at several interconnect points. These points are also called Network Access Points
(NAPs), Metropolitan Area Exchanges (MAEs) points, and Federal Internet Exchange (FIX) points.
The ten major internet backbone providers in the US (as of mid 2002) are UUNET/WorldCom (27.9%),
AT&T (10.0%), Sprint (6.5%), Genuity (6.3%), PSINet (4.1%), Cable & Wireless (3.5%), XO Communica-
tions (2.8%), Verio (2.6%), Qwest (1.5%), and Global Crossing (1.3%).
Several backbones (we’ll call them networks) may be connected at an interconnection point, but not
every network at the point is connected to every other one. The owner of a network A may have agreements
with the owners of networks B and D but not with the owners of network C to exchange data at an
interconnection point. Such agreements are called peering. By peering, networks A and B open their data
lines to one another, while A and C do not communicate even though they are both connected to the
interconnection point.
An interconnection point may be owned by a commercial entity or by a government (the San Francisco
interconnect point is owned by Pacific Bell, the New York interconnect point is owned by SprintLink, and
the Maryland interconnect point is owned by the federal government). The point consists of computers and
routers that are needed to receive and transmit data via high-speed data lines. The lines themselves are
owned by local and long-distance telephone carriers. The two types of high-speed lines currently in use are
data service (DS), and optical carrier (OC). The various speeds of these lines are discussed in Section 3.25.
There are more than 40 national backbone operators operating in the United States today (2000). A national
backbone operator in the US is generally defined as one that has operations (points of presence, POPs) in at
least five states and has at least four peering agreements as access points. Most national backbone operators
in the US also have lines that stretch coast to coast.
Once the main lines of a national backbone have been installed by a national backbone operator, the
operator extends the backbone by placing Points of Presence (PoPs) in a variety of communities. Low-speed
data lines are leased by the operator to users in the community, turning the operator into the ISP of these
users. Physically, the PoP is a router and most users use a telephone to dial to the router in order to connect
themselves to the internet. The PoPs become part of the backbone and the national backbone becomes a
part of the Internet by peering with similar national backbones at interconnect points.
The two magazines [ispworld 2002] and [boardwatch 2002] are good references for ISPs and Internet
backbones.
One step below the national backbones are the regional backbones. Such a backbone extends over a
single state or an area that covers a few states. A regional backbone is connected to the Internet through a
national backbone (or even at an interconnect point, if one is available in its area). The regional backbone
operator does business by setting up local PoPs and leasing low-speed lines to ISPs.
The ISPs, the next lower level in the Internet hierarchy, exist between the backbone providers (national
or regional) and the end users (the dial-up home and small business owners). A typical ISP may offer Internet
connection to some hundreds of users for $15–20 a month. The ISP has modems, a router and a server.
The server is a computer dedicated to communications. It is the domain name server and it is responsible
for assigning dynamic IP numbers to the many customers. The router receives the actual data from the
customers and sends it to the server, to be sent to the router at the backbone. The router also receives data
from the server and sends it to the right customer.
An ISP can be a small, “mom and pop” local company, but it can also be part of a university campus or a
large corporation. Even a backbone operator can serve as an ISP (Pacific Bell is a good example). [List 2000]
is a URL listing many thousands of local, regional, and backbone ISPs. Another source of information is
[Haynal 2000].
Considering how the Internet is organized, it should be obvious that any computer hooked up to the
Internet can serve as an ISP. A small, private computer user who pays an ISP can become a PoP and start
providing Internet service to others by simply purchasing a router, several modems, and server software.
Such a service would likely be too slow, but it is possible in principle and it shows how the Internet is
organized and how it grows.
The lowest level in the Internet hierarchy is the end user. An end-user connects to the Internet through
a dial-up connection or a dedicated line. The dedicated line can be DSL (Section 3.24), a television cable,
156 3. Input/Output
or a special leased line. The latter requires a server computer on the user’s end, and can cost thousands of
dollars a month.
Once the physical layout of the Internet becomes clear, we realize that the various ISPs and backbone
operators own part of the Internet, its physical part. This part of the Internet grows and spreads all the
time without the need for approval from a higher authority. The other part of the Internet, the protocols
used by the software to transfer the information, is not owned by anyone, but is subject to approval.
With 32 bits, there can be 232 ≈ 4.3 billion IPs. This is a large number, but the number of available IPs
is steadily shrinking. In the future, the IP size may be increased to 128 bits, resulting in 2128 ≈ 3.4×1038
IPs, a confortably large number.
Exercise 3.20: With about six billion people on Earth (mostly poor), why are so many IP numbers needed?
Another part of the IP standard is the packet size. A long message is broken, by the IP programs at
the sending router, into packets that are sent independently. Each packet contains the two addresses and
looks like a complete message. Long packets mean fewer packets, and less work for the routers. However,
an error may be discovered in a packet when it is received, requiring a retransmission. This is why packets
should not be too long. Also, when a network is heavily used, users sending fewer long packets would get
better service than users sending many short ones. For fairness, IP limits the packet size. Typically, packets
do not exceed about 1500 bytes.
3.28.2 Transmission Control Protocol (TCP)
Packets sent in a certain order may be sent along different routes, and may arrive out of order. Also, packets
may be lost, or may get damaged along the way. Here again the post office is a good analogy. This is why
IP is enough in principle, but not in practice. Another protocol, the transmission control protocol (TCP),
is used on the Internet to help with packet organization.
TCP software at the sending router breaks a single long message into packets, and adds a serial number
and a checksum to each packet. At the receiving router, similar software combines arriving packets according
to their numbers, and checks for missing or bad packets.
Most network literature talks about IP and TCP as if they were on the same level. In fact, TCP ‘exists’
on a layer higher than IP. A very detailed reference for IP and TCP is [Comer 89].
The combined efforts of IP and TCP create the effect of Internet resources being dedicated to each user
when, in reality, all users are treated equally. Note that, in practice, most users don’t know anything about
IP/TCP, and use software on a higher layer. A typical program used for Internet communications, such as
a web browser, starts with a menu of standard Internet operations, such as Email, Telnet, FTP, Finger, and
Whois (see [Krol 94] for details). The user selects an option, types in an Internet address to connect to, or
a file name to send, and the software does the rest.
3.28.3 Domain Names
All software used for communications over the Internet uses IP addresses. Human users, however, rarely have
to memorize and type such numbers. We normally use text-based addresses like http://www.JohnDoe.com/.
Such an address is called a universal resource locator (URL). The part JohnDoe.com is called the domain
name. The domain name system was developed in 1984 to simplify email. It has grown exponentially and
it is currently used to identify Internet nodes in general, not just email addresses. Here are some reasons for
why domain names are important:
1. People find it more convenient to memorize and use text rather than numbers. We find it more
natural to use http://www.JohnDoe.com/ instead of an IP number such as 123.234.032.321.
2. IP numbers may change, but internet users prefer to stay with the same domain names. An orga-
nization with URL http://www.GeneralCharity.org/ (where the domain name is GeneralCharity.org)
may move its headquarters from town A to city B. Its location on the internet will change, it will switch to
a different internet service provider (ISP), and so will be assigned a different IP address. It prefers, however,
to keep its domain name.
3. Since an IP address is 32 bits, there can be 232 IP addresses. This is a large number (about 4.3
billion), but the amount of available IP addresses is dwindling fast because of the rapid growth of the
Internet. The number of valid domain names, however, is much bigger, and it is possible to associate several
domain names with one IP address.
This is why domain names are important and why the internet needs two central data bases, one with
all the assigned IP addresses and the other with all the assigned domain names. Internet standards and
software implementors should also make it easy for internet software to find the IP address associated with
any domain name.
URL http://ws.arin.net/cgi-bin/whois.pl can be used to find out who owns a given IP address
or a chunk of addresses. This URL is part of ARIN (see below). Users need to input an IP address or
158 3. Input/Output
its most-significant part. Try, for example, 130.166. An alternative is the specialized search engine at
http://www.whonami.com/.
The data bases are part of the InterNic (network information center) and are maintained by Network
Solutions Inc., under a contract to ICANN (Section 3.30). Address translation is provided by numerous
computers around the world. Such a computer is a domain-name server (DNS) and is normally dedicated to
this task. When a user inputs a domain name into any internet software, such as a browser, the software has
to query a domain-name server to find the IP address associated with that domain name. Any individual
domain-name server is familiar with only some of the many existing domain names, typically the names
and numbers of all computer users in its corner of the Internet. As a result, a domain-name server must
sometimes send a query to other name servers. The various DNSs should be able to communicate with each
other and especially with the “root” DNSs, which are part of the InterNic. These root DNSs are:
A.ROOT-SERVERS.NET., B.ROOT-SERVERS.NET., C.ROOT-SERVERS.NET., D.ROOT-SERVERS.NET., and E.ROOT-
SERVERS.NET..
If the DNS used by computer X has a problem and stops working, computer X will still be con-
nected to the internet, but its software won’t be able to resolve any domain names into IP addresses, and
would therefore stop functioning. This is why each domain name should be known by at least two DNSs.
When a new domain name is registered with ICANN, the registrar (a company authorized by ICANN)
must place the name in at least two DNS computers. A list of accredited registrars can be found at
http://www.icann.org/registrars/accredited-list.html.
As an example, the registrar Domains.com maintains the two DNSs NS1.DOMAINS.COM (at IP address
208.169.214.68) and NS2.DOMAINS.COM (at IP address 208.169.214.25).
Notice that the domain name is not owned by the registrar. It has an owner (an individual, a corporation,
an institute, an organization, or the government) that’s responsible for paying the fees to the registrar to
keep the name current. The owner has to interact with ICANN through a registrar. The owner selects an
available domain name through the registrar and pays the registrar to register the name with ICANN.
The owner then looks for an ISP and becomes a client of the ISP. The ISP gets a chunk of IP addresses
from ICANN and assigns them to its clients. The owner sends the registrar its new IP address, and sends
the ISP its domain name. The registrar’s DNS computers associate the new IP address with the owner’s
domain name. When a message is sent to the owner, the sender knows just the domain name. That name
is sent by the software (such as a browser or an email program) to several DNSs until one of them finds the
name and its associated IP address. The message is then sent with the IP address (and also the domain
name) over the Internet. When the message arrives at the ISP, the ISP’s computers route it to the right
client based on the domain name, since the ISP may have allocated the same IP address to several of its
clients. Here are three variations on this simple theme:
1. An internet-savvy individual A has discovered that domain name abc.com is available. The name
is registered by A immediately, even though he does not need it. He figures that there is (or will be) an
individual or an organization B whose initials are abc and they might be interested in this name. Selling the
name is what A hopes to do. Since A does not need to use the name, he only has to park it, so that anyone
interested would be able to find it. Many registrars provide this service for a small fee or for free. Anyone
browsing that domain name will find a “For Sale” sign on it. There are even companies that will conduct
a public auction for a desirable name such as business.com. Notice, however, that it is illegal to register
someone else’s registered trademark (such as IBM) as a domain name (such as ibm.com), then try to sell it
to them. Such behavior is known as cyber-squatting, and it even includes reserving names of celebrities as
domain names.
[For a list of domain names that are currently for sale see http://www.domainnames-forsale.net/.]
2. An individual already has a web page, such as http://www.ibm.com/larry, at his place of work.
He now decides to have a side business. He registers the domain name MonkeyBusiness.com, but wants to
forward anyone surfing there to his original web page at IBM without the surfer’s knowledge. Some registrars
provide a forwarding service for a small fee.
3. It is expensive to get authorized as a registrar by ICANN (see “How to Become an ICANN-Accredited
Registrar” at http://www.icann.org/registrars/accreditation.htm). A small company that wants part
of the act but cannot afford to be a registrar, can become a subregistrar. A company A may attract clients
by offering low prices. When a client wants to register a domain name, A pays a registrar B to actually
3.28 Internet: Physical Layout 159
register the name. A then charges the client for the registration, parking, forwarding, disk space for a web
page, web-page design, and other services.
As an example of point 3, here is a domain that was registered by its owner at DomainsAreFree.com
(a subregistrar). They, in turn, referred it to Networks Solutions Inc., a registrar that actually registered it
with the InterNic. This information is easy to find at http://www.internic.net/whois.html.
Domain Name: BOOKSBYDAVIDSALOMON.COM
Registrar: NETWORK SOLUTIONS, INC.
Whois Server: whois.networksolutions.com
Referral URL: www.networksolutions.com
Name Server: NS1.DOMAINSAREFREE.COM
Name Server: NS2.DOMAINSAREFREE.COM
Updated Date: 29-mar-2000
When a domain name moves to a different IP address, its owner has to notify the registrar to update
their DNS computers. When a domain name moves to another DNS, the owner (through the registrar)
has to notify InterNic. The InterNic data base for domain names is then updated, to associate the domain
name with the new DNSs. The “root” InterNic DNS servers are also updated, by Network Solutions Inc., of
Herndon, Virginia, USA, so they can provide reliable referrals to other DNSs.
A large organization may maintain its own DNS. Local administrators can then modify their DNS
information in order to block access of users from the organization to certain domain names.
It is clear that the quality of internet communications depends heavily on the ISP. This is why Inverse
Internet Technology Inc. regularly provides ratings of some of the big name ISPs, using measures such as
“average time to login,” “call failure rate,” etc. It is possible to get quite a bit of information from their
assessment, which is located at http://www.inversenet.com/products/ims/ratings/.
With so many web pages and domain names, how can anyone find anything? The answer is: Search
engines. Several companies maintain computers that scan the internet and collect the contents of millions
of web pages. Anyone can use those search engines freely. Anyone who owns a domain name, has a
web page, and wants to advertise it, can submit their URL to many of these search engines. Here are
two examples. Altavista is one of the most popular search engines. At the bottom of their initial page
“http://www.altavista.com/” there is an “Submit a site” menu item that leads to information on adding
and removing URLs from their search engines. The Yahoo start web page also has a button labeled ”How
to suggest a site” which has a similar function.
A domain name may consist of letters, digits, and hyphens. It may not start or end with a hyphen.
The maximum size is 63 characters (this tells us that the sizes are stored in the InterNic data base as
6-bit numbers, with size zero reserved). The number of valid domain names is thus huge. The number of
strings of one letter is 26. There are 262 = 676 two-letter strings, and the number of strings of 63 letters is
2663 ≈ 1.39×1089 . This number is much greater than our estimate of the number of elementary particles in
the observable universe and is not much smaller than the legendary googol. Here is an example of a really
long domain name
this-is-the-real-domain-name-that-I-have-always-wanted-all-my-life.com
The domain name is just one of the fields that make up a URL. The general format of a URL is specified
by RFC1738, and it depends on the particular protocol used. As an example, the URL syntax for an http
protocol is http://site.name:port/path#fragment, where:
1. http: (which stands for Hypertext Transfer Protocol) is the protocol that web browsers and web
servers use to communicate with each other.
2. site is any string of letters, digits and periods. Often, it is just www, but long site names such as
hyperarchive.lcs (used in http://hyperarchive.lcs.mit.edu/) and www.irs (used in government URL
http://www.irs.ustreas.gov/), are common.
3. name is a domain name or an IP address.
4. :port is a port number to connect to. The default value of :port for an http URL is 80.
5. path is a directory path, consisting, as usual, of subdirectory names separated by slashes. The last
name may be a file name.
160 3. Input/Output
6. If the last name in the path is a file name, then #fragment is the point (in the file) that should be
placed by the browser at the top of the screen (if possible). If #fragment is not used, the browser positions
the top of the file at the top of the screen.
The URL syntax for the ftp protocol is ftp://site:<password>@name:port/path. It is similar to
http with the following differences:
1. The default value of name is anonymous.
2. :<password>@ is an optional password for sites that have limited access. The default value of
:password@ is the email address of the end user accessing the resource.
3. The default port is 21.
Table 3.59 lists some popular internet protocols.
Name Description
ftp File Transfer protocol
http Hypertext Transfer Protocol
RTSP Real-time streaming protocol
gopher The Gopher protocol
mailto Electronic mail address
news USENET news
nntp USENET news using NNTP access
telnet Reference to interactive sessions
wais Wide Area Information Servers
file Host-specific file names
prospero Prospero Directory Service
dict accessing word definitions
Table 3.60: Various internet protocols
Using water as an organic network between two computers.
StreamingMedia is an interactive data sculpture that employs a new Internet
protocol (H20/IP) I developed that uses water to transmit information between com-
puters. H20/IP functions in a similar way as TCP/IP but focuses on the inherent
viscous properties of water that are not present in traditional packet networks. The
StreamingMedia demonstration of H20/IP exists as an installation of two comput-
ers at different heights where one captures an image and transmits it to the second
computer in the form of modulated water drops. The project attempts to show
how digital information can be encoded and decoded into organic forms to create a
physical network between digital devices.
—Jonah Brucker-Cohen, Media Lab Europe, 2002, jonah@coin-operated.com
IANA (the internet assigned numbers authority, Section 3.30) is the organization responsible for estab-
lishing top-level domain identifiers.
Any extension to the domain name, such as com or edu, is a top-level domain name (TLD). There are
three types of top-level domains:
1. Generic top-level domains (gTLDs). For years, these were just com, net, and org. On 16 November
2000, ICANN approved the three TLDs biz, info, and name. The gTLDs are not affiliated with any country
and are “unrestricted”—anyone from anywhere can register them. In the future there may be more generic
top level domains, such as arts, shop, store, news, office, lib, private, and sex. Notice that gTLDs
are not limited to three letters.
2. Limited top-level domains. The original ones are:
edu for educational institutions.
int for international entities (http://www.iana.org/int.html)
gov, for US government agencies (http://www.nic.gov/)
mil for US military bases and units (http://nic.mil/).
3.29 CSUN in the Internet 161
On 16 November 2000, ICANN approved the seven additional TLDs museum, aero, coop, biz,
info, name, and pro. Only certain entities can register domain names with these TLDs.
3. Country-specific domains. They consist of two letters such as au for Australia and cn for China.
These are operated by separate registry authorities in 184 different countries. About one-half of these
country-specific domains are “unrestricted”—anyone anywhere can register them, just like com, net, and
org (although some are expensive). The rest are “restricted” and require a local presence and/or company
documentation. Table 3.60 lists some of these codes and Table D.1 lists all of them.
Popular country codes are at (Austria), cc (Cocos Island, formerly Keeling), nu (Niue), to (Tonga),
md (Moldavia), ws (Samoa, stands for “web site”), and tv (Tuvalu). Somewhat less popular are ac,
li, sh, ms, vg, tc, gs, tf, ky, fm, dk, ch, do, and am. Try, e.g., http://go.to/, http://stop.at/,
http://vacation.at/hawaii, and http://www.tv/.
See http://www.iana.com/cctld.html and http://www.iana.com/domain-names.html for more in-
formation on top-level domains.
The US Domain Registry is administered by the Information Sciences Institute of the University of
Southern California (USC-ISI). The main ISI web page is at http://www.isi.edu. Detailed information
about the US domain can be found at http://www.nic.us/.
ARIN (American Registry for Internet Numbers) is a nonprofit organization established for the purpose
of administration and registration of IP addresses for the following geographical areas: North America, South
America, the Caribbean, and sub-Saharan Africa. Their web site can be found at http://www.arin.net/
and a WhoIs search is provided at http://www.arin.net/whois/index.html.
ARIN is one of three Regional Internet Registries (RIRs) worldwide, which collectively provide IP
registration services to all regions around the globe. The others are: RIPE NCC—for Europe, Middle East,
and parts of Africa and APNIC—for Asia Pacific.
California at the California State Hayward campus. That hub connects the Humboldt, Sonoma, Chico,
Sacramento, Stanislaus, San Jose, and Monterey bay campuses, as well as many community colleges. It is
connected to the rest of the internet through UUNET (which is owned by MCI Inc.). The southern California
hub connects the San Luis Obispo, Channel Islands, Bakersfield, Northridge, LA, Dominguez Hills, Pomona,
San Bernardino, Fullerton, San Marcos, and San Diego campuses (and many community colleges). This hub
has an OC-48 Sonet ring line (2.4Gbits/sec). The communications lines range from T1 (1.544 Mbits/sec) to
OC-3c. The 4CNET hub in Los Alamitos is, in turn, connected to the rest of the Internet through UUNET
(MCI). There is also a cable connecting the Bakersfield, Fresno, and Stanislaus campuses, and another one
between San Luis Obispo and Monterey bay. These two complete the two backbones of the 4C network.
Figure 3.62 shows the main parts of the CSUN communications network. The single cable from 4CNET
is hooked up to a router (through an ATM switch made by FORE). Currently, this is a CISCO 4700 router.
The main task of the router is to confirm that the IP addresses of all incoming messages start with 130.166.
The CISCO router is connected to a Cabletron Smart Switch 8600 router (through an FDDI switch). The
8600 router has 16 slots with two, 1Gbits/s ports each. This router can therefore handle bandwidths of
up to 32Gbits/s. This equipment is located in the MDF (main distribution frame) building. One line goes
from the 8600 router to the campus’ main computer room, where the main campus computers, CSUN1 and
CSUN2, are located (and also other computers dedicated to communications). In addition, the 8600 router
is connected, by underground 1Gbit/s fiber optic cables, to many campus buildings. Each building has its
own local area network(s).
The third octet of all the 130.166 IP numbers designates the campus subnet. Subnet 1 connects the
main campus computers, such as IBM S/3-90, CSUN1, and CSUN2. Each local area network is a subnet.
Each subnet has an ethernet router that sends the incoming message to one of its computers based on the
4th octet.
In the engineering building, for example, there are subnets 2, 12, 40–46, 67, and 68. When a router in
that building receives a message, it routes it to one of these networks, where it finally gets to its destination
computer based on its fourth octet.
Each of the four octets of the IP number consists of 8 bits, so it can have values in the range 0–255.
On our campus, the values are allocated as follows: 0 and 255 are unused. 1–19 are reserved. 20-229 are
available for general use. 230–249 are used for network equipment, and 250–254 are used for test equipment.
The computer center on campus allocates dynamic IP numbers. Any computer used for communications
has to have a communications card (such as an ethernet card) installed in it. Such a card has a unique,
48-bit id number (referred to as MAC, for media access control) hardwired in it (see also Section 3.26.2).
The leftmost 24 bits identify the card manufacturer and the rightmost 24 bits are unique to the card. A
subnet must have a computer that stores the id numbers of all the computers connected to the subnet
(these numbers may change since computers and communications cards change). This computer matches
the (permanent or semi-permanent) 48-bit id numbers with the (temporary) 24-bit IP numbers in incoming
and outgoing blocks of data.
Humboldt
Chico
Sacramento
Sonoma
Hayward
Stanislaus
T
ES
W
Q
to Fresno
Monterey
SL Obispo Bakersfield
CS
Chan UN
nel I ona
s Pom San Bern.
Los Alamitos
Fullerton
ET
N
U
U
to San marcos
SDSU
Qwest in Anaheim
CSUDH
CSULA
FORE ATM switch
Switch
100Mb/s CSULB
CISCO 4700 router 4CNET
DHCP
FDDI
CC & Schools CSU Fullerton
Cabletron smart switch
router 8600. DNS
1Gb/s
16 slots with 2 ports each.
IP & IPX AGS+
Appletalk protocol. EA bldg
1Gb/s each, for up Appletalk router
to 32Gb/s nonblocking. switch in a
1Gb/s ethernet
communic closet
CSUN1
CSUN2 to all
mail1 campus A B C
mail2 buildings
S/3-90 100Mb/s
3 communic closets
board serve one-year terms, and are succeeded by at-large directors elected by an at-large membership
organization. See http://www.icann.org/general/abouticann/htm for biographies of the 19 directors.
Formed in October 1998, ICANN is a nonprofit, private sector corporation formed by a broad coalition
of the Internet’s business, technical, and academic communities. ICANN has been designated by the U.S.
Government to serve as the global consensus entity to which the U.S. government is transferring the respon-
sibility for coordinating four key functions for the Internet: the management of the domain name system,
the allocation of IP address space, the assignment of protocol parameters, and the management of the root
server system.
Critics of ICANN, some of whom want to change it and others who want to tear it down, met around
the issue of “governing the commons: the future of global internet administration,” sponsored by computer
professionals for social responsibility, a public interest group.
ICANN is the organization to which the U.S. government transferred administration of the technical
functions of the Internet, including the single root system. The nonprofit group is also charged with pro-
moting competition in domain names com, org, and net, that until recently were handled exclusively by
Network Solutions Inc., of Herndon, Virginia, USA.
Several other companies are now registering domain names and more are accredited by ICANN all the
time.
* a blank line *
the html file follows here
The server agrees to use HTTP version 1.0 for communication and sends the status 200 indicating it
has successfully processed the client’s request. It then sends the date and identifies itself as an NCSA HTTP
server. It also indicates it is using MIME version 1.0 to describe the information it is sending, and includes
the MIME-type of the information about to be sent in the “Content-type:” header. Next, it sends the
number of characters it is going to send, followed by a blank line and the html file (the requested data).
A web browser can also access files with other protocols such as FTP (for file transfer), NNTP (a
2
protocol for reading news), SMTP (for email) and others.
Because of the popularity of the web, there is a huge amount of material of all types available for brows-
ing and downloading. The problem is to locate material of interest, and this is done by using a search engine.
Such an engine scans the web all the time, and collects information about available documents. A user spec-
ifies a search topic, such as “introduction to html”, and the search engine lists all the web pages that contain
this string. Two popular search engines are http://www.google.com/ and http://www.infoseek.com/.
A word about cgi. A cgi script is a program that’s executed on the web server in response to a user
request. Most information traffic on the web is from a server to a client. However, because of the importance
and popularity of the web, html and the http protocol have been extended to allow for information to be
sent back. This is important, for instance, when shopping online. A customer should be able to select
merchandise and inform the seller how payment will be made. An html document may display a form that
the user fills out. The form includes a button for the user to click when done. That button specifies the
name of a program that will process the information in the form. When the user sends the form back to the
server, that program is invoked. Such a program is referred to as a “common gateway interface” or a cgi
script.
3.31.1 Web Search Engines
Thw world wide web is immense, encompassing millions of pages and billions of words on many topics
ranging from poetry to science and from personal thoughts to government regulations. There is much useful
information, which raises the question of finding items of interest. Much as a library would lose much of its
value without a catalog, the web would lose much of its usefulness without the various search engines. A
search engine is a software system that keeps a huge amount of information about web pages and can locate
any piece of information quickly.
Since there is so much information on the web, it makes sense to search the web itself to find how search
engines work. URL http://www.searchenginewatch.com/webmasters/work.html was found by such a
search.
3.31 The World Wide Web 167
The two main tasks of a search engine are: (1) To go out to the web, collect information, and update
it often and (2) to quickly respond to any query with a ranked list of URLs. These two tasks ae discussed
here.
Going into the web and collecting existing pages is done by programs called spiders, crawlers, or robots.
Such a program finds web pages, inputs their entire texts (sometimes with images), and stores this data in
the search engine’s data base together with the address (URL) of each page. The main problem is to cover
the entire Internet (or at least most of it). The spider has to employ a search strategy that will find all (or
most) of the pages in the world wide web. One component of such a strategy is the many links found in web
pages. Once a spider has found a web page and has input its content, the spider can use the links in that
page to find more pages, collect them, then use links found in them to dig deeper and find more web pages.
Such a strategy is useful, but does not guarantee that all, or even most, of the existing web pages have been
found. It is possible that starting from another web page, new links would be found, extending the coverage.
The amount of material found by a spider therefore depends on the choice of initial URLs.
A spider starts with an initial list of URLs that have many links. Examples of such URLs are ya-
hoo.com for general reference, the Internet movie data base (http://www.imdb.com/) for movie information,
http://vlib.org/ for literary resources, and http://hometown.aol.com/TeacherNet/ for educational re-
sources. Such a list can be started by using another search engine to search for web pages containing phrases
such as “many links,” “the most links,” “art resources,” or “all about music.” Once an initial list of URLs
has been established, the spider can follow every link from each of the URLs on the list. When those web
pages are input into the data base, the number of links in each is counted, and those pages with the most
links are added to the initial list.
Such a strategy takes into account the important fact that the web changes continuously. Web pages
appear, are modified, then may disappear. It is therefore important to modify the initial list all the time.
The second important task of a search engine is ranking the results. The aim is to present the user with
a list of web pages containing the term or phrase searched for, suh that the items first on this list will be the
most useful to the user. None of the present methods used to rank results is fully satisfactory, since there is
a big difference between how humans think and how computers work. The main methods used for ranking
are the following:
1. Term frequency. The search engine locates all the web pages in its data base containing the term
requested by the user. Each of those pages is checked for multiple occurrences of the term and is ranked
according to how many times the term appears. The assumption is that a page where the term appears
many times is likely to be relevant. On the other hand, just repeating a term does not guarantee that the
page would have important information on the term.
This ranking method is simple and fast, but might generate irrelevant items in cases where the term is
very common or has several meanings.
2. A close variant is a method that considers both the number of occurrences of the term and their
positions in the document. If the term is mentioned in the header or anywhere at the beginning of the web
page, this method considers the page relevant.
3. Another variant asks the user to supply relevancy terms. For example, searching for “cork,” produces
web pages about bottles, other cork products, and also about Ireland. When the user specifies “cork” as the
search term and “bottle” as the relevancy term, most web pages about te city of Cork in Ireland will not be
included in the results list. Some web pages may discuss corks and bottles in Ireland, and they can also be
avoided if the search engine allows specifications such as: Avoid pages with the term “Ireland.”
4. A different approach to ranking considers the number of times a web page has been referenced in
other pages. The idea is that a page that’s referenced from many other pages must be relevant to many
readers even if it contains the search term just once. To implement this approach, the search engine must
count the number of times each web page in its data base is referenced by other pages, a time-consuming
task.
5. Imagine a user searching for a common term such as “heart.” This may result in a long list of
matching web pages. The search engine may spy on the user, trying to determine what pages are selected
by the user, and assigning those pages high ranks next time someone searches for the same term.
In addition to the spider-based search engines, there are also web directories. Two important ones are
Yahoo and ODP. They use human editors to compile their indexes, which are clean and well-organized but
168 3. Input/Output
1. It is smaller. Main memory must have room for programs, which may be very large. The control store
need only have room for the microprograms, and they are normally very short, just a few microinstructions
each.
Exercise 4.1: How many microprograms are there in the control store?
2. It must be faster. Executing a single machine instruction involves fetching several microinstructions
from the control store and executing them. Microprogramming is thus inherently slow, and one obvious way
to speed it up is to have a fast control store. Fortunately, this does not increase the price of the computer
significantly, since the control store is small.
170 4. Microprogramming
3. It is normally read-only. Different programs are loaded into main memory all the time, but the
microprograms have to be written and loaded into the control store only once. They also should stay in the
control store permanently, so ROM is the natural choice for this storage.
The microprograms for a given, microprogrammed computer are normally written by the manufacturer
when the computer is being developed and are never modified. Recall that the microprograms describe
the machine instructions, so modifying a microprogram is the same as changing the meaning of a machine
instruction. A user modifying the control store may discover that none of their programs executes properly.
Two important features of microprogramming should now be clear to the reader; it is slow (a disadvan-
tage), and it simplifies the hardware (a small advantage, since computer hardware isn’t expensive). However,
microprogramming has another, more important advantage; it makes the instruction set (in fact, the entire
design of the control unit) more flexible. In a conventional, hardwired computer, changing the way a machine
instruction operates requires changing the hardware. In a microprogrammed computer it only requires a
modification of the microcode. We have already seen that under normal use, the machine instructions are
never changed, but there are two good examples of cases where they need to be changed.
1. Imagine a computer manufacturer making and marketing a computer X. If X proves successful, the
manufacturer may decide to make a better model (perhaps to be called X+ or X pro). The new model
is to be faster than the existing machine X, have bigger memory, and also an extended instruction set.
If X is microprogrammed, it is easy to design the extended instruction set. All that it takes is adding
new microprograms to the control store, and perhaps modifying existing microprograms. In contrast, if X
is hardwired, new control circuits have to be designed and debugged, a much slower and more expensive
process.
2. Traditionally, when a new computer is developed, it has to be designed first and a prototype built,
to test its operation. Using microprogramming, a new computer Y can be developed by simulating its
instruction set on an existing, microprogrammed computer X. All that it takes is to replace the control store
of X with a new control store containing the microprograms for the instruction set of Y. Machine Y can then
be tested on machine X, without having to build a prototype. This process, a simulation of a computer by
microinstructions, is called emulation.
Today, many modern microprocessors are microprogrammed, with the control store as an integral part
of the chip, and usually with this feature being completely transparent to the users.
[Smotherman 99] is a more detailed history of microprogramming.
4.3 The Computer Clock
Electronic components are fast, but not infinitely fast. They still require time to perform their operations.
When, e.g., the inputs of a latch change states, it takes the latch a certain amount of time to sense this,
change its internal state, and update its output. The response time of a simple digital device is measured
in nanoseconds; it is thus very short but it is not zero. [A nanosecond (ns) is defined as 10−9 of a second,
or one thousandth of a microsecond. See also pages 9 and 359.] As a result it is important to design the
computer such that each operation should wait for the previous one to finish before it can start. This is
why every computer has a clock that supplies it with timing information. The clock is an oscillator that
produces an output signal shaped like a square wave. During each cycle, the clock pulse starts low, then goes
up (through a rising edge) then goes down again (through a falling edge). The clock pulse is sent to all the
other parts of the processor, and the rising and falling edges of each cycle trigger the individual components
to perform their next operation.
In the microprogramming example described in this chapter, we assume a clock with four output lines,
all carrying the same signal, but at different phases as follows:
...
...
...
...
These outputs divide each clock cycle into four subcycles, not necessarily of the same size, and the four rising
edges can be used to trigger the processor to start four operations each clock cycle (Section 4.6). Notice how
2
three of the clock outputs are generated as delayed copies of the first one, using delay lines.
Anagrams
No Senator (nanostore)
No critics mourn it (microinstruction)
Rig poor Mr. Mac (microprogram)
Red code (decoder)
Egg is octal (logic gates)
Referring tasters (register transfer)
Man go corrupter (program counter)
Mr. Dole yeomanry (read-only memory)
Strict union (instruction)
Proton latch (control path)
is very similar to the microprogramming example in Tanenbaum’s book Structured Computer Organization
(3rd edition, p. 170) which, in turn, is based on existing commercial bit slices. The main differences between
our example and Tanenbaum’s are.
1. Tanenbaum’s example uses microinstructions to isolate and check the individual bits of the opcode.
Our example uses extra hardware to move the opcode to the MPC.
2. Tanenbaum’s computer is based on a stack, where all the important operations take place. Our
example uses 4 general-purpose registers.
3. Our example adds hardware to make it possible for the microinstructions to generate and use
constants. Tanenbaum’s example is limited to a few registers which contain permanent constants. No other
constants can be used.
We describe the machine in several steps: the data path, the microinstructions, the microinstruction
timing, the control path, the machine instructions, and the microcode.
The data path is shown in Figure 4.1. The central part is a block of 8 registers, four 16-bit general-
purpose and 4 special-purpose. The latter group includes the 16-bit IR, the 16-bit auxiliary register AX, and
the two 12-bit registers PC and SP. In addition there are: an ALU (with 2 status flags, Z and N), an MAR,
an MBR and a multiplexor (the AMUX). These components are connected by 16-bit data buses (the thick
lines; some are only 12 bits wide) along which data is moved in the computer. The main paths for moving
data are:
1. The A and B buses. Any register can be selected, moved to either the A bus or the B bus, and end
up in the A latch or the B latch. The B latch is permanently connected to the right-hand input of the ALU,
but it can also be sent to the MAR. The left-hand input of the ALU comes from the AMUX, which is fed
by the A latch and the MBR, and can be switched left or right.
2. The C bus. The output of the ALU can be moved to the C bus (and from there, to any of the
registers) or to the MBR (and from there, to main memory). It is important to realize that this output can
also be moved to both the C bus and the MBR, and it can also be ignored (i.e., moved to none of them).
There are therefore 4 ways of dealing with the ALU output.
Exercise 4.2: Why would anyone want to go to the trouble of using the ALU to create a result only to
disregard it?
4.5 The Microinstructions
In addition to the registers and buses, there are many gates, decoders, control lines and other hardware
components which are not shown in Figure 4.1. This diagram is kept simple in order to show the basic
operations of the computer. Here is what can be done by this computer:
1. Two registers can be selected and sent to the two latches.
2. A 3-bit code can be sent to the ALU, requesting an ALU operation. Also, the B latch can be sent to
the MAR.
3. The ALU output can be sent to the C bus, to the MBR, to both places or to none of them.
4. A memory operation (read or write) can be specified.
Since our computer is microprogrammed, the control unit does not have circuits to execute the machine
instructions, and they are executed by the microinstructions. Each microinstruction should therefore be
able to specify any of the four operations above. It has been mentioned that microinstructions are similar to
machine instructions but are simpler, so it takes several microinstructions (a microprogram) to execute one
machine instruction. Writing a microprogram is similar to writing a program in assembler. In particular,
the microinstructions in a microprogram should be able to jump to each other. This is why another, fifth,
operation is needed namely,
5. Jump to another microinstruction in the control store.
Any microinstruction should be able to perform the five operations above. Figure 4.2 shows the format
of the microinstructions. All the fields where no width is indicated are 1-bit wide. Notice that they all have
the same format, specifically, the microinstructions don’t have an opcode, which is one reason why they are
simpler than machine instructions. This means that a microinstruction can be fetched from the control store
and can be executed immediately, without any opcode decoding.
Here is the meaning of the individual fields:
The AXL and AXR fields control the way the AX register is loaded from the ADDR field (Section 4.7).
4.5 The Microinstructions 173
A bus B bus
R0
C bus
R1
R2
R3
to MMUX
IR
AX
PC
SP
A latch B latch
Main MAR
Memory
MBR
AMUX
Function N
code ALU Z
A JU-
A A I M M E C B A
M MP ALU R W ADDR
X X R B A N (3) (3) (3)
U (2) (3) D R (8)
L R G R R C
X
The IRG bit controls loading the opcode from the leftmost 4 bits of the IR to the MPC (Section 4.7).
The AMUX field controls the setting of the A-multiplexor. Any microinstruction that wants to flip
the A-multiplexor to the left (MBR) should have a one in this field.
The 2-bit JUMP field specifies the type of jump the microinstruction requires. There are 2 conditional
jumps, based on the values of the status flags Z and N, one unconditional jump, and one value for no jump.
The jump address is contained in the ADDR field.
The ALU field specifies 1 of 8 possible ALU operations. Code 2 means the ALU will simply pass the
value of the A latch through, and disregard the input from the B latch. Code 3 is similar, except that the
value being passed is 1s complemented while going through the ALU. Code 6 increments the A input by
1. It can be used to increment any register but it has been included mostly to increment the PC. Code 7
decrements the A input by 1. It can be used to decrement any register but it has been included mostly to
decrement the SP.
Exercise 4.3: Programs tend to have bugs in them, and microprograms have a similar nature. When the
microprograms are written for a new, microprogrammed computer they always have bugs. Yet when the
computer is sold, the microprograms must be completely debugged, otherwise no programs will run. How
can the manufacturer guarantee that no microinstructions will contain any invalid values?
The MBR and ENC (enable C) fields specify the destinations of the ALU output. Each of these two
bits can be 0 or 1, allowing for four ways to route the output, as discussed earlier.
The MAR field indicates whether the B latch should be moved to the MAR. This is done when a
microinstruction needs to load the MAR with an address.
The RD and WR fields are set to 1 by any microinstruction that needs to read memory or write to
it. Since memory can only do one thing at a time, it is invalid to set both RD and WR to 1 in the same
microinstruction.
The A, B and C fields are used to select registers. Each field can select 1 of the 8 registers, which is
how a microinstruction selects source registers to be sent to the A and B latches, and a destination register
for the ALU output.
Exercise 4.4: Our microinstructions are 31 bits long, but we know that the word size in modern computers
is a multiple of 8. Why weren’t the microinstructions designed to be 32 bits long?
Exercise 4.5: How big is the control store of our example computer?
4.6 Microinstruction Timing 175
Control
Store
256x31
from clock line 1
31
MIR
2. In subcycle 2, two registers are selected, are sent to the A and B buses and, from there, to the two
latches. The voltages at the latches are given the rest of this subcycle to stabilize. Figure 4.5 shows the
details of selecting a register and moving it to the A bus. The 3-bit ‘A’ field in the MIR becomes the input
of the A-decoder. Only one of the 8 decoder outputs can be high, and that one opens a gate (actually, 16
gates) that move the selected register to the A bus and from there, in subcycle 2, to the A latch.
Figure 4.4a shows the details of moving data (16 bits) from the MBR to the AMUX. This is controlled
by 16 AND gates which are opened when the ‘MBR’ field in the MIR is 1. This field is updated in subcycle
1 so, if it is 1, the MBR will move to the AMUX and from there, (if the AMUX is switched to the left) to
the ALU, during subcycle 2.
The end of subcycle 2 sees stable inputs at the ALU.
3. Subcycle 3 is devoted to the ALU. Notice that there are no gates at the ALU inputs and no way to
enable/disable it. The ALU is always active, it always looks at its inputs and uses them to generate some
output. Most of the time the inputs are wrong, producing wrong output. The trick of using the ALU is to
disregard its output until the moment it is supposed to be correct. The earliest time the ALU inputs are
right is the start of subcycle 3, which is why the control unit should disregard the ALU output until the
start of subcycle 4.
Exercise 4.6: Consider the 3 ‘function’ input lines to the ALU. They tell the ALU which of its 8 operations
to perform. When does the ALU look at these inputs?
Loading the MAR from the B latch is also done during this subcycle. Figure 4.4b shows the details of
moving the B latch to the MAR. This is done through an AND gate (actually 12 gates) that opens only
176 4. Microprogramming
when subcycle 3 starts AND when the microinstruction selects this particular move (when the ‘MAR’ field
in the MIR is 1).
4. Since the ALU output is now stable, it is only natural to use subcycle 4 to moving the ALU output
to its destination. There are two destinations, the C bus and the MBR.
Figure 4.4a shows the details of moving the ALU to the MBR. This move is controlled by 16 gates which
open only when the ‘MBR’ field in the MIR is 1 AND it is subcycle 4.
Figure 4.5 shows the details of moving the C bus to one of the registers. Notice that the C-decoder is
normally disabled. It is enabled only when the ‘ENC’ field in the MIR is 1 AND it is subcycle 4.
SP from clock
subcycle 2
C bus
A latch
A microinstruction jump also takes place during subcycle 4. A jump in a program is always done by
resetting the PC. Similarly, a jump in a microprogram is done by resetting the MPC. Figure 4.6 shows the
details at the MPC. The value of the MPC is always incremented by 1 and is sent back to the MMUX
4.6 Microinstruction Timing 177
multiplexor, from which it goes back to the MPC in subcycle 4 (the MMUX is normally switched to the
left). This is true for all microinstructions that don’t need to jump. If a microinstruction decides to jump,
it simply switches the MMUX to the right, resetting the MPC to the value of the 8-bit ‘ADDR’ field in the
MIR.
MMUX
from clock line 4
8
+1 MPC
8
Control
Store
28
MIR
to MMUX
L
N
Micro R
ALU Sequence N
Z Logic
MIR
IR
IR
Opcode decoder
Address bus
25 56 74 129 ROM
Execution Execution 0 1 3 5
circuit circuit Data bus
(a) (b) MPC
In our example computer, no ROM is used. Instead, the 4-bit opcode is moved from the IR to the
MPC each time the next instruction is fetched from main memory. Our opcodes (Table 4.12) have values 1
through 12, implying that the next microinstruction will be fetched from one of these locations (locations 1
through 12) in the control store. All that’s necessary is to place ‘goto’ microinstructions at those locations,
that will go to the start of the right microprograms. The PUSH instruction looks different in Table 4.12
since it has a 6-bit opcode. However, Section 4.9 shows how the last 2 bits of this opcode are isolated and
analyzed by microinstructions. The control unit can therefore treat this instruction as if it had the 4-bit
opcode 1011.
Figure 4.9 shows the details of moving the opcode of the current machine instruction from the IR to the
MPC through the MMUX. The four leftmost bits of the IR are moved to the MMUX, through AND gates
followed by OR gates. The microinstruction which does that has to satisfy the following three conditions:
(1) it has to open the AND gates; (2) it should have an ADDR field of all zeros, so as not to send anything
through the OR gate; (3) it has to make sure that the MMUX is flipped to the right. All three conditions are
4.7 The Control Path 179
achieved by the simple microinstruction ‘irg; goto 0;’ (see Section 4.9 for this notation). A little thinking
should convince the reader that a ‘goto 0;’ combined with an ‘irg’ does not cause a jump to location
zero of the control store. Rather it moves the eight bits 0000pppp (where pppp are the opcode bits from
the IR) to the MMUX and from there, to the MPC. The MPC is thus reset to 0000pppp, causing the next
microinstruction to be fetched from location pppp of the control store. Since pppp is a 4-bit number, it can
be between 0 and 15 (actually it should be between 1 and 15, see below). Locations 1 through 15 of the
control store should therefore be reserved for special use. Location L should contain the microinstruction
‘goto P’ where P is the start address, in the control store, of the microprogram for the machine instruction
whose opcode is L. Location 0 of the control store is reserved for a ‘goto’ microinstruction that starts the
fetch sequence of the first machine instruction (Section 4.9).
4 IR
MMUX
+1 MPC
8 MIR
Control
Store
31
IRG MIR AX
Figure 4.9: Moving the opcode to the MPC Figure 4.10: Moving a constant to the AX register
The next feature discussed in this section has to do with the use of constants. When writing machine
instructions, many computers offer the immediate mode (Section 2.6), making it easy to use constants. Our
microinstructions do not use any addressing modes and can only use the existing registers and gates. Since
constants are useful, one more feature has been added to our microinstructions, making it possible to create
a 16-bit constant in the AX register. The two fields AXL and AXR of every microinstruction can be used
to move the 8-bit ADDR field to the left or right halves of the AX register. Recall that the ADDR field
normally contains a jump address; it now has a second use as a constant. Figure 4.10 shows the details of
the gates involved in this move.
Exercise 4.7: When does this move occur (in what subcycle)?
Figure 4.11 shows the entire control path. Areas surrounded by circles or ellipses indicate details shown
in previous diagrams.
180 4. Microprogramming
4
C decoder
B decoder
A decoder
to AX
4
MMUX
8 registers
2
4
3 +1 MPC
1
MIR
MBR
AMUX
JUMP
ALU
MAR
MBR
ENC
AXR
AXL
IRG
WR
AMUX
RD
C B A ADDR
ALU
4
and ‘r3:=lshift(inv(r3));’, where band stands for “boolean AND”, inv, for “inverse” and lshift, for
“left shift.” The phrase ‘alu:=r1;’ means selecting R1, moving it to the A latch and, from there, to the
ALU, where it is passed through without any processing (ALU function-code 2) and then discarded. Nothing
is moved to either the C bus or the MBR. The result of such an operation is new values of the status flags
Z and N, so such a phrase is usually combined in the microinstruction with a conditional jump phrase.
Exercise 4.8: What notation should be used to move R2 to both the C bus and the MBR?
A jump is specified by either ‘goto abc;’ or ‘if N goto xyz;’.
The microinstruction ‘axr 127;’ moves the 8-bit constant 127 (= 011111112 ) to the right half of AX,
without modifying the left half. If this is followed by ‘axl 0;’, then the AX register is loaded with 9 zeros
on the left followed by 7 ones on the right; a number useful as a mask to isolate 7 bits out of 16. Another
example is the microinstruction ‘axl axr 15;’ It loads the AX with the 16-bit pattern ‘0000111100001111’.
We assume that the microassembler can handle binary and hex constants, in addition to decimal. Thus the
microinstructions ‘axl 15;’, ‘axl F’X;’ and ‘axl 1111’B;’ are all identical.
Table 4.13 shows some microinstructions written in this notation, together with their numeric fields.
A J
A A I M U A M M E
X X R U M L B A R W N
Microinstruction L R G X P U R R D R C C B A ADDR
mar:=pc; rd; 0 0 0 x 0 2 0 1 1 0 0 x 6 x xx
rd; pc:=pc+1; 0 0 0 0 0 6 0 0 1 0 1 6 x 6 xx
ir:=mbr; 0 0 0 1 0 2 1 0 0 0 1 4 x x xx
irg; goto 0; 0 0 1 0 3 0 0 0 0 0 0 0 0 0 00
axl 12; alu:=r1; 1 0 0 0 0 2 0 0 0 0 0 x x 1 12
alu:=r0; if Z goto 10; 0 0 0 0 2 2 0 0 0 0 0 x x 0 10
r2:=lshift(r2); 0 0 0 0 0 4 0 0 0 0 1 2 x 2 xx
r3:=band(r2,ax); 0 0 0 0 0 1 0 0 0 0 1 3 5 2 xx
r1=mbr:=r2; mar:=pc; 0 0 0 0 0 2 1 1 0 0 1 1 6 2 xx
Table 4.13: Microinstruction examples (x=don’t care)
Exercise 4.9: What is the difference between the two microinstructions ‘r1:=mbr; goto fech;’ and
‘goto fech; r1:=mbr;’ ?
Before we look at the entire microcode, here are some examples of machine instructions and their
microprograms.
182 4. Microprogramming
1. A direct load, LODD R1,123. The 12-bit x field in the machine instruction is a direct address, and the
content of memory location x are moved to register R1. This instruction uses the direct mode (Section 2.4)
since the entire 12-bit address is included in the instruction. The microprogram is,
mar:=ir; rd;
rd;
r1:=mbr;
Notice that the MAR is a 12-bit register while the IR has 16 bits. All 16 bits are moved to the B bus
and the B latch, but only the 12 least significant ones are moved from there to the MAR. Notice also how
the ‘read’ signal to memory is maintained during two microinstructions, i.e., during two clock cycles. This
makes sense since memory is always slower than the processor. We assume that our processor can perform
4 operations per cycle, while memory needs 2 cycles to complete one of its operations.
This instruction is basically a memory-read operation, where the result is sent to R1. A real computer
would, of course, have similar instructions for all 4 registers, and also the similar STORE instructions, which
are memory-write.
Exercise 4.10: Write the microprogram for the Store Direct instruction STOD R1.
2. A load in the relative mode, LODR R1,14. The 12-bit x field in the machine instruction is considered
a relative address, and the effective address is the sum of x and the PC.
axl F’X;
axr FF’X; [ax=0FFF hex]
ax:=band(ax,ir); [ax becomes the 12 lsb of the ir]
ax:=ax+pc; [ax becomes the effective address]
mar:=ax; rd;
rd;
r1:=mbr;
Four microinstructions are needed just to compute the effective address. Those familiar with the relative
mode (Section 2.5) may remember that in this mode the displacement is signed, which introduces a slight
problem. After the first 3 microinstructions isolate the 12 least significant bits of the IR and move them to
the AX register, that register contains 0000sxxx...x, where s is the sign bit of the 12-bit relative address.
This sign bit should be moved to the leftmost position to create the pattern sssssxxx...x in the AX.
We conveniently ignore this point since it is tedious to do by our microinstructions. We also ignore the
complications resulting in the ALU from the different sizes of the two registers being added (the AX is 16
bits long, whereas the PC is only 12 bits).
3. Indirect load, LODN R1,13. The 12-bit x field in the machine instruction is considered an indirect
address, and the effective address is the content of m[x] (we denote EA=m[x]).
mar:=ir; rd; [indirect address is sent to memory]
rd;
ax:=mbr; [ax becomes the effective address]
mar:=ax; rd; [effective address is sent to memory]
rd;
r1:=mbr;
Exercise 4.11: Why is it necessary to read the effective address to the AX register and then send it from
there to the MAR. Why not move the EA directly from the MBR to the MAR with a microinstruction such
as ‘mar:=mbr’ ?
4. Immediate load, LODI R1,17. The 12-bit x field in the machine instruction is considered an immediate
quantity to be moved to R1. There is no effective address and no memory operations.
axl F’X;
axr 255; [ax=0FFF hex]
r1:=band(ax,ir); [r1 becomes 0000+the 12 lsb of the ir]
4.9 The Microcode
Exercise 4.12: The next important addressing mode, after direct, relative, indirect and immediate, is the
index mode. Design a reasonable format for the ‘LODX R1,indx’ (Load Index) instruction, and write the
microprogram.
183
5. Increment SP, INSP 12. This instruction uses an 8-bit constant, which has to be extracted from the
IR before any addition can take place. The microprogram is straightforward.
axl 0;
axr FF’X;
ax:=band(ax,ir);
sp:=sp+ax;
Exercise 4.13: The previous examples show that the AX register is used a lot as an auxiliary register.
What other registers can the microinstructions use as auxiliaries?
6. Push a register into the stack, PUSH Ri. This is an example of an instruction with a 6-bit opcode.
The opcode starts with 1011, followed by two more bits that indicate the register to be pushed into the
stack. The microprogram is much longer than the average, since 9 microinstructions are necessary just to
isolate the extra two opcode bits and decide what register to use.
sp:=sp-1;
axl 00001000’B;
axr 0; [ax:=0000100...0, a mask]
ax:=band(ax,ir); if Z goto r0r1; [isolate bit ir(11)]
r2r3: axl 00000100’B; [ir(11)=1, check ir(10)]
axr 0;
ax:=band(ax,ir); if Z goto r2;
r3: mbr:=r3; mar:=sp; wr; goto writ;
r2: mbr:=r2; mar:=sp; wr; goto writ;
r0r1: axl 00000100’B; [ir(11)=0, check ir(10)]
axr 0;
ax:=band(ax,ir); if Z goto r0;
r1: mbr:=r1; mar:=sp; wr; goto writ;
r0: mbr:=r0; mar:=sp; wr;
writ: wr; goto fech; [Done!]
Exercise 4.14: Show how an ADD instruction can be included in our instruction set. Write the micropro-
grams for two types of ADD; an ADD R1, which adds m[x] to register R1, and an ADD R1,Ri, which adds
register Ri to R1.
One more thing should be discussed before delving into the details of the microprograms namely, what
to do at the end of a microprogram. A look at Table 4.14 shows that every microprogram ends with a ‘goto
fech’. Label ‘fech:’ starts a microprogram that fetches the next machine instruction. This is another
important feature of microprogramming. The control unit does not need a special circuit to fetch the next
machine instruction, since this operation can easily be done by a short microprogram.
The microcode is thus one big loop. It starts at control store location 0, where a ‘goto fech’ causes the
very first machine instruction to be fetched. Following that, the microcode moves the opcode to the MPC,
resetting the MPC to a value between 1 and 12, and causing a jump to the start of the right microprogram.
After executing the microprogram, a ‘goto fech’ is executed, sending control to the microprogram that
fetches the next machine instruction.
Exercise 4.15: The microcode is a loop, but there is no HALT microinstruction. How does it stop? Also,
how can we be sure that the loop will start at control store location 0 (i.e., that the MPC will contain zero
when the computer starts)?
Table 4.14 shows the entire microcode for our example. It is not hard to read since the machine
instructions are simple.
2: goto lodr;
3: goto lodn;
4: goto lodi;
5: goto jmp;
6: goto ijmp;
7: goto jzer;
8: goto jneg;
9: goto call;
10: goto ret;
11: goto push;
12: goto insp;
sp:=sp+ax; goto fech;
[end of microcode]
Table 4.14. The microcode
Exercise 4.16: The short sequence
axl F’X;
axr FF’X;
ax:=band(ax,ir);
occurs often in our microcode. When such a thing happens in a program, we write it as a procedure. Discuss
the use of procedures in the microcode.
The following should be noted about the microcode.
1. Microprograms should be efficient, squeezing maximum performance out of every last nanosecond,
but they don’t have to be readable. Remember that they are written and debugged once, then used often,
perhaps over years, without any changes.
2. One example of microprogram efficiency is the microprogram for JZER which says ‘if Z goto jmp;’.
It branches to the microprogram for JUMP, which saves one microinstruction in the microcode. Another
example is ‘pc:=pc+1; rd;’. It increments the PC while waiting for the slow memory. Our microcode has
quite a few microinstructions that say ‘rd’ or ‘wr’ and just wait for a memory operation to complete. A
real microprogrammed computer can perform many useful background tasks while waiting for slow memory.
Examples are: refresh dynamic RAM, check for enabled interrupt lines, refresh the time and date displayed
on the screen.
3. Another example of how the microprograms can be made more efficient is to start fetching the next
machine instruction before the current microprogram is complete. As an example, here is a modified version
of the microprogram for LODD.
lodd: mar:=ir; rd; [load direct]
186 4. Microprogramming
rd;
r1:=mbr; mar:=pc; rd; goto f1;
The last line finishes the load operation by saying ‘r1:=mbr;’, and also starts fetching the next machine
2
instruction by saying ‘mar:=pc; rd;’. It then goes to ‘f1’ instead of to ‘fech’. Other microprograms can
be modified in a similar way. This is a primitive form of pipelining, where the control unit performs several
tasks on several consecutive machine instructions simultaneously.
4. Our microprograms tend to be short, since the machine instructions are so simple. Real micropro-
grams, however, may be quite long. A good example is the microprogram for PUSH. It is 17 microinstructions
long (compared to 4 microinstructions which is the average of all the other microprograms) because it has
to isolate and check individual bits.
Exercise 4.17: A microinstruction can only perform certain types of data movements, since the data path
(Figure 4.1) is limited and does not allow arbitrary register transfers. Microinstructions such as ‘r1:=r2;’
or ‘mbr:=r2;’ perform one move each. It’s been mentioned earlier that the notation ‘r1=mbr:=r2;’ can be
used to move R2 to both R1 and the MBR (notice the use of the equal sign ‘=’ and the assignment operator
‘:=’). The question is, Can one microinstruction perform 3 data moves?
Exercise 4.18: Consider adding a ‘shift’ instruction to our small instruction set. The instruction should
have a 4-bit opcode followed by a 12-bit x field. It should shift R1 to the left x positions. Discuss the
problem of writing a microprogram for this instruction.
4.10 Final Notes
One way to intuitively understand microprogramming is to consider it an additional level in the computer.
It is possible to consider the computer a multilevel machine where the lowest level, level 0, is the hardware
and each level uses the ones below it. In a conventional (hardwired) computer, the level above the hardware,
level 1, is the machine instructions. If we write a program in machine language, we need the hardware of
level 0 to execute it. Level 2 is the assembler, which isn’t hardware, but is very useful. Level 2 is used
to translate programs into level 1, to be executed by the hardware of level 0. Level 3 is the higher-level
languages.
With this picture in mind, we can consider microprogramming an additional level, inserted between
the hardware and the machine instructions. In a microprogrammed computer, the hardware constitutes
level 0, the microinstructions, level 1, the machine instructions constitute level 2, and the assembler makes
up level 3. The hardware no longer executes the machine instructions but the microinstructions. The
microinstructions, in turn, execute the machine instructions, except that the word “execute” isn’t the right
one. There is a difference between the way the hardware executes the microinstructions and the way they
execute the machine instructions. We therefore say that the hardware interprets the microinstructions,
and they emulate the machine instructions. The word “simulate” is reserved for simulation by instructions
(software), while the word “emulate” is used to indicate simulation by microinstructions (sometimes called
firmware).
Exercise 4.19: How about the assembler and the microassembler. Do they simulate or emulate?
In principle it is possible to write a compiler that would translate a program in a higher-level language
directly into microinstructions. Such a compiler would generate a microprogram for each machine instruction
that it normally generates. The trouble with this approach is that the microprograms would have to be stored
in a large memory (since programs can be very large) and large, fast memories are still very expensive. A
large slow memory would result in slow execution. This is why contemporary computers use the paradigm
of compiler translation, followed by microcode emulation, in a double-memory configuration (a large, slow
main memory, and a small, fast control store).
4.10 Final Notes 187
One promising approach is RISC. A RISC machine has simple instructions; an approach which makes
the instructions easy to execute by hardware but requires more instructions to do the same job. A RISC
computer, by definition, does not use microprogramming, which makes it faster and more than compensates
for the larger number of instructions that have to be fetched and executed. (Some microprogrammed
computers are advertised as RISC, but experts agree that “pure” RISC should not utilize microprogramming.
A microprogrammed computer can therefore at best be called “RISC-like.”)
Another interesting point to consider is speeding up the microprogrammed computer by adjusting the
subcycles. Up until now we assumed that each cycle is the same length and is divided into 4 subcycles in
the same way. This is clearly not efficient. ALU operations require different times, so subcycle 3 should
only last for as long as the ALU needs. Subcycle 4 may require different times depending on how the ALU
output is routed and whether a jump is necessary.
How can we change the duration of subcycles. Speeding up and slowing down the clock frequency is
not practical since the clock is driven by a crystal that gives it precise frequency. The solution is to use a
faster clock and have each subcycle last a certain number of clock cycles. We now use the term “period”
instead of subcycle, so every microinstruction is fetched and executed in 4 periods, each lasting a certain
number of clock cycles (Figure 4.15). Period 3, for example, may last between say, 1 and 20 clock cycles on
our example computer, depending on the ALU operation.
How does the control unit know how long each period should be? One approach is to add 4 fields
to each microinstruction, containing the lengths of the 4 periods for the microinstruction (in clock cycles).
The microprogrammer should know how long each ALU operation takes, and how long period 4 takes
for every possible output move and every possible jump. That information should be determined by the
microprogrammer and added to every microinstruction in the microcode. Another approach is to have all
the timing information stored in a ROM (or in a PLA). The ALU function code is sent to the address bus
of the ROM at the start of period 3, and out, on the data bus of the ROM, comes the necessary duration of
the period.
A microprogrammed computer designed for speed should use this approach even though it complicates
the hardware and results in a higher price for the computer. Computers designed to be inexpensive normally
use fixed-size periods, so each period should be long enough to accommodate the slowest operation possible
in that period.
The next important point to discuss has to do with the encoded fields in our microinstructions. The
‘A’, ‘B’, ‘C’, ‘ALU’ and ‘JUMP’ fields are encoded, and have to be decoded when the microinstructions are
executed. We have seen the A-, B- and C decoders. The ALU has its own internal decoder, to decode its three
“function” input lines. The two bits of the ‘JUMP’ field are decoded by the gates in the micro-sequencing-
logic circuit. It is possible to redesign our microinstructions without any encoded fields (Section 4.11). The
‘A’, ‘B’ and ‘C’ fields would be 8-bit wide each. The ‘ALU’ field would be 5 bits wide (even though our
ALU can execute 6 functions) and the ‘JUMP’ field would be 3 bits wide. This would eliminate the need for
decoders and would consequently speed up the hardware. The microinstructions would be longer (51 bits
instead of the present 31). Such an approach to microprogramming is called horizontal because the control
store can be visualized as wide and flat (Figure 4.16a). The opposite approach is vertical microprogramming,
where a microinstruction has just a few, highly encoded fields. The microinstruction is short, but more of
them are needed for the microcode. The control store looks like Figure 4.16b.
The fields of a vertical microinstruction typically include an opcode, source operand (a register or a
memory address), destination operand (the same), ALU function, and jump. All encoded, and requiring
decoders to execute the microinstruction. Tanenbaum’s book discusses such an example on pp. 194–199.
Vertical microinstructions are generally less powerful than horizontal ones, so more are needed. Hori-
zontal microinstructions are longer and harder to write and debug. This is why most real microprogrammed
188 4. Microprogramming
..
.
..
.
(a) (b)
Figure 4.16. Horizontal and vertical control stores
..
.
15 3
..
.
15 · · · 0 ··· ..
.
..
.
computers use microinstructions that are in-between these two extremes. A typical microinstruction has
some encoded fields and some explicit (decoded) ones. This is why our example, while being simple, is also
representative.
In the late 1960s, a research team under Robert Rosin at the State University of New York at Buffalo
came up with a combination of the two methods. They envisioned a design where the hardware fetches and
executes extremely simple horizontal nanoinstructions from a nanostore. The nanoinstructions are organized
into nanoprograms, each for the execution of a microinstruction. The microinstructions, which are vertical,
are thus fetched and executed by the nanoinstructions. The microinstructions themselves fetch and execute
the machine instructions.
This approach adds two levels to the computer, and is therefore slow. Its main advantage is that it is
now easy to change the microinstructions! Changing the microinstructions is like changing the instruction
set, something that users never do. It may, however, be useful for a computer manufacturer or a research
lab, where one computer could be used to develop, compare and implement different instruction sets. Two
computers based on those ideas, the QM-1 and QM-2, were actually built and sold in small quantities during
the 1970s by Nanodata Inc. Two references describing this approach are [Rosin 69] and [Salisbury 76].
The terms nanoprogramming, nanoinstructions and nanostore are also used to describe another approach
to microprogramming. The nanostore contains the set of all different microinstructions. The control store
contains pointers to the microinstructions in the nanostore. Thus a microprogram in the control store is
a set of pointers to microinstructions in the nanostore. This approach is slow since it is dual-level, but it
may make sense in cases where the microinstructions are very long (horizontal) and there are relatively few
4.10 Final Notes 189
different microinstructions.
This idea can be generalized to the case where several microinstructions are very similar, and can be
made identical by changing one field into a parameter. Imagine a set of microinstructions that have a field
indicating a register. One representative microinstruction from this set, with zeros in the register field, is
placed in the nanostore. When a microprogram needs one of these microinstructions for, say R3, it places a
pointer and the parameter 3 in the control store (Figure 4.17).
The last important point to mention is that some modern microprogrammed computers are only partially
microprogrammed. In such a computer, the control unit has execution circuits for those machine instructions
that are simple to execute, and a control store for the more complex instructions. This results in relatively
simple hardware, fast execution for the simple instructions and slow execution for the complex ones. The
ALU, registers and the entire control unit, including the control store, are fabricated on one small chip, so
users don’t even know if the computer is microprogrammed, and to what extent.
A JU-
A A I M M
M MP ALU R W C B A ADDR
X X R B A
U (3) (5) D R (8) (8) (8) (8)
L R G R R
X
SP from clock
subcycle 2
C bus
A latch
to MMUX
L
N Z
Micro R
ALU Sequence
Z Logic N
A
MIR
The three bits of the JUMP field are denoted L, R and A in Figure 4.20, which stand for “left” (jump
if Z=1), “right” (jump if N=1) and “always”. Only 5 bits are shown for ALU operations but they can be
increased to 8 or any other number.
The microcode does not change, which shows that our original example was very close to horizon-
tal. Converting this example to pure vertical would result in completely different microinstructions, and
consequently different microcode. The interested reader should consult Tanenbaum, pp. 195–201.
There are—and, for the foreseeable future, will be—applications requiring faster computers than we
have today.
Conventional (sequential) computers are reaching fundamental physical limits on their speed.
With VLSI devices, it is possible to produce a computer with hundreds or even thousands of processors,
at a price that many users can afford.
Parallel computers can provide an enormous increase in speed, using standard components available
today.
5.2 Classifying Architectures
This short section discusses coarse-grained and fine-grained architectures, scalar, superscalar and vector
architectures, and the Flynn classification.
A parallel computer with coarse-grained architecture consists of a few units (processors or PEs), each
relatively powerful. A fine-grained architecture consists of a large number of relatively weak units (typically
1-bit PEs). The concept of graininess is also used to classify parallel algorithms. A coarse-grained algo-
rithm involves tasks that are relatively independent and communicate infrequently; a fine-grained algorithm
involves many simple, similar subtasks that are highly dependent and need to communicate often. Often,
they are the individual repetitions of a loop. The fine-grained algorithms are the ones we consider parallel,
but the concept of graininess is qualitative. It is impossible to assign it a value, so sometimes we cannot tell
whether improving an algorithm changes it from coarse-grained to fine-grained.
A scalar computer is one that can only do one thing at a time. In a superscalar computer there
is hardware that makes it possible to do several things at a time. Examples are (1) executing an integer
operation and a floating-point operation simultaneously, (2) executing an addition and a multiplication at the
same time, and (3) moving several pieces of data in and out of different memory banks simultaneously, using
separate buses. The PowerPC, the Intel i860, and the Pentium are examples of superscalar architectures.
A vector computer has even more hardware, enabling it to operate on several pieces of data at the same
time, in a pipeline located in the ALU.
The Flynn classification is based on the observation that a parallel computer may execute one or more
programs simultaneously, and that it may operate on one or more sets of data. As a result, the following
192 5. Parallel Computers
It can be shared. If, at a certain time, one node needs just a little memory, the other nodes can get
more memory. A while later, the memory allocation can be changed.
Files can be shared between programs. An important data base, for example, used by several programs,
needs to be loaded into memory only once.
A third design issue is general-purpose vs. special-purpose. A general-purpose computer is generally
more useful. A special-purpose computer, however, is much faster for certain applications. experience shows
that there are major computer users, with special problems, willing to pay for the design and development
of a special-purpose computer. Another aspect of this design issue is the nature of the individual nodes.
They may all be identical, or they may be different, some may be designed for special tasks.
5.5 The Hypercube
This is a very common architecture for parallel computers, and it deserves a special mention [Siegel and
McMillan 81]. In a hypercube, the individual nodes are located at the corners of an n-dimensional cube,
with the edges of the cube serving as communication lines between the nodes. The hypercube has two
important properties (1) It has an intermediate number of connections between the individual nodes. (2)
The nodes can be numbered in a ‘natural’ way.
a. b. c.
The first property is easy to understand by comparing the three connection schemes of Figure 5.1. In
Figure 5.1a, four nodes are connected linearly. The number of connections is three (in general, for n nodes,
it is n − 1). In Figure 5.1b, eight nodes are connected in a three-dimensional cube. Each node is connected
to three other ones, and the total number of connections is 8×3/2 = 12. In general, for n nodes, this number
is n log2 n/2. In Figure 5.1c the eight nodes are fully connected. Each is connected to all the other ones.
The total number of connections is 8×7/2 = 28 and, in general n(n − 1)/2.
The more connections between the nodes, the easier it is for individual nodes to communicate. However,
with many nodes (large n), it may be impossible to fabricate all the connections, and it may be impractical
for the nodes to control their individual connections. This is why the hypercube architecture is a compromise
between too few and too many connections.
The second important property of the hypercube has to do with the numbering of its nodes, and is
discussed later. Figure 5.2 shows cubes of 1-, 2-, 3-, and 4 dimensions. Several cube properties immediately
become clear.
An n-dimensional cube is obtained (Figure 5.2a) by generating two (n − 1)-dimensional cubes and
connecting corresponding nodes.
An n-dimensional cube consists of 2n nodes, each connected to n edges.
There is a natural way to number the nodes of a cube. The two nodes of a one-dimensional cube are
simply numbered 0 and 1. When the one-dimensional cube is duplicated, to form a two-dimensional cube,
the nodes of one copy get a zero bit appended—to become 00 and 01—and the nodes of the other copy
194 5. Parallel Computers
110 111
10 11
1
010 011
100 101
0
00 01 0xxx 1xxx
000 001
a. b. c. d.
get a 1 bit appended in the same way—to form 10 and 11. The additional bit can be appended on either
the left or the right. Figure 5.2c shows how the node numbers are carried over from a two-dimensional to
a three-dimensional cube. Again, the new, third bit can be added on the left, on the right, or in between
the two existing bits. This numbering method is not unique. Numbers can be carried over from the n − 1-
dimensional to the n-dimensional cube in several different ways. However, the method has one interesting
property, namely each bit in the node number corresponds to a dimension of the cube. In Figure 5.2c, all
four nodes on the top of the cube have numbers of the form x1x, while the numbers of the four nodes on the
bottom are x0x. The middle bit of the node number thus corresponds to the vertical dimension of the cube.
In a similar way, the leftmost bit corresponds to the depth dimension, and the rightmost bit corresponds to
the horizontal dimension.
Exercise 5.1: How can we use our numbering method to generate different node numbers for the four-
dimensional cube in Figure 5.2d?
This property helps in sending messages in a hypercube computer. When node abcd in a four-dimensional
cube decides to send a message to its ‘next door’ neighbor on the second dimension, it can easily figure out
that the number of this neighbor is abcd where b is the complement of bit b.
Exercise 5.2: Summarize the above mentioned properties for a 12-dimensional cube.
a. b. c.
It should be noted that there are many other ways to connect individual nodes in a parallel computer.
Figure 5.3 shows a few possibilities. The star configuration is especially interesting since it has both a short
communication distance between the nodes and a small number of connections. It is not very practical,
however, because its operation depends heavily on the central node.
5.6 Array Processors 195
Control Program
Unit Memory
Communication Lines
The program memory holds one program, and instructions are fetched, one by one, by the control unit.
The control unit decodes each instruction, executes some of them, and sends the rest to be executed by
the ALUs. Instructions such as Enable Interrupts, JMP, and CALL are executed by the control unit. Other
instructions, involving operations on numbers, are executed by the ALUs. What makes the array processor
special is that each such instruction is executed by all the ALUs simultaneously. The ALUs, of course,
operate on different numbers. The communication lines can be used to transfer data between the ALUs.
A word about terminology. Since the ALUs operate on different data, each should also have its own
registers and data memory. An ALU with registers is usually called a PE (processing element). A PE with
data memory (PEM) is called a PU (processing unit).
The principle of operation of this computer is that all the PUs execute the same instruction simul-
taneously (they operate in lockstep), but on different data items. This, of course, makes sense for certain
applications only. Examples of applications that don’t lend themselves to this kind of execution are: text
processing, the compilation of programs, and most operating systems tasks. Examples of applications that
can benefit from such execution are: payroll, weather forecasting, and image processing.
In a typical payroll application, the program performs a loop where it inputs the record of an employee,
calculates the amount of net pay, and writes the paycheck to the output file, to be printed later. The total
processing time is, therefore, proportional to the number of employees M .
An array processor also involves a loop, but it starts by reading the records of the first N employees
(where N is the number of PUs) into the data memories of the PUs. This input is performed serially, one
record at a time. The array processor then executes a program to calculate the net pay which, of course,
calculates N net pays simultaneously. The last step is to write N paychecks to a file, one at a time, to be
eventually printed. The entire process is repeated for the next group of N employees. The last group may,
of course, contain fewer than N employee records. The main advantage of this type of processing is the run
time, that is proportional to M/N . However, the following points should be carefully considered, since
they shed important light on the operation of the array processor.
196 5. Parallel Computers
The input/output operations are not performed by the PUs. These operations can be performed by
the control unit, but most array processors have a conventional computer, called a host, attached to them,
to perform all the I/O and the operating system tasks. The host computer treats the PE memories (and the
program memory) as part of its own memory, and can thus easily move data into and from them.
An interesting result is that an array processor can be viewed in two different ways. One way is to view
the array as the central component, and the host computer as a peripheral. The other way is to think of the
array as a peripheral of the host computer. Figure 5.5 illustrates the two points of view.
Host
Prog
C
mem.
U
Host
a. b.
Figure 5.5. Two views of an array processor
It is now clear that a program for an array processor includes three types of instructions: control unit
instructions, such as a JMP, PU instructions (operations on numbers), and host instructions (input/output
and memory reference). After fetching an instruction, the control unit decodes it and either executes it or
sends it to the host or the PUs for execution. To simplify the coding process, the control unit should wait
with the next instruction until it is sure that the current instruction has been completed.
In the payroll example above, the last group of employees normally has fewer than N records. In the
case of a payroll, this does not present a problem. The extra PUs perform meaningless operations and end
up with meaningless results, that are ignored by the program. In other cases, however, it is important to
disable certain PUs. Consider, for example, the program fragment (a) below:
. . . .
. . . .
JMP L1 BPL L1 DPL BPL L1
ADD .. ADD .. ADD .. BMI L2
. . . .
. . . .
L1 SUB .. L1 SUB .. ENA L1 BCC L3
. . SUB .. .
. . . .
(a) (b) (c) (d)
This is easy to execute, since JMP is a control unit instruction. The control unit executes it by resetting
its PC to the address of the SUB instruction. If, however, the JMP is changed to a conditional jump, such as
a BPL (branch on plus, see (b) earlier), the situation becomes more complicated. BPL means “branch if the
most recent result is positive (if the S flag is zero).” Since the PUs operate on different numbers, some have
positive results while others have negative results. The first group should branch to L1 while the remaining
PUs should continue normally.
5.7 Example: MPP 197
This problem is solved by adding a status bit to each PU. The PU can be either enabled or disabled,
according to the value of its status bit. Instead of a BPL, the compiler generates a DPL instruction (disable if
plus), that is sent to all the PUs. All PUs with an S flag of zero disable themselves. The ADD instruction and
the instructions that follow are only executed by the PUs that remain enabled. At point L1 the compiler
should generate an ENA (enable all) instruction, to enable all the PUs. This is case (c) earlier.
What if the program employs a complex net of conditional branches, such as in (d) earlier? In such a
case, there may not be any way to compile it. This is one reason why array processors are special-purpose
computers. They cannot execute certain programs that conventional computers find easy.
An important class of applications for array processors is image processing. With many satellites and
space vehicles active, many images arrive at receiving stations around the world every day, and have to be
processed.
An image taken from space is digitized, i.e., made up of many small dots, called pixels, arranged in rows
and columns. The picture is transmitted to Earth by sending the values of the pixels row by row, or column
by column. The value of a pixel depends on the number of colors sensed by the camera. It can be as little
as a single bit, in case of a black and white (monochromatic) image. If the image contains many colors, or
many shades of grey, the value of each pixel consists of several bits.
Image processing may be necessary if certain aspects of the image should be enhanced. A simple example
is contrast enhancement. To achieve more contrast in an image, the program has to loop over all the pixels;
bright pixels are made brighter, and dark ones are made darker. Another example is enhancing certain
colors, which involve a similar type of processing.
Image processing is a highly-developed art, and is advancing all the time. Typically, an image processing
program consists of a loop where all the pixels are scanned, and each pixel undergoes a change in value,
based on its original value and the values of the pixels around it. A typical example is:
where i and j go over the rows and columns of the picture, respectively.
On a conventional computer, such a computation is done by a double-loop, over the rows and columns
of the picture. On an array processor, however, the process is different, and much faster. The PUs are
arranged in a two-dimensional grid, with each pixel connected to its four nearest neighbors. The image is
read in and pixel p[i, j] is stored in (say, location 32) of the memory of the PU in row i and column j. The
program executes one instruction of the form “get the contents of memory location 32 of your four nearest
neighbors, add them and subtract 4 times your own location 32. Store the result as the new value of your
location 32.” This instruction is executed simultaneously by all PUs, thus processing N pixels in the time
it takes a conventional computer to process one pixel.
Another group of N pixels may be stored in location 33 of all the PUs, and may be processed by a single
instruction, similar to the one above.
This is a typical example of image processing, and it shows that a computer designed for such applications
should have its PUs arranged in a two-dimensional array, with as many connections between the PUs
as possible. Other applications may require a one-dimensional or a three-dimensional topology, but the
two-dimensional ones are the most common. Figure 5.6 shows a typical array processor featuring a two-
dimensional array of PUs with nearest neighbor connections.
If an application requires more than just nearest neighbor connection, several instructions may be
necessary, each moving data between near neighbors.
5.7 Example: MPP
The Massively Parallel Processor (MPP) is a typical example of an array processor. It is large, consisting of
16K PUs. Each PU, however, is only 1-bit wide. The MPP is thus an example of a fine-grained architecture.
The MPP was designed and built for just one application, image processing of pictures sent by the
landsat satellites. This is a huge task since each picture consists of eight layers, each recorded by a sensor
sensitive to a different wavelength. Each layer is made of 3000×3000 7-bit pixels. Each picture thus consists
of 504×106 bits, and many pictures are recorded each day. Anticipating the huge amount of data processing
needed, NASA, in 1975, awarded a contract to the Goodyear Aerospace company to develop a special-purpose
computer to do the task. The MPP was delivered in 1983.
198 5. Parallel Computers
Array Control
Unit (ACU)
I/O PDP-11
Devices (PDMU)
Host
Computer
The 16K PUs are arranged in a 128 × 128 grid (Figure 5.7) and are controlled by the Array Control
Unit (ACU). The ACU itself is controlled by the Program and Data Management Unit (PDMU), which is
a PDP-11 computer, and by the host computer, a VAX 11-780. The PDMU receives the compiled program
from the host and sends it to the ACU instruction by instruction. It also is responsible for handling the
disk, printer and other peripherals of the MPP. A third task of the DPMU is to move data from the output
switch back to the input switch.
The host computer compiles MPP programs. It is also responsible for receiving the data from the
satellite and sending it to the staging memory. From that memory, the data goes to the input switches
which hold 128 bits. From the input switches, 128 bits are fed to the leftmost column of the array, and the
program is written such that each PU receives a bit of data from its neighbor on the left, processes it, and
sends it to its neighbor on the right. In normal operation, groups of 128 bits appear at the rightmost column
all the time, and are sent to the output switches, from which they can be sent back to the host computer or
to the staging memory, to be fed again into the leftmost column.
It is interesting to look at the architecture of a PU on the MPP. It is a simple, 1-bit wide, circuit
5.8 Example: The Connection Machine 199
s c
C
Full Adder
N bit
B A
Shift Reg.
To and from 4 P = G
neighbor PUs
logic
Data Bus
consisting of six registers, a full-adder, a comparison circuit, a logic circuit, a and 1K×1 data memory.
The main operation performed by a PU (Figure 5.8) is adding numbers. The full adder adds three bits
(two input bits plus an input carry) and produces two bits, a sum and a carry. Numbers of different sizes
can be added using the shift register. It can be adjusted to 2, 6, 10, 14, 18, 22, 26, or 30 bits, and an n-bit
number can be stored by adjusting the shift register size to n − 2 bits, and holding the other 2 bits in the A
and B registers. The other n-bit number to be added should come from the P register, bit by bit.
The P register is used for routing and for logic operations. It is possible, e.g., to instruct all the PUs
to send their P register to their neighbor on the left. Each PU thus sends its P to the left and, at the same
time, receives a new P from its neighbor on the right. The box labeled ‘logic’ can perform all the 16 boolean
operations on two bits. Its inputs are the P register and the data bus. Its single output bit stays in the P
register.
The G register is the mask bit. If G=0, the PU is disabled. It still participates in data routing and
arithmetic operations, but not in the logic operations.
The S register is connected to the S registers of the neighbors on the left and right. The S registers of
the extreme PUs on the left and right of the array are connected to the input and output switches.
The box labeled ‘=’ is used for comparisons. The P and G registers are compared and the result is sent
to the data bus.
5.8 Example: The Connection Machine
This is a very fine-grained SIMD computer with 64K PUs, each 1-bit wide, arranged as a 12-dimensional
hypercube, where each of the 4K nodes is a cluster of 16 PUs. The entire machine is driven by a host
computer, typically a VAX. The Connection Machine was designed by W. Daniel Hillis, as a Ph.D. thesis
[Hillis 85], around 1985. The machine was designed to overcome the main restriction of older SIMD computers
namely, rigid array connections.
Traditionally, the PUs of SIMD computers were connected in a two-dimensional grid. In such a config-
uration, it is easy for a PU to send a piece of data to any of its four nearest neighbors. Sending data to any
other PU, however, requires several steps, in which the data is sent between near neighbors.
The main idea in the design of the connection machine was to make it easy for PUs to send messages to
200 5. Parallel Computers
each other, thereby simulating all kinds of connections. It turns out that this is easy to do in a hypercube,
because of the natural numbering of the nodes. However, before we go into the details of message routing,
let’s look at the main architectural components of the machine.
The 64K PUs are organized as a 12-dimensional cube, with 4K corners called clusters. Each cluster is
thus directly connected to 12 other clusters. A cluster is made up of a central circuit, called a router, and
16 PUs.
Each PU (Figure 5.9) is one bit wide, has sixteen 1-bit registers called flags, and a 4K×1 memory. There
is a simple logic circuit to execute the instructions, but there is no ALU.
4k bits memory
Logic
A B
16 Registers
Truth
I O
Tables
Instruction
Since there is no ALU, instructions sent to the PU have no opcode. Instead, each instruction specifies
three input bits, two output bits, and two truth tables, one for each output bit. A truth table is made up of
eight bits, specifying the output bit for every possible combination of the three input bits. The instruction
has two 12-bit address fields A and B and two 4-bit register fields I and O. The three input bits come from
locations A, B in memory, and from the register specified by I. The two output bits go to location A and the
register specified by O.
Examples: The two truth tables 01110001 and 00001111 create two output bits that are the sum and
the carry of the three inputs. The truth tables 01111111 and 00000001 create the logical OR and logical
AND of the three inputs.
The main feature of the connection machine, however, is the message routing mechanism. The program
contains instructions to the PUs to send a message, and, operating in lockstep, they all create messages
and send them simultaneously. The 16 PUs in a cluster are connected in a 4×4 grid. Thus, in the cluster,
messages can be sent to nearest neighbors easily. However, if a message has to be sent to a PU in another
cluster, than it should consist of an address part and a data part. The data bits are, of course, the important
part, but the address part is necessary. It consists of a 12-bit destination cluster address, a 4-bit PU number
within the destination cluster, and a 12-bit memory address of the destination PU.
The important part is the 12-bit cluster address. It is supplied by the instructions and may be the same
for all PUs. It is a relative address. Each of its 12 bits specifies one of the 12 dimensions of the hypercube.
Thus address 000110010000 should cause the message to be sent from the sending cluster—along the fifth,
ninth, and tenth dimensions—to the destination cluster.
The message is sent by each PU to its router, and is accumulated by the router, bit by bit, in an internal
buffer. When the router has received the entire message from the sending PU, it scans the 12-bit destination
address. In our example, the first nonzero bit is the fifth one. The router clears this bit and sends the
message to the fifth of its 12 router neighbors. That router does the same and ends up sending the message
5.9 MIMD Computers 201
to the ninth of its 12 immediate neighbors. When a router receives a message with all 12 bits cleared, it
sends it to the destination PU in its own cluster (actually, to the memory of that PU).
It should be emphasized that the PUs operate in lockstep, but the routers do not! While messages are
sent in the machine, all the buffers of a certain router may be full. If a message arrives at that router, it may
have to refer it to another router. Referring means that the router sets one of the 12 bits of the destination
cluster address, and sends the message in that direction. That message will therefore take longer to arrive
at its destination. Message routing, therefore, is not done in lockstep and, once the program initiates the
process of sending a message, it should wait until all the PUs have received their messages.
5.9 MIMD Computers
An MIMD Computer consists of several processors that are connected, either by means of shared memory
or by special communication lines. In the former case, the MIMD computer is called a multiprocessor . In
the latter case, it is called a multicomputer. Their main components are summarized in Figs. 5.10 and 5.11.
CM 1 CM N
Proc. Memory I/O Proc.
Bus
Message Routing Network
Switch
processor, its local memory, if any, its I/O processor, and a switch connecting the module to the rest of the
multiprocessor.
The main problem in such a computer is memory conflicts. If the number of processors is large, a
processor may have to wait an unacceptably long time for memory access. The solution is to use memory
banks (Figure 5.10b) interleaved on the most significant bits. In the ideal case, processors use different
banks, and never have to wait for memory. In practice, programs have different sizes, and as a result, some
processors need less than one bank, others need more than a bank, and there is a certain number of memory
conflicts.
In a multicomputer, all the memory is local, and the individual processors communicate by passing
messages between them. The hypercube architecture is a popular way to connect processors in this type of
computer.
5.10 Parallel Algorithms
The key to an understanding of MIMD computers is an understanding of parallel algorithms. We therefore
start by looking at examples of algorithms designed specifically for MIMD computers. We describe one
algorithm for multiprocessors and several for multicomputers.
5.10.1 Parallel Rank Sorting
1. This is an example of a shared-memory algorithm, to be run on a multiprocessor. We assume that n
numbers are to be sorted (with no repetitions), and n2 processors are available (an overkill, but it simplifies
the algorithm). The original numbers are stored in array V , and the sorted ones will be stored in array W .
We declare a matrix Ann , and assign each processor to one of the matrix elements (each processor is assigned
indexes i and j). The processors communicate only through shared memory. They are not connected in a
grid or any other topology, and they do not send messages to each other.
Step 1. Each processor calculates one matrix element:
1, if V [i] ≤ V [j];
A[i, j] =
0, otherwise;
(this takes time 1 if we ignore memory conflicts). Note that A is antisymmetric, so only n2 /2 processors are
actually needed. Each calculates one matrix element and stores it in positions Aij and Aji .
n
Step 2. n processors are selected out of the n2 , and each computes one of the sums T [i] = j=1 A[i, j].
This takes time O(n). Note that T contains the rank of each element.
Step 3. Scatter V into W by W [T [i]] ⇐ V [i] (n processors do it in time 1).
The total time is thus O(n). Excellent, but requires too many processors. We can decrease the number
of processors, which will make the method a little less efficient. Example: V = (1, 5, 2, 0, 4) (n = 5). The
results are: ⎡ ⎤ ⎡ ⎤
1 1 1 0 1 4 1 0 0 1 0 2
⎢0 1 0 0 0⎥ 1 ⎢1 1 1 1 1⎥ 5
⎢ ⎥ ⎢ ⎥
⎢0 1 1 0 1⎥ 3 ⎢1 0 1 1 0⎥ 3
⎣ ⎦ ⎣ ⎦
1 1 1 1 1 5 0 0 0 1 0 1
0 1 0 0 1 2 1 0 1 1 1 4
A[i, j] = V [i] ≤ V [j]; A[i, j] = V [i] ≥ V [j];
The sums on the right show how the numbers are to be placed in W . The matrix on the left results in
descending sort, and the one on the right, in ascending order. (End of example.)
5.10.2 Parallel Calculation of Primes
2. Calculating primes. Here we assume that the multicomputer has a linear topology. Each processor (except
the two extreme ones) has two immediate neighbors. We use a parallel version of the sieve of Eratosthenes.
All the processors, except the first one, execute the same algorithm: A processor receives a sequence of
numbers from its left neighbor, it assumes that the first number is a prime, and it prints it (or writes it on a
disk, or sends it to a host, or stores it in shared memory). All the other numbers are compared to the first
5.10 Parallel Algorithms 203
number. Any that are multiples of the first number are ignored; those that are not multiples are sent to the
neighbor on the right.
The first processor (that could also be a host) prints the number 2 (the first prime) and creates and
sends all the odd numbers, starting at 3, to the right. The second processor receives and prints the number
3. It then receives 5, 7, 9, 11, . . ., discards 9, 15, 21, . . . and sends the rest to the right. The third processor
receives and prints the number 5. It then discards 25, 35, 55, . . . and so on. An elegant algorithm.
Note that more and more processors join the computations, and work in parallel to eliminate the non-
primes, to print one prime each, and to send more work to the right.
Exercise 5.3: The amount of primes that can be calculated equals the number of processors. How can this
limitation be overcome?
3. Sum 16 numbers. On a sequential computer this takes 16 (well, 15) steps. On a parallel computer, it
can be done in log2 16 = 4 steps. We imagine 16 processors arranged in a binary tree. A host sends two
numbers to each of the eight leaves of the tree. Each leaf sums one pair, and sends the sum to its parent.
The parent adds two such sums and sends the result to its parent. Eventually, the root adds the two sums
that it receives, and the process is over.
The root is also processor 0, which is responsible for the overhead (sending the data to the leaf proces-
sors). It does not seem to make sense to send many messages all the way down the tree just to add numbers.
In practice, however, the tree of processors may be embedded in a larger topology (such as a hypercube), so
the distance between processor 0 and the leaves cannot exceed the diameter of that topology.
The process can be visualized as a wave of activity, propagating from the leaves to the root. If more than
16 numbers are to be summed, the algorithm can easily be extended. The host sends the first 16 numbers
to the leaves, waits one step (for the leaves to send their results to their parents), sends the next group of
16 numbers, and so on.
4. Sorting. We again assume a binary tree topology. The array of numbers to be sorted is divided into
equal parts that are sent, by the host, to all the nodes. Each node sorts its part, waits for its sons to send
it their parts (if it has sons), merges all the sorted streams, and sends the result to its parent. When node
0 is done, the process is complete.
Note that the higher a node is in the tree, the longer it has to wait for input from its sons. A possible
improvement is to divide the original array of numbers into unequal parts and to send the bigger parts to
the higher-placed nodes in the tree.
80286 80287 no
de
host
ROM RAM
node 0 node 127
82586 82586 82586
ethernet cable
The host features an 80286 CPU (8MHz), 2Mb DRAM, 140Mb hard drive, a 1.2Mb floppy drive, a
terminal, 60Mb tape, and Xenix operating system.
Each node contains (Figure 5.12) An 80286 (8MHz), 80287, 64Kb PROM, 512Kb DRAM, seven 82586
communication channels (LAN controllers) connected to the seven nearest neighbors, and one 82586 con-
necting the node to the host by means of an ethernet cable. An NX mini OS resides in each node.
A cube of 32 nodes is called a unit. There may be 1, 2, or 4 units in the iPSC/1, resulting in cubes of
5, 6, or 7 dimensions. A cube can be partitioned into subcubes.
Communications between the nodes is point-to-point. A node may also send a message directly to the
host by specifying an id of −32768. The host can send messages to the nodes by means of the ethernet
cluster. This is a multiplexor that connects the single host to all the nodes. The host may select either one
node, several nodes, or all the nodes, and send a message.
To send or receive a message, a node executes a subroutine call. The subroutines are part of the OS,
and execute 82586 commands. Here are the steps for sending a message:
The program prepares the message in a buffer and calculates the number of the destination node. It
calls the appropriate subroutine.
The subroutine (part of the OS) decides which nearest neighbor will be the first to receive the message.
It creates 82586 commands with the buffer address, the number of the receiving node, and the number of
the neighbor. If the message is sent in the ‘blocked’ mode, the OS does not immediately return to the user
program, but waits for an interrupt from the 82586.
The 82586 reads the message from the buffer and sends it on the serial link. When done, it interrupts
the 80286.
5.11 The Intel iPSC/1 205
The OS responds to the interrupt. If the message was sent in the ‘blocked’ mode, it restarts the user
program.
The steps for receiving a message are:
The message comes into one of the seven 82586s, which writes it in a buffer in the system area, and
interrupts the CPU.
The OS processes the interrupt. It gets the RAM address from the 82586, and checks to see if the
message is for that node. If not, it automatically sends the message to one of the nearest neighbors.
If the message is for this node, the OS checks to see if the user program is waiting for a message of the
same type as the received message (i.e., if it is waiting in a blocked subroutine call). If yes, the OS transfers
the message from its buffer to the user’s buffer. Otherwise, the message stays in the system buffer, which
may eventually overflows.
A ‘blocked’ send stops the sending program until the message is sent by the OS (but not necessarily
received). Using ‘unblocked send’ is faster but, before sending the next message, the program must ask the
OS (through function STATUS) whether the message has been sent (whether the buffer is available). Similarly,
a ‘blocked’ receive is safe, but may delay the receiving program since, if no message has been received, the
OS will put the receiving program on hold. In an ‘unblocked’ receive, the program first has to check if there
is a message to be received.
Here is a list of the important procedures and functions used for message passing and cube management
(we use Fortran syntax, but the same routines can be called from C).
CRECV(TYPE,BUF,LEN) receives a message in the blocked mode. Parameter TYPE is a 32-bit cardinal.
The programmer may assign meanings to several different types, and a program may want to receive only
a certain type of message at any time. The type is the only clue that the receiving program has about the
sender of the message. BUF is an address in the receiving node’s memory, where there is room for LEN bytes.
If the incoming message is longer than LEN bytes, execution halts.
CSEND(TYPE,BUF,LEN,NODE,PID) sends (in the blocked mode) a message of length LEN and type TYPE
from address BUF. The message is intended for process with id number PID, running on node # NODE (note
that more than one process may be running on a node, so each process must be assigned an id by the host).
A node number of −1 will send the message to all the nodes, but they will accept it selectively, depending
on the type and current pid. To send a message to the host, write ‘myhost()’ in the NODE field.
KILLCUBE(NODE,PID) kills process PID running on node NODE. This is necessary since many node
programs don’t halt. They end by waiting for a message that never arrives, so they have to be killed.
LOAD(’FILENAME’,NODE,PID) loads a program file into the memory of NODE, and assigns it the id
number PID. This is used by the host to load nodes with (identical or different) programs. A NODE of −1
selects all nodes.
SETPID(PID) sets the id number of a process. This is supposed to be used only by the host program.
WAITALL(NODE,PID) waits for certain process (on a certain node) to finish. This used mostly by the
host. Both NODE and PID can be −1.
The following are functions. They don’t have nay parameters, and are used, e.g., as in ‘i=infocount()’:
INFOCOUNT() can be used to return the size (in bytes) of the most recent message received.
INFONODE(), INFOPID() and INFOTYPE() return the number of the node that sent the most recent
message, the process id, and the type of that message, respectively.
INFOPID() is the same for process id.
MYNODE(), MYHOST() and MYNPID return the node numbers of the node executing the call, the node
number of the host, and the process id number of the current process, respectively.
NUMNODES() returns the total number of nodes in the cube.
There is also a SYSLOG function where a node can send data to the system log file. This is useful for
debugging.
206 5. Parallel Computers
x
x
y
y
x+1
y+1
Sum Sum
There are two main types of vector processors, register-register and memory-memory. In the first type,
numbers are sent to the ALU from special vector registers. In the second type, they are sent from memory.
The CRAY computers are an example of the first type; the Cyber 205 is an example of the second one.
The CRAY-1 computer has eight V registers, each with room for 64 numbers. A typical vector instruction
is ‘FADD V1,V3(55)’, which adds the first 55 pairs of numbers in registers V1 and V3 as floating-point numbers.
On the Cyber 205, the instruction ‘FADD A,B(55)’ works similarly, except that it fetches the 55 pairs from
arrays A and B in memory.
A register-register vector processor seems faster since the operands are already in registers, but how do
they get there? In some cases the operands go into the registers as results of previous operations. More
commonly, though, they are fetched from memory. Since only one number can be read form memory at
any time, this seems to defeat the purpose of vector processing. This is why vector processors use memory
banks interleaved on the least significant bit. In such a memory, successive memory words are located in
5.12 Vector Processors 207
consecutive banks, speeding up the process of reading an entire array. To read the elements of an array from
memory, the processor initiates a read from the first bank and, without waiting for the result, initiates a
read from the second bank, and so on. When results start arriving from the memory banks to the processor,
they do so at a rate much greater than the memory bandwidth.
Figure 5.14 illustrates the time it takes to performs an operation on n pairs of numbers, as a function of
n. Three types of computers are compared, a conventional (also called serial, sequential, or scalar, a reg-reg
vector processor, and a mem-mem vector processor.
Time Conventional
reg.-reg.
mem.-mem.
The curve for a scalar computer is a straight line. Such a computer takes twice as long to operate on
24 pairs as it takes to operate on 12. The curve for the mem-mem vector is also a straight line, but with
a much lower slope. Such a computer fetches the numbers from memory at a constant rate, sends them to
the ALU at the same rate, and the results are obtained at that rate. However, it should be noted that, for
small values of n, the line is higher than the one for the scalar computer. This is because it takes time to
initialize (or to setup) the vector circuits before they can be started.
The situation with the reg-reg vector is different. Assuming that each vector register has room for 64
numbers, it takes a linear time to operate on values of n between 1 and 64. In this range the curve is a
straight line with about the same slope as in the mem-mem case. However, if the program needs to operate
on 65 pairs, two vector instructions are necessary, which increases the execution time required. This is why
the curve looks like a staircase, where each part is a straight line, slightly sloping.
It should be noted that a good compiler is an essential part of any vector processor. A good compiler, of
course, is always good to have, but on a vector processor, it can make a significant difference. The compiler
should identify all the cases where vector instructions should be used, which is not always easy. Obviously,
vector instructions should not be used in cases such as z:=x+y, and should be used in cases such as
maximum size of memory. In an associative memory, there are no addresses and, therefore, no limit on the
size of memory. There are, of course, practical limits to the size of those memories. Note that, even though
words do not have addresses, they are still ordered. There is such a thing as the first word in the associative
memory, the second word, the last one, and so on. This order is important and is used in certain ‘read’
operations. However, it is impossible to refer to a word in memory by its position. A word can only be
accessed by its content.
Since an associative memory performs certain operations in parallel, it requires more hardware than a
conventional memory. Each word must have enough hardware to perform the search and the other parallel
operations. Conventional memory, in contrast, requires less hardware since it is sequential and operates on
one word at a time.
The pattern is specified by means of two registers, the comparand and the mask. The comparand
contains the values searched for, and the mask contains the bit positions. In the example above, the
comparand should contain a 1 in bit position 3, a 1 in position 7, a 0 in position 8, etc., while the mask
should have 1’s in positions 3, 7, 8, . . . Two registers are needed, since with only a comparand, it would be
impossible to tell which comparand bits are part of the pattern and which are unused. Figure 5.15 shows
the two registers connected to all the memory words by means of the memory bus.
Control
Unit
Comparand
Commands
Mask
Word1 1
Word2 2
Pattern
WordN N
Before getting into the details of the various memory operations, let’s try to convince ourselves that the
entire idea is useful, by looking at certain applications.
Example 1. Finding the largest number. On a conventional computer, finding the largest of M numbers
takes M steps. We will show that, on our associative computer, the same process takes fixed time. We
start by loading both the comparand and the mask register with the pattern 10 . . . 00. When the process
ends, the comparand will contain the largest number. We start by performing a ‘search’, and ask if there
are ‘any responders’. If the answer is yes, then we know that the largest number has a 1 in the leftmost
position. We accordingly keep the 1 in that position of the comparand. If the answer is no, then we clear the
leftmost position of the comparand. The comparand now contains x0 . . . 00, where x equals the leftmost bit
of the largest number. Next, we set the comparand to x10 . . . 00 and shift the mask to the right 010 . . . 00.
We repeat the process and get a value of xy0 . . . 00 in the comparand, where y equals the largest number’s
second bit from the left. The loop is repeated N times, where N is the word size, and thus takes fixed time.
The execution time is independent of M , the number of words.
210 5. Parallel Computers
Exercise 5.6: Describe the process for finding the smallest number.
Example 2. Sorting M numbers is now a trivial process. The largest number is selected as shown earlier,
and the comparand is stored at the start of an M -word array in conventional memory. Since the comparand
contains a copy of the largest number, a ‘search’ will identify that word in memory, and it should be cleared
or somehow flagged and disabled (see below). Repeating the process would get the second largest number
and so on. The entire sort thus takes M steps, instead of a best case of M log M steps on a conventional
computer. Since sorting is such a common application of computers, this example is a practical one.
Example 3. Payroll is a typical example of a business application that requires a sort. In a simple payroll
program, records are read from a master file—each containing the payroll information of one employee—pay
checks are prepared, and the (modified) records are written to a new master file. The master file is sorted by
employee id number (or by social security numbers). As a record is read from the master file, the program
compares it to modification records stored in memory, to see if there are any modifications to the payroll
information of the current employee. On a conventional computer, the modification records are either sorted
in memory or are written, in sorted order, on a file. On an associative computer, the modification records
may be stored in memory in any order. When an employee record is read off the master file, the social
security number is used as a pattern to search and flag all the modification records for that employee. Those
modification records are then read, one by one, and used to produce the pay stub and the modified record
of the employee.
This example is very general, since there are many business applications which work in a similar way,
and may greatly benefit from parallel searches in an associative memory. It also shows the importance of an
architectural approach which combines an associative and a conventional computer.
These examples, although simple, illustrate the need for certain instructions and features in an associa-
tive computer. The most important of those are listed below.
Instructions are needed to load the comparand and the mask, either from other registers, from a
conventional memory, or with a constant. It should also be possible to perform simple arithmetic operations,
logical operations, and shifts on the two registers. It should also be possible to save those registers in several
places.
After a search, a special instruction is needed to find out whether any memory words have responded.
Conventional instructions are necessary to perform loops, maintain other registers, do I/O, etc.
Since we rarely want to perform a search, a sort, or any other operation on all the words in memory,
there should be more hardware for enabling and disabling words. As a result, the response store of each
word should be extended from 1 to 2 bits. The first bit (the response bit), indicates whether the word is
a responder to the current search. The second bit (the select bit), tells whether the word is enabled and
should participate in the current operation, or whether it is disabled and should be ignored.
Suppose that it is necessary to search a certain array. How can the program enable the words that
belong to the array, and disable the rest of the associative memory? The key is to uniquely identify each
data structure stored in the associative memory. Our array must be identified by a unique key, and each
word in the array should contain that key. A search is first performed, to identify all the words of the array.
Those words are then enabled by copying the response bits to the select bits of the response store.
This discussion shows the importance of the response store, and as a result, it seems reasonable to add
a third, auxiliary, bit to it. The auxiliary bit could be used to temporarily store either the response bit or
the select bit. It also makes sense to add instructions that can save the response store, either in conventional
memory or in the associative memory itself.
As a result of the discussion above, it is clear that words should be large, since each should contain
the data operated on, and a key (or several keys) to identify the word.
There are two ways to read from the associative memory. A ‘read’ instruction can either read the first
responder (in which case, it should be possible to turn off that responder after the read), or it can read the
logical OR of all the responders. In the payroll example above, responders should be read one by one.
As a result of these points, designers have concluded that a good way to design an associative com-
puter is as a combination of a conventional computer and an associative one. In such an architecture, the
5.15 Data Flow Computers 211
conventional computer is the host, and the associative computer is a peripheral of the host. The two com-
puters can process different parts of the same application, with the conventional computer executing the
sequential parts, sending to the associative computer those parts that involve parallel memory operations.
Cj 1 C1 1 C0 1 Comparand
Mj 1 M1 1 M0 1 Mask
Perform
Search
0 W 0j 1 0 W 01 1 0 W 00 1 S
T0
R
0 W 1j 1 0 W 11 1 0 W 10 1 S
T1
R
0 W ij 1 0 W i1 1 0 W i0 1 S
Ti
R
Figure 5.16 illustrates the search operation in an associative memory. The control unit executes a search
by sending a ‘set T’ signal, followed by a ‘perform search’ signal. The first signal sets all the T bits of the
response store. The second signal clears the response bits of all words that don’t match the pattern.
5.15 Data Flow Computers
Another approach to the design of a non von Neumann computer is the data flow idea. A data flow computer
resembles a von Neumann machine in one respect only, it has addressable memory. In all other respects,
this computer is radically different. Figure 5.17 lists some of the differences between a von Neumann and a
data flow computer and can serve as an informal definition of a data flow computer.
What is the earliest time that an instruction can be executed? Consider an instruction such as ‘ADD A,B’.
In principle, this instruction can be executed as soon as both its operands A, B are ready. In a von Neumann
machine, execution is sequential and our instruction may have to wait for its turn. In the sequence:
212 5. Parallel Computers
.
STO R1,A
.
STO R2,B
SUB R1,C
ADD A,B
both A and B are ready after the second STO and, at that point, the ADD instruction can, in principle, be
executed. Let’s assume, however, that the SUB instruction also becomes ready at this point. The SUB, of
course, is executed first, because of the order in which the instructions were written. The ADD has to wait.
In a data flow computer, our ADD instruction wouldn’t have to wait. It would be executed as soon as its
operands are ready, which implies that it would be executed in parallel with the ADD. Thus, one important
aspect of data flow computers is parallel execution. A data flow computer should have a number of PEs
(processing elements) where several instructions can be executed simultaneously.
Research on data flow computers started in the late 1960s, both at MIT and at Stanford university.
The first working prototype, however, was not built until 1978 (at the Burroughs Corp.). Since then, several
research groups, at Texas Instruments, UC Irvine, MIT, University of Utah, and Tokyo University, have been
exploring ways to design data flow computers, and to compile programs for them. Today, several working
prototypes exist, but not enough is known on how to program them. A short survey of data flow computers
is [Agerwala and Arvind 82].
Since the data flow concept is so much different from the sequential, von Neumann design, it takes
a conceptual breakthrough to design such a machine. This section describes one possible implementation,
the Manchester data flow computer, so called because it was developed and implemented in Manchester,
England [Watson and Gurd 82].
To concentrate our discussion, we start with a simple, one-statement program. Figure 5.18a shows a
common assignment statement. In a conventional (von Neumann) computer, such a statement is typically
executed in four consecutive steps (Figure 5.18b). However, in a data flow computer, where several instruc-
tions can be executed simultaneously, we first have to identify the parallel elements of the program. The
example is so simple that it is immediately clear that the two multiplications can be executed in parallel.
The graph in Figure 5.18c exposes the parallelism in the program by displaying the two multiplications on
the same level. Note that there are only three instructions. We later see that there is no need for a final
STO instruction. Note also that, since execution is not sequential, instructions can be stored anywhere in
memory. They do not have to be stored sequentially. Addresses 2, 7, and 11 have been assigned arbitrarily
to the three instructions.
The first multiplication instruction ‘∗’, at address 11, generates a product (= 6) that should become the
left operand of the subtract ‘−’ instruction at address 7. The ‘∗’ therefore generates a result that consists of
5.15 Data Flow Computers 213
A B C D
MUL A,B 11 2
MUL C,D * *
E:=A*B-C*D ADD A,C
STO B
7 —
final
a. b. c.
Figure 5.18. (a) The program, (b) conventional execution, (c) the data flow graph
the product and the destination address, 7. Such a result is called a token. A token is a result generated by
an instruction. It can either be an intermediate result or a final one. If the token represents an intermediate
result, then it should serve as an input operand to some other instruction and it should contain the address
of that instruction. If it is a final result, however, then it should be stored in memory (or be output) and
should contain the memory address where it should be stored. In either case, the token consists of the
following fields:
A value (which could be integer, floating-point, boolean, or any other data type).
A type. This describes the type of the ‘value’ field.
A one-bit field to indicate an intermediate or a final result.
A destination address. If the token is intermediate, this is the address of an instruction in the
instruction store, where the token is headed. Otherwise, this is an address (in the host’s memory, see later),
where the token will be stored for good.
Handedness. Our example token goes to the ‘−’ instruction at address 7 as its left input operand.
The right input comes from the ‘*’ instruction at address 2. Each (intermediate) token should carry its
handedness, except if it is destined for a single operand instruction (such as a DUP, see later).
Bypass matching store? A one-bit field indicating whether the token should go to the matching store
(see later) or should bypass it.
A tag field (see later).
The seven tokens used in our example are shown (first five fields only) in the table
The next point to discuss is the matching of tokens. Tokens 1 and 2 in our example are the operands
of the ‘∗’ instruction at 11. Before attaching themselves to that instruction, they should somehow meet and
make sure they constitute a pair. Tokens are matched by destination address and handedness. Tokens 1 and
214 5. Parallel Computers
2 match since they have the same destination address (= 11) and opposite values of handedness. Tokens 5
and 6 also match because of the same reason.
The matching is done in a special memory, the matching store. A token arriving at the matching store
searches the entire store to find a match. If it does not find any, it remains in the matching store, waiting
for a matching token. If it finds a match, it attaches itself to the matched token and the two tokens proceed,
as a pair, to the instruction store.
Exercise 5.7: What kind of memory should the matching store be?
In the instruction store, the two tokens pick up a copy of the instruction they are destined for, and
proceed, as a packet, to be executed by an available PE. Figure 5.19 illustrates the basic organization of our
computer so far. Note that the buses connecting the three main components are different since each carries
a different entity.
Matching Store
Token Pairs
Tokens
Instruction Store
PEs Packets
The basic organization of the particular data flow computer discussed here is a ring consisting of the
three main components—the matching store, instruction store, and PEs. As with any other loop, it is
important to understand how it starts and how it stops. A good, practical idea is to add a host computer to
help with those and with other operations. The host performs tasks such as:
Injecting the initial tokens (A, B, C, and D in our example) into the data flow machine (the main
ring).
Receiving the final tokens and storing them in its memory (to be printed, saved on a file, or further
processed by the host).
Compiling the program for the data flow computer and sending it to the instruction store.
Executing those parts of the program that do not benefit from parallel execution.
In general, the host may consider the data flow computer an (intelligent) peripheral. The host may run
a conventional program, executing those parts of the program that are sequential in nature, and sending the
other parts to to be executed by the data flow computer. The results eventually arrive back at the host, as
tokens, and can be saved, output, or further processed. The host is interfaced to the data flow computer by
means of a switch, placed in the main ring between the PEs and the matching store (Figure 5.20).
The switch routes the initial tokens from the host to the matching store, where they match and proceed
to the instruction store, starting the data flow operation. Any tokens generated in the PEs pass through the
switch and are routed by it to one of three places. A token destined for a double-operand instruction is sent
to the matching store. A token destined for a single-operand instruction is sent directly to the instruction
store. A token representing a final result is routed to the host.
At the end of the program, the last final result is generated and is sent to the host. At that point,
the data flow computer should have no more tokens, either on its buses on in the matching store. This,
obviously, is the condition for it to stop. Note that certain bugs in the program can cause the data flow
computer to run forever. It is possible, for example, to have unmatched tokens lying in the matching store,
5.15 Data Flow Computers 215
Tokens
PEs Packets
waiting for a match that never comes. It is easy to create a situation where a token is executed by an
instruction, generating an identical token, which causes an infinite loop. There can also be a case where
tokens are duplicated without limit, flooding the matching store and clogging the buses.
Similar situations are also possible on a conventional computer and are dealt with by means of interrupts.
A well-designed data flow computer should have interrupts generated by an overflowing matching store,
overflowing queue (see below), a timer limit, and other error situations. However—since the data flow
computer does not have a PC, and there are no such things as the next instruction or a return address—
conventional handling of interrupts is impossible, and the easiest way to implement interrupts is to send
them to the host.
The next step in refining the design of the main ring is to consider two extreme situations. One is the
case where no PEs are available at a certain point. This may happen if the program has many parallel
elements, and many instructions are executed simultaneously. In such a case, the next packet (or even more
than one packet) has to wait on its way from the instruction store to the PEs. Another hardware component,
a queue, should be added to the ring at that point. The PE queue is a FIFO structure that should be large
enough not to overflow under normal conditions. If it does overflow, because of a program bug, or because
of an unusual number of parallel instructions, an interrupt should be generated, as discussed earlier.
The other extreme situation is the case where a token arrives at the matching store while the store is
busy, trying to match a previous token. This is a common case and is also handled by a queue, the token
queue. All tokens passing through the switch go into the token queue, and the top of the queue goes to the
matching store when the store becomes ready. The token queue should be large because it is common to
have large quantities of tokens injected from the host, all arriving at the queue on their way to the matching
store.
The final design of the ring is shown in Figure 5.21. Note the special bus connecting the token queue
to the instruction store. This is used by tokens that do not need a match and should bypass the matching
store. Another bus, not shown in the diagram, connects the host to the instruction store. It is used to load
the program in the instruction store before execution can start.
To better understand how the data flow computer operates, we look at a few more simple programs.
Example 1. E:=A*B+C/A+4×A↑2; This is a short program consisting, as before, of one assignment
statement. The difference is that variable A is used here three times. In a conventional computer, such a
situation is simple to handle since A is stored in memory, and as many copies as necessary can be made.
LOD R1,A {\tenrm 1st use}
MUL R1,B
LOD R2,C
DIV R2,A {\tenrm 2nd use}
LOD R3,A {\tenrm 3rd use}
MUL R3,A {\tenrm 4th use}
MUL R3,\#4
ADD R1,R2
216 5. Parallel Computers
Token
Queue Instruction Store
Packets
PE
Host Switch Queue
PEs
Tokens Packets
Figure 5.21. Data flow computer organization (3rd version)
ADD R1,R3
STO R1,E
In a data flow computer, however, there are no variables stored in memory; only tokens flowing along
buses (arcs of the ring). In order to use A several times, extra copies of A’s token have to be made. We now
realize that a new instruction, DUP, is needed, to duplicate a token. This is a common instruction in data
flow computers and is an example of an instruction with one input and two output tokens.
dup
B
* dup
C
/ dup
+
*
4
*
+
final
Figure 5.22. A simple data flow graph with DUP instructions
The data flow graph is shown in Figure 5.22 and it raises two interesting points. First, the graph is
six levels deep, meaning that it takes six cycles of the ring to execute (compare this to the 10 instructions
above). Second, one ‘∗’ instruction uses the immediate constant 4. Instead of injecting or generating a
special token with a value of 4, it is better to have the instruction contain the immediate quantity.
We thus arrive at an instruction format for a data flow computer, containing the following fields (Fig-
ure 5.23):
1. OpCode
2. Destination address for the first output token.
3. Handedness for the first output token.
4. First token is an intermediate or a final result (I/F).
5. First token needs matching or should bypass the matching store (M/B).
6. A zero in this field indicates that fields 7–10 are a second output token; a one indicates they are a
second, immediate, input operand.
7. Destination address for the second output token.
8. Handedness for the second output token.
9. Second token is intermediate or final result (I/F).
10. Second token needs matching or should bypass the matching store (M/B).
There are 10 fields, but fields 3–6, 8–10 consist of a single bit each. If field 6 is one (meaning that the
second operand is an immediate input operand), then fields 7–10 contain the immediate quantity.
This format allows for instructions with either one or two outputs. If the instruction generates only one
output, then one of its input operands can be immediate. Note that our instructions can have any number
of input operands, although in practice they have just one or two.
Example 2. Any non-trivial program has to make decisions. In a conventional computer, decisions
are made by the various comparison instructions and are executed by the conditional branch instructions.
In a data flow computer, some ‘compare’ instructions, and a conditional branch instruction (BRA) have to
be designed. The ‘compare’ instructions are straightforward to design. Each receives two input tokens,
compares them, and outputs one boolean token. The two input tokens should normally be of the same type,
and their values are compared. Several such comparisons should be implemented in a data flow computer,
the most important ones being EQ, GE, and GT. Note that there is no need for a LE and LT instructions since a
‘LE A,B’ instruction is identical to a ‘GT B,A’. Also, a NE instruction is not needed, since it can be replaced
by the two-instruction sequence EQ, NEG (see Figure 5.24b for such a sequence). The ‘negate’ instruction NEG
is quite general and can negate the value of a numeric or a boolean input.
Since those instructions have just one output token, they can have one immediate input. Thus instruc-
tions such as ‘EQ A,0’; ‘GE B,2’; ‘GT C,-6’; are valid.
The conditional branch instruction, BRA, is not that easy to design, since the data flow computer cannot
branch; there is no PC to reset. One way to design a BRA is as a 2-input, 2-output instruction that always
generates just one of its two outputs. Such an instruction receives an input token X and generates one of two
possible output tokens, depending on its second input, that should be boolean. The two output tokens are
identical to the input X, except for their destinations. Figure 5.24a shows this instruction and Figure 5.24b
is an example of an absolute value calculation. A numeric token X is tested and, if negative, is negated to
obtain its absolute value.
A
dup
X
bra ≥0
boolean input
bra
false true
false true
neg
X X
final
(a) (b)
Figure 5.24. (a) A BRA instruction, (b) An ABS calculation
218 5. Parallel Computers
5
5
X 9 X
−1 9 −1
10 8 10
8
≥0 processing ≥0 incl
processing
incl 11
11
dup dup
12 12
bra bra
final final
(a) (b)
Figure 5.25. (a) The basic mechanism, (b) with labels added
The data flow graph (Figure 5.25a) illustrates a simple loop. We assume that the instruction at address
8 does some processing. The processing may, of course, require more than one instruction. The loop is
straightforward except for one subtle point—the two loops do not run at the same rate. The main loop—
consisting of instructions 8 and 12—is slow since it involves processing. The counting loop—instructions
9–11—is faster since it involves three simple instructions. This is an obvious but important point, since it
may throw off the matching of tokens. Recall that tokens are matched, in the matching store, by destination
and handedness. In our loop, however, tokens with the same destination and handedness are generated,
one in each iteration, by the BRA instruction at address 12, part of the slow loop. We denote such a token,
generated in iteration i, by Ai . Our Ai is sent to the matching store to match itself to a token Bi generated,
in iteration i, by the “≥ 0” instruction at address 10 (part of the fast loop). When Ai gets to the matching
store, however, it may find several B tokens, Bi , Bi−1 , Bi−2 , . . ., generated by several iterations of the fast
loop, already waiting. They all have the same destination and handedness, since they were all generated by
the same instruction, but they may have different values. Most are ‘true’ but the last one should be ‘false’.
Our particular token Ai should match itself to Bi and not to any of the other Bj s.
We conclude that matching must involve the iteration number, and a general way of achieving this
is to add a new field, a label field, to each token, and to match tokens by destination, handedness, and
label. Figure 5.25b shows how this approach is implemented. Each iteration increments the label, thereby
generating unique tokens. A token Ai arriving at the matching store will match itself to a B token with the
same label. Such a token is Bi (a B token that was generated by the same iteration as Ai ).
There is another approach to the same problem. The matching store can be redesigned such that
5.15 Data Flow Computers 219
matching is done in order of seniority. A token Ai will always match itself to the oldest token that has the
same destination and opposite handedness. In our case this would mean that Ai would match itself to Bi .
In other cases, however, matching by seniority does not work.
Exercise 5.8: What could be a good data structure for such a matching store?
Even the use of labels does not provide a general solution to the matching problem. There are cases,
such as nested loops, where one label is not enough! A good design for a data flow computer should allow
for tokens with several label fields, and there should be instructions to update each of them individually.
Exercise 5.9: What could be another case where simple labels may not be enough to uniquely identify a
token?
Parallel Processing is an efficient form of information
processing which emphasizes the exploitation of
concurrent events in the computing process
—K. Hwang and F. A. Briggs, Computer Architecture & Parallel Proc.
6
Reduced Instruction Set
Computers
6.1 Reduced and Complex Instruction Sets
RISC (reduced instruction set computers) is an important trend in computer architecture, a trend that
advocates the design of simple computers. Before explaining the word ‘simple’ and delving into the details
of RISC architectures, let’s look at the historical background of, and the reasons behind, RISC and try to
understand why certain designers have decided to take this approach and implement simple computers.
one trend has been dominant in computer architecture since the early days of computer design, namely
a trend toward CISC (complex instruction set computers). Larger, faster computers with more hardware
features and more instructions were developed, and served as the predecessors of even larger, more complex
computers. The following table shows a comparison of computer architectures from the mid 1950s to about
1976. It is clear that computer architects and designers did an excellent job designing and building complex
computers.
One important feature of CISC architectures is a large instruction set, with powerful machine instructions,
many types, sizes and formats of instructions, and many addressing modes. A typical example of a powerful
machine instruction is EDIT. This instruction (available on old computers such as the IBM 360, IBM 370,
and DEC VAX), scans an array of characters in memory and edits them according to a pattern provided
in another array. Without such an instruction, a loop has to be written to perform this task. Another
example is a square-root machine instruction, available on the CDC STAR100 computer (now obsolete). In
the absence of such an instruction, a program must be written to do this computation.
While most computer designers have pursued the CISC approach—performance evaluation studies,
coupled with advances in hardware and software—have caused some designers, starting in the early 1970s,
to take another look at computer design, and to consider simple computers. Their main arguments were:
222 6. Reduced Instruction Set Computers
The complex instructions are rarely used. Most programmers prefer simple, conventional instructions
and are not willing to take the time to learn complex instructions, which are not needed very often. Even
compiler writers may feel that they are presented with too many choices of instructions and modes, and may
decide not to take the trouble of teaching their compilers the proper use of the many addressing modes and
complex instructions.
Studies of program execution, involving tracing the execution of instructions, have shown that a few
basic instructions such as load, store, jump, and integer add, account for the majority of the instructions
executed in a typical program.
Procedures are important programming tools and are used all the time in all types of programs. An
efficient handling of procedures on the machine instruction level is one way of speeding up the computer.
Complex hardware is slow to design and costly to debug, lengthening the development time of a new
computer and increasing its price; thereby reducing the competitiveness of the manufacturer.
Complex instructions are implemented by microprogramming which, while making it easy to imple-
ment complex machine instructions, takes longer to execute.
In a microprocessor, where the entire processor fits on one chip, the hardware and control store needed
to implement complex instructions take up a large part of the chip area, area that could otherwise be devoted
to more useful features.
Complex instruction sets tend to include highly-specialized instructions that correspond to statements
in a certain higher-level language (for example, COBOL). The idea is to simplify the construction of compilers
for that language. However, those instructions may be useless for compiling any other higher-level language,
and may represent a lot of hardware implemented for a very specific use.
Most older architectures up to the 1970s did not use pipelining. Pipelining was a relatively new
feature in computers and RISC advocates considered it very useful and worth the cost of the extra hardware
it requires.
Based on these arguments, RISC advocates have come up with a number of design principles that today
are considered the basis—or even the definition—of RISC. They are:
as a definition. A computer design following all those points is certainly a RISC. A design implementing
just some of them may be considered RISC by some, and non-RISC by others. Perhaps a good analogy
is the concept of fine-grained architecture vs. coarse-grained architecture, discussed in Chapter 5. Certain
parallel computers have fine-grained architectures, others have coarse-grained architectures. Most parallel
computers, however, are in-between and are hard to classify as belonging to either type.
The above points also serve to illustrate the advantages of RISC. Those are listed by RISC advocates
as follows:
Easy VLSI realization. Because of the simplified instruction set, the control unit occupies a small area
on the chip. Generally, the control unit of a microprocessor occupies more than 50% of the chip area. In
existing RISCs, however, that figure is typically reduce to around 10%. The available area can then be used
for other features such as more registers, cache memory, or I/O ports.
Faster execution. Because of the simple instruction set, the compiler can generate the best possible
code under any circumstances. Also, instructions are faster to fetch, decode and execute, since they are
uniform. RISC advocates also see the large number of registers as a point in favor of fast execution, since it
requires fewer memory accesses.
Designing a RISC is fast and inexpensive, compared to other architectures. This entails the following
advantages:
• A short design period results in less expensive hardware.
• It gets the computer to the market faster, thus increasing the competitiveness of the manufacturer and
resulting in products that truly reflect the state of the art. A delay in getting a computer to market
may result in computers that are outdated the day they are first introduced.
• It contains fewer hardware bugs and new bugs found are easier to fix.
It is easier to write an optimizing compiler for a RISC. To compile a given statement in a higher-level
language, a compiler on a RISC usually has just one choice of instruction and mode. This simplifies the
design of the compiler itself, making it easier to optimize code. Also, the large number of registers makes
it easier for the compiler to optimize the code by generating more register-register instructions and fewer
memory reference instructions. On a CISC, in contrast, the compiler has several choices of instructions and
modes at any given point, which makes it much harder to implement a good optimizing compiler.
As with most other concepts in the computer field, it is impossible to absolutely prove or disprove the
advantages listed here. RISC detractors were therefore quick to announce that they too have a list, a list of
RISC disadvantages. Here it is:
Because of the simple instruction set, certain functions that, on a CISC, require just one instruction,
will require two or more instructions on a RISC. This results in slower RISC execution. RISC proponents,
however, point out that, in a RISC, only the rarely-executed instructions are eliminated, keeping the com-
monly executed ones. It is well known, they say, that in a typical program, a few instructions account for
more than 90% of all instruction executions. Therefore, eliminating rarely-used instructions should not affect
the running time.
Floating-point numbers and operations are important but are complex and slow to execute. Most
RISCs today either do not support floating-point numbers in hardware or offer this feature as a complex,
non-RISC, option.
RISCs may not be efficient with respect to virtual memory and other OS functions. There is not
enough experience with RISCs today to decide one way or the other.
Given the above arguments, one thing is clear; RISC is an important architectural topic today even if
it is just because of the controversy it generates. This controversy has been going on since 1975 and will
perhaps continue in the future. This is healthy, since it leads to more research in computer architecture and
to the development of diverse computer architectures.
224 6. Reduced Instruction Set Computers
Opcode. The fixed-size, 7-bit opcode allows for up to 128 instructions. The computer only has 31
instructions, so a 5-bit opcode would have been enough. However, when the RISC I was designed, the
designers already had in mind the next model (RISC II), and decided to make the opcode longer than
needed, for future expansion.
SCC (Set Condition Code), a 1-bit field. When set, the results of the instruction affect the condition
codes (status flags).
Rd . The destination field, one of the 32 available registers.
Rs . A source field, a register.
6.3 The RISC I Computer 225
IM. A 1-bit mode field. IM=0 implies that the second source is also a register. In such a case, the S2
field specifies a register and only 5 of the 13 bits of that field are used. IM=1 implies that the second source
is a 13-bit signed constant used, by the index mode, to obtain the address.
S2. The second source operand to the instruction. Either a register or an index.
The long instructions are used for all the branches and all the PC–relative instructions. The fields are:
Opcode and SCC. Those are the same as for the Short Format instructions.
Rd . This is interpreted either as a destination register or (for the conditional branch instructions) its
four least-significant bits constitute the condition code, while the most-significant bit is unused.
IMM. A 19-bit signed constant.
The addressing modes are:
Indexed. EA = Rx + S2, where Rx is either Rs or Rd , and S2 is a signed constant.
Register Indirect. EA=Rd
Array Indexed. EA=Rx +S2. Where S2 specifies a register.
PC–relative. EA=PC+IMM.
The instruction set is divided into four groups. There are 12 arithmetic and logical instructions, 8
Load/Store instructions, 7 branch and call, and 4 miscellaneous ones. Following are some examples of
RISC I instructions:
Integer Add. It has fields Rs ,S2,Rd . It adds Rd ← Rs +S2. All the arithmetic and logic instructions
have this format.
Load Long. The fields are Rx ,S2,Rd . The description is Rd ←M[Rx +S2]. The first source field serves
as an index register.
Store Long. Its fields are Rm ,(Rx ),S2. It stores Rm (the first source field) in the memory location
whose address is Rx +S2. The description is M[Rx +S2]← Rm .
Conditional Jump. COND,(Rx ),S2. The description is PC← Rx +S2, but the jump is only executed
if the status flags are agree with the COND field (which is the Rd register). The following table summarizes
the 16 COND values that can be used.
Code Syntax Condition
0 none Unconditional
1 GT Greater Than
2 LE Less or equal
3 GE Greater or equal
4 LT Less Than
5 HI Higher Than
6 LOS Lower or Same
7 LO or NC Lower Than or No Carry
8 HIS or C Higher or Same or Carry
9 PL Plus
10 MI Minus
11 NE Not Equal
12 EQ Equal
13 NV No Overflow
14 V Overflow
15 ALW Always
Call Relative (and change window). Uses fields Rd ,IMM. The call is done by saving the PC in
the destination register and incrementing the PC by the signed, 19-bit constant IMM. Thus Rd ←PC,
PC←PC+IMM.
226 6. Reduced Instruction Set Computers
an interrupt is generated (register file overflow) and the parameters of the new call are pushed into a stack
in main memory. When the corresponding return is executed, another interrupt (register file underflow) is
generated.
Exercise 6.1: What could be a common reason for such deep nesting of procedure calls?
This shows the importance of an optimizing compiler on the RISC I. A good optimizing compiler should be
able to generate code with just a few NOPs, thereby making full use of the delayed jumps.
As a result, the compiler writer has to be aware of the pipeline, a fact that makes the RISC I pipeline
an architectural feature. Normally, a pipeline is an organizational feature.
228 6. Reduced Instruction Set Computers
A 0 0 1 1 A
B 0 1 0 1
Inverse
Inverse A 1 1 0 0
A
AND AB 0 0 0 1 AB
B
OR A+B 0 1 1 1 AND
XOR A⊕B 0 1 1 0
A
NAND AB 1 1 1 0 AB=A+B
NOR A+B 1 0 0 0 B OR
XNOR A⊕B 1 0 0 1
(a) (b)
Figure 7.1: Logic gates (a) and universal NAND (b)
Logic gates, (with the exception of NOT) can have more than two inputs. A multi-input AND gate, for
example, outputs a 1 when all its inputs are 1’s.
230 7. Digital Devices
Exercise 7.1: Explain the outputs of the OR and XOR gates with several inputs.
The last three gates of Figure 7.1a are NAND, NOR, and XNOR. They are combinations of AND, OR,
and XOR, respectively, with a NOT. Such combinations are useful and are also universal (a universal logic
gate is one that can be used to simulate all other gates). Figure 7.1b shows how NAND gates can be used
to simulate a NOT, AND, and OR gates.
Exercise 7.2: Show how NOR gates can be used to produce the same effects.
Exercise 7.3: Show how the XOR gate can be constructed out of either NOR gates or NAND gates.
The AND-OR-INVERT (AOI) gate is not really a logic gate but a simple digital device that consists
of two AND gates and a NOR gate (Figure 7.2a). If we label its four inputs a, b, c, and d, then its output
is ab + cd. The AOI gate is useful because of its universality. Many useful digital devices can be generated
as combinations of these gates. Figure 7.2b,c,d shows how this device can be used to construct the NOR,
NAND, and XOR gates, respectively. Figure 7.2e shows how it can simulate an XNOR.
Logic gates are easy to implement in hardware and are easy to combine into more complex circuits. Since
the gates have binary inputs and outputs, they are the main reason why computers use binary numbers.
Another reason for the use of binary numbers is that the logic gates can perform only logic operations, but
with binary numbers there is a close relationship between logic and arithmetic operations. Logic gates can
therefore be combined to form circuits that perform arithmetic operations.
Computers are composed of nothing more than logic gates stretched out to the
horizon in a vast irrigation system.
—Stan Augarten.
the latch as a memory unit for a single bit. Registers and most types of computer memories are based on
latches.
The simplest latch is shown in Figure 7.3a. It consists simply of two NOT gates connected in series.
Once power is turned on, this simple device settles in a state where one NOT gate outputs a 1, which is
input by the other NOT gate, which outputs a 0, which is input by the first NOT. Thus, the output of each
gate reinforces the input of the other, and the state is stable. Figure 7.3b shows the same device arranged
in a symmetric form, with the two NOT gates side by side.
Q Q
Q Q Q Q
1 0
0
1
S R S R S R
(a) (b) (c) (d) (e)
S
R
Q
(f)
Figure 7.3: Latches
In order for this device to be practical, two inputs and an output are required. Figure 7.3c shows how
one of the NOT gates is arbitrarily selected and its output labeled Q. The output of the other NOT is always
the opposite, so it is labeled Q̄ and is termed the complementary output. The two inputs are called S (for
“set”) and R (for “reset”) and they are fed to the NOT gates through two OR gates. These inputs are
normally low (i.e., binary 0), so they do not affect the state of the latch. However, if S is momentarily set
to 1, the OR gate outputs a 1 regardless of its other input, this 1 is fed to the NOT on the left, which then
outputs a 0, which is sent to the OR gate on the right, and through it, to the NOT on the right. This NOT
starts outputting a 0, which is eventually fed to the NOT on the left. The S input can now be dropped,
and the final result is that Q becomes 1 (and Q̄, of course, becomes a 0). We say that the latch has been set.
Similarly, a momentary pulse on the R input resets the latch. This behavior is illustrated in Figure 7.3f and
this type of latch is called an SR latch (although some authors prefer the term RS latch). An SR latch is
sometimes called an asynchronous device.
Feeding 1’s through both inputs simultaneously results in an unpredictable state, so it is important to
make sure that this does not happen.
Figure 7.3d shows how a pair of NOT and OR gates is replaced by a single NOR gate, and Figure 7.3e
shows how a latch can be constructed from two NAND gates and inverted inputs. In this type of device, the
inputs are normally held high and are pulled low momentarily to change the state.
In practical situations, it is preferable to have a latch that’s normally disabled. Such a latch has to be
enabled before it responds to its inputs. The advantage of such a device is that the designer does not have
to guarantee that the inputs will always have the right values. The latch is disabled by a control line, and as
long as that line is low, the inputs can have any values without affecting the state of the device. Figure 7.4a
shows such a device. It requires two AND gates, and the E input controls the response of the device. Such a
latch is said to be synchronized with the control line, and is called a flip-flop. Figure 7.4b shows a variation
of an SR flip-flop where two new inputs P (or PR) and CR can be used to preset and clear the device even
when it is disabled. Such a device is both synchronous and asynchronous as shown by the waveforms of
Figure 7.4c.
Many times, the control line is the output pulse produced by the computer clock. Computer operations
are synchronized by a clock that emits pulses of the form . . . , and operations are generally
232 7. Digital Devices
Q Q Q Q
P
CR
S
PR CR R
S E R S E R Q
triggered by the rising (or falling) edge of the clock pulse. If the clock output is the control line, the flip-flop
can change state only on a clock edge.
The symbol shown in Figure 7.5a indicates a flipflop that’s enabled on a rising edge of the clock. In
contrast, Figure 7.5b,c shows flipflops that are enabled on a falling edge of the clock.
When both inputs of an SR latch are high, its state is undefined. Thus, the circuit designer has to make
sure that at most one input will be high at any time. One approach to this problem is to design a latch
(known as a JK flipflop) whose state is defined when both inputs are high. Another approach is to design a
D latch, where there is just one input.
The D input of a D latch (D for “data”) determines the state of a latch. However, a binary input line
can have only two values (high and low), and it needs a third value to indicate no change in the current
state. This is why the D latch has to have a control input E. When E is low, the D input does not change
the state of the latch. Figure 7.6a shows the standard symbol of a D latch, Figure 7.6b shows the basic
design, based on an SR latch (notice the asynchronous inputs; they should not be high simultaneously), and
Figure 7.6c shows how a D latch can be constructed from an SR latch and a NOT gate. Figure 7.6d is a
typical waveform.
Figure 7.6e illustrates a problem with this design. The D input goes high while the clock pulse is high,
and as a result Q goes high immediately. In the computer, we want all states to change on a clock edge
(rising or falling), so this design of a D latch is called untimed. Untimed bistables are generally called latches
and timed ones are called flipflops. In a latch, the E input is called “enable,” while in a flipflop it is called
a “clock” and is denoted by CK. In a timed flipflop, the state can change only on a clock edge (leading or
trailing), as illustrated in Figure 7.7a. One way to design such a device is to connect two SR latches in a
master-slave configuration, as shown in Figure 7.7b.
Q1
D S1 S2 Q2
D
E1 E2
E Q1
R1 R2 Q2
Q
CK
(a) (b)
Figure 7.7: Timed D flip-flop
7.2 Multivibrators 233
PR
D
Q
D D S Q
Q SR
Q E E latch Q
CK
Q R
E
CR
D D
E E
Q Q
(d) (e)
Figure 7.6: D flip-flops
This figure shows that the first SR latch is enabled only on a trailing edge of the clock. Therefore, Q1
will change states only at that time. Output Q2 , however, can change states only during a leading edge. A
high input at D would therefore pass through Q1 during a trailing edge and would appear at Q2 during a
leading clock edge. This is a simple example of the use of the clock pulse to synchronize the propagation of
signals in digital devices.
The D latch solves the problem of S = R = 1 by having just one input. A JK flipflop solves the same
problem in a different way. This device is a timed SR flipflop with two inputs J (for “set”) and K (for
“reset”). When J = K = 1, the device toggles its state. Figure 7.8a shows the standard symbol for the
JK flipflop. Figure 7.8b shows how this device can be constructed from two SR flipflops, and Figure 7.8c
illustrates typical waveforms.
J Q S
CK
Q S1 S2 Q2 R
K J Q1
E1 E2 CR
J Q K Q1
CK R1 R2 Q2 Q1
K Q
Q2
CK
Another useful type of latch is the T flipflop. It has a single input labeled T. When T is low, the state
of the T latch does not change. When T is high, the state Q toggles for each leading clock edge. Figure 7.9a
shows how this type can be constructed from a JK flipflop, while Figure 7.9b shows typical waveforms. This
type of latch is used in counters (Section 7.3).
7.2.2 Monostable Multivibrators
A monostable (also called a one shot) is a digital device that has one stable state (normally the 0 state)
and one unstable state. In a typical cycle, a pulse on the input line switches the monostable to its unstable
state, from which it automatically drops back (after a short delay) to its stable state. Figure 7.10a shows
this behavior and Figure 7.10b shows how such a device can be built from two NOR gates and a delay line
(this is an example of a device that requires more than just logic gates).
234 7. Digital Devices
T
T J Q
CK CK
K Q
Q
(a) (b)
Figure 7.9: The T flipflop
Input input
output (state)
State
delay
delay
(a) (b)
Figure 7.10: A Monostable multivibrator
0 1
b 0
output
1 0
a delay
1
(a) (b)
Figure 7.11: The Astable multivibrator
Other designs may have different delays for the two states and may also have a “reset” input that resets
the state of the astable to low. This is a simple design for the computer clock, a design that emits a clock
pulse of the form ....
7.3 Counters
A counter is a device (constructed from flipflops) that can perform the following functions:
1. Count pulses.
2. Divide a pulse train.
3. Provide a sequence of binary patterns for sequencing purposes.
A modulo-N counter is a device that provides an output pulse after receiving N input pulses. Fig-
ure 7.12a is the general symbol of a counter. The main input is denoted by CK (clock). A pulse on this input
triggers the counter to increment itself. The CR (clear) input resets the counter. Counters use flipflops to
7.3 Counters 235
count and are therefore binary. As a result, it is easy to construct a counter that will count up to a power
of 2. It is possible to modify a binary counter to count up to any positive integer.
In the computer, counters are used in the control unit and the ALU to control the execution of in-
structions. A simple example is the multiplication of two n-bit integers. Certain ALUs may perform this
multiplication by a loop that iterates n times and uses additions and shifts to multiply the numbers (Sec-
tion 9.3). Such a loop is controlled by a counter that emits an output pulse after being triggered n times,
thereby terminating the loop.
Counters can be synchronous or asynchronous, the latter type is also called ripple counter. A ripple
counter can be constructed from T flipflops by connecting the output of each flipflop to the clock (CK) input of
its successor flipflop. Figure 7.12b shows a three-stage ripple counter where each stage toggles on the trailing
edge of the input (this is implied by the small circle at the CK input). Each stage toggles its successor, since
the T inputs are permanently high. Typical waveforms are shown in Figure 7.12c. Notice that all three
outputs go back to their initial states after eight input pulses, which is why this counter is said to count
modulo-8.
Exercise 7.4: There is also a reason to claim that this counter counts to 4. What is it?
Input Q1 Q2 Q3
Input CK CK CK
CK
Reset CR
Output CR
T
CR
T
CR
T
Reset
High
(a) (b)
1 2 3 4 5 6 7 8
Input
Q1
Q2
Q3
(c)
Figure 7.12: A Ripple Counter with T flipflops
Exercise 7.5: Show how a 3-stage ripple counter can be designed with JK flipflops.
Sometimes, it is preferable to have a counter where only one output line goes high for each input pulse.
Figure 7.13a shows how a 1-of-8 decoder (Section 7.6) can be connected to the outputs of a 3-stage counter,
such that each input pulse causes the next decoder output to go high. Figure 7.13b shows how such an
effect can be achieved explicitly, using AND gates instead of a decoder (this also illustrates a use for the
complementary outputs Q̄i of a flipflop).
7
decoder
1 T
0
T
Q1 Q2 Q3
Q1 Q2 Q3 0 1 2 3 4 5 6 7
(a) (b)
Figure 7.13: A counter followed by a decoder
A nonbinary ripple counter can be constructed by modifying a binary counter in such a way that it
resets itself after the desired number of counts. A modulo-5 counter, for example, is easily constructed from a
236 7. Digital Devices
modulo-8 counter where the three output lines Q1 , Q̄2 , and Q3 are connected to an AND gate whose output
resets the counter after five input pulses (Figure 7.14). Notice that the use of Q̄2 is not really necessary,
since the first time both outputs Q1 and Q3 will simultaneously be high is after the fifth input pulse.
Q3
Q1 Q2
T T T
CR CR CR
Exercise 7.6: Show how to construct a nonbinary counter that counts up to any given positive integer N .
The counters described so far count up. They count from 0 to N − 1 by a design that toggles output Qi
whenever its predecessor Qi−1 changes from 1 to 0. It is possible to design a counter that will count from
N − 1 down to 0. This is achieved by toggling output Qi when its predecessor Qi−1 changes from 0 to 1.
The idea is to use the complementary output line Q̄i−1 as the input to stage i (Figure 7.15a).
Q1 Q2 Q3
Input Q1 Q1 Q1
CK CK CK T T T
Q1 Q2 Q3
(a) (b)
Up
Q1 Q2 Q3
CK CK CK
Q1 Q2 Q3
Down
(c)
Figure 7.15: Down counters
Q3 Q2 Q1 Q3 Q2 Q1
0 0 0 1 0 0
0 0 1 1 0 1
0 1 0 1 1 0
0 1 1 1 1 1
Table 7.16: Toggling states
High
T Q1 T Q2 T Q3 T Q4
CK CK CK CK
Input
Q1 Q2
P P P
Reset D Q D Q D Q
Q3
CK CK CK
CR CR CR
Input
This mechanism of starting the output cycle when the last output is high, may fail in the presence of
errors. Imagine a 5-stage ring counter. If, as a result of an error, the output pattern gets corrupted into,
say, 01001 (two 1’s), then this bad pattern will continue rotating on the output lines indefinitely. A better
design for a ring counter should be able to correct such an error automatically. Such a design is shown in
Figure 7.18b. A high signal is fed into the D input only when Q1 = Q2 = · · · = Qm−1 = 0 i.e., when all
outputs except the last one are low. In such a counter, the bad output 01001 will go through the sequence
00100 (even though Q5 is high, nothing is fed into the first input, since not all other outputs are low), 00010,
00001, 10000, etc.
It is also possible to design an m-stage ring counter that counts up to 2m or up to other values. Such
devices are called twisted ring counters.
7.3.3 Applications of Counters
1. Digital clocks. The 60 Hz frequency of the power line is fed into a modulo-60 counter that produces an
output pulse once a second. This is again divided by 60 by another counter, to produce an output pulse
once a minute. These pulses are then used to increment a display. This is the principle of a simple consumer
digital clock.
2. Clocking ALU operations. Some ALU operations are done in steps, and a counter can be used to
trigger the start of each step.
238 7. Digital Devices
3. Clocking the control unit. Each step performed by the control unit takes a certain amount of time.
These times are expressed as multiples of one time unit (the clock cycle) and a counter can be used to count
any desired number of clock cycles before the next step is started.
7.4 Registers
Intuitively, a register is a small storage device that can store one number. Physically, a register is a set of
latches. Each latch stores one bit, so an n-bit register consists of n latches, each with its two input and two
output lines. There are two general types of registers as follows:
1. A latch register (also referred to as a buffer). This type is used for storage. The individual latches
are not connected in any way, and modifying the state of any of them will normally not affect the others.
General-purpose registers in a computer are usually of this type. Figure 7.19a is an example of an n-bit
latch register.
2. A functional register. This type of register can store a number and can also perform an operation
on it. Common examples of functional registers are a counter, an adder, and a shifter, but there are many
more. In such a register, the individual latches are connected such that changing the state of one may affect
other ones.
Bn−1 B1 B0
Qn−1Qn−1 Q1 Q1 Q0 Q0
Move A to B
An−1 A1 A0
An−1 A1 A0
Sn−1 Rn−1 S1 R1 S0 R0
(a) (b)
Figure 7.19: Registers
Most registers are parallel; the n bits are sent to the register in parallel on n lines, and are read from
the register in parallel, on another set of n lines. Serial registers are also possible. The n bits are sent to
such a register serially, one at a time, on the same line. Reading a serial register is also done serially.
A register transfer is a very common and important operation in computers. It is a task any computer
performs many times each second. Given two registers A and B, how can A be transferred to B? We know
from Section 1.3.2 that when data is moved inside the computer, only a copy is moved and the original
always remains unchanged. This implies that transferring A to B makes register B identical to A and does
not change A in any way.
The principle of transferring A to B is to connect the outputs of every latch in A to the inputs of the
corresponding latch in B. This connection must be done through gates, since otherwise A would always
transfer to B, and A and B would be identical. Figure 7.19b shows how a register transfer is done. Two
AND gates connect the outputs of each latch of A to the inputs of the corresponding latch of B. The Q
output is connected to the S input and the Q̄ output, to the R input. The gates act as switches and are
normally off, so the two registers are separate. When the control unit decides to move A to B, it sends a
short pulse on the line controlling the gates. The gates switch on momentarily, which copies each latch of A
to B. After the control line returns to low, the two registers are effectively disconnected. Notice that only
one control line is necessary, regardless of how large the registers are.
Exercise 7.8: If register A can be transferred to B (i.e., if all the necessary AND gates are in place), can
B also be transferred to A?
7.5 Multiplexors 239
7.5 Multiplexors
A multiplexor is a digital switch that’s used to switch one of several input buses to its output bus. Each bus
consists of n lines. Figure 7.20a shows a 2-to-1 multiplexor switching either bus A or bus B to the output
bus. This multiplexor must have a control line that switches it up or down. Figure 7.20 shows the details of
this simple device. Three gates, two AND and one OR, are required for each bus line. Thus, a multiplexor
to switch 16-bit buses requires 48 gates. It is also possible to have a 4-to-1 multiplexor, switching one of
four input buses to its output bus. Such a multiplexor requires two control lines.
Exercise 7.9: (Easy.) How many control lines are needed in a multiplexor with k inputs?
A Ai A
Output Oi Input
B B
Bi
Control
Control Control
7.6 Decoders
A decoder is one of the most common digital devices. Figure 7.21 shows the truth table and the design of
a decoder with two inputs x and y and four outputs A, B, C, and D (also denoted by 0, 1, 2, and 3). The
principle of decoding is: For each possible combination of the input lines, only one output line goes high.
We say that the decoder’s input is encoded but its output is decoded which means that the input lines can
have any binary values, but only one output line can be high at any time. A decoder has one more input
called D/E (for disable/enable). When this input is high, the decoder operates normally. When this input is
pulled low, the decoder is disabled; all its output lines are zero.
x
y 3
x: 0 0 1 1
Inputs
y: 0 1 0 1 2
A: 1 0 0 0
B: 0 1 0 0 1
Outputs
C: 0 0 1 0
D: 0 0 0 1 0
Two input lines can have 22 = 4 different states, since they are encoded. A decoder with two input lines
should therefore have four output lines. It is called a “1-of-4 decoder” or a “2-to-4 decoder.” In principle,
it is possible to have a 1-of-n decoder, where n is any power of 2. In practice, n is limited to small values,
typically in the range [4, 16].
240 7. Digital Devices
7.7 Encoders
An encoder is the opposite of a decoder. Its inputs are decoded (i.e., at most one is high) and its output
lines are encoded (i.e., in binary code). Generally, such a device has 2n inputs and n outputs. It is easy to
design an encoder, but its use in computers is limited.
Exercise 7.11: Show the design of an encoder with eight inputs.
Exercise 7.12: An encoder with n outputs generally has 2n inputs. Can it have a different number of
inputs?
7.7.1 Priority Encoders
A priority encoder is a special case of an encoder and it happens to be a useful device. It assigns priorities to
its inputs, and its output is the encoded value of the highest-priority input. Table 7.22 is the truth table of
a priority encoder with five inputs and three outputs (plus an extra output, none, that indicates the absence
of any inputs). Input I5 has the highest priority. When this input is high, the outputs are Y2 Y1 Y0 = 101
regardless of the other inputs.
Such a device can be designed in a straightforward way, writing the logical expressions for each of the
three outputs. We first define the five auxiliary entities
(where the dot indicates logical AND). Then define the four outputs by
I5 I4 I3 I2 I1 Y2 Y1 Y0 none
1 x x x x 1 0 1 0
0 1 x x x 1 0 0 0
0 0 1 x x 0 1 1 0
0 0 0 1 x 0 1 0 0
0 0 0 0 1 0 0 1 0
0 0 0 0 0 0 0 0 1
Table 7.22: A priority encoder truth table
I5 I4 I3 I2 I1
None
P5 P4 P3 P2 P1
This chapter covers some of the important digital devices used in computers. There are many more,
less important digital devices, as well as more sophisticated versions of the devices described here. Those
interested in more information on this important topic are referred to [Mano 97] and [Mano 91] as well as
to the may other texts in this area.
3. As read/write (RWM) or read only (ROM).
4. By whether it is volatile (i.e., loses its content when power is turned off) or nonvolatile.
5. Whether it is destructive or not (in a destructive memory, reading a word destroys its content).
Exercise 8.1: Reading a word in a destructive memory erases its content. How can the word be reread?
8.1 A Glossary of Memory Terms
Memory is a computer system’s primary workspace. It works in tandem with the CPU, to store data,
programs, and processed information that can be made immediately and directly accessible to the CPU or
to other parts of the computer. Memory is central to a computer’s operation because it forms the critical
link between software and the CPU. Computer memory also determines the size and number of programs
that can be run simultaneously, and helps to increase the capabilities of today’s powerful microprocessors.
There are several different types of memory, each with its own features and advantages. Unfortunately,
with different types of memory, it can be easy to get them confused. This glossary is an attempt to help
sort out the confusion and act as a quick reference.
The two main forms of RAM are DRAM and SRAM.
DRAM (Dynamic RAM). This is the most common type of computer memory. It is called “dynamic”
because it must be refreshed, or re-energized, hundreds of times each second in order to retain the data in its
words. Each bit in a word in DRAM is designed around a tiny capacitor that can store an electrical charge.
A charged capacitor indicates a 1-bit. However, the capacitor loses its charge rapidly, which is why DRAM
must be refreshed.
SRAM (Static RAM). This type of memory is about five times faster, twice as expensive, and twice as
big as DRAM. SRAM does not need to be refreshed like DRAM. Each bit in SRAM is a flip-flop; a circuit
that has two stable states. Once placed in a certain state, it stays in that state. Flip-flops are faster to read
and write than the capacitors of DRAM, but they consume more power.
244 8. The Memory
Because of its lower cost and smaller size, DRAM is used mostly for main memory, while the faster,
more expensive SRAM is used primarily for cache memory.
Cache RAM. Cache is small, fast memory (usually SRAM) located between the CPU and main
memory, that is used to store frequently used data and instructions. When the processor needs data, a
special piece of hardware (the cache manager) checks the cache first to see whether the data is there. If not,
the data is read from main memory both into the cache and the processor.
Cache works much like a home pantry. A pantry can be considered a “cache” for groceries. Instead of
going to the grocery store (main memory) every time you’re hungry, you can check the pantry (cache) first
to see if the food you want is there. If it is, then you’ve saved a lot of time. Otherwise, you have to spend
the extra time to get food from the store, leave some of it in the pantry, and consume the rest.
FPM (Fast page mode) DRAM. This type of DRAM memory is divided into equal-size chunks called
pages. It has special hardware that makes it faster to read data items if they are located in the same page.
Using FPM memory is like searching in a dictionary. If the word you want is on the same page as the
previous one, it is fast and easy to scroll down the page and find the definition. If you have to flip pages,
however, it takes a little longer to find what you want.
EDO (Extended Data Out) DRAM. EDO DRAM is similar to FPM, the difference being that back-
to-back memory accesses are much faster in EDO. Because EDO is easy to implement, it has become very
popular. (EDO is sometimes called Hyper Page Mode DRAM.)
BEDO (Burst EDO) DRAM. This is a variation of EDO where a burst of data can be read from
memory with a single request. The assumption behind this feature is that the next data-address requested
by the CPU will be sequential to the last one, which is usually true. In BEDO DRAM all memory accesses
occur in bursts.
SDRAM (Synchronous DRAM). This type of DRAM can synchronize itself with the computer clock
that controls the CPU. This eliminates timing delays and makes memory reads more efficient.
SGRAM (Synchronous graphics RAM). SGRAM is an extension of SDRAM that includes graphics-
specific read/write features. SGRAM allows data to be read or written in blocks, instead of as individual
words. This reduces the number of reads and writes that the memory must perform, and increases the
performance of a graphics controller.
VRAM (Video RAM). Graphics memory must be very fast because it has to refresh the graphics screen
(CRT) many times a second in order to prevent screen “flicker.” While graphics memory refreshes the CRT,
the CPU or the graphics controller write new data into it in order to modify the image on the screen. With
ordinary DRAM, the CRT and CPU must compete for memory accesses, causing a bottleneck of data traffic.
VRAM is a “dual-ported” memory that solves this problem by using two separate data ports. One port
is dedicated to the CRT, for refreshing the screen. The second port is dedicated for use by the CPU or
graphics controller, to modify the image data stored in video memory.
A good analogy to VRAM is a fast food drive-through with two windows. A customer places an order
and pays at one window, then drives up and picks up the food at the next window.
Figure 8.1a shows a memory unit of size M ×N and the buses connecting it to the rest of the computer
(Section 1.3 shows why M is a power of 2, M = 2k ). The two main types of memory used in modern
computers are semiconductor RAM (read/write) and ROM (read only) and the two main types of RAM are
static (SRAM) and dynamic (DRAM). Static RAM is the most important type of computer memory, and
its design is the first topic discussed in this chapter.
8.2 Static RAM
Static RAM stores each bit of information in a latch. Thus, there may be millions of latches in a single
SRAM memory integrated circuit (chip). A single latch and the gates around it are called a bit cell (BC,
Figure 8.1b,c) and the main problem in RAM design is to construct a large memory with the minimum
number of gates and wires. The individual BCs are organized in words, and the address lines are decoded,
to become select lines that select the BCs of an individual word.
An intuitive way of combining individual BCs into a memory unit is shown in Figure 8.2a. Four words
of three bits each are shown. The diagram is simple and easy to understand, but is not practical because
of the multi-output decoder needed. Even a memory with 1K words (which is extremely small by today’s
standards) requires a decoder with 1024 output lines! The solution is to use two or more decoders, and to
8.3 ROM 245
select
k address lines
R
address register
S Q
input output
read
control
lines
M words
write R/W (0/1)
N bits/word
(b)
buffer register
select
N data lines
input BC output
(a)
R/W
(c)
Figure 8.1: RAM (a) and a single BC (b,c)
construct a memory unit from bit-cell building blocks that are selected by two or more select lines each.
Figure 8.2b shows an example of such memory, with an individual BC shown in Figure 8.2c. Such a memory
unit has a capacity of M ×1 (where M can be in the millions) and is fabricated on a single chip. Assuming
that the memory has 1K words, each decoder receives five of the 10 address lines and has 32 output lines.
The total number of decoder outputs is therefore 64 instead of 1024. There is one output line and one input
line for the entire memory.
Exercise 8.2: The memory unit of Figure 8.2a has two control lines, R/W and enable. Show how to replace
them with the two lines R and W (i.e., read and write, with no enable line).
Exercise 8.3: Suggest a way to design a 1M memory unit with a small number of decoder outputs.
A complete M ×N memory unit consists of N such memory chips, as shown in Figure 8.3a. Figure 8.3b
shows how several small memory chips can be connected to increase the number of words (i.e., the value of
M ). The diagram shows four memory chip of size 256K×N each, connected to form a 1M×N memory unit.
Recall that 1M = 220 , so a memory with 1M words requires 20 address lines. Each chip, however, contains
256K = 218 words, and requires only 18 address lines. The two most significant address lines are therefore
used to select one of the four memory chips by means of a 1-of-4 decoder.
8.3 ROM
The simplest type of ROM is programmed when it is manufactured and its content cannot be changed later.
Such ROM consists of a grid of M word lines and N bit lines, with no connections between word lines and bit
lines at the grid points (Figure 8.4a). The N outputs of such a grid are all zeros. In the last manufacturing
step, the data is written in the ROM by fabricating a diode at each grid intersection where a 1 is required.
This process is called masking and it sets some of the N outputs to 1’s, depending on which word is selected.
The advantage of this fabrication process is that a customer can order a custom made ROM, and this affects
only the last stage of manufacturing, not the entire process. If large quantities of identical ROMs are needed,
the price per unit is small.
When only a small number of ROM units is needed, a programmable ROM (PROM) may be the solution.
This type of ROM is manufactured as a uniform grid of words and bits, where every grid intersection has a
small diode (Figure 8.4b). Thus, every word in the PROM consists of all 1’s. The PROM is programmed
by placing it in a special machine that fuses the diodes at those grid intersections where zeros are required.
Once programmed, the content of a PROM cannot be changed (strictly speaking, bits of 1 can be changed
to 0, but not the opposite).
For applications where the content of the ROM has to be modified several times, an erasable PROM
(EPROM) is the natural choice. Such a device can be programmed, erased and reprogrammed several times.
Programming is done by trapping electrical charges at those grid points where 1’s are needed. Erasing is
246 8. The Memory
BC BC BC
address lines
BC BC BC
1-of-4 decoder
BC BC BC
memory enable
BC BC BC
R/W
data lines
output lines
(a)
input
output
memory enable
address
(b)
select
select input
BC
output
(c)
Figure 8.2: RAM (a,b) and a single BC (c)
8.4 PLA 247
A bus
R/W
M×1 M×1 M×1
D bus
(a)
20 18
A bus
2
256×N 256×N 256×N 256×N
D bus
N
1-of-4
decoder
(b)
Figure 8.3: Increasing N (a) and increasing M (b)
bit lines
word M−1
word lines
decoder
decoder
k k
word 1
word 0
done by releasing the charges, either by exposing the device to ultra violet light or by applying high voltage
to it in a direction opposite that of the charges.
Exercise 8.4: Suggest a typical application where the content of a ROM has to be changed several times
before it stabilizes.
8.4 PLA
The term PLA stands for programmable logic array. This device is similar to a ROM. An address is sent
to it, and out comes a piece of data. However, in a PLA, only those locations that are actually needed
are fabricated. In a conventional M ×N ROM or PROM, all 2M locations are fabricated and exist on the
memory chip. A 1K PROM has all 1024 locations, even though any particular application may use just a
small number of locations. A PLA, on the other hand, may have 10 address lines but only 75 actual locations.
Most of the 1024 possible addresses may not correspond to any actual words, while several addresses may
correspond to the same word.
An interesting application of a PLA is conversion from the old, obsolete punched cards character codes
to ASCII codes. The standard punched cards that were so popular with computers in the past had 12
rows and 80 columns where holes could be punched. Each column corresponded to one character, so each
character had a 12-bit code. However, only 96 characters were actually used. An application that reads
248 8. The Memory
punched cards and converts the character codes to ASCII may benefit from a PLA with 12-bit addresses and
96 8-bit locations. Each of the 96 locations is set to an ASCII code, and a character is converted by simply
using its original, 12-bit code as an address to the PLA. The content of that address is the required ASCII
code.
Exercise 8.5: Can this problem be solved with a ROM?
A PLA consists of an array of AND gates followed by an array of OR gates. Every input is connected
to all the AND gates, and the output of every AND gate is connected to all the OR gates. The resulting
device has two grids, one between the inputs and the AND gates and the other between the AND gates and
the OR gates. The device is made such that there are no electrical connections at the grid points. The last
step in the fabrication process is to program the device by connecting the two wires at certain grid points.
Figure 8.5a shows a PLA with four inputs a3 a2 a1 a0 , three AND gates, and two OR gates. With four input
(i.e., address) lines, there are 16 possible addresses. However, the connections shown allow only for the seven
addresses 1010 (the left AND gate), 0x01 (the center gate) and 00xx (the right one). The bottom OR gate
(output bit d0 ) outputs a 1 for addresses 1010 and 0x01, and the top OR gate outputs a 1 for addresses 1010
and 00xx.
Exercise 8.6: List the 2-bit content of each of the seven addresses.
a3 a3
a2 a2
a1 a1
a0 a0
d0 d0
(a) (b)
Figure 8.5: PLA (a) and PAL (b)
A relative of the PLA is the FPLA (field programmable logic array). This device is manufactured as
a PLA with connections at all the grid points and with fuses at every input to a logic gate. A special
instrument is used to burn some fuses selectively, thereby disconnecting their inputs from the gates. This
effectively programs the FPLA.
A third member of the PLA family is the PAL (programmable array logic). This useful digital device
is similar to a PLA with the difference that the array of OR gates is not programmable (i.e., there is no
complete grid between the AND gates and the OR gates). Figure 8.5b shows an example.
A+B A−B
A B S C D R
0 0 0 0 0 0
0 1 1 0 1 1
1 0 1 0 1 0
1 1 0 1 0 0
Table 9.1: Adding and subtracting two bits
It is clear from the table that S = D = A⊕B, C = A·B, and R = A ·B (where ⊕ is the XOR operation,
Table 2.6). Thus, adding two bits requires an XOR and AND gates, and subtracting them requires the same
gates plus an inverter. These two simple operations are similar, so they can be combined in one device with
a control line specifying which operation is required. Such a device is called a half-adder-subtractor (HAS)
because it cannot add or subtract entire numbers.
A full adder (FA) and full subtractor (FS) operate on entire numbers, not just a single pair of bits, by
propagating the carry or borrow. A FA should be able to add two bits and an input carry, and produce
a sum and an output carry. A FS should be able to subtract two bits, subtract an input borrow from the
result, and produce a difference and an output borrow. Table 9.2 lists the eight cases that have to be handled
by a full-adder-subtractor. The input carry and input borrow are denoted by IC and IB, respectively.
It is easy to see that the four outputs can be produced by
S = A ⊕ B ⊕ IC,
C = A · B · IC + [A · B · IC + A · B · IC + A · B · IC] = A · B · IC + A(B + IC),
D = A ⊕ B ⊕ IB,
R = [A · B · IB + A · B · IB + A · B · IB] + A · B · IB = A · B · IB + A (B + IB).
250 9. The ALU
A + B + IC A − B − IB
A B IC S C IB D R
0 0 0 0 0 0 0 0
0 0 1 1 0 1 1 1
0 1 0 1 0 0 1 1
0 1 1 0 1 1 0 1
1 0 0 1 0 0 1 0
1 0 1 0 1 1 0 0
1 1 0 0 1 0 0 0
1 1 1 1 1 1 1 1
Table 9.2: Adding and subtracting three bits
As a result, a full-adder-subtractor can be designed with four inputs A, B, Z (IC or IB), and T (high for
add and low for subtract) and two outputs X (S or D) and Y (C or R). This device is defined by
X = A ⊕ B ⊕ Z,
Y = C · T + R · T = A · B · Z · T + A(B + Z)T + A · B · Z · T + A (B + Z)T
= B · Z(A T + AT ) + (B + Z)(A · T + A · T )
= B · Z(A ⊕ T ) + (B + Z)(A ⊕ T ) .
Exercise 9.1: Use logic gates to design an FAS.
Such a FAS, combined with a latch, can be used as a serial adder/subtractor (Figure 9.3a). This
circuit is triggered by the clock pulse, and it adds one pair of bits for each pulse. The numbers to be
added/subtracted are held in two shift registers whose least-significant bits are fed to the A and B inputs of
the FAS. The X output of the FAS is fed into the result register (also a shift register), and the Y output is
sent to the latch, where it is saved until the next clock pulse, when it is sent to the Z input.
B
Shift
A
Q
A B Z
T FAS CK Shift
X Y D
Result
Shift
(a)
An Bn A2 B2 A1 B1 A0 B0
T T T T
FAS FAS FAS HAS
Xn X2 X1 X0
To overflow
indicator (b)
Figure 9.3b shows how the FAS can serve as a building block of a parallel adder/subtractor. Each FAS
receives a pair of bits Ai and Bi . The carry/borrow is sent to the next most-significant FAS, where it is
added to the next pair of bits. The least significant bit position does not have any input carry/borrow, so
it needs only a HAS, not a FAS. This is a simple design, but it is not completely parallel. All the stages get
their inputs in parallel and produce their outputs simultaneously, but the two outputs of stage i are initially
wrong, since the Z input to this stage is initially wrong. Stage i receives the correct Z input from stage
i − 1 only when that stage produces correct outputs, and this happens only after stage i − 1 has received
the correct Z input from stage i − 2. All the stages operate continuously, but there is a ripple effect that
propagates from the least-significant to the most-significant stages. The most-significant stage produces
correct outputs after a delay proportional to the number of stages, so this device is not truly parallel and
deserves the name ripple-carry adder/subtractor.
In practice, the computer uses two’s complement numbers (Section 2.20.3), so the ALU does not need
a subtractor, but it needs a complementor. More sophisticated methods for adding integers are described
below, but first here is a short discussion of two’s complementors.
A parallel two’s complementor is shown in Figure 9.4a. The “complement” control signal transfers the
contents of register A to a counter while inverting every bit. After a short delay, to guarantee that all the
bits have been transferred, the control signal arrives at the counter and increments its content by 1, to form
the two’s complement.
There are two special cases. The case A = 0 generates overflow, since the counter is first filled up
with 1’s, then incremented. The case A = 10 . . . 0 is also problematic, since the counter is first filled up
with 01 . . . 1, then incremented to produce the original value 10 . . . 0. The conclusion is that this number
(the smallest negative integer) cannot be 2’s complemented. This case can easily be detected, though, by
comparing the final sign of the counter to the original sign of A. They should be different and the case above
is the only one where they will be identical.
”complement” A
A
Result
delay
Q Q
Latch
Counter CK S R
Reset
(a) (b)
It is also possible to generate the two’s complement of an integer serially, bit by bit. This is based on the
observation that the 2’s complement of the integer x . . . x10 . . . 0 is x̄x̄ . . . x̄10 . . . 0, i.e., the least-significant
zeros and the first bit of 1 preceding them are preserved in the process. All the bits to the left of that 1
should be flipped.
are shifted to the right). As long as these bits are zeros, the latch stays reset, so the upper AND gate is
open. The first 1 sets the latch (and also passes to the result register unchanged), so the lower AND gate
opens and all the remaining bits from A are complemented before they pass to the result register.
Exercise 9.3: Show that the remaining bits from A do not affect the state of the latch regardless of their
values.
252 9. The ALU
Once the ALU knows how to produce the two’s complement of an integer, only an adder is needed in
order to implement both addition and subtraction. The next section describes an efficient adding technique
known as carry look-ahead.
where Gi = Ai Bi and Pi = Āi Bi + Ai B̄i = Ai ⊕ Bi . This is a recursive expression for Ci that can be written
explicitly
C0 = G0 ,
C1 = G1 + P1 C0 ,
C2 = G2 + P2 C1 = G2 + P2 (G1 + P1 C0 ) = G2 + P2 G1 + P2 P1 C0 ,
C3 = G3 + P3 C2 = G3 + P3 (G2 + P2 G1 + P2 P1 C0 ) = G3 + P3 G2 + P3 P2 G1 + P3 P2 P1 C0 ,
..
.
Cn = Gn + Pn Cn−1 = Gn + Pn Gn−1 + Pn Pn−1 Gn−2 + · · · + Pn Pn−1 Pn−2 . . . P1 C0 .
Thus, the CLA circuit receives the Gi ’s and Pi ’s from the individual modified full adders (except the
single half adder, which sends C0 ), and generates all the Ci ’s (from C1 to Cn ) in parallel. Figure 9.5a is a
schematic diagram of the FAs and the CLA circuit. Figure 9.5b shows the details of generating C1 , C2 , and
C3 . It is clear that fan-in becomes a problem. Even C3 requires gates with a fan-in of 4! Therefore, a practical
CLA adder must be designed in several parts. Each part waits for a carry bit from its predecessor, adds its
input bits using the CLA method, then sends its most-significant carry to the next part. Figure 9.5c shows
a 12-bit, three-part adder, where each part adds four bits. Such an adder is a cross between a ripple-carry
adder and a true CLA adder.
Exercise 9.5: The computation of C3 requires the expression P3 P2 P1 G0 , which is one reason for the fan-in
of 4. However, the subexpression P2 P1 G0 is available as part of the calculation of C2 , and so can be used to
reduce the fan-in required for C3 . What’s wrong with this argument?
9.1 Integer Addition and Subtraction 253
An−1 Bn−1 A1 B1 A0 B0
Gn−1 G1 G0
Sn−1 S1 S0
A Cn−2 C1 C0
B CLA
(a)
G0 C0
P3
G2
P1
C0 C1
G1 P3
P2
G1
P2
C3
G1 P3
P2 P2
P1 C2 P1
G0
G0
G2 G3
(b)
B11−B8 A11−A
fl 8 B7−B4 A7−A
fl 4 B3−B0 A3−A
fl 0
S11−S8 S7−S4 S3−S0
four FAs four FAs four FAs
(c)
AU
AL
Up Down Left Right
(a)
Q Q Q Q
R S AU R S
U U
L Q Q R L Q Q R
R S AL R S
(b)
Adding and subtracting requires an FAS for each bit position (except the least-significant one, where
only an HAS is needed). Any register S can be transferred to the row of FASs, added to or subtracted from
AU, with the result sent to AL. Thus, the two control signals ADD and SUB perform AL ← AU + S and
AL ← AU − S, respectively. Figure 9.7a,b shows the general setup and details.
The 1’s complement operation in our accumulator is AL ← AU. This is done by transferring the Q
output of each stage AUi to the R input of stage ALi and transferring the Q̄ output of AUi to the S input
of ALi . The logical AND operation is AL ← (AL and AU). This is done by feeding the Q outputs of ALi
and AUi to an AND gate, and sending the gate’s output to the S input of ALi while the inverse of the gate’s
output is sent to the R input of ALi .
9.3 Integer Multiplication 255
Q Q Q
Un U2 U1
Ln L2 L1
R S R S R S
carry
c s c s c s
Q Q Q
Sfln Sfl2 Sfl1
(a)
Q
Li
R S
down to next
Q
stage c s
Ui Add
FAS Subtract
carry
Q
Sfli
(b)
1001 multiplicand
×
0110 multiplier
0000
1001
1001
0000
0110110
Table 9.8: Multiplying two 4-bit integers
2. MQ (multiplier-quotient). This is originally set to the multiplier. At the end of the operation, the
MQ contains the least-significant part of the product.
3. S (source). This contain the multiplicand (it can be a register or a word in memory).
The algorithm is based on shifts and additions, but is different from the rule above. It generates the
partial product only when this should be the multiplicand. A partial product of zeros is never generated.
Instead of shifting the partial product to the left, the sum-so-far (which is contained in the pair AC, MQ) is
shifted to the right and nonzero partial products are added to it.
1. AC ← 0.
2. If MQ0 = 1, then AC = AC + S.
3. right shift AC and MQ as one unit.
4. Repeat steps 2 and 3 n times.
In order to simplify our implementation, we assume that both the AC and MQ are double registers, i.e.,
there are two registers AU and AL (for upper and lower) and two registers MU and ML. We assume that
each can be shifted up and down (i.e., AU ← AL and AL ← AU) and also left and right. The control signals
for the vertical shifts are denoted by SAU, SAD, SMU, and SMD. The algorithm now becomes
1. AU ← 0.
2. SMD. If MU0 = 1, then AL = AU + S, else SAD.
3. right shift AC and MQ as one unit.
4. Repeat steps 2 and 3 n times.
Figure 9.10a is a schematic diagram of the integer multiplication circuit of the ALU. It shows that bit 0
of the MU register is fed to this circuit. In addition, the circuit also receives a “start” input from the control
unit, and the clock pulse φ. When done, the multiplication circuit sends a “done” signal back to the control
unit. The ADD output is a control signal to perform the addition AL = AU + S. The SR output is a control
signal that shifts and Ac and MQ to the right. The circuit contains a modulo 2n + 2 counter (Figure 9.10b)
that’s reset by the “start” signal and is clocked by the clock pulse. The counter has L + 1 output lines C0 ,
C1 ,. . . ,CL that generate the correct sequence of shift signals sent to the AC and MQ. This sequence is listed
in Table 9.9.
C0 C1 C2 ... CL output
0 0 0 0 CR
1 0 0 0 SMD and either ADD or SAD
0 1 0 0 SR (1st time)
1 1 0 0 SMD and either ADD or SAD
0 0 1 0 SR (2nd time)
.. .. .. ..
. . . .
0 1 1 1 SR (nth time)
1 1 1 1 done
Table 9.9: Outputs of the multiplication circuit
9.3 Integer Multiplication 257
Using the table, it is not hard to figure out how each of the outputs of the multiplication circuit is
generated by the L + 1 counter outputs.
Therefore, only the two expressions C1 C2 . . . CL and C̄1 C̄2 . . . C̄L have to be generated. This part of the
multiplication circuit is shown in Figure 9.10c.
start CR
SMD
φ ADD CR
SAD CL
MU0 C0
SR SR
done C2
C1
done
(a) C0
SMD
start CR CL
ADD
MU0
SAD
C2
C1
φ CK C0 (c)
(b)
Figure 9.10: Integer multiplication (a) general, (b) counter, and (c) details
Exercise 9.6: There is a problem with the circuit as shown in Figure 9.10b. When the multiplication is
over, the clock pulse continues to clock the counter, thereby generating SMD, ADD, and other signals that
may corrupt the content of the accumulator and MQ registers. Show how to stop clocking the counter as
soon as the “done” signal is generated.
This simple method can multiply only unsigned numbers. When 2’s complement numbers are used,
any negative numbers should be complemented before the multiplication starts, and the sign of the product
should be determined by comparing the signs of the original numbers. It is possible, however, to design a
multiplication circuit that multiplies signed numbers directly. Such a circuit must be able to handle two
cases as follows:
1. The multiplicand (A in A×B) is negative. The partial products (which are either A or zero) are
negative. When shifting AC and MQ to the right, a 1 should be inserted on the left end, instead of a zero.
In general, the sign bit of A should be inserted.
2. The multiplier B is negative. Here we observe that if an integer N is represented by n bits, then the
2’s complement of N equals 2n −N . Therefore, a negative B has an unsigned value of 2n −B and multiplying
A by B results in A(2n − B) = 2n A − A · B. The result should therefore be corrected by subtracting 2n A.
Exercise 9.7: Illustrate these concepts by multiplying 12×(−6) and (−6)×12.
258 9. The ALU
A 4 A3 A2 A1
B4 B3 B2 B1
R8 R7 R6 R5 R 4 R 3 R 2 R 1 .
Four partial products are generated by ANDing A with each of the four bits of B
A4 B1 A3 B1 A2 B1 A1 B1 ,
A4 B2 A3 B2 A2 B2 A1 B2 ,
A4 B3 A3 B3 A2 B3 A1 B3 ,
A4 B4 A3 B4 A2 B4 A1 B4 .
The eight bits of the result are produced by adding the seven columns of these partial products as follows
(where the notation Ai Bj means a logical AND but a “+” means arithmetic addition):
R1 = A1 B1 ,
R2 = A2 B1 + A1 B2 ,
R3 = A3 B1 + A2 B2 + A1 B3 + carry from R2 ,
R4 = A4 B1 + A3 B2 + A2 B3 + A1 B4 + carry from R3 ,
R5 = A4 B2 + A3 B3 + A2 B4 + carry from R4 ,
R6 = A4 B3 + A3 B4 + carry from R5 ,
R7 = A4 B4 + carry from R6 ,
R8 = carry from R7 .
We denote each example by a = b − c = d. The integer d following the second equal sign in each of the
examples combines the bits of b and c, with a bar added above the bits of c. This integer is the recoded
version of the multiplier. Using this principle, the (unsigned) multiplication
11101010 11101010
× can be written in the form × .
11110 10001̄0
A4B2 A3B4 A4B2 A3B3 A4B1 A2B2 A3B1 A2B2 A2B1 A1B2 A1B1
FA FA FA FA HA
C S C S C S C S C S
FA FA FA FA HA
C S C S C S C S C S
A1B4
9.3 Integer Multiplication
HA HA
C S C S
Originally, four copies of the multiplicand (shifted relative to each other) had to be added. In the Booth
form, only two copies (one of them two’s complemented) need be added.
Recoding any (signed) n-bit multiplier to the compact form used by the Booth algorithm is done in
three simple steps as follows:
Step 1. Denote the bits of the multiplier by bn−1 . . . b1 b0 , where bn−1 is the sign bit. Add a bit b−1 = 0
on the right side of the multiplier. For example, 111001111 becomes 111001111|0.
Step 2. Scan the multiplier from b0 on the right to bn−1 on the left. Since each bi (including b0 ) now
has a predecessor, Table 9.12 can be used to determine the bit (0, 1, or −1) that recodes bi .
Step 3. Remove the extra bit on the right.
The numbers can now be multiplied by preparing a partial product for each nonzero bit of the multiplier.
For a bit of 1, the partial product is a copy of the multiplicand. For a bit of −1, the partial product is the
two’s complement of the multiplicand. The partial products are shifted and added in the usual way.
Multiplier Recoded
biti biti−1 biti
0 0 0
0 1 1
1 0 −1
1 1 0
Table 9.12: Booth compact multiplier
We illustrate this method with two examples of multiplying signed integers. The first example is the
two 7-bit positive integers 46×30 = 01011102 ×00111102 = 101011001002 = 1380. Table 9.13a shows how
these numbers are multiplied “normally.” The Booth method is shown in Table 9.13b (notice that there is
a carry that should be ignored).
0101110 0101110
0011110 010001̄0
0000000 00000000000000 01100
0101110 1111111010010 (2’s compl.) 01̄11̄0
0101110 000000000000 0000000000
0101110 00000000000 111110100
0101110 0000000000 00001100
0000000 000101110 1110100
0000000 00000000 000000
0010101100100 1) 00010101100100 11110111000
(a) (b) (c)
Table 9.13: Booth multiplication examples
9.3 Integer Multiplication 261
The second example is of 5-bit integers where the multiplier is negative 12×(−6) = 011002 ×110102 =
1101110002 = −72. The multiplier is recoded as 01̄11̄0 and the process is summarized in Table 9.13c.
It is clear that the Booth algorithm is data dependent. Certain multipliers are better than others and
result in fast multiplication. The best Booth multiplier is a number with just one long run of 1’s, such as
0011 . . . 100 . . . 0. Such a multiplier recodes into just two nonzero bits and therefore takes full advantage of
the algorithm.
Exercise 9.8: What is the worst multiplier for the Booth algorithm?
The two advantages of the Booth algorithm are (1) both positive and negative multipliers are recoded
to the compact representation in the same way and (2) it normally speeds up the multiplication by skipping
over 1’s. On the other hand, the hardware implementation of this algorithm is complex, so it is normally
implemented in microcode.
9.3.3 Three-Bit Shift
The principle of the basic integer multiplication method of Section 9.3 is to multiply the multiplicand by
each bit of the multiplier to generate the partial products. This results in a simple algorithm, since each
partial product is either zero or is the multiplicand. The tradeoff is slow execution, because the algorithm
performs an iteration for each bit of the multiplier. The three-bit shift method performs an iteration for
each segment of three bits of the multiplier. Since a group of three bits can have values in the range [0, 7],
each partial product has the form i×multiplicand where 0 ≤ i ≤ 7.
The three-bit method starts by preparing the eight quantities i × multiplicand for i = 0, 1, . . . , 7 in a
buffer. The multiplier is segmented into groups of three bits, denoted by . . . , s3 , s2 , s1 , s0 . The algorithm
then goes into a loop where in iteration i it uses segment si as a pointer to the buffer. The number pointed
to becomes the ith partial product. It is shifted to the left by 23i and is added to the product-so-far. The
advantage of this method is a short loop (one-third the number of iterations, compared to the basic algorithm
of Section 9.3). The downside is the overhead needed to prepare the eight multiples of the multiplicand.
Also, since the eight words of the buffer contain up to 7×multiplicand, they have to be three bits wider than
the multiplicand. If the multiplicand occupies an n-bit word, the buffer must consist of special n + 3-bit
registers.
It is also possible to have a four-bit shift multiplication, and this method can have two versions. The
simple version is a straightforward extension of the three-bit method. It uses a buffer where the 16 multiples
i×multiplicand for i = 0, 1, . . . , 15 are prestored. Each iteration uses the next four bits of the multiplier as
a pointer to the buffer.
The sophisticated version uses just a 9-word buffer where the nine multiples i × multiplicand for i =
0, 1, . . . , 8 are prestored. Each group of four multiplier bits is converted into a partial product that is shifted
by 24i and is either added to or subtracted from the product-so-far. If the current segment si of four
multiplier bits is in the range [0, 7], it is used as a pointer to the 9-word buffer and the word pointed to is
added (after being shifted) as a partial product. If si is in the range [8, 15], then the difference 16 − si is first
computed. This quantity is in the range [8, 1]. It is used as a pointer to the buffer, and the word pointed
to is shifted and then subtracted from the product-so-far. This means that the product-so-far is smaller
by 16×multiplicand (shifted) from what it should be. To compensate, the next multiplier segment si+1 is
incremented by 1. The following (unsigned) example illustrates this version.
We select an arbitrary multiplicand m and the integer
11 1010
1111
1100
0100
= 241,604
s4 s3 s2 s1 s0
as our multiplier. The first segment, s0 is 4, so it is used as a pointer to the buffer, and the multiple 4 · m
is added (shifted to the left by 20 ) to the product-so-far (that’s still zero). The second segment s1 is 12,
so 16 − 12 = 4 is used as a pointer, the multiple 4 · m is subtracted (shifted to the left by 24 ) from the
product-so-far, and segment s2 is incremented by 1. This segment is originally 15, so incrementing it causes
it to overflow. Its value is wrapped around to become zero, and the next segment, s3 is incremented from
10 to 11. Segment s2 does not change the product-so-far, and segment s3 subtracts 16 − 11 = 5 multiples
of the multiplicand from it and increments segment s4 to 4. The last step is to use segment s4 as a pointer
262 9. The ALU
and to increment the product-so-far by 4 · m. The final value of the product-so-far is
dividend
= quotient, remainder or dividend ÷ divisor = quotient, remainder.
divisor
The algorithm described here assumes that the dividend is stored in the register pair AC and MQ and the
divisor is in S (a register or a memory location). When the process is over, the MQ contains the quotient
and the AC contains the remainder. The original divisor in S is unchanged. The basic algorithm is
1. If S = 0 or S ≤ AC, set error flag and abort.
2. Shift AC and MQ to the left, creating a zero in MQ0 .
3. If S ≤ AC, then AC ← AC − S and MQ0 ← 1.
4. Repeat steps 2 and 3 n times.
As an example, we assume 4-bit registers and divide the 8-bit integer 66 = 0100 00102 by 5 = 01012 to
obtain a quotient of 13 = 11012 and a remainder of 1. Table 9.14 lists the individual steps. Note that the
process stops when steps 2 and 3 of the above algorithm have been repeated four times.
AC MQ S Description
0100 0010 0101 Initial. S = 0 and S > AC, so proceed.
1000 0100 1st shift. S < AC, so subtract.
0011 0101 Set MQ0 ← 1.
0110 1010 2nd shift. S < AC, so subtract.
0001 1011 Set MQ0 ← 1.
0011 0110 3rd shift. S > AC.
0110 1100 4th shift. S < AC, so subtract.
0001 1101 Set MQ0 ← 1.
Table 9.14: Dividing two integers
We would like to use our design of double accumulator and MQ registers, and this design has no gates
to compare AC to S, so we change our basic method and replace the comparisons by subtractions. The
algorithm is modified to become
0. If S = 0, abort.
1. AC ← AC − S
2. If AC ≥ 0, abort.
3. AC ← AC + S. Restore, cancel step 1.
4. Left shift AC and MQ, creating a zero in MQ0 .
5. AC ← AC − S.
6. If AC ≥ 0, set MQ0 ← 1, else restore AC ← AC + S.
9.4 Integer Division 263
Step 0 can be omitted because we assume an unsigned dividend. If step 0 is omitted and S is zero, then
step 1 would be AC ← AC − 0, and in step 2 the AC would be nonnegative and the algorithm would abort.
In our design for the accumulator, subtraction is done by the two signals AL ← AU + S̄ and UP.
Therefore, steps 1 and 5 in the algorithm above become AL ← AU + S̄ and steps 3 and 6 (AC ← AC + S)
can be eliminated. There is no need to restore, since the original AU is not affected. If the subtraction is
to become permanent (i.e., if the “then” part of step 6 is executed), then an UP shift would set AU to the
result of the subtraction. When these changes are incorporated into the algorithm above, we realize that S
is no longer added to anything, just subtracted from the AC. We therefore prepare the 2’s complement of S
in a temporary register T. Thus, the final version of our algorithm is
1. AL ← S (2’s complement).
2. T ← AL.
3. AL ← AU + T. (replaces AC ← AC − S).
4. If ALn = 0, abort (replaces previous step 2).
5. Shift AC and MQ down (replaces previous step 4).
6. Shift AC and MQ left and up (also replaces previous step 4).
7. AL ← AU + T (replaces previous step 5).
8. If ALn = 0, shift AC up and set MQ0 ← 1 (replaces previous step 6).
9. Repeat steps 5 through 8 n times.
Based on this algorithm, an integer division circuit can be designed with the inputs and outputs shown
in Figure 9.15a. Steps 1–4 are executed once. Steps 5–8 are executed n times each, and one more step is
needed, to send a “done” signal at the end. The total number of steps needed for the division is therefore
4 + 4n + 1 = 4n + 5, an odd number. Instead of constructing a counter modulo an odd number, we use an
L-stage counter, where L is determined by 2L−1 < 4n + 5 < 2L . An AND gate is connected to the outputs
of those stages that go high after 4n + 5 counts, and the output of this gate becomes the “done” signal. As
an example, consider n = 4. The value of 4n + 5 is 21, implying L = 5. After 21 counts, the 5-stage counter
outputs 20, or 101002 , so the AND gate should be connected as shown in Figure 9.15b. Table 9.16 is the
truth table of the general integer division circuit.
We already know how the “done” signal is generated by the division circuit. An examination of Table 9.16
tells how the remaining eight output signals depend on the counter outputs Ci . We first define the auxiliary
quantity X = C3 + C4 + · · · + CL . The eight output signals are:
COMPL = C̄1 C̄2 . . . C̄L = C̄1 C̄2 (C3 + · · · + CL ) = C̄1 C̄2 X̄,
TAL = C1 C̄2 . . . C̄L = C1 C̄2 X̄,
ADD = C̄1 C2 and any C3 . . . CL = C̄1 C2 ,
abort = C1 C2 C̄3 . . . C̄L ALn = C1 C2 X̄ALn ,
SD = C̄1 C̄2 (C3 + C4 + · · · + CL )done = C̄1 C̄2 X done,
SL = C1 C̄2 (C3 + · · · + CL ) = C1 C̄2 X,
SU, MU01 = C1 C2 ALn .
The last point to be mentioned is how the counter is stopped. This is shown in Figure 9.15c. The
counter is clocked by the clock pulse φ, and is stopped when either “done” or “abort” are generated by the
division circuit. These signals are sent back to the control unit, to indicate the state of the division, but
they are also used to stop the counter when the division is complete.
264 9. The ALU
ALn
COMPL (AL S)
start TAL (T AL)
division ADD
φ SD To ALU
SL
To control unit
SU
MU01 (MU0 1)
abort φ
done done CK CL
abort
(a)
start division CR C1
(c)
C5 C4 C3 C2 C1
1 0 1 0 0
done
(b)
9.5 Shifts
The shift operation, with its various modes and applications, is discussed in Section 2.13.3. The ALU may
have one or more shift registers in order to implement the different types of shifts. This section shows two
designs for general shift registers, one using D flipflops and the other using JK flipflops. A shift register
may be unidirectional or bidirectional. It may have serial or parallel inputs (or both). In a shift register
with serial input, the bits to be shifted are input one by one. Each time a bit is shifted into one end, the
register shifts its content, which moves one bit out of the other end. Such a register should be cleared before
it is used. In contrast, parallel input is fed into the shift register in one step, following which the register is
shifted by the desired amount. A shift register may also have serial or parallel outputs and it may also have
both.
Figure 9.17a shows a serial-serial (i.e., serial input and serial output) shift register based on D flipflops.
This register shifts because the output Q of a D flipflop is set to the input D on a clock pulse (in the diagram,
they are set on the trailing edge of the clock pulse). The register has serial output, but parallel output can
be obtained “for free” just by connecting lines to the various Q outputs. The figure also shows how parallel
input can be added to this shift register by using flipflops with a “preset” input.
parallel output
(a)
parallel input
enable parallel
input
S R S R S R
J Q J Q J Q
serial serial
CK CK CK output
input K Q K Q K Q
CR CR CR
shift
reset
(b)
Figure 9.17b illustrates a similar shift register based on JK flipflops connected in a master-slave config-
uration. These flipflops have synchronous JK inputs and asynchronous SR inputs. The former perform the
shifts and the latter are used for the parallel input.
These diagrams also suggest that a typical shift register can easily be extended to perform circular shifts
by feeding the serial output into the serial input of the register.
266 9. The ALU
AB
A<B A
A=B
A>B B
AB
(a)
A2
B2
A1
B1 A>B
A0
B0
A2
B2
A1
A=B
B1
A0
A<B
B0
(b)
A=B A<B
A>B
(c)
9.6 Comparisons
Section 2.11 discusses integer comparisons. Two integers can be compared either by subtracting them and
checking the sign of the difference (but subtraction can generate overflow) or by comparing pairs of bits from
left to right, looking for the first pair that differs. The result of a comparison is one of the three relations
“>”, “=”, and “<”. This section describes simple designs for a serial and a parallel comparators.
Figure 9.18a shows a serial comparator. The two numbers to be compared are first moved to the
temporary registers A and B. These registers are rotated to the left bit by bit, and pairs of bits are sent to
the comparator from the left ends of A and B. Eventually, one of the three output lines goes high. This line
should set the correct status flags and also stop the comparison.
A parallel comparator is more complex, since it requires enough gates to compare all bit pairs in parallel.
It also uses high fan-in gates, which makes it impractical for large numbers. Figure 9.18b shows a parallel
comparator for 3-bit integers. The design is straightforward. Comparing larger numbers can be done in
two stages. The first stage uses several parallel comparators, each comparing 3-4 consecutive bit pairs. The
second stage uses the results of the first stage to decide on the relation between the numbers. Figure 9.18c
shows such a comparator for 9-bit integers. The first stage consists of three 3-bit comparators (like the one
shown in Figure 9.18b) and the second stage has gates with fan-in of up to three.
9.6.1 Summary
This short chapter discusses a few basic ALU operations. In addition to those operations, the ALU of a
modern computer has circuits to perform floating-point arithmetic, BCD arithmetic, certain logical oper-
ations, and several types of shifts. The floating-point arithmetic operations are described in Section 2.21.
They are by far the most complex operations implemented in the ALU. In fact, until the 1990s, most ALUs
did not include circuits for the floating-point operations and a computer user had to either obtain a special
coprocessor (so called “math chip”) for these operations or implement them in software.
This means that each source instruction is translated into exactly one target instruction.
This definition has the advantage of clearly describing the translation process of an assembler. It is not a
precise definition, however, because an assembler can do (and usually does) much more than just translation.
It offers a lot of help to the programmer in many aspects of writing the program. The many types of help
offered by the assembler are grouped under the general term directives (or pseudo-instructions).
Another good definition of assemblers is:
An assembler is a translator that translates a machine-
oriented language into machine language.
This definition distinguishes between assemblers and compilers. Compilers being translators of problem-
oriented languages or of machine-independent languages. This definition, however, says nothing about the
one-to-one nature of the translation, and thus ignores a most important operating feature of an assembler.
One reason for studying assemblers is that the operation of an assembler reflects the architecture of
the computer. The assembler language depends heavily on the internal organization of the computer. Ar-
chitectural features such as memory word size, number formats, internal character codes, index registers,
and general purpose registers, affect the way assembler instructions are written and the way the assembler
handles instructions and directives. This fact explains why there is an interest in assemblers today and why
a course on assembler language is still required for many, perhaps even most, computer science degrees.
270 10. Assemblers
The first assemblers were simple assemble-go systems. All they could do was to assemble code directly
in memory and start execution. It was quickly realized, however, that linking is an important feature,
required even by simple programs. The pioneers of programming have developed the concept of the routine
library very early, and they needed assemblers that could locate library routines, load them into memory,
and link them to the main program. It is from this task of locating, loading, and linking—of assembling a
single working program from individual pieces—that the name assembler originated. Today, assemblers are
translators and they work on one program at a time. The tasks of locating, loading, and linking (as well as
many other tasks) are performed by a loader.
A modern assembler has two inputs and two outputs. The first input is short, typically a single line
typed at a keyboard. It activates the assembler and specifies the name of a source file (the file containing
the source code to be assembled). It may contain other information that the assembler should have before
it starts. This includes commands and specifications such as:
The names of the object file and listing file.
Display (or do not display) the listing on the screen while it is being generated.
Display all error messages but do not stop for any error.
Save the listing file and do not print it (see below).
This program does not use macros.
The symbol table is larger (or smaller) than usual and needs a certain amount of memory.
All these terms are explained elsewhere. An example is the command line that invokes MACRO, the
VAX assembler. The line
MACRO /SHOW=MEB /LIST /DEBUG ABC
activates the assembler, tells it that the source program name is abc.mar (the .mar extension is implied),
that binary lines in macro expansions should be listed (shown), that a listing file should be created, and that
the debugger should be included in the assembly.
Another typical example is the following command line that invokes the Microsoft Macro assembler
(MASM) for the 80x86 microprocessors.
MASM /d /Dopt=5 /MU /V
It tells the assembler to create a pass 1 listing (/D), to create a variable opt and set its value to 5, to convert
all letters read from the source file to upper case (MU), and to include certain information in the listing file
(the V, or verbose, option).
The second input is the source file. It includes the symbolic instructions and directives. The assembler
translates each symbolic instruction into one machine instruction. The directives, however, are not translated.
The directives are our way of asking the assembler for help. The assembler provides the help by executing
(rather than translating) the directives. A modern assembler can support as many as a hundred directives.
They range from ORG, which is very simple to execute, to MACRO, which can be very complex.
The first and most important output of the assembler is the object file. It contains the assembled
instructions (the machine language program) to be loaded later into memory and executed. The object file
is an important component of the assembler-loader system. It makes it possible to assemble a program once,
and later load and run it often. It also provides a natural place for the assembler to leave information to the
loader, instructing the loader in several aspects of loading the program. This information is called loader
directives. Note, however, that the object file is optional. The user may specify no object file, in which case
the assembler generates only a listing.
The second output of the assembler is the listing file. For each line in the source file, a line is created
in the listing file, containing:
The Location Counter (page 275).
The source line itself.
The machine instruction (if the source line is an instruction), or some other relevant information (if
the source line is a directive).
10.1 Introduction 271
The listing file is generated by the assembler, sent to the printer, gets printed, and is then discarded.
The user, however, can specify either not to generate a listing file or not to print it. There are also directives
that control the listing. They can be used to suppress parts of the listing, to print page headers, or to control
the printing of macro expansions.
The cross-reference information is normally a part of the listing file, although the MASM assembler
creates it in a separate file and uses a special utility to print it. The cross-reference is a list of all symbols
used in the program. For each symbol, the point where it is defined and all the places where it is used, are
listed.
Exercise 10.1: Why would anyone want to suppress the listing file or not to print it?
As mentioned earlier, the first assemblers were assemble-go type systems. They did not generate any
object file. Their main output was machine instructions loaded directly into memory. Their secondary
output was a listing. Such assemblers are also in use today (for reasons explained in Section 10.6) and are
called one-pass assemblers. In principle, a one pass assembler can produce an object file, but such a file
would be absolute and its use is limited.
Most current assemblers are of the two-pass variety. They generate an object file that is relocatable and
can be linked and loaded by a loader.
A loader, as the name implies, is a program that loads programs into memory. Modern loaders, however,
do much more than that. Their main tasks are loading, relocating, linking and starting the program. In
a typical run, a modern linking-loader can read several object files, load them one by one into memory,
relocating each as it is being loaded, link all the separate object files into one executable module, and start
execution at the right point. Using such a loader has several advantages (Section 10.7), the most important
being the ability to write and assemble a program in several, separate, parts.
Writing a large program in several parts is advantageous, for reasons that will be briefly mentioned
but not fully discussed here. The individual parts can be written by different programmers (or teams of
programmers), each concentrating on his own part. The different parts can be written in different languages.
It is common to write the main program in a higher-level language and the procedures in assembler language.
The individual parts are assembled (or compiled) separately, and separate object files are produced. The
assembler or compiler can only see one part at a time and does not see the whole picture. It is only the loader
that loads the separate parts and combines them into a single program. Thus when a program is assembled,
the assembler does not know whether this is a complete program or just a part of a larger program. It
therefore assumes that the program will start at address zero and assembles it based on that assumption.
Before the loader loads the program, it determines its true start address, based on the memory areas available
at that moment and on the previously loaded object files. The loader then loads the program, making sure
that all instructions fit properly in their memory locations. This process involves adjusting memory addresses
in the program, and is called relocation.
Since the assembler works on one program at a time, it cannot link individual programs. When it
assembles a source file containing a main program, the assembler knows nothing about the existence of any
other source files containing, perhaps, procedures called by the main program. As a result, the assembler
may not be able to properly assemble a procedure call instruction (to an external procedure) in the main
program. The object file of the main program will, in such a case, have missing parts (holes or gaps) that
the assembler cannot fill. The loader has access to all the object files that make up the entire program. It
can see the whole picture, and one of its tasks is to fill up any missing parts in the object files. This task is
called linking.
The task of preparing a source program for execution includes translation (assembling or compiling),
loading, relocating, and linking. It is divided between the assembler (or compiler) and the loader, and
dual assembler-loader systems are very common. The main exception to this arrangement is interpretation.
Interpretive languages such as BASIC or APL use the services of one program, the interpreter, for their
execution, and do not require an assembler or a loader. It should be clear from the earlier discussion that
the main reason for keeping the assembler and loader separate is the need to develop programs (especially
large ones) in separate parts. The detailed reasons for this will not be discussed here. We will, however, point
out the advantages of having a dual assembler-loader system. They are listed below, in order of importance.
It makes it possible to write programs in separate parts that may also be in different languages.
272 10. Assemblers
It keeps the assembler small. This is an important advantage. The size of the assembler depends on
the size of its internal tables (especially the symbol table and the macro definition table). An assembler
designed to assemble large programs is large because of its large tables. Separate assembly makes it possible
to assemble very large programs with a small assembler.
When a change is made in the source code, only the modified program needs to be reassembled. This
property is a benefit if one assumes that assembly is slow and loading is fast. Often, however, loading is
slower than assembling, and this property is just a feature, not an advantage, of a dual assembler-loader
system.
The loader automatically loads routines from a library. This is considered by some an advantage of
a dual assembler-loader system but, actually, it is not. It could easily be done in a single assembler-loader
program. In such a program, the library would have to contain the source code of the routines, but this is
typically not larger than the object code.
10.2 A Short History of Assemblers
One of the first stored program computers was the EDSAC (Electronic Delay Storage Automatic Calculator)
developed at Cambridge University in 1949 by Maurice Wilkes and W. Renwick. From its very first days the
EDSAC had an assembler, called Initial Orders. It was implemented in a read-only memory formed from
a set of rotary telephone selectors, and it accepted symbolic instructions. Each instruction consisted of a
one letter mnemonic, a decimal address, and a third field that was a letter. The third field caused one of 12
constants preset by the programmer to be added to the address at assembly time.
It is interesting to note that Wilkes was also the first to propose the use of labels (which he called
floating addresses), the first to use an early form of macros (which he called synthetic orders), and the first
to develop a subroutine library.
The IBM 650 computer was delivered around 1953 and had an assembler very similar to present day
assemblers. SOAP (Symbolic Optimizer and Assembly Program) did symbolic assembly in the conventional
way, and was perhaps the first assembler to do so. However, its main feature was the optimized calculation
of the address of the next instruction. The IBM 650 (a decimal computer, incidentally), was based on a
magnetic drum memory and the program was stored in that memory. Each instruction had to be fetched
from the drum and had to contain the address of its successor. For maximum speed, an instruction had
to be placed on the drum in a location that would be under the read head as soon as its predecessor was
completed. SOAP calculated those addresses, based on the execution times of the individual instructions.
One of the first commercially successful computers was the IBM 704. It had features such as floating-
point hardware and index registers. It was first delivered in 1956 and its first assembler, the UASAP-1,
was written in the same year by Roy Nutt of United Aircraft Corp. (hence the name UASAP—United
Aircraft Symbolic Assembly Program). It was a simple binary assembler, did practically nothing but one-
to-one translation, and left the programmer in complete control over the program. SHARE, the IBM users’
organization, adopted a later version of that assembler and distributed it to its members together with
routines produced and contributed by members. UASAP has pointed the way to early assembler writers,
and many of its design principles are used by assemblers to this day. The UASAP was later modified to
support macros.
In the same year another assembler, the IBM Autocoder was developed by R. Goldfinger for use on the
IBM 702/705 computers. This assembler (actually several different Autocoder assemblers) was apparently
the first to use macros. The Autocoder assemblers were used extensively and were eventually developed into
large systems with large macro libraries used by many installations.
Another pioneering early assembler was the UNISAP, for the UNIVAC I & II computers, developed in
1958 by M. E. Conway. It was a one-and-a-half pass assembler, and was the first one to use local labels
(Section 10.9).
By the late fifties, IBM had released the 7000 series of computers. These came with a macro assembler,
SCAT, that had all the features of modern assemblers. It had many directives (pseudo instructions in the
IBM terminology), an extensive macro facility, and it generated relocatable object files.
The SCAT assembler (Symbolic Coder And Translator) was originally written for the IBM 709 and
was modified to work on the IBM 7090. The GAS (Generalized Assembly System) assembler was another
powerful 7090 assembler.
10.2 A Short History of Assemblers 273
Machine Language
Assembler Language
Absolute Assembler
Directives
Relocation External
Bits Routines
Relocatable
Assembler
and Loader
Conditional Assembly
Full-Feature, Relocatable
Macro Assembler, with
Conditional Assembly
The idea of macros originated with several people. M. D. McIlroy was probably the first to propose
the modern form of macros and the idea of conditional assembly. He implemented these ideas in the GAS
assembler mentioned earlier.
One of the first full-feature loaders, the linking loader for the IBM 704–709–7090 computers, is an
example of an early loader supporting both relocation and linking.
The earliest discussion of meta-assemblers seems to be by Ferguson. The idea of high-level assemblers
originated with Wirth and had been extended, a few years later, by an anonymous software designer at NCR,
who proposed the main ideas of the NEAT/3 language.
The diagram summarizes the main phases in the historical development of assemblers and loaders.
A One-pass Assembler: One that performs all its functions by reading the source file once.
A Resident Assembler: One that is permanently loaded in memory. Typically such an assembler
resides in ROM, is very simple (supports only a few directives and no macros), and is a one-pass assembler.
The above assemblers are described below.
A Cross-Assembler: An assembler that runs on one computer and assembles programs for another.
Many cross-assemblers are written in a higher-level language to make them portable. They run on a large
machine and produce object code for a small machine.
A Disassembler: This, in a sense, is the opposite of an assembler. It translates machine code into a
source program in assembler language.
A high-level assembler. This is a translator for a language combining the features of a higher-level
language with some features of assembler language. Such a language can also be considered a machine
dependent higher-level language.
A Bootstrap Loader: It uses its first few instructions to either load the rest of itself, or load another
loader, into memory. It is typically stored in ROM.
An Absolute Loader: Can only load absolute object files, i.e., can only load a program starting from
a certain, fixed location in memory.
A Relocating Loader: Can load relocatable object files and thus can load the same program starting
at any location.
A Linking Loader: Can link programs that were assembled separately, and load them as a single
module.
A Linkage Editor: Links programs and does some relocation. Produces a load module that can later
be loaded by a simple relocating loader.
10.4 Assembler Operation 275
The comment must be separated from the operand by at least one blank. If there is no operand, the
comment may not start before column 17.
The comment extends through column 80 but columns 73–80 are normally used for sequencing and
identification.
It is obviously very hard to enter such source lines from a keyboard. Modern assemblers are thus more
flexible and do not require any special format. If a label exists, it must end with a ‘:’. Otherwise, the
individual fields should be separated by at least one space (or by a tab character), and subfields should be
separated by either a comma or parentheses. This rule makes it convenient to enter source lines from a
keyboard, but is ambiguous in the case of a source line that has a comment but no operand.
example: EI ;ENABLE ALL INTERRUPTS
The semicolon guarantees that the word ENABLE will not be considered an operand by the assembler.
This is why many assemblers require that comments start with a semicolon.
Exercise 10.2: Why a semicolon and not some other character such as ‘$’ or ‘@’ ?
Many modern assemblers allow labels without an identifying ‘:’. They simply have to work harder in
order to identify labels.
The instruction sets of some computers are designed such that the mnemonic specifies more than just
the operation. It may also contain part of the operand. The Signetics 2650 microprocessor, for example,
has many mnemonics that include one of the operands. A ‘Store Relative’ instruction on the 2650 may be
written ‘STRR,R0 SAV’; the mnemonic field includes R0 (the register to be stored in location SAV), which is
an operand.
On other computers, the operation may partly be specified in the operand field. The instruction
‘IX7 X2+X5’, on the CDC Cyber computers means: “add register X2 and register X5 as integers, and store
the sum in register X7.” The operation appears partly in the operation field (‘I’) and partly in the operand
field (‘+’), whereas X7 (an operand) appears in the mnemonic. This makes it harder for the assembler to
identify the operation and the operands and, as a result, such instruction formats are not common.
Exercise 10.3: What is the meaning of the Cyber instruction FX7 X2+X5?
To translate an instruction, the assembler uses the opcode table, which is a static data structure. The
two important columns in the table are the mnemonic and opcode. Table 10.1 is an example of a simple
opcode table. It is part of the IBM 360 opcode table and it includes other information.
Exercise 10.4: Why does the IBM 360 have 16 general purpose registers and not a round number such as
15 or 20?
Example: The instruction AR 4,6 means: add register 6 (the source) to register 4 (the destination
277
operand). It is assembled as the 16-bit machine instruction 1A46, in which 1A is the opcode and 46, the two
operands.
Type RX stands for Register-indeX. In these instructions the operand consists of a register followed by
an address.
Example: BAL 5,14. This instruction calls a procedure at location 14, and saves the return address in
register 5 (BAL stands for Branch And Link). It is assembled as the 32-bit machine instruction 4550000E in
which 00E is a 12-bit address field (E is hexadecimal 14), 45 is the opcode, 5 is register 5, and the two zeros
in the middle are irrelevant to our discussion. (A note to readers familiar with the IBM 360—This example
ignores base registers as they do not contribute anything to our discussion of assemblers.)
Exercise 10.5: What are the two zeros in the middle of the instruction used for?
This example is atypical. Numeric addresses are rarely used in assembler programming, since keeping
track of their values is a tedious task better left to the assembler. In practice, symbols are used instead of
numeric addresses. Thus the above example is likely to be written as BAL 5,XYZ, where XYZ is a symbol
whose value is an address. Symbol XYZ should be the label of some source line. Typically the program will
contain the two lines
Besides the basic task of assembling instructions, the assembler offers many services to the user, the
most important of which is handling symbols. This task consists of two different parts, defining symbols,
and using them. A symbol is defined by writing it as a label. The symbol is used by writing it in the
operand field of a source line. A symbol can only be defined once but it can be used any number of times.
To understand how a value is assigned to a symbol, consider the example above. The ‘add’ instruction A
is assembled and is eventually loaded into memory as part of the program. The value of symbol XYZ is the
memory address of that instruction. This means that the assembler has to keep track of the addresses where
instructions are loaded, since some of them will become values of symbols. To do this, the assembler uses
two tools, the location counter (LC), and the symbol table.
The LC is a variable, maintained by the assembler, that contains the address into which the current
instruction will eventually be loaded. When the assembler starts, it clears the LC, assuming that the first
instruction will go into location 0. After each instruction is assembled, the assembler increments the LC by
the size of the instruction (the size in words, not in bits). Thus the LC always contains the current address.
Note that the assembler does not load the instructions into memory. It writes them on the object file, to be
eventually loaded into memory by the loader. The LC, therefore, does not point to the current instruction.
It just shows where the instruction will eventually be loaded. When the source line has a label (a newly
defined symbol), the label is assigned the current value of the LC as its value. Both the label and its value
(plus some other information) are then placed in the symbol table.
The symbol table is an internal, dynamic table that is generated, maintained, and used by the assembler.
Each entry in the table contains the definition of a symbol and has fields for the name, value, and type of the
symbol. Some symbol tables contain other information about the symbols. The symbol table starts empty,
labels are entered into it as their definitions are found in the source, and the table is also searched frequently
to find the values and types of symbols whose names are known. Various ways to implement symbol tables
are discussed in Section 10.14.
In the above example, when the assembler encounters the line
XYZ A 5,ABC ;THE SUBROUTINE STARTS HERE
it performs two independent operations. It stores symbol XYZ and its value (the current value of the LC)
in the symbol table, and it assembles the instruction. These two operations have nothing to do with each
278 10. Assemblers
other. Handling the symbol definition and assembling the instruction are done by two different parts of the
assembler. Often, they are performed in different phases of the assembly.
If the LC happens to have the value 260, then the entry
will be added to the symbol table (104 is the hex value of decimal 260, and the type REL will be explained
later).
When the assembler encounters the line
BAL 5,XYZ
it assembles the instruction but, in order to assemble the operand, the assembler needs to search the symbol
table, find symbol XYZ, fetch its value and make it part of the assembled instruction. The instruction is,
therefore, assembled as 45500104.
Exercise 10.6: The address in our example, 104, is a relatively small number. Often, instructions have a
12-bit field for the address, allowing addresses up to 212 − 1 = 4095. What if the value of a certain symbol
exceeds that number?
This is, in a very general way, what the assembler has to do in order to assemble instructions and handle
symbols. It is a simple process and it involves only one problem which is illustrated by the following example.
In this case the value of symbol XYZ is needed before label XYZ is defined. When the assembler gets to the
first line (the BAL instruction), it searches the symbol table for XYZ and, of course, does not find it. This
situation is called the future symbol problem or the problem of unresolved references. The XYZ in our example
is a future symbol or an unresolved reference.
Obviously, future symbols are not an error and their use should not be prohibited. The programmer
should be able to refer to source lines which either precede or follow the current line. Thus the future
symbol problem has to be solved. It turns out to be a simple problem and there are two solutions, a one-
pass assembler and a two-pass assembler. They represent not just different solutions to the future symbol
problem but two different approaches to assembler design and operation. The one-pass assembler, as the
name implies, solves the future symbol problem by reading the source file once. Its most important feature,
however, is that it does not generate a relocatable object file but rather loads the object code (the machine
language program) directly into memory. Similarly, the most important feature of the two-pass assembler
is that it generates a relocatable object file, that is later loaded into memory by a loader. It also solves
the future symbol problem by performing two passes over the source file. It should be noted at this point
that a one-pass assembler can generate an object file. Such a file, however, would be absolute, rather than
relocatable, and its use is limited. Absolute and relocatable object files are discussed later in this chapter.
Figure 10.2 is a summary of the most important components and operations of an assembler.
Pass
Location counter
indicator
Source
file
Error
proc.
Source line Main Object
buffer Program file
Object
code
assembly
area
Lexical scan
routine Table search procedures
Figure 10.2. The main components and operations of an assembler
Exercise 10.7: What if a certain symbol is needed in pass 2, to assemble an instruction, and is not found
in the symbol table?
To assign values to labels in pass 1, the assembler has to maintain the LC. This in turn means that the
assembler has to determine the size of each instruction (in words), even though the instructions themselves
are not assembled.
In many cases it is easy to figure out the size of an instruction. On the IBM 360, the mnemonic
determines the size uniquely. An assembler for this machine keeps the size of each instruction in the opcode
table together with the mnemonic and the opcode (Table 10.1). On the DEC PDP-11 the size is determined
both by the type of the instruction and by the addressing mode(s) that it uses. Most instructions are one
word (16-bits) long. However, if they use either the index or index deferred modes, one more word is added
to the instruction. If the instruction has two operands (source and destination) both using those modes, its
size will be 3 words. On most modern microprocessors, instructions are between 1 and 4 bytes long and the
size is determined by the opcode and the specific operands used.
This means that, in many cases, the assembler has to work hard in the first pass just to determine the
size of an instruction. It has to look at the mnemonic and, sometimes, at the operands and the modes,
even though it does not assemble the instruction in the first pass. All the information about the mnemonic
and the operand collected by the assembler in the first pass is extremely useful in the second pass, when
instructions are assembled. This is why many assemblers save all the information collected during the first
pass and transmit it to the second pass through an intermediate file. Each record on the intermediate file
contains a copy of a source line plus all the information that has been collected about that line in the first
pass. At the end of the first pass the original source file is closed and is no longer used. The intermediate
file is reopened and is read by the second pass as its input file.
A record in a typical intermediate file contains:
The record type. It can be an instruction, a directive, a comment, or an invalid line.
The LC value for the line.
A pointer to a specific entry in the opcode table or the directive table. The second pass uses this
pointer to locate the information necessary to assemble or execute the line.
A copy of the source line. Notice that a label, if any, is not use by pass 2 but must be included in the
280 10. Assemblers
pass 2
yes
eof
stop
?
no
assemble
instruction
pass 1
read line 1
from source file
label pass 2
yes
defined
?
store name & value
no in symbol table
determine size
of instruction
LC:=LC+size
Labels normally have a maximum size (typically 6 or 8 characters), must start with a letter, and may
only consist of letters, digits, and a few other characters. Labels that do not conform to these rules are
invalid labels and are normally considered a fatal error. However, some assemblers will truncate a long label
to the maximum size and will issue just a warning, not an error, in such a case.
Exercise 10.8: What is the advantage of allowing characters other than letters and digits in a label?
The only problem with symbols in the second pass is bad symbols. These are either multiply-defined or
undefined symbols. When a source line uses a symbol in the operand field, the assembler looks it up in the
symbol table. If the symbol is found but has a type of MTDF, or if the symbol is not found in the symbol
table (i.e., it has not been defined), the assembler responds as follows.
It flags the instruction in the listing file.
It assembles the instruction as far as possible, and writes it on the object file.
It flags the entire object file. The flag instructs the loader not to start execution of the program. The
object file is still generated and the loader will read and load it, but not start it. Loading such a file may be
282 10. Assemblers
LC
36 BEQ AB ;BRANCH ON EQUAL
.
.
67 BNE AB ;BRANCH ON NOT EQUAL
.
.
89 JMP AB ;UNCONDITIONALLY
.
.
126 AB anything
Figure 10.4. Example of future symbols
Symbol AB is used three times as a future symbol. On the first reference, when the LC happens to stand
at 36, the assembler searches the symbol table for AB, does not find it, and therefore assumes that it is a
future symbol. It then inserts AB into the symbol table but, since AB has no value yet, it gets a special type.
Its type is U (undefined). Even though it is still undefined, it now occupies an entry in the symbol table, an
entry that will be used to keep track of AB as long as it is a future symbol. The next step is to set the ‘value’
field of that entry to 36 (the current value of the LC). This means that the symbol table entry for AB is now
pointing to the instruction in which AB is needed. The ‘value’ field is an ideal place for the pointer since it
is the right size, it is currently empty, and it is associated with AB. The BEQ instruction itself is only partly
assembled and is stored, incomplete, in memory location 36. The field in the instruction were the value of
AB should be stored (the address field), remains empty.
When the assembler gets to the BNE instruction (at which point the LC stands at 67), it searches the
symbol table for AB, and finds it. However, AB has a type of U, which means that it is a future symbol and
thus its ‘value’ field (=36) is not a value but a pointer. It should be noted that, at this point, a type of
U does not necessarily mean an undefined symbol. While the assembler is performing its single pass, any
undefined symbols must be considered future symbols. Only at the end of the pass can the assembler identify
undefined symbols (see below). The assembler handles the BNE instruction by:
Partly assembling it and storing it in memory location 67.
Copying the pointer 36 from the symbol table to the partly assembled instruction in location 67. The
instruction has an empty field (where the value of AB should have been), where the pointer is now stored.
There may be cases where this field in the instruction is too small to store a pointer. In such a case the
assembler must resort to other methods, one of which is discussed below.
10.6 The One-Pass Assembler 283
Copying the LC (=67) into the ‘value’ field of the symbol table entry for AB, rewriting the 36.
When the assembler reaches the JMP AB instruction, it repeats the three steps above. The situation at
those three points is summarized below.
memory symbol memory symbol memory symbol
table table table
loc contents n v t loc contents n v t loc contents n v t
It is obvious that an indefinite number of instructions can refer to AB as a future symbol. The result will
be a linked list linking all these instructions. When the definition of AB is finally found (the LC will be 126
at that point), the assembler searches the symbol table for AB and finds it. The ‘type’ field is still U which
tells the assembler that AB has been used as a future symbol. The assembler then follows the linked list
of instructions using the pointers found in the instructions. It starts from the pointer found in the symbol
table and, for each instruction in the list, the assembler:
saves the value of the pointer found in the address field of the instruction. The pointer is saved in
a register or a memory location (‘temp’ in the figure below), and is later used to find the next incomplete
instruction.
Stores the value of AB (=126) in the address field of the instruction, thereby completing it.
The last step is to store the value 126 in the ‘value’ field of AB in the symbol table, and to change the
type to D. The individual steps taken by the assembler in our example are shown in the table below.
It therefore follows that at the end of the single pass, the symbol table should only contain symbols
with a type of D. At the end of the pass, the assembler scans the symbol table for undefined symbols. If it
finds any symbols with a type of U, it issues an error message and will not start the program.
Figure 10.5 is a flow chart of a one-pass assembler.
The one-pass assembler loads the machine instructions in memory and thus has no trouble in going back
and completing instructions. However, the listing generated by such an assembler is incomplete since it cannot
backspace the listing file to complete lines previously printed. Therefore, when an incomplete instruction
(one that uses a future symbol) is loaded in memory, it also goes into the listing file as incomplete. In
the example above, the three lines using symbol AB will be printed with asterisks ‘*’ or question marks ‘?’,
instead of the value of AB.
284 10. Assemblers
start
lc 0 6 enter name,
pointer, &
1 type of U
read line
from
source
3
yes
eof? 5
7 copy pointer
no from s.t. to
instruction
being
assembled
label yes
defined 4
no
placeLC in
s.t. to point
scan line 2 to current
instruction
being
assembled
a no
symbol 3
used
3 assemble line
yes
not
found load in memory
search
symbol 6
table
printLC, source
found & object codes
type 1
3 7
=D ? no
yes
Figure 10.5. The operations of the one-pass assembler (part 1)
The key to the operation of a one-pass assembler is the fact that it loads the object code directly in
memory and does not generate an object file. This makes it possible for the assembler to go back and
complete instructions in memory at any time during assembly.
The one-pass assembler can, in principle, generate an object file by simply writing the object program
from memory to a file. Such an object file, however, would be absolute. Absolute and relocatable object
files are discussed in Section 10.7.
One more point needs to be mentioned here. It is the case where the address field in the instruction
is too small for a pointer. This is a common case, since machine instructions are designed to be short and
10.6 The One-Pass Assembler 285
error!
found no label is
search type
s.t. =U? doubly
5
defined
yes
scan
enter follow pointer s.t.
name, n value field. 1
LC, & complete all
type=D instr. waiting
in s.t. for value of no
stop U
the symbol symbol
?
2 yes
storeLC in
value field of error!
s.t., change undefined
type to D symbol
2 stop
normally do not contain a full address. Instead of a full address, a typical machine instruction contains two
fields, mode and displacement (or offset), such that the mode tells the computer how to obtain the full
address from the displacement (Section 2.3). The displacement field is small (typically 8–12 bits) and has
no room for a full address.
To handle this situation, the one-pass assembler has an additional data structure, a collection of linked
lists, each corresponding to a future symbol. Each linked list contains, in its nodes, pointers to instructions
that are waiting to be completed. The list for symbol AB is shown in Figure 10.6 in three successive stages
of its construction.
When symbol AB is found, the assembler uses the information in the list to complete all incomplete
instructions. It then returns the entire list to the pool of available memory.
An easy way to maintain such a collection of lists is to house them in an array. Figure 10.7 shows our
list, occupying positions 5,9,3 of such an array. Each position has two locations, the first being the data item
stored (a pointer to an incomplete instruction) and the second, the array index of the next node of the list.
symbol table 3 4 5 6 7 8 9
n v t 36 89 67
/ 9 3
AB 5 U
Exercise 10.9: What would be good Pascal declarations for such a future symbol list:
a. Using absolute pointers.
b. Housed in an array.
286 10. Assemblers
n v t n v t n v t
AB U AB U AB U
36 67 89
36 67
36
The relocation bits themselves are not loaded into memory since memory should contain only the object
code. When the computer executes the program, it expects to find just instructions and data in memory.
Any relocation bits in memory would be interpreted by the computer as either instructions or data.
This explains why a one-pass assembler cannot generate a relocatable object file. The type of the
instruction (absolute or relocatable) can be determined only by examining the original source instruction.
The one-pass assembler loads the machine instructions directly in memory. Once in memory, the instruction
is just a number. By looking at a machine instruction in memory, it is impossible to tell whether the original
instruction was absolute or relocatable. Writing the machine instructions from memory to a file will create an
object file without any relocation bits, i.e., an absolute object file. Such an object file is useful on computers
were the program is always loaded at the same place. In general, however, such files have limited value.
Some readers are tempted, at this point, to find ways to allow a one-pass assembler to generate relocation
bits. Such ways exist, and two of them will be described here. The point is, however, that the one-pass
assembler is a simple, fast, assemble-load-go program. Any modifications may result in a slow, complex
assembler, thereby losing the main advantages of one-pass assembly. It is preferable to keep the one-pass
assembler simple and, if a relocatable object file is necessary, to use a two-pass assembler.
Another point to realize is that a relocatable object file contains more than relocation bits. It contains
loader directives and linking information. All this is easy for a two-pass assembler to generate but hard for
a one-pass one.
10.7.2 One-Pass, Relocatable Object Files
Two ways are discussed below to modify the one-pass assembler to generate a relocatable object file.
1. A common approach to modify the basic one-pass assembler is to have it generate a relocation bit each
time an instruction is assembled. The instruction is then loaded into memory and the relocation bit may be
stored in a special, packed array outside the program area. When the object code is finally written on the
object file, the relocation bits may be read from the special array and attached each to its instruction.
Such a method may work, but is cumbersome, especially because of future symbols. In the case of a
future symbol, the assembler does not know the type (absolute or relocatable) of the missing symbol. It
thus cannot generate the relocation bit, resulting in a hole in the special array. When the symbol definition
is finally found, the assembler should complete all the instructions that use this symbol, and also generate
the relocation bit and store it in the special array (a process involving bit operations).
2. Another possible modification to the one-pass assembler will be briefly outlined. The assembler can write
each machine instruction on the object file as soon as it is generated and loaded in memory. At that point
the source instruction is available and can be examined, so a relocation bit can be prepared and written
on the object file with the instruction. The only problem is, as before, instructions using future symbols.
They must go on the object file incomplete and without relocation bits. At the end of the single pass, the
assembler writes the entire symbol table on the object file.
The task of completing those instructions is left to the loader. The loader initially skips the first part of
the object file and reads the symbol table. It then rereads the file, and loads instructions in memory. Each
time it comes across an incomplete instruction, it uses the symbol table to complete it and, if necessary, to
relocate it as well.
The trouble with this method is that it shifts assembler tasks to the loader, forcing the loader to do
what is essentially a two-pass job.
None of these modifications is satisfactory. The lesson to learn from these attempts is that, traditionally,
the one-pass and two-pass assemblers have been developed as two different types of assemblers. The first is
fast and simple; the second, a general purpose program which can support many features.
10.7.3 The Task of Relocating
The role of the loader is not as simple as may seem from the above discussion. Relocating an instruction is
not always as simple as adding a start address to it. On the IBM 7090/7094 computers, for example, many
instructions have the format:
The exact meaning of the fields is irrelevant except that the Address and Decrement fields may both contain
addresses. The assembler must determine the types of both fields (either can be absolute or relocatable),
and prepare two relocation bits. The loader has to read the two bits and should be able to relocate either
field. Relocating the Decrement field means adding the start address just to that field and not to the entire
instruction.
Exercise 10.10: How can the loader add something to a field in the middle of an instruction ?
Those familiar with separate assembly (the EXTRN and ENTRY directives) know that each field can in fact
have three different types, Absolute, Relocatable, and special relocation. Thus the assembler generally has
to generate two relocation bits for each field which, in the case of the IBM 7090/7094 (or similar computers),
implies a total of four relocation bits. The loader uses those pairs of relocation bits as identification bits,
identifying each line in the relocatable object file as one of four types: an absolute instruction, a relocatable
instruction, an instruction requiring special relocation, or as a loader directives.
On the PC, an absolute object file has the extension .COM, and a relocatable object file has the extension
.EXE.
10.7.4 Relocating Packed Instructions
An interesting problem is, how does the assembler handle relocation bits when several instructions are packed
in one word?
In a computer such as the CDC Cyber, only 30-bit and 60-bit instructions may contain addresses. There
are only six ways of combining instructions in a 60-bit word, as Figure 10.8 shows.
60
30 30
30 15 15
15 30 15
15 15 30
15 15 15 15
The assembler has to generate one of the values 0–5 as a 3-bit relocation field attached to each word
as it is written on the object file. The loader reads this field and uses it to perform the actual relocation
(Figure 10.9).
10.8 Absolute and Rel. Address Expressions
Most assemblers can handle address expressions. Generally, an address expression may be written instead
of just a symbol. Thus the instruction LOD R1,AB+1 loads register 1 from the memory location following AB;
the instruction ADD R1,AB+5 similarly operates on the memory location whose address is 5 greater than the
address AB. More complex expressions can be used, and the following two points should be observed:
Many assemblers (almost all the old ones and some of the newer ones) do not recognize operator
precedence. They evaluate any expression strictly from left to right and do not even allow parentheses.
Thus the expression A+B*C will be evaluated by the assembler as (A+B)*C and not, as might be expected, as
A+(B*C). The reason is that complex address expressions are rarely necessary in assembler programs and it
is therefore pointless to add parsing routines to the assembler.
When an instruction using an expression is assembled, the assembler should generate a relocation bit
based on the expression. Every expression should therefore have a well defined type. It should be either
10.8 Absolute and Rel. Address Expressions 289
60
0
1 30 30
2 30 15 15
3 15 30 15
4 15 15 30
5 15 15 15 15
absolute or relative. As a result, certain expressions are considered invalid by the assembler. Examples:
AB+1 has the same type as AB. Typically AB is relative, but it is possible to define absolute symbols (by
means of EQU and SET). In general, an expression of the form rel+abs, rel-abs, are relative, and expressions
of the form abs ± abs are absolute.
An expression of the form rel − rel is especially interesting. Consider the case
LC
16 A LOD
.
.
27 B STO
The value of A is 16 and its type is relative (meaning A is a regular label, defined by writing it to the left of an
instruction). Thus A represents address 16 from the start of the program. Similarly B is address 27 from the
start of the program. It is thus reasonable to define the expression B-A as having a value of 27 − 16 = 11 and
a type of absolute. It represents the distance between the two locations, and that distance is 11, regardless
of where the program starts.
Exercise 10.11: What about the expression A-B? is it valid? If yes, what are its value and type?
On the other hand, an expression of the form rel + rel has no well-defined type and is, therefore, invalid.
Both A and B above are relative and represent certain addresses. The sum A+B, however, does not represent
any address. In a similar way abs ∗ abs is abs, rel ∗ abs is rel but rel ∗ rel is invalid. abs/abs is abs, rel/abs
is rel but rel/rel is invalid. All expressions are evaluated at the last possible moment. Expressions in any
pass 0 directives (see Chapter 11 for a discussion of pass 0) are evaluated when the directive is executed,
in pass 0. Expressions in any pass 1 directives are, similarly, evaluated in pass 1. All other expressions (in
instructions or in pass 2 directives) are evaluated in pass 2.
An extreme example of an address expression is A-B+C-D+E where all the symbols involved are relative.
It is executed from left to right (((A-B)+C)-D)+E, generating the intermediate types: (((rel − rel) + rel) −
rel) + rel → ((abs + rel) − rel) + rel → (rel − rel) + rel → abs + rel → rel. A valid expression.
Generally, expressions of the type X+A-B+C-D+· · ·+M-N+Y are valid when X,Y are absolute and A,B,. . .,N
are relative. The relative symbols must come in pairs like A-B except the last one M-N, where N may be
missing. If N is missing, the entire expression is relative, otherwise, it is absolute.
Exercise 10.12: How does the assembler handle an expression such as A-B+K-L in which all the symbols
are relative but K,L are external?
10.8.1 Summary
The two-pass assembler generates the machine instructions in pass two, where it has access to the source
instructions. It checks each source instruction and generates a relocation bit according to:
290 10. Assemblers
If the instruction uses a relative symbol, then it is relocatable and the relocation bit is 1.
If the instruction uses an absolute symbol (such as EQU) or uses no symbols at all, then the instruction
is absolute and the relocation bit is 0.
An instruction in the relative mode contains an offset, not the full address, and is therefore absolute
(see Section 2.5 for the ralative mode).
The one-pass assembler generates the object file at the end of its single pass, by dumping all the
machine instructions from memory to the file. It has no access to the source at that point and therefore
cannot generate relocation bits.
As a result, those two types of assemblers have evolved along different lines, and represent two different
approaches to the overall assembler design, not just to the problem of resolving future symbols.
10.9 Local Labels
In principle, a label may have any name that obeys the simple syntax rules of the assembler. In practice,
though, label names should be descriptive. Names such as DATE, MORE, LOSS, RED are preferable to A001,
A002,. . .
There are exceptions, however. The use of the non-descriptive label A1 in the following example:
.
JMP A1
DDCT DS 12 reserve 12 locations for array DDCT
A1 .
.
is justified since it is only used to jump over the array DDCT. (Note that the array’s name is descriptive,
possibly meaning deductions or double-dictionary) The DS directive is used to reserve memory for an array.
We say that A1 is used only locally, to serve a limited purpose.
As a result, many assemblers support a feature called local labels. It is due to M. E. Conway who used it
in the early UNISAP assembler for the UNIVAC I computer. The main idea is that if a label is used locally
and does not require a descriptive name, why not give it a name that will signify this fact. Conway used
names such as 1H, 2H for the local labels. The name of a local label in our examples is a single decimal digit.
When such a label is referred to (in the operand field), the digit is followed by either B or F (for Backward
or Forward).
LC
.
.
13 1: ...
.
.
17 JMP 1F jump to 24
.
.
24 1: LOD R2,1B 1B here means address 13
.
.
31 1: ADD R1,2F 2F is address 102
.
.
102 2: DC 1206,-17
.
.
115 SUB R3,2B-1 102-1=101
The example shows that local labels is a simple, useful, concept that is easy to implement. In a two-pass
assembler, each local label is entered into the symbol table as any other symbol, in pass 1. Thus, the symbol
table in our example contains
n v
1 13
1 24
1 31
2 102
Symbol Table
The order of the labels in the symbol table is important. If the symbol table is sorted between the two
passes, all occurrences of each local label should remain sorted by value. In pass 2, when an instruction uses
a local label such as 1F, the assembler identifies the specific occurrence of label 1 by comparing all local
labels 1 to the current value of the LC. The first such instruction in our example is the JMP 1F at LC=17.
Clearly, the assembler should look for a local label with the name ‘1’ and a value ≥ 17. The smallest such
label has value 24. In the second case, LC=24 and the assembler is looking for a 1B. It needs the label with
name ‘1’ and a value which is the largest among all values < 24. It therefore identifies the label as the ‘1’ at
13.
Exercise 10.13: If we modify the instruction at 24 above to read 1: LOD R2,1F would the 1F refer to
address 31 or 24?
In a one-pass assembler, again the labels are recognized and put into the symbol table in the single pass.
An instruction using a local label iB is no problem, since is needs the most recent occurrence of the local
label ‘1’ in the table. An instruction using an iF is handled like any other future symbol case. An entry is
opened in the symbol table with the name iF, a type of U, and a value which is a pointer to the instruction.
In the example above, a snapshot of the symbol table at LC=32 is
n v t
1 13 D
1 24 D
1 31 D 31 is the value of the third 1
2 31 U 31 is a pointer to the ADD instruction
Symbol Table
An advantage of this feature is that the local labels are easy to identify as such, since their names start with
a digit. Most assemblers require regular label names to start with a letter.
In modern assemblers, local labels sometimes use a syntax different from the one shown here.
10.9.1 The LC as a Local Symbol
Virtually all assemblers allow a notation such as BPL *+6 where ‘*’ stands for the current value of the LC.
The operand in this case is located at a point 6 locations following the BPL instruction.
The LC symbol can be part of any address expression and is, of course, relocatable. Thus *+A is valid if
A is absolute, while *-A is always okay (and is absolute if A is relative, relative if A is absolute). This feature
is easy to implement. The address expression involving the ‘*’ is calculated, using the current value of the
LC, and the value is used to assemble the instruction, or execute the directive, on the current source line.
Nothing is stored in the symbol table.
Some assemblers use the asterisk for multiplication, and may designate the period ‘.’ or the ‘$’ for the
LC symbol.
On the PDP-11 the notation X: .=.+8 is used to increment the LC by 8, and thus to reserve eight
locations (compare this to the DS directive).
Exercise 10.14: What is the meaning of JMP *, JMP *-*?
292 10. Assemblers
However, at run time, the hardware, after executing the ADD instruction, would try to execute the first
element of array D as an instruction. Obviously, instructions and data have to be separated, and normally
all the arrays and constants are declared at the end of the program, following the last executable instruction
(HLT).
Multiple location counters make it possible to enjoy the best of both worlds. The data can be declared
when first used, and can be loaded at the end of the program or anywhere else the programmer wishes.
This feature uses several directives, the most important of which will be described here. It is based on the
principle that new location counters can be declared and given names, at assembly time, at any point in
the source code. The example above can be handled by declaring a location counter with a name (such as
DATA) instructing the assembler to assemble the DS directive under that LC, and to switch back to the main
LC—which now must have a name—like any other LC. Its name is ‘ ’ (a space).
This is done by the special directive USE:
ADD D, ...
USE DATA
D DS 12
USE *
.
.
This directive directs the assembler to start using (or to resume the use of) a new location counter. The
name is specified in the operand field, so an empty operand means the main LC. The asterisk ‘*’ implies the
previous LC, the one that was used before the last USE.
Exercise 10.15: The previous section discusses the use of asterisk as the LC value. When executing a USE
*, how does the assembler know that the asterisk is not the LC value?
The USE directives divide the program into several sections, which are loaded, by the loader, into separate
memory areas. The sections are loaded in the order in which their names appear in the source. Figure 1–8
is a good example:
At load time, the sections would be loaded in the order MAIN, DATA, BETA, GAMMA or 1,3,6,2,5,4,7. Such
a load involves an additional loader pass.
Exercise 10.16: Can we start a program with a USE ABC? in other words, can the first section be other
than the main section?
Another example of the same feature is procedures. In assembler language, a procedure can be written
as part of the main program. However, the procedure should only be executed when called from the main
program. Therefore, it should be separated from the main instruction stream, since otherwise the hardware
10.10 Multiple Location Counters 293
.
. (1)
.
USE DATA
.
. (2)
.
USE *
.
. (3)
.
USE BETA
.
. (4)
.
USE DATA
.
. (5)
.
USE <space>
.
. (6)
.
USE GAMMA
.
. (7)
.
END
would execute it when it runs into the first instruction of the procedure. So something like:
.
.
0 LOD ...
.
.
15 SUB ...
16 CALL P
17 P ADD R5,N
.
.
45 RET
46 CLR ...
.
.
104 END
is wrong. The procedure is defined on lines 17–45 and is called on line 16. This makes the source program
more readable, since the procedure is written next to its call. However, the hardware would run into the
procedure and start executing it right after it executes line 16, i.e., right after it has been called. The
solution is to use a new LC—named, perhaps, PROC—by placing a USE PROC between lines 16, 17 and a USE
* between lines 45, 46.
294 10. Assemblers
The two arrays A, B would be loaded in the labeled common /NAM/, while the constants labeled C would
end up as part of section DAT.
The IBM 360 assembler has a CSECT directive, declaring the start of a control section. However, a
control section on the 360 is a general feature. It can be used to declare sections like those described here,
or to declare sections that are considered separate programs and are assembled separately. They are later
loaded together, by the loader, to form one executable program. The different control sections are linked by
global symbols, declared either as external or as entry points.
The VAX MACRO assembler has a .PSECT directive similar to CSECT, and it does not support multiple
LCs. A typical VAX example is:
.TITLE CALCULATE PI
.PSECT DATA, NOEXE,WRT
A=2000
B: .WORD 6
C: .LONG 8
.PSECT CODE, EXE,NOWRT
.ENTRY PI,0
.
.
<instructions>
.
.
$EXIT
.PSECT CONS, NOEXE,NOWRT
K: .WORD 1230
.END PI
Each .PSECT includes the name of the section, followed by attributes such as EXE, NOEXE, WRT, NOWRT.
The memory on the 80x86 microprocessors is organized in 64K (highly overlapping) segments. The
microprocessor can only generate 16-bit addresses, i.e., it can only specify an address within a segment. A
physical address is created by combining the 16-bit processor generated address with the contents of one
of the segment registers in a special way. There are four such registers: The DS (data segment), CS (code
segment), SS (stack segment) and ES (extra segment).
When an instruction is fetched, the PC is combined with the CS register and the result is used as the
address of the next instruction (in the code segment). When an instruction specifies the address of a piece
10.11 Literals 295
of data, that address is combined with the DS register, to obtain a full address in the data segment. The
extra segment is normally used for string operations, and the stack segment, for stack-oriented instructions
(PUSH, POP or any instructions that use the SP or BP registers).
The choice of segment register is done automatically, depending on what the computer is doing at the
moment. However, there are directives that allow the user to override this choice, when needed.
As a result of this organization, there is no need for multiple LCs on those microprocessors.
10.11 Literals
Many instructions require their operands to be addresses. The ADD instruction is typically written ADD AB,R3
or ADD R3,AB where AB is a symbol and the instruction adds the contents of location AB to register 3.
Sometimes, however, the programmer wants to add to register 3, not the contents of any memory location
but a certain constant, say the number −7. Modern computers support the immediate mode which allows
the programmer to write ADD #-7,R3. The number sign ‘#’ indicates the immediate mode and it implies that
the instruction contains the operand itself, not the address of the operand. Most old computers, however,
do not support this mode; their instructions have to contain addresses, not the operands themselves. Also,
in many computers, an immediate operand must be a small number.
To help the programmer in such cases, some assemblers support literals.A notable example is the MPW
assembler for the Macintosh computer. A literal is a constant preceded by an equal sign ‘=’. Using literals,
the programmer can write ADD =-7,R3 and the assembler handles this by:
Preloading the constant −7 in the first memory location in the literal table. The literal table is loaded
in memory immediately following the program.
Assembling the instruction as ADD TMP,R3 where TMP is the address where the constant was loaded
Such assemblers may also support octal (=O377 or =377B), hex (=HFF0A), real (=1.37 or =12E-5) or
other literals.
10.11.1 The Literal Table
To handle literals, the assembler maintains a table, the literal table, similar to the symbol table. It has
columns for the name, value, address and type of each literal. In pass 1, when the assembler finds an
instruction that uses a literal, such as −7, it stores the name (−7) in the first available entry in the literal
table, together with the value (1 . . . 110012 ) and the type (decimal). The instruction itself is treated as any
other instruction with a future symbol. At the end of pass 1, the assembler uses the LC to assign addresses to
the literals in the table. In pass 2, the table is used, in much the same way as the symbol table, to assemble
instructions using literals. At the end of pass 2, every entry in the literal table is treated as a DC directive
and is written on the object file in the usual way. There are three points to consider when implementing
literals.
Two literals with the same name are considered identical; only one entry is generated in the literal
table. On the other hand, literals with different names are treated as different even if they have identical
values, such as =12.5 and =12.50.
All literals are loaded following the end of the program. If the programmer wants certain literals to
be loaded elsewhere, the LITORG directive can be used. The following example clarifies a point that should
be mentioned here. .
ADD =-7,R3
.
LITORG
.
SUB =-7,R4
.
The first −7 is loaded, perhaps with other literals, at the point in the program where the LITORG is specified.
The second −7, even though identical to the first, is loaded separately, together with all the literals used
since the LITORG, at the end of the program.
296 10. Assemblers
The LITORG directive is commonly used to make sure that a literal is loaded in memory close to the
instruction using it. This may be important in case the relative mode is used.
The LC can be used as a literal ‘= ∗’. This is an example of a literal whose name is always the same,
but whose value is different for each use.
Exercise 10.17: What is the meaning of JMP =*?
10.11.2 Examples
As has been mentioned before, some assemblers support literals even though the computer may have an
immediate mode, because an immediate operand is normally limited in size. However, more and more
modern computers, such as the 68000 and the VAX, support immediate operands of several sizes. Their
assemblers do not have to support any literals. Some interesting VAX examples are:
1. MOVL #7,R6 is assembled into D0 07 56. D0 is the opcode, 07 is a byte with two mode bits and six bits
of operand. The two mode bits (00) specify the short literal mode. This is really a short immediate
mode. Even though the word ‘literal’ is used, it is not a use of literal but rather an immediate mode.
The difference is that, in the immediate mode, the operand is part of the instruction whereas, when a
literal is used, the instruction contains the address of the operand, not the operand itself. The third byte
(56) specifies the use of register 6 in mode 5 (register mode). The assembler has generated a three-byte
MOVL instruction in the short literal mode. This mode is automatically selected by the assembler if the
operand fits in six bits.
2. MOVW I^#7,R6 is assembled into B0 8F 0007 56. Here the user has forced the assembler to use the
immediate mode by specifying I^. The immediate operand becomes a word (2 bytes or 16 bits) and the
instruction is now 5 bytes long. The second byte specifies register F (which happens to be the PC on
the VAX) in mode 8 (autoincrement). This combination is equivalent to the immediate mode, where
the immediate operand is stored in the third byte of the instruction. The last byte (56) is as before.
3. Again, a MOVL instruction but in a different context.
LC
MOVL #DATA,R6 assembled into D0 8F 00000037’ 56
.
.
0037 DATA .BYTE ...
.
.
Even though the operand is small (0037) and fits in six bits, the assembler has automatically selected
the immediate mode (8F) and has generated the constant as a long word (32 bits). The reason is that
the source instruction uses a future symbol (DATA). The assembler has to determine the instruction size
in pass 1 and, since DATA is a future symbol, the assembler does not have its value and has to assume
the largest possible value. The result is a seven byte instruction instead of the three bytes in the first
example!
Incidentally, the quote in (00000047’) indicates that the constant is relocatable.
10.12 Attributes of Symbols
The value of a symbol is just one of several possible attributes of the symbol, stored, together with the
symbol name, in the symbol table. Other attributes may be the type, LC name, and length. The LC name is
important for relocation. So far the meaning of relocation has been to add the start address of the program.
With multiple LCs, the meaning of ‘to relocate’ is to add the start address of the current LC section. When
a relocatable instruction is written on the object file, it is no longer enough to assign it a relocation bit
of 1. The assembler also has to write the name of the LC under which the instruction should be relocated.
Actually, a code number is written instead of the name.
Not every assembler supports the length attribute and, when supported, this attribute is defined in
different ways by different assemblers. The length of a symbol is defined as the length of the associated
10.13 Assembly-Time Errors 297
instruction. Thus A LOD R1,54 assigns to label A the size of the LOD instruction (in words). However the
directive C DS 3may assign to label C either length 3 (the array size) or length 1 (the size of each array
element). Also, a directive such as D DC 1.786,‘STRNG’,9 may assign to D either length 3 (the number of
constants) or the size of the first constant, in words.
The most important attribute of a symbol is its value, such as in SUB R1,XY. However, any attribute
supported by the assembler should be accessible to the programmer. Thus things such as T’A, L’B specify
the type and length of symbols and can be used throughout the program. Examples such as:
H DC L’X the length of X (in words) is preloaded in location H
G DS L’X array G has L’X elements
AIF (T’X=ABS).Z a conditional assembly directive, (Chapter 11).
are possible, even though not common.
10.13 Assembly-Time Errors
Many errors can be detected at assembly time, both in pass 1 and pass 2. Chapter 11 discusses pass 0, in
connection with macros and, that pass, of course, can have its own errors.
Assembler errors can be classified in two ways, by their severity, and by their location on the source line.
The first classification has three classes: Warnings, errors, and fatal errors. A warning is issued when the
assembler finds something suspicious, but there is still a chance that the program can be assembled and run
successfully. An example is an ambiguous instruction that can be assembled in several ways. The assembler
decides how to assemble it, and the warning tells the user to take a careful look at the particular instruction.
A fatal erroris issued when the assembler cannot continue and has to abort the assembly. Examples are a
bad source file or a symbol table overflow.
If the error is neither a warning nor fatal, the assembler issues a message and continues, trying to find
as many errors as possible in one run. No object file is created, but the listing created is as complete as
possible, to help the user to quickly identify all errors.
The second classification method is concerned with the field of the source instruction where the error
was detected. Wrong comments cannot be detected by the assembler, which leaves four classes of errors,
label, operation, operand, and general.
1. Label Errors. A label can either be invalid (syntactically wrong), undefined, or multiply-defined. Since
labels are handled in pass 1, all label errors are detected in that pass. (although undefined errors are
detected at the end of pass 1).
2. Operation errors. The mnemonic may be unknown to the assembler. This is a pass 1 (or even pass 0)
error since the mnemonic is necessary to determine the size of the instruction.
3. Operand errors. Once the mnemonic has been determined, the assembler knows what operands to
expect. On many computers, a LOD instruction requires a register followed by an address operand. A
MOV instruction may require two address operands, and a RET instruction, no operands. An error ‘wrong
operand(s)’ is issued if the right type of operand is not found.
Even if the operands are of the right type, their values may be out of range. In a seemingly innocent
instruction such as LOD R16,#70000, either operand, or even both, may be invalid. If the computer has 16
registers, R0-R15, then R16 is out of range. If the computer supports 16-bit integers, then the number 70000
is too large.
Even if the operands are valid, there may still be errors such as a bad addressing mode. Certain
instructions can only use certain modes. A specific mode can only use addresses in a certain range
4. General errors do not pertain to any individual line and have to do with the general status of the
assembler. Examples are ‘out of memory’, ‘cannot read/write file xxx’, ‘illegal character read from
source file’, ‘table xxx overflow’ ‘phase error between passes’.
The last example is particularly interesting and will be described in some detail. It is issued when pass 1
makes an assumption that turns out, in pass 2, to be wrong. This is a severe error that requires a reassembly.
Phase errors require a computer with sophisticated instructions and complex memory management; they
don’t exist on computers with simple architectures. The Intel 80x86 microprocessors—with variable-size
instructions, several offset sizes, and segmented memory management—are a good example of computer
architecture where phase errors may easily occur.
Here are two examples of phase errors on those microprocessors.
298 10. Assemblers
An instruction in a certain code segment refers to a variable declared in a data segment following the
code segment. In pass 1, the assembler assumes that the variable is declared in the same segment as the
instruction, and is a future symbol. The instruction is determined accordingly. In pass 2, when the time
comes to assemble the instruction, all the variables are known, and the assembler discovers that the variable
in question is far. A longer instruction is necessary, the pass-1 assumption turns out to be wrong, and
pass 2 cannot assemble the instruction. This error is illustrated below.
an instruction in the relative mode has a field for the relative address (the offset). Several possible
offset sizes are possible on the 80x86, depending on the distance between the instruction and its operand.
If the operand is a future symbol, even in the same segment as the instruction, the assembler has to guess
a size for the offset. In pass 2 the operand is known and, if it too far from the instruction, the offset size
guessed in pass 1 may turn out to be too small.
The Microsoft macro assembler (MASM), a typical modern assembler for the 80x86 microprocessors,
features a list of about 100 error messages. Even an early assembler such as IBMAP for the IBM 7090 had
a list of 125 error messages, divided into four classes according to the severity of the error.
A linear array.
A hash table.
Such a symbol table has a variable size. More nodes can be allocated and added to the buckets, and
the table can, in principle, use the entire available memory.
Advantages. Fast operations. Flexible table size.
Disadvantages. Although the number of steps is small, each step involves the use of a pointer and is therefore
slower than a step in the previous methods (that use arrays). Also, some programmers always tend to assign
names that start with an A. In such a case all the symbols will go into the first bucket, and the table will
behave essentially as a linear array.
Such an implementation is recommended only if the assembler is designed to assemble large programs,
and the operating system makes it convenient to allocate storage for list nodes.
Exercise 10.18: What if symbol names can start with a character other than a letter? Can this data
structure still be used? If yes, how?
10.14.4 A Binary Search Tree
This is a general data structure used not just for symbol tables, and is quite efficient. It can be used by
either a one pass or two pass assembler with the same efficiency.
The table starts as an empty binary tree, and the first symbol inserted into the table becomes the root
of the tree. Every subsequent symbol is inserted into the table by (lexicographically) comparing it with the
root. If the new symbol is less than the root, the program moves to the left son of the root and compares
the new symbol with that son. If the new symbol is greater than the root, the program moves to the right
son of the root and compares as above. If the new symbol turns out to be equal to any of the existing tree
nodes, then it is a doubly-defined symbol. Otherwise, the comparisons continue until a node is reached that
does not have a son. The new symbol becomes the (left or right) son of that node.
Example: Assuming that the following symbols are defined, in this order, in a program.
BGH J12 MED CC ON TOM A345 ZIP QUE PETS
Symbol BGH becomes the root of the tree, and the final binary search tree is shown in Figure 10.10.
BGH
A345 J12
CC MED
ON
TOM
QUE ZIP
PETS
Most texts on Data Structures discuss binary search trees. The minimum number of steps for insertion
or search is obviously 1. The maximum number of steps depends on the height of the tree. The tree in
10.14 The Symbol Table 301
Figure 10.10 above has a height of 7, so the next insertion will require from one to seven steps. The height
of a binary tree with N nodes varies between log2 N (which is the height of a fully balanced tree), and N
(the height of a skewed tree). It can be proved that an average binary tree is closer to a balanced tree than
to a skewed tree, and this implies that the average time for insertion or search in a binary search tree is of
the order of log2 N .
Advantages. Efficient operation (as measured by the average number of steps). Flexible size.
Disadvantages. Each step is more complex than in an array-based symbol table.
The recommendations for use are the same as for the previous method.
It should consider all the bits in the original name. Thus when two names that are slightly different
are hashed, there should be a good chance of producing different hash indexes.
For a group of names that are uniformly distributed over the alphabet, the function should produce
indexes uniformly distributed over the range 0 . . . 2N − 1.
Once the hash index is produced, it is used to insert the symbol into the array. Searching for symbols
is done in an identical way. The given name is hashed, and the hashed index is used to retrieve the value
and the type from the array.
Ideally, a hash table requires fixed time for insert and search, and can be an excellent choice for a large
symbol table. There are, however, two problems associated with this method namely, collisions and overflow,
that make hash tables less than ideal.
Collisions involve the case where two entirely different symbol names are hashed into identical indexes.
Names such as SYMB and ZWYG6 can be hashed into the same value, say, 54. If SYMB is encountered first in
the program, it will be inserted into entry 54 of the hash table. When ZWYG6 is found, it will be hashed, and
the assembler should discover that entry 54 is already taken. The collision problem cannot be avoided just
by designing a better hash function. The problem stems from the fact that the set of all possible symbols is
very large, but any given program uses a small part of it. Typically, symbol names start with a letter, and
consist of letters and digits only. If such a name is limited to six characters, then there are 26 × 365 (≈ 1.572
billion) possible names. A typical program rarely contains more than, say, 500 names, and a hash table of
size 512 (= 29 ) may be sufficient. When 1.572 billion names are mapped into 512 positions, more than 3
million names will map into each position. Thus even the best hash function will generate the same index
for many different names, and a good solution to the collision problem is the key to an efficient hash table.
The simplest solution involves a linear search. All entries in the symbol table are originally marked as
vacant. When the symbol SYMB is inserted into entry 54, that entry is marked occupied. If symbol ZWYG6
should be inserted into entry 54 and that entry is occupied, the assembler tries entries 55, 56 and so on.
This implies that, in the case of a collision, the hash table degrades to a linear table.
302 10. Assemblers
Another solution involves trying entry 54 + P where P and the table size are relative primes. In either
case, the assembler tries until a vacant entry is found or until the entire table is searched and found to be
all occupied.
It can be shown that the average number of steps to insert (or search for a) symbol is 1/(1 − p) where
p is the percent-full of the table. The value p = 0 corresponds to an empty table, p = 0.5 means a half-full
table, etc. The following table gives the average number of steps for a few values of p.
number
p of steps
0 1
.4 1.66
.5 2
.6 2.5
.7 3.33
.8 5
.9 10
.95 20
It is clear that when the hash table gets more than 50%–60% full, performance suffers, no matter how
good the hashing function is. Thus a good hash table design makes sure that the table never gets more than
60% occupied. At that point the table is considered overflowed.
The problem of hash table overflow can be handled in a number of ways. Traditionally, a new, larger
table is opened and the original table is moved to the new one by rehashing each element. The space taken
by the original table is then released. A better solution, though, is to use open hashing.
10.14.7 Open Hashing
An open hash table is a structure consisting of buckets, each of which is the start of a linked list of symbols.
It is very similar to the buckets with linked lists discussed earlier. The principle of open hashing is to hash
the name of the symbol and use the hash index to select a bucket. This is better than using the first character
in the name, since a good hash function can evenly distribute the names over the buckets, even in cases
where many symbols start with the same letter.
a b c d
though section 1 is the first in the program, it is not the first to be executed since it constitutes the
definition of a subroutine. When the program is assembled, the assembler reads and assembles the source
file straight through. It does not change the positions of the different sections, and does not treat section 1
(the subroutine) in any special way. The order of execution has therefore to do only with the way the CALL
instruction works. This is why subroutines are a hardware feature.
Program B (Figure 11.1b) is handled in a different way. The assembler reads the MACRO, ENDM directives
and thus recognizes the two instructions DIV, OUT as the body of macro N. It then places a copy of that
body wherever it finds a source line with N in the operation field. The output is the same program, with the
macro definition removed, and with all the expansions in place. This output (Figure 11.1c) is ready to be
assembled, in the usual way, by passes 1 and 2. This is why macros are an assembler feature and handling
them is done by a special pass, pass 0, where a new source file is generated (Figure 11.1d) to be read by pass
1 as its source file.
Having a separate pass 0 simplifies the design of the assembler, since it divides the entire assembly job
in a logical way between the three passes. The user, of course, has to pay a price in the form of increased
assembly time, but this is a reasonable price to pay for the added power of the assembler. It is possible
to combine passes 0 and 1 into one pass, which speeds up the assembler. However, this results in a very
complex pass 1, which takes more time to write and debug, and reduces assembler reliability.
The task of pass 0 is thus to read the source file, handle all macro definitions and expansions, and
generate a new source file that is identical to the original file, except that it does not have the macro
definitions, and it has all the macro expansions in place. In principle, the new source file should have no
mention of macros; in practice, it needs to have some macro information which eventually is transferred to
11.1 Introduction 305
pass 2, to be written on the listing file. This point is further discussed below.
11.1.1 Macro Definition And Expansion
To define a macro, the MACRO, ENDM directives are used.
Those two directives always come in pairs. The MACRO directive defines the start of the macro definition,
and should have the macro name in the label field. The ENDM directive specifies the end of the definition.
Some assemblers use different syntax to define a macro. The IBM 360 assembler uses the following
syntax:
MACRO
&p1 name &p2,&p3,. . .
.
.
MEND instead of ENDM
where &p1, &p2 are parameters (Section 11.2), each starting with an ampersand ‘&’.
To expand a macro, the name of the macro is placed in the operation field, and no special directives are
necessary. .
.
COMP ..
NU
SUB ..
.
.
The assembler recognizes NU as the name of a macro, and expands the macro by placing a copy of the macro
definition between the COMP and SUB instructions. The object code generated will contain the codes of the
five instructions:
COMP ..
LOD A
ADD B
STO C
SUB ..
Handling macros involves two separate phases. Handling the definition and handling the expansions. A
macro can only be defined once (see the discussion of nested macros in Section 11.6 for exceptions, however),
but it can be expanded many times. Handling the definition is a relatively simple process. The assembler
reads the definition from the source file and saves it in a special table, the Macro Definition Table (MDT).
The assembler does not try to check the definition for errors, to assemble it, execute it, or do anything else
with it. It just saves the definition as it is (again, there is an exception, mentioned on page 305, that has
to do with identifying parameters). On encountering the MACRO directive, the assembler switches from the
normal mode of operation to a special macro-definition mode in which it:
locates available space in the MDT
reads source lines and saves them in the MDT until an ENDM is read.
Upon reading ENDM from the source file, the assembler switches back to the normal mode. If the ENDM is
missing, the assembler stays in the macro definition mode and saves source lines in the MDT until an obvious
error is found, such as another MACRO, or the END of the entire program. In such a case, the assembler issues
an error (run away definition) and aborts the assembly.
306 11. Macros
Handling a macro expansion starts when the assembler reads a source line that is not any instruction or
directive. The assembler searches the MDT for a macro with that name and, on locating it, switches from
the normal mode of operation to a special macro-expansion mode in which it:
Reads a source line from the MDT.
Writes it on the new source file, unless it is a pass 0 directive, in which case it is immediately executed.
Repeats the two steps until the end of the macro is located in the MDT.
The following example illustrates this process. The macro definition contains an error and a label.
BAD MACRO
ADD #1,R4
A$D R5 wrong mnemonic
LAN CMP R3,R5
ENDM
The definition is stored in the MDT with the error (A$D) and the label. Since the assembler copies the macro
definition verbatim, it does not recognize LAN as a label at this point. The macro may later be expanded
several times, causing several copies to be written onto the new source file. Pass 0 does not check these
copies in any way and, as a result, does not issue any error messages (note that pass 0 does not handle labels
and does not maintain the symbol table). When pass 1 reads the new source file, it discovers the multiple
definitions of LAN and issues an error on the second and subsequent definitions. When pass 2 assembles the
instructions, it discovers the bad A$D instructions and flags each of them.
Exercise 11.1: In such a case, how can we ever define a macro with a label?
This does not sound like a good way to implement macros. It would seem better to assemble the macro
when it is first encountered, i.e., when its definition is found, and to store the assembled version in the MDT.
The reason why assemblers do not do that but rather treat macros as described earlier, is because of the use
of parameters.
11.2 Macro Parameters
The use of parameters is the most important feature of macros. It is similar to the use of parameters!in
subroutines, but there are important differences. The following examples illustrate the use of parameters in
a simple macro. They show that parameters can be used in all the fields of an instruction, not just in the
operation field.
1 2 3 4
MG1 MACRO MG2 MACRO A,B,C MG3 MACRO A,B,C MG4 MACRO P
LOD G LOD A A G LOD G
ADD H ADD B B H P ADD H
STO I STO C C I STO I
ENDM ENDM ENDM ENDM
Example 1 is a simple, three-line macro without parameters. Every time it is expanded, the same source
lines are generated. They add the contents of memory locations G and H, and store the result in location I.
Example 2 uses three parameters A,B,C for the three locations involved. The macro still does the same but,
each time it is expanded, different locations are added, and the result is stored in a different location. When
such a macro is expanded, the user should specify values (actual arguments) for the three parameters. Thus
the expansion MG2 X,Y,Z would generate:
LOD X
ADD Y
STO Z
For the assembler to be able to assemble those instructions, the arguments X,Y,Z must be valid symbols
defined in the program, i.e., the program should contain:
X DS 4
Y DC 44
Z EQU $FF00
11.2 Macro Parameters 307
The process of assigning the value of an actual argument to a formal parameter is called binding. Thus
the formal parameter A is bound to the actual argument X. The process of placing the actual arguments in
place of the formal parameters when expanding a macro, is called parameter substitution.
Exercise 11.2: Consider the case of an actual argument that happens to be identical to a formal parameter.
If the macro of example 2 above is expanded as MG2 B,X,Y, we would end up with the expansion
LOD B
ADD X
STO Y
However, B is the name of the second parameter. Would the assembler perform double substitution, to end
up with LOD X?
Example 3 is even more striking. Here the parameters are used in the operation field. The operands
are always the same. When such a macro is expanded, the user should specify three arguments which are
valid mnemonics. The expansion MG3 LOD,SUB,STO would generate:
LOD G
SUB H
STO I
LOD G
CMP H
JNE I
which is a very different macro. It is obvious now that such a macro cannot be assembled when it is defined.
In example 4, the parameter is in the label field. Each expansion of this macro will have to specify an
argument for the parameter, and that argument will become a new label. Thus MG4 NON generates
LOD G
NON ADD H
STO I
in the assembly process. They cannot be done in the second pass because the symbol table must be stable
during that pass. All macro expansions are done in pass 0 (except in assemblers that combine passes 0 and
1).
Exercise 11.3: How many parameters can a macro have?
Another, more important, reason why macros must be expanded early is the need to maintain the
LC during the first pass. The LC is used, during that pass, to assign values to symbols. It is important,
therefore, to execute each expansion and to increment the LC while individual lines are expanded. If macros
are handled in pass 0, then pass 1 is not concerned about them and it proceeds normally.
308 11. Macros
The main advantage of having a separate pass 0 is simplicity. Pass 0 only handles macros and pass 1
remains unchanged. Another advantage is that the memory area used for the MDT in pass 0 can be used
for other purposes—like storing the symbol table—in the other passes. The main disadvantage of pass 0 is
the additional time it takes, but there is a simple way to speed it up. It is only necessary to require that
all macro definitions appear at the very beginning of the source file. When pass 0 starts, it examines the
first source line. If it is not a MACRO directive, pass 0 assumes that there are no macro definitions (and,
consequently, no macro expansions) in the program, and it immediately starts pass 1, directing it to use the
original, instead of the new, source file.
Another point that should be noted here is the listing information that relates to macros. In principle,
the new source file, generated by pass 0, contains no macro information. Pass 0 completely handles all
macro definitions and expansions, and pass 1 needs to know nothing about macros. In practice, though,
the user wants to see a listing of the macro definitions and, sometimes, also listings of all the expansions.
This information can only be generated by pass 0 and should be transferred, first to pass 1 through the
new source file, and then to pass 2—where the listing is created—through the intermediate file. This listing
information cannot be transferred by storing it in memory since, with many definitions and expansions, it
may be too large.
A related problem is the listing of macro expansions. The definition of a macro should obviously be
listed, but a macro may be expanded often, and the user may want to suppress the listing of all or some
of the expansions. Special directives that tell the assembler how to handle listing of macro expansions are
discussed in the literature.
11.2.1 Attributes of Macro Parameters
The last example of the use of parameters is a macro whose arguments may be compound.
C MACRO L1,L2,L3,L4,L5,L6
ADD L1,L2(2) L2 is assumed compound and its 2nd component used
L3
B’L4 DEST
C’L5’D L6
.
.
ENDM
Which illustrates the following points about handling arguments in a macro expansion:
1. There are two spaces between the ADD and the SUM on the first line. This is because the macro definition
has two spaces between the ADD and the L1. In the second line, though, there is only one space between
the SUB and the R1. This is because the argument in the expansion has one space in it. The assembler
expands a macro by copying the arguments, as strings, into the original macro line without any editing
(except that when the argument is compound, its parentheses are stripped off). In the third line,
the parameter occupies three positions (‘L4) but the argument Z only takes one position. When Z is
11.2 Macro Parameters 309
substituted for ‘L4, the source line becomes two positions shorter, which means that the rest of the line
is moved two positions to the left. The assembler also preserves the original space between L4 and DEST.
As a result, the expanded line has one space between BZ and DEST.
2. The second line of the definition simply reads L3. The assembler replaces L3 by the corresponding
argument SUB R1,L1 and, before trying to assemble it, scans the line again, looking for more occurrences
of parameters. In our example it finds that the newly generated line has an occurrence of L1 in it. This
is immediately replaced by the argument corresponding to L1 (SUM) and the line is then completely
expanded. The process of macro expansion turns out to be a little more complex than originally
described.
3. In line three of the definition the quote (’) separates the character B from the parameter L4. On
expanding this line, the assembler treats the quote as a separator, removes it, and concatenates the
argument Z to the character B, thus forming BZ. If BZ is a valid mnemonic, the line can be assembled.
This is another example of a macro line that has no meaning before being completely expanded.
4. The argument corresponding to parameter L5 is null. The result is the string CD with nothing between
the C and the D. Again, if CD is a valid mnemonic (or directive), the line can eventually be assembled
(or executed). Otherwise, the assembler flags it as an error.
Note that there is a difference between a null argument and an argument that is blank. In the expansion
C SUM,(D,T,U),(SUBR1,L1),Z,,SN, the fifth argument is a blank space, which ends up being inserted
between the C and the D in the expanded source line. The final result is CDSN which is not the same as
CDSN. It could even be interpreted by the assembler as C=label, D=mnemonic, SN=operand. If a mnemonic
D exists, the instruction would be assembled and C would be placed in the symbol table, without any error
messages or warnings.
Exercise 11.4: What if the last argument of an expansion is null? How can the assembler distinguish
between a missing last argument and a null one?
A little thinking shows that the precise names of the formal parameters are not important. The param-
eters are only used when the macro is expanded. No parameter names are used after pass 0 (except that
the original names should appear in the listing). As a result, the assembler replaces the parameter names
with serial numbers when the macro is placed in the MDT. This makes it easier to locate occurrences of the
parameters when the macro is later expanded. Thus in one of the examples above:
D MACRO A,B,C
LOD A
ADD B
STO C
ENDM
D MACRO AD,BCD,ST7
LOD AD
ADD BCD
STO ST7
ENDM
Let’s follow the editing of the second line. It is broken up into the two tokens ADD and BCD, and each
token is compared with all three parameter names. The first token ‘ADD’ almost matches the first parameter
‘AD’. The second one ‘BCD’ exactly matches the second parameter. That token is replaced by the serial
number ‘#2’ of the parameter, and the entire line—including the serial number—is stored in the MDT.
Our new example will be stored in the MDT in exactly the same way as before, illustrating the fact that
parameter names can be changed without affecting the meaning of the macro.
start
mode:=N
Input 4
yes
D E macro
2 mode 3 mode:=E
name
N
no
3
yes
mode:=D output
MACR
O
no no
2
END 1
yes
pass 0 Execute it yes
directive
stop
no
1
11.3 Pass 0
Pass 0 is devoted to handling macros. Macro definitions and expansions are fully done in pass 0, and the
output of that pass, the new source file, contains no trace of any macros (except the listing information
mentioned earlier). The file contains only instructions and directives, and is ready for pass 1. Pass 0 does
not handle label definitions, does not use the symbol table, does not maintain the LC, and does not calculate
the size of instructions. It should, however, execute certain directives that pertain to macros, and it has
to handle two types of special symbols. The directives are called pass 0 directives, they have to do with
11.3 Pass 0 311
input outpu
t
return return
3
expansion
2 mode
definition
mode input
locate space
in MDT yes
end of
mode:=N
macro
store macro
no
name and 1
parameters
scan line &
substitute
parameters
input
yes
output pass 0
Execute it
directive
no MEND
no
yes 3
output
mode:=N
1 3
conditional assembly, and are explained in Section 11.8. The special symbols involved are SET symbols
and sequence symbols. They are fully handled by pass 0, are stored in a temporary symbol table, and are
discarded at the end of the pass. The following rules (and Figure 11.2) summarize the operations of pass 0.
1. Read the next source line.
2. If it is MACRO, read the entire macro definition and store in MDT. Goto 1.
3. If it is a pass 0 directive, execute it. Goto 1. Such a directive is written on the new source file but in a
special way, not as a normal directive, since it is only needed for the listing in pass 2.
312 11. Macros
4. If it is a macro name, expand it by bringing in lines from the MDT, substituting parameters, and writing
each line on the new source file (or executing it if it a pass 0 directive). Goto 1.
5. In any other case, write the line on the new source file. Goto 1
6. If current line was the END directive, stop (end of pass 0).
To implement those rules, the concept of an operating mode is introduced. The assembler can be in
one of three modes: the normal mode (N)—in which it reads lines from the source file and writes them to
the new source file; the macro definition mode (D)—in which it reads lines from the source file and writes
them into the MDT; and the macro expansion mode (E)—in which it reads lines from the MDT, substitutes
parameters, and writes the lines on the new source file. A fourth mode (DE) is introduced in Section 11.6
in connection with nested macros.
Exercise 11.5: Normally, the definition of a macro must precede any expansions of it. If we eliminate that
restriction, what modifications do we have to make to the assembler?
11.4 MDT Organization
Most computers have operating systems (OS) that provide supervision and offer services to the users. The
date and time are two examples of such services. To use such a service (e.g., to ask the OS for the date) the
user has to place a request to the OS. Since the typical user cannot be expected to be familiar with the OS,
the OS provides built-in macros, called system macros. To get the date, for example, the user should write
a source line such as DATE D where DATE is a system macro, and D is an array in which DATE stores the date.
As a result, the MDT does not start empty. When pass 0 starts, the MDT contains all the system macros.
As more macro definitions are added to the MDT, it grows larger and larger and the problem of efficient
search becomes more and more important. The MDT should be organized to allow for efficient search, and
most MDTs are organized in one of two ways: as a chained MDT, or as an array where each macro is pointed
to from a Macro Name Table (MNT).
A chained MDT is a long array containing all the macro definitions, linked with backwards pointers.
Each definition points to its predecessor and each is made up of individual fields separated by a special
character that we will denote by . A typical definition contains:
where the last separator is immediately followed by the name of the next macro. Such an MDT is easy
to search by following the pointers and comparing names. Since the pointers point backwards, the table is
searched from the end to the beginning; an important feature. It guarantees that when a multiply-defined
macro is expanded, the back search will always find the last definition of the macro. Multiply-defined
macros are a result of nested macro definitions(Section 11.9), or of macros that users write to supersede
system macros. In either case, it is reasonable to require that the most recent definition always be used.
The advantage of this organization is its flexibility. It is only limited by one parameter, the size of the array.
The total size of all the definitions cannot exceed this parameter. Thus we can define a few long macros or
many short ones.
The other way to organize the MDT is to store the macros in an MDT array with separators as described
before, but without pointers. An additional array, called the MNT, contains pairs <macro name, pointer>
where the pointers point to the start of a definition in the MDT array. The advantage of this organization
is that the MNT has fixed size entries, allowing for a faster search of names. However, the total amount of
macros that can be stored in such an MDT is limited by two parameters. The size of the MDT array—which
limits the total size of all the macros—and the size of the MNT—which limits the number of macros that
can be defined. Figure 11.3 is an example of such an MDT organization. It shows an MDT array with 3
macros. The first has 3 parameters, the second, 4, and the third, 2. The MNT array has fixed-size entries.
11.4.1 The REMOVE Directive
Regardless of the way the MDT is organized, if the assembler supports system macros, it should also support
a directive to remove a macro from the MDT. A user writing a macro to redefine an existing system macro
may want the redefinition to be temporary. They define their macro, expand it as many times as necessary,
11.5 Other Features of Macros 313
3 1st line .... last line 4 line .... line 2 line .... line MDT Array
and then remove it such that the original system macro can be used again. Such a directive is typically
called REMOVE.
In old assemblers this directive is executed by removing the pointer that points to the macro, not the
macro itself. Removing the macro itself and freeing the space in the MDT is done by many new assemblers
(Chapter 10). It is done by changing the macro definition to a null string, and storing another (smaller)
macro in the space thus freed. After defining and removing many macros, the MDT becomes fragmented; it
can be defragmented by moving macros around and changing pointers in the MNT.
Exercise 11.6: What could be a practical example justifying the actual removal of a macro from the MDT?
Associating by name, however, is different. Using the macro definition above, we can have an expansion
M P2=DON,P1=SON,P3=YON. Here each of the three actual arguments SON, DON, YON is explicitly associated
with one of the parameters.
It is possible to combine the two methods, such as in M P3=MAN,DAN,P1=JAN. Here the second argument
has no name association and it therefore corresponds to the second parameter. This, of course, implies that
an expansion such as M P2=MAN,DAN,P1=JAN, is wrong.
Exercise 11.8: There is, however, a way to assign such an expansion a unique meaning. What is it?
The third method uses a special parameter named SYSLIST such that SYSLIST(i) refers to the ith
314 11. Macros
the first parameter would be bound to ‘-12’, and the second, to ‘45,=a98.62’. Note that the comma is part
of the second actual argument, not a delimiter. Also, the period in ‘98.62’ is considered part of the second
argument, not a delimiter.
Exercise 11.9: What happens if the user forgets the period-space?
11.5.3 Numeric Values of Arguments
5. Macro arguments are normally treated as strings. However, MACRO, the VAX assembler, can optionally
use the value, rather than the name, of an argument. This is specified by means of a ‘\’. A simple
example is:
.MACRO CLEER ARG
CLRL R’ARG
.ENDM
After defining this macro, we assign CONS=5, and expand CLEER twice. The expansion CLEER CONS gen-
erates CLRL RCONS (which is probably wrong), whereas the expansion CLEER \CONS generates CLRL R5.
11.5.4 Attributes of Macro Arguments
Each argument in a macro expansion has attributes that can be used to make decisions—inside the macro
definition—each time the macro is expanded. At the time the macro definition is written, the arguments are
unknown. They only become known when the macro is expanded, and may have different attributes each
time the macro is expanded.
Exercise 11.10: Chapter 10 mentions attributes of symbols. What is the difference between attributes of
symbols and of macro arguments?
We will discuss the six attributes supported by the IBM 360 assembler and will use, as an example, the
simple macro definition:
M MACRO P1
P1
ENDM
followed by the three expansions:
M FIRST
M SEC
M (X,Y,Z) the argument is compound
FIRST DC P’+1.25’. DC is the Define Code directive, and symbol FIRST is the name of a packed
decimal constant.
SEC ADD 5,OP. Symbol SEC is the label of an ADD instruction
The example illustrates the following attributes:
The count attribute, K, is the length of the actual argument. Thus K‘P1 is 5 in the first expansion, 3
in the second one, and 7, in the third.
The type attribute, T, is the type of the actual argument. In the first expansion it is ‘P’ (for Packed
decimal) and in the second, ‘I’ (since the argument is an Instruction). In the third expansion the type is ‘N’
(a self-defined term).
The length attribute, L, is the number of bytes occupied by the argument in memory; 2 in the first
case, since the packed decimal constant 1.25 occupies two bytes, and 4 in the second case, since the ADD
instruction takes 4 bytes. The compound argument of the third expansion has no length attribute.
The integer attribute, I, is the number of decimal digits in the integer part of the argument; 1 in the
first expansion, 0 in the second.
The scaling attribute, S, is the number of decimal digits in the fractional part of the argument; 2 in
the first example and 0 in the second one.
The number attribute, N only has a meaning if the argument is compound. It specifies the number of
elements in the compound argument. In the third example above N‘P1 is 3.
The attributes can be used in the macro itself and the most common examples involve conditional
assembly (Section 11.8).
11.5.5 Directives Related to Arguments
MACRO, the VAX assembler, supports several interesting arguments that make it easy to write sophisticated
macros. Here are a few:
The .NARG directive provides the number of actual arguments. There may be fewer arguments than
parameters. Its format is .NARG symbol and it return the number of arguments in the symbol. Thus macro
Block
.MACRO Block A,B,C,D
.NARG Num
...
.BLKW Num
.ENDM
creates a block of 1–4 words each time it is expanded, depends on the number of arguments in the expansion.
The directive .NCHR returns the size of a character string. Its format is .NCHR symbol,<string>.
Thus after defining:
.MACRO Exmpl L,S
.NCHR Size,S
...
L: .BLKW Size
.ENDM
the expansion Exmpl M,<Yours T> will generate M: .BLKW 7
The IF-ELSE-ENDIF directive can be used to determine whether an argument is BLANK or NOT BLANK,
or whether two arguments are IDENTICAL or DIFFERENT. It can also be used outside macros to compare any
quantities known in pass 0, and to determine if a pass 0 quantity is defined or not.
11.5.6 Default Arguments
Some assemblers allow a definition such as N MACRO M1,M2=Z,M3, meaning that if the actual argument
binding M2 is null in a certain expansion, then M2 will be bound to Z by default.
Exercise 11.11: What is the meaning of N MACRO M1,M2=,M3?
316 11. Macros
The IRP directive loops 4 times, assigning the actual arguments of D, C, B, A to Temp. The .IIF directive
generates a PUSHL instruction, to push the argument’s address on the stack, for any non-blank argument.
Finally, a CALLS instruction is generated, to preform the actual procedure call. This is a handy macro that
can be used to call procedures with up to 4 parameters.
IRP stands for Indefinite RePeat.
expansions (PRINT NOGEN) or to turn on such listings (PRINT GEN). This directive does not affect the listing
of the macro definitions or of the body of the program. Those listings are controlled by the LIST, NOLIST
directives.
MACRO A,B,C
* WATCH OUT, PARAMETER B IS SPECIAL
.
.
C R1
* THE PREVIOUS LINE CHANGES ITS MEANING
.
.
The comments should be printed, together with the definition of the macro, in the listing file, but should
they also be printed with each expansion? The most general answer is: It depends. Some comments refer to
the lines in the body of the macro and should be printed each time an expansion is printed (as mentioned
elsewhere, the printing of macro expansions is optional). Other comments refer to the formal parameters of
the macro, and should be printed only when the macro definition is printed. The decision should be made
by the programmer, which means that the assembler should have two types of comment lines, the regular
type, which is indicated by an asterisk, and the special type, indicated by another character, such as a ‘!’,
for comments that should be printed only as part of a macro definition.
318 11. Macros
There is nothing unusual about macro C. An expansion of macro A, however, is special since it involves an
expansion of C. An expansion of B is even more involved. Most assemblers support nested macro expansion
since it is useful and also easy to implement. They allow it up to a certain maximum depth. In our example,
the expansion of B turns out to be nested to a depth of 2. The expansion of C is nested to a depth of 0.
II
III
discards the second pointer and switches to the first one—which means resuming the expansion of the original
macro B. Three typical steps in this process are shown in Figure 11.4.
In part I, the second line of B has been fetched and the (first) pointer points to that line. The expansion
of macro A has just started and the second pointer points to the first line of A.
In part II, the second pointer points to the second line of A. This means that the line being processed
is the second line of A. The expansion of C has just started.
In part III, the expansion of C has been completed, the third pointer discarded, and the assembler is in
the process of fetching the third line of A.
The rules for nested macro expansion therefore are:
In the macro expansion mode, when encountering the name of a macro, find it in the MDT, set up
a new pointer to point to the first line, save the arguments of the current macro, and continue expanding,
using the new pointer.
After fetching and expanding the last source line of a macro, discard the current pointer and start
using the previous one (and the previous set of arguments).
If there is no previous pointer, the (nested) macro expansion is over.
From this discussion it is clear that the pointers are used in a Last In First Out (LIFO) order, and should
thus be stored in a stack. This stack is called the macro expansion stack (MES), and its size determines the
number of possible pointers and thus the maximum depth of nesting.
Implementing nested macro expansions is, therefore, done by declaring a stack and using it while ex-
panding a macro. All other details of the expansion remain the same as in a normal expansion (however, see
conditional assembly, Section 11.8).
The following is a set of rules and a flow chart (Figure 11.5, a generalized subset of Figure 11.2) which
illustrate the main operations of a pass 0 supporting nested macro expansions.
1. Input line from MDT (since mode=E).
2. If it is a pass 0 directive, execute it. Goto 1.
3. If is is a macro name, start a new expansion. Goto 1.
4. If it is an end-of-macro character, stop current expansion and look back in MES.
• If MES empty, change mode to N. Goto main input.
320 11. Macros
3 expansion
mode empty not empty
MES
input mode:=N
resume
using
previous
end of yes 1 parameters
macro
no
pass 0 3
Execute it 3
directive
no
no
NOGOOD MACRO
INST1
NOGOOD
INST2
ENDM
An attempt to expand this macro will generate INST1 and will then start the inner expansion. The inner
expansion will do the same. It will generate INST1 and start a third expansion. There is nothing in this
macro to stop any of the inner expansions and go back to complete the outer ones. Such an expansion will
very quickly overflow the MES, no matter how large it is.
It turns out, though, that such macros, called recursive macros, are very useful. To implement such
a macro, a mechanism is necessary that will stop the recursion (the self expansion) at some point. Such a
mechanism is supported by many assemblers and is called conditional assembly.
11.8 Conditional Assembly 321
Q SET 1
NOGOOD
will start a recursive expansion where, in each step, the value of Q (in the temporary symbol table) will
be incremented by 1. This expansion is still an infinite one and, in order to stop it after N steps, a test is
necessary, to compare the value of Q to N. This test is another pass 0 directive, typically called AIF (Assembler
IF). Its general form is AIF exp.symbol, where exp is a boolean expression containing only quantities known
to the assembler. The assembler evaluates the expression and, if its value is true, the assembler goes to the
line labeled .symbol.
The next version of the same macro is now:
GOOD MACRO
INST1
Q SET Q+1
AIF Q=N.F if Q equals N then go to line labeled .F
GOOD
.F INST2
ENDM
An expansion such as
N EQU 2
Q SET 0
GOOD
322 11. Macros
The macro has been recursively expanded to a depth of 2 because of the way symbols Q, N have been
defined. It is also possible to say AIF Q=2.F, in which case the depth of recursion will depend only on the
initial value of Q.
Exercise 11.14: Is it valid to write AIF Q>N.F?
An important question that should be raised at this point is: what can N be? Clearly it can be anything
known in pass 0, such as another (non-future) SET symbol, a constant, or an actual argument of the current
macro. This, however, offers only a limited choice and some assemblers allow N to be an absolute symbol,
defined by an EQU. An EQU, however, is a pass 1 directive, so such an assembler should look at each EQU
directive in pass 0 and—if the EQU defines an absolute symbol—execute it, store the symbol in the temporary
symbol table, and write the EQU on the new source file for a second execution in pass 1. In pass 1, all EQU
directives are executed and all EQU symbols, absolute and relative, end up in the permanent symbol table.
Some assemblers go one more step and combine pass 0 with pass 1. This makes for a very complex
pass but, on the other hand, it means that conditional assembly directives can use any quantities known in
pass 1. They can use the values of any symbols in the symbol table (i.e., any non-future symbols), the value
of the LC, and other things known in pass 1 such as the size of the last instruction read. The following is
an example along these lines.
.
.
P DS 10 P is the start address of an array of length 10
N DS 3 N is the start address of an array of length 3 immediately
. following array P. Thus N=P+10
.
Q SET P The value of Q is an address
GOOD depth of recursion will be 10 since
. it takes 10 steps to increment the
. value of Q from address P to address N.
The next example is more practical. It is a recursive macro FACT that calculates a factorial. Calculating
a factorial is a common process and is done by computers all the time. In our example, however, it is done
at assembly time, not at run time.
FACT MACRO N
S SET S+1
K SET K*S
AIF S=N.DON
FACT N
.DON ENDM
The expansion:
S SET 0
K SET 1
FACT 4
11.8 Conditional Assembly 323
will calculate 4! (=24) and store it in the temporary symbol table as the value of the SET symbol K. The
result can only be used at assembly time, since the SET symbols are wiped out at the end of pass 0. Symbol
K could be used, for example, in an array declaration such as: FACT4 DS K which declares FACT4 as an array
of length 4!. This, of course, can only be done if the assembler combines passes 0 and 1.
The FACT macro is rewritten below using the IIF directive. IIF stands for Immediate IF. It has the
form IIF condition,source line. If the condition is true, the source line is expanded, otherwise, it is ignored.
FACT MACRO N
S SET S+1
K SET K*S
IIF S=N,(FACT N)
ENDM
line 3
ENDIF
If X=2, then line 1 will be expanded, otherwise, lines 2 and 3 will be expanded.
Exercise 11.15: What can X be?
The IBM 360, 370 assemblers, were the first ones to offer an extensive conditional assembly facility, and
similar facilities are featured by most modern assemblers. It is interesting to note that the MPW assembler
for the Macintosh computer supports conditional assembly directives that are almost identical to those of the
old 360. Some of the conditional assembly features supported by the 360, 370 assemblers will be discussed
here, with examples. Notice that those assemblers require all macro parameters and all SET symbols to start
with an ampersand ’&’.
The AIF directive on those assemblers has the format AIF (exp)SeqSymbol. The expression may con-
tain SET symbols, absolute EQU symbols, constants, attributes of arguments, arithmetic operators, the six
relationals (EQ NE GT LT GE LE), the logical operators AND OR NOT, and parentheses. SeqSymbol (sequence
symbol) is a symbol that starts with a period. Such a symbol is a special one, it has no value and is not
stored in any symbol table. It is used only for conditional assembly and thus only exists in pass 0. When the
assembler executes an AIF and decides to go to, say, symbol .F, it searches, in the MDT, for a line labeled
.F, sets the current MES pointer to point to that line, and continues the macro expansion. If a line fetched
from the MDT has such a label, the line goes on the new source file without the label, which guarantees
that only pass 0 will see the sequence symbols. In contrast, regular symbols (address symbols and absolute
symbols) do not participate in pass 0. They are defined in pass 1, and their values used in pass 2.
Examples:
AIF (&A(&I) EQ ‘ABC’).TGT where &A is a compound parameter and &I is a SET symbol used to
select one component of &A.
324 11. Macros
AIF (T‘&X EQ O).KL if the type attribute of argument &X is O, meaning a null argument, then go to
symbol .KL.
AIF (&J GE 8).HJ where &J could be either a parameter or a SET symbol.
AIF (&X AND &B EQ ‘(’ ).LL where &X is a B type SET symbol (see below) and &B is either a C type
SET symbol or a parameter.
The AGO Directive: The general format is ‘AGO SeqSymbol’. It directs the assembler to the line labeled
by the symbol. (This is an unconditional goto at assembly time, not run time.)
The ANOP (Assembler No OPeration) directive. The assembler does nothing in response to this directive,
and its only use is to provide a line that can be labeled with a sequence symbol. This directive is used in
one of the examples at the end of this chapter.
SET symbols. They can be of three types. A (arithmetic), B (boolean) or C (character). An A type SET
symbol has an integer value. B type have boolean values of true/false or 1/0. The value of a C type SET
symbol is a string.
Any SET symbol has to be declared as one of the three types and its type cannot be changed. Thus:
LCLA &A,&B
LCLB &C,&D
LCLC &E,&F
declare the six symbols as local SET symbols of the appropriate types. A local SET symbol is only known
inside the macro in which it is declared (or inside a control section, but those will not be discussed here).
There are also three directives to assign values to the different types of SET symbols.
&A SETA 1
&A SETA &A+1
&B SETA 1+(B‘1011’*X‘FF1’-15)/&A-N‘SYSLIST(&A) where B‘1011’ is a binary constant,
X‘FF1’ is a hex constant, and N’ is the number attribute
(the number of components) of the second argument of the current
expansion (the second, because &A=2).
&C SETB (&A LT 5)
&D SETB 1 means ‘true’
&E SETC ‘’ the null string
&E SETC ‘12’ the string ‘12’
&F SETC ‘34’
&F SETC ‘0&E&F.5’ the string ‘012345’. The dot separates the value of &F from the 5
&E SETC ‘ABCDEF’(2,3) the string ‘BCD’. Substring notation is allowed.
The three directives GBLA, GBLB, GBLC declare global SET symbols. This feature is very similar to the COMMON
statement in Fortran. Once a symbol has been declared global in one macro definition, it can be declared
11.8 Conditional Assembly 325
global in other macro definitions that follow, and all those declarations will use the same symbol.
N1 MACRO
GBLA &S
&S SETA 1
AR &S,2
N2
ENDM
N2 MACRO
LCLA &S
&S SETA 2
SR &S,2 the local symbol is used
N3
ENDM
N3 MACRO
GBLA &S
CR &S,2 the global symbol is used
ENDM
This macro uses the conditional assembly directives to loop and generate several AD (Add Double) instruc-
tions. The number of instructions generated equals the number of components (the N attribute) of the third
argument.
A more sophisticated version of this macro lets the user specify, in argument ®, which register to
use. If argument ® is omitted (its T attribute equals O) the macro selects register 5.
326 11. Macros
(We ignore any pointers.) Macro Y will not be recognized as a macro but will be stored in the MDT as part
of the definition of macro X.
11.9 Nested Macro Definition 327
Another method for matching MACRO-ENDM pairs while reading-in a macro definition is to require each
ENDM to contain the name of the macro it terminates. Thus the above example should be written:
X MACRO
MULT
Y MACRO
ADD
JMP
ENDM Y
DIV
ENDM X
This requires more work on the part of the programmer but, on the other hand, makes the program easier
to read.
Exercise 11.16: In the case where both macros X, Y end at the same point
X MACRO
-
-
Y MACRO
-
-
ENDM Y
ENDM X
Each time Y is expanded, the assembler should, of course, expand the most recent definition of Y
(otherwise nested macro definition would be a completely useless feature). Expanding the most recent
definition of Y is simple. All that the assembler has to do is to search the MDT in reverse order; start
from the new macros and continue with the older ones. This feature of backward search has already been
mentioned, in connection with macros that redefine existing instructions.
The above example can be written, with the inclusion of parameters, and with some slight changes, to
make it more realistic.
X MACRO A,B,C,D
MULT A
Y MACRO C
C
ADD DIR
JMP B
ENDM Y
DIV C
Y D
ENDM X
The body of X now contains a definition of Y and also an expansion of it. An expansion of X will generate:
A MULT instruction.
A definition of Y in the MDT.
A DIV instruction.
An expansion of Y, consisting of three lines, the last two of which are ADD, JMP.
The expansion X SEC,TOR,DIC,BPL will generate:
The only thing that may be a surprise in this example is the fact that macro Y is stored in the MDT
without C being substituted. In other words. Y is defined as Y MACRO C and not as Y MACRO DIC. The rule
in such a case is the same as in a block structured language. Parameters of the outer macro are global and
are known inside the inner macro unless they are redefined by that macro. Thus parameter B is replaced by
its value TOR when Y is defined, but parameter C is not replaced by DIC.
Since we are interested in how things are done by the assembler, the implementation of this feature will
be discussed in detail. In fact, we will describe in detail two ways to implement nested macro definitions.
One is the traditional way, described below. The other, due to G. Revesz is more recent and more elegant;
it is described in Section 11.9.2.
11.9.1 The Traditional Method
Normally, when a macro definition is entered into the MDT, each parameter is replaced with a serial number
#1, #2, . . . To support nested macro definition, the assembler replaces each parameter, not with a single
serial number, but with a pair of numbers (definition level, serial number). To determine those pairs, a stack,
called the macro definition stack (MDS), is used.
11.9 Nested Macro Definition 329
When the assembler starts pass 0, it clears the stack and initializes a special counter (the Dlevel counter
mentioned earlier) to 0. Every time the assembler encounters a MACRO line, it increments the level counter
by 1 and pushes the names of the parameters of that level into the stack, each with a pair (level counter,
i ) attached, where i is the serial number of the parameter. The assembler then starts copying the definition
into the MDT, comparing every token on every line with the stack entries (starting with the most recent
stack entry). If a token in one of the macro lines matches a stack entry, the assembler considers it to be a
parameter (of the current level or any outer one). It fetches the pair (l,i ) from the stack entry that matched,
and stores #(l,i ) in the MDT instead of storing the token itself. If the token does not match any stack entry,
it is considered a stand-alone token and is copied into the MDT as part of the source line.
When an ENDM is encountered, the stack entries for the current level are popped out and Dlevel is
decremented by 1. After the last ENDM in a nested definition is encountered, the stack is left empty and
Dlevel should be 0.
The example below shows three nested definitions and the contents of the MDT. It also shows the macro
definition stack when the third, innermost, definition is processed (the stack is at its maximum length at
this point).
The following points should be noted about this example:
Lines 3,5,8, and 10 in the MDT show that the assembler did not treat the inner macros Q, R as
independent ones. They are just part of the body of macro P.
On line 4, the #(2,1) in the MDT means parameter 1 (A) of level 2 (Q), while the #(1,3) means
parameter 3 (C) of level 1 (P).
On line 7, #(3,3) is parameter 3 (E) of level 3 (R) and not that of level 2 (Q). The H is not found in
the stack and is therefore considered a stand-alone symbol, not a parameter.
On line 11, the assembler is back to level 1 where none of the symbols is a parameter. The stack at
this point only contains the four bottom lines, and symbols E,F,G,H are all considered stand-alone.
Figure 11.6 is a flow chart (a generalized subset of Figure 11.2) summarizing the operations described
earlier.
11.9.2 Revesz’s Method
There is another, newer and more elegant method—due to G. Revesz—for implementing nested macro
definitions. It uses a single serial number—instead of a pair—to tag each macro parameter, and also has the
advantage that it allows for an easy implementation of nested macro expansions and nested macro definitions
within a macro expansion.
The method is based on the following observation: When a macro A is being defined, we are only
concerned with the definition of A (the level 1 definition) and not with any inner, nested definitions, since A
330 11. Macros
stack
G (3,4) top
E (3,3)
C (3,2)
A (3,1)
F (2,4)
E (2,3)
B (2,2)
A (2,1)
D (1,4)
C (1,3)
B (1,2)
A (1,1) bottom
Dlevel:=Dlevel+1
no
Dlevel=1 5
?
yes
input
no yes
5 MEND 6
Figure 11.6. The classical method for nested macro definitions (part I)
is the only one that is stored in the MDT as a separate macro. In such a case why be concerned about the
parameters of all the inner, higher level, nested definitions inside A? Any inner definitions are only handled
when A is expanded, so why not determine their actual arguments at that point?
This is an example of an algorithm where laziness pays. We put off any work to the latest possible point
in time, and the result is simple, elegant, and correct.
The method uses the two level counters Dlevel and Elevel as before. There are also three stacks, one for
11.9 Nested Macro Definition 331
5
6
Dlevel:=Dlevel+1
match
yes no
Dlevel=0
Fetch the pair (l,i) ?
Store token
associated with the
in MDT
matched name yes no
mode:=N
7
Store (l,i) in MDT
instead of token
1
Figure 11.6. The classical method for nested macro definitions (part II)
formal parameters (P), the second for actual arguments (A), and the third, (MES), for the nested expansion
pointers. A non-empty formal parameter stack implies that the assembler is in the definition mode (D),
and a non-empty argument stack, that it is in the expansion mode (E). By examining the state (empty/non
empty) of those two stacks, we can easily tell in which of the 4 modes the assembler currently is.
Following are the details of the method in the case of a nested definition. When the assembler is in
mode 1 or 3, and it finds a MACRO line, it switches to mode 2 or 4 respectively, where it:
1. Stores the names of the parameters in stack P, each with a serial number attached. Since stack P should
be originally empty, this is level 1 in P.
2. Opens up an area in the MDT for the new macro.
3. Brings in source lines to be stored in the MDT, as part of the new definition. Each line is first checked
to see if it is a MACRO or a MEND.
4. If the current source line is MACRO, the line itself goes into the MDT and the assembler enters the names
of the parameters of the inner macro in stack P as a new, higher level. Just the names are entered, with
no serial numbers.
This, again, is an important principle of the method. It distinguishes between the parameters of level
1 and those of the higher levels, but it does not distinguish between the parameters of the different higher
levels. Again the idea is that, at macro definition time we only handle the level 1, outer macro, so why
bother to resolve parameter conflicts on higher levels at this time. Such conflicts will be resolved by the
same method anytime an inner macro is defined (becomes level 1).
5. If the current line is a MEND, the line itself again goes into the MDT and the assembler removes the
highest level of parameters from P. Thus after finding the last MEND, stack P should be empty again,
signifying a non-macro definition mode. The assembler should then switch back to the original mode
332 11. Macros
pass 0 8 MACRO? 2
ENDM? 3
Initialize
Dlevel:=0
Elevel:=0
mode:=N mode
N,E D,DE
1
macro name? 4
mode
E,DE
N,D
5
Increment pointer
Read next
located at top of
line from
MES. Use it to read mode
source file DE
next line from MDT
6
E
6 7
eof N
8 D 7
?
yes
output
end of
pass 0
(either 1 or 3).
6. If the source line is neither of the above, it is scanned, token by token, to determine what tokens are
parameters. Each token is compared with all elements on stack P, from top to bottom. There are three
cases:
a: No match. The token is not a parameter of any macro and is not replaced by anything.
b: The token matches a name in P that does not have a serial number attached. The token is thus
a parameter of an inner macro, and can be ignored for now. We simply leave it alone, knowing
that it will be replaced by some serial number when the inner macro is eventually defined (that will
happen when some outer macro is eventually expanded).
c: The token matches a name in P that has a serial number attached. The token is thus a parameter
of the currently defined macro (the level 1 macro), and is replaced by the serial number.
After comparing all the tokens of a source line, the line is stored in the MDT. It is a part of the currently
defined macro.
The important point is that the formal parameters are replaced by serial numbers only for level 1, i.e.,
only for the outermost macro definition. For all other nested definitions, only the names of the parameters
are placed on the stack, reflecting the fact that only level 1 is processed in macro definition mode. That
level ends up being stored in the MDT with each of its parameters being replaced by a serial number.
On recognizing the name of a macro, which can happen either in mode 1 (normal) or 3 (expansion), the
assembler enters mode 3, where it:
7. Loads stack A with a new level containing the actual arguments. If the new macro has no arguments,
11.9 Nested Macro Definition 333
2
MACRO
3
ENDM
mode
N E
Dlevel
mode:=D mode:=DE
=0 >1
=1
Dlevel:= D,
Dlevel+1 DE Copy line
Error! Place in
into current
Unmatched current def
definition
ENDM in MDT
in MDT
Dlevel
>1 Pop current
=1
Dlevel:=0 level of
Allocate space mode:=N
Push all mode:=N params
in MDT for from P
params into new macro. Push
stack P, but each param, with
without #i #i attached, Dlevel:=
1
into P Dlevel+1
Figure 11.7. Revesz’s method for nested macro definitions (part II)
4 macro name
6
mode:=E
Elevel:=Elevel+1
For each token on source line,
if token is #i, search stack A
top to bottom. On finding a match,
Push actual args, with replace #i with argument from A.
#i attached, into stack A
7
Push a new pointer into
MES, pointing to 1st line
of macro to be expanded
For each token on source line,
search stack P, top to bottom,
for a pair (token,#i).
1 If found, replace token with #i
Figure 11.7. Revesz’s method for nested macro definitions (part III)
h: Only one is empty. This should never happen and can only result from a bug in the assembler
itself. The assembler should issue a message like ‘impossible error, inform a systems programmer’.
i: Both are empty. This is the end of the expansion. The assembler switches to mode 1.
The following example illustrates a macro L1 whose definition is nested by L2 which, in turn, is nested
by a level 3 macro, L3.
Dlevel
=0 >0
D,
DE
Copy into current
=0 Elevel >0
definition in MDT
N E
Error! Pop current
illegal levelof args Dlevel:=Dlevel-1
character from stack A
read while
in normal >1
Dlevel 1
mode
Pop top of MES
=1
1
Elevel:=Elevel-1 mode
D DE
Figure 11.7. Revesz’s method for nested macro definitions (part IV)
after
line 1 4 7 10 13 16
D #4 C G C D #4
C #3 D C D C #3 empty
B #2 F E E B #2
A #1 E A F A #1
D #4 C D #4
C #3 D C #3
B #2 F B #2
A #1 E A #1
D #4
C #3
B #2
A #1
2. Lines 4–13 become the 8-line definition of L2 in the MDT as shown earlier. Later, when L2 is expanded,
with arguments W,X,Y,Z, it results in:
1. Lines 5,6,11,12 written on the new source file as:
MOV W,Z CMP M,X ADD M,Y SUB Z,M.
2. Lines 7–10 stored in the MDT as the 2-line definition of L3. From that point on, L3 can also be
expanded.
Figure 4–7 is a summary of this method, showing how to handle nested macro definitions and expansions,
as well as definition mode nested inside expansion mode.
11.9.3 A Note
The TEX typesetting system, mentioned earlier, supports a powerful macro facility that also allows for nested
macro definitions. It is worth mentioning here because it uses an unconventional notation. To define a macro
a with two parameters, the user should say ‘\def\a#1#2{...#1...#2...}’. Each #p refers to one of the
parameters. To nest the definition of macro b inside a, a double ## notation is used. Thus in the definition
‘\def\a#1{..#1..\def\b##1{..##1..#1}} ’ the notation ##1 refers to the parameter of b, and #1, to that
of macro a.
The rule is that each pair ## is reduced to one #. This way, macro definitions can be nested to any
depth.
Example:
\def\a#1{‘#1’\def\b##1{[#1,##1] \def\x####1{(#1,##1,####1)}\x X}\b Y}
The definition of macro a consists of:
Defining macro b.
Expanding x with the argument X.
Macro x simply prints all three arguments, a’s, b’s and its own, in parentheses.
error. Example :
BEGIN EX
A MACRO P,Q
.
.
ENDM
M MACRO C,II
.
.
ENDM
X DS 12 an array declaration
N MACRO
.
.
ENDM
.
.
The definition of macro N will be flagged, since it occurs too late in the source.
With this separation of macro definition and expansion, pass 0 is easy to implement. In the normal
mode such an assembler simply copies the source file into the new source. Macro definitions only affect the
MDT, while macro expansions are written onto the new source file.
Such an assembler, however, cannot support nested macro definitions. They generate new macro defi-
nitions too late in pass 0.
An assembler that supports nested macro definitions cannot impose the restriction above, and can only
require the user to define each macro before it is expanded.
Having a separate pass 0 simplifies pass 1. The flow charts in his chapter prove that pass 0 is at least as
complicated as pass 1, so keeping them separate reduces the total size and the complexity of the assembler.
Combining pass 0 and pass 1, however, has the following advantages:
No new source file needed, saving both file space and assembly time.
Procedures—such as input procedures or procedures for checking the type of a symbol—that are
needed in both passes, only need to be implemented once.
All pass 1 facilities (such as EQU and other directives) are available during macro expansion. This
simplifies the handling of conditional assembly.
. . . Then there are young men who dance around and get paid by the
women, they’re called ‘macros’ and aren’t much use to anyone . . .
—Noël Coward (1939), To Step Aside
A
History of Computers
This short chapter presents an outline of the major developments in the history of computers, concentrating
on topics that are not well known. Because of the importance and popularity of computers, there are many
books on this topic ([Augarten 84], [Burks 88], [Goldstine 72], [Randell 82], and [Singh 99] to cite a few
examples), and historical resources on the web (see [IowaState] or search under computers and history).
A.1 A Brief History of Computers
The old history of computing machines starts with the French mathematician Blaise Pascal, although some
sources, such as [Shallit 00] mention much older devices for computations. Strictly speaking, the first
mechanical calculator was built by Wilhelm Schickard, a professor at the university of Tübingen in Germany.
His machine, however, remained practically unknown in his time and did not have any influence on subsequent
historical developments. This is why today we consider Pascal the father of calculating machines. In 1642,
at age nineteen, he designed and built an 8-digit, 2-register mechanical calculator (termed the Pascaline)
that could add and subtract. His calculator had two sets of wheels, one to enter the next number and the
other, to hold the sum/difference. He was the first to realize that decimal digits could be arranged on wheels
in such a way that each wheel—when passing from 9 to 0—would drag its neighbor on the left one-tenth of
a revolution, thereby propagating the carry mechanically. This method of carry propagation has remained
2
the basis of all the mechanical and electromechanical calculators up to the present.
It is interesting to note that Pascal was also the father of the modern science of probability and also
the inventor of the game of roulette. He also created, in his last year, the first public transportation system,
an omnibus service in Paris. Legend has it that he developed his calculator after his father, a provincial tax
collector in Rouen, asked him to help with that year’s tax rolls.
The next step occurred thirty years later, when Gottfried Wilhelm von Leibniz improved on Pascal’s
design by inventing the stepped wheel, which made it possible to do multiplication and division, as well as
addition and subtraction. His machine had four registers, the two additional ones for the multiplier and the
340 A. History of Computers
multiplicand. It used chains and pulleys, and is described in [Smith 29]. Surprisingly, his invention found
little use in his time.
In the nineteenth century, the British mathematician and philosopher Charles Babbage devoted the
better part of his life to the development of calculating devices (engines). Babbage, who was also the inventor
of the speedometer and the cowcatcher, and who was also the first to develop reliable life expectancy tables,
kept designing better calculating engines. His first important design, the difference engine (1823), used the
method of finite differences for calculating a large number of useful functions. His main project, however, was
the analytical engine (1834), which was to be a general-purpose calculating machine, similar to present day
computers. It had data input (from punched cards), automatic sequencing (a program on punched cards), a
control unit, a memory unit (consisting of 1000 words of 50 decimal digits each, equivalent to about 170.9K
bits), and an ALU (mill) that could perform arithmetic and logic operations. A number could be read from
memory, sent to the mill, operated on, and stored back in memory. An important feature was the program
store, a memory unit where the program was stored in the form of machine instructions with opcodes and
addresses, similar to those found in modern computers. The idea of using punched cards was adopted from
the Jacquard loom, which used them to control the weaving of a pattern. For more information on this
original and complex design see [Augarten 84], pages 63–64. A detailed biography of Babbage and his work
is [Hyman 82].
Babbage ended up spending most of his $100,000 inheritance, and another $1500 in government grants,
without completing any machines. It is commonly believed today that his engines could not be built because
of the state of the art of mechanical engineering at his time. We now know, however, that during Babbage’s
lifetime, some of his designs had been implemented by George and Edvard Scheutz, a Swedish father and
son team. This implies that lack of technology was just one of Babbage’s problems. His main problem was
that he lost interest in any work in progress as soon as he came up with a new, more powerful design. He
was a perfectionist and a difficult person to work with.
Seventy years after Babbage, In the 1930s and 1940s there were three, roughly parallel efforts directed
by Konrad Zuse, Alan Turing, and Howard Aiken, to build larger calculating machines.
In Berlin, the young German engineer Konrad Zuse, motivated by what he called “those awful calcula-
tions required of civil engineers,” built three electromechanical calculating machines that worked successfully.
The Z-1, his first machine, was completely mechanical. It was built from erector parts in his parents living
room and was used mainly to help him do stress analysis calculations
The Z-2, Zuse’s second attempt (again at home), used electromechanical relays (the type used in tele-
phones in his day) and could solve simultaneous algebraic equations. It worked as planned, and it encouraged
his employers to help him build the Z-3.
The Z-3, completed in late 1941, was electromechanical, consisted of 2600 relays, and contained most of
the components of a modern, general-purpose digital computer. It had a 64 × 22 memory and an ALU that
could perform the four arithmetic operations. It was slow, taking 3–5 seconds for a single multiplication. The
program was read from punched tape. Zuse received a German patent (#Z23624, filed April 11, 1936) on the
Z-3. It seems that this machine deserves the title “the world’s first operational, programmable computer.”
Encouraged by the performance of the Z-3, Zuse started on the Z-4, and founded a company. The
end of the war, however, put a stop to his efforts. It is also interesting to note that Zuse was the first to
develop a higher-level programming language, the Plankalkul. [Roja 00] is an article on Zuse and his work. A
translation of Zuse’s own article appears in [Randell 82], pages 159–166. Arnold Fast, a blind mathematician
hired by Zuse in 1941 to program the Z-3, probably deserves the title “the world’s first programmer of an
operational programmable computer.”
Most of Zuse’s machines were destroyed during the war, and information about his work did not reach
the world until the mid-1960s. It seems that Zuse was, for a while, way ahead of other computer developers
but, perhaps fortunately for the world, the German government did not understand his work, gave him little
support, and even drafted him as a simple soldier in world war II. Luckily for him, his boss at the Heinkel
aircraft manufacturing company needed him, and secured his discharge.
In Bletchley Park, a manor house in Hertfordshire, England, a team that included Alan Turing spent
the war years trying to break the Enigma, the German code used before and during the war (see [Singh 99]
for the Enigma code and its cracking).
Part of the British effort to crack the Enigma code involved the construction, in 1940, of machines—
A.1 A Brief History of Computers 341
primitive by today’s standards—called ‘bombes.’ These were electromechanical devices that could perform
just one type of computation. (Another name for the bombes was the “Robinson machine,” after a popular
cartoonist who drew Rube Goldberg machines.) It seems that the bombes deserve the title “the world’s
first operational computer” since Babbage’s machine were never completed. Later in the war, the British
discovered that the Germans used another cipher, dubbed the Lorenz, that was far more complicated than
the enigma. Breaking the Lorenz code required a machine much more sophisticated than the bombes,
a machine that could perform statistical analysis, data searching, and string matching, and that could
easily be reconfigured to perform different operations when necessary. Max Newman, one of the Bletchley
mathematicians, came up with a design for such a machine, but his superiors were convinced that constructing
it, especially during the war, was beyond their capabilities.
Fortunately, an engineer by the name of Tommy Flowers heard about the idea and believed that it was
workable. He worked for the British post office in North London, were he managed to convert Newman’s
design into a working machine in ten months. He finished the construction and delivered his machine
to Bletchley park in December 1943. It was called Colossus, and it had two important features: it was
completely electronic, using 1500 vacuum tubes and no mechanical relays, and it was programmable. It used
paper tapes for input and output. Today, the Colossus is one of the few candidates for the title “the first
modern electronic computer,” but for a long time it was kept secret.
After the war, Colossus was dismantled, and its original blueprints destroyed. This is why for many
years, others were credited with the invention of the modern computer.
After the war, Turing designed the ACE computer, an ambitious project for that time. It was supposed
to be a large, stored-program machine, driven at a million steps per second. A scaled down version was
eventually completed in 1950.
Both Zuse and Turing, independently of each other, decided to use base-two (binary) numbers in their
machines, thus marking a departure from the traditional base-ten numbers, and opening the way to efficient
arithmetic circuits and algorithms.
At about the same time, Howard Aiken, a physicist at Harvard University, approached IBM with a
proposal to build a large electromechanical calculator. The result was the Mark I computer, completed in
January, 1943. It used mechanical relays and sounded, as an observer once noted “like a roomful of ladies
knitting.” The Mark I lasted 16 years, during which it was used mostly for calculating mathematical tables.
Like all early computers, the Mark I had a very small internal memory and had to read its instructions
from paper tape. It seems that the Mark I deserves the title “the first programmable computer built by an
American.” See [Augarten 84], pages 103–107, for more information on this machine.
An independent development was the Atanasoff-Berry computer (ABC), developed during 1937–42 in
Ames, Iowa, USA, at Iowa State University (then Iowa State college), by John V. Atanasoff, with the help
of graduate student Clifford Berry. Today it seems that this machine is the best candidate for the title “the
world’s first electronic computer” (however, it was nonprogrammable, see [Burks 88], [IowaState 00], and
[Mollenhoff 88]).
Atanasoff and Berry later founded the Atanasoff-Berry Computer company, that pioneered several
innovations in computing, including logic devices, parallel processing, regenerative memory, and a separation
of memory and computing functions.
On October 19, 1973, US Federal Judge Earl R. Larson made, following a lengthy trial, a decision
that declared the ENIAC patent of Mauchly and Eckert invalid and named Atanasoff the inventor of the
electronic digital computer. Details about this historically important machine are available in [Mollenhoff 88]
and [Burks 88].
In recognition of his achievement, Atanasoff was awarded the National Medal of Technology by President
2
George Bush at the White house on November 13, 1990.
“I have always taken the position that there is enough credit for ev-
eryone in the invention and development of the electronic computer.”
—John Vincent Atanasoff
342 A. History of Computers
Table A.1 summarizes the most important developments in calculating devices that were either mechan-
ical, electromechanical, or electronic analog.
computer. Parts were custom made for this machine and procedures were developed specifically for it.
Working with ENIAC was a slow and tedious process. A mathematical problem had to be understood and
its solution written in a way that allowed rewiring of the machine to solve the problem. Obviously, such a
process was worth the effort only for the problems that interested the sponsors (i.e., the army).
The modern approach to the design and programming of computers was developed at the Institute for
Advanced Study in Princeton by the mathematician John von Neumann and his collaborators. Their main
ideas were summarized in the now famous report that was published in 1945 [Burks et al. 45]. In that report
they called their design “The IAS computer” (IAS is the Institute for Advanced Study, in Princeton, New
Jersey). To the world in general their design is known as the “von Neumann machine.” The first machine
2
built according to this design was the EDVAC (Electronic Discrete Variable Computer), completed in 1952.
It contained 4000 tubes and 10000 diodes, and remained operative until 1962.
Von Neumann and his collaborators realized that controlling a computer can and should be done in-
ternally. They were also the first to understand that a computer performed logical operations and that the
arithmetic aspects of the computer were secondary to the logical ones. In 1944 this way of thinking was a
major step forward and opened up the way to the design of the modern digital computer. The main contri-
bution of this small group to the field of computer design is the idea of the stored program. The program is
stored inside the computer, in the same memory with the data, and the computer uses logic circuits—of the
same type used to perform arithmetic operations—to fetch instructions from memory, decode and execute
them.
To summarize, the main idea in a von-Neumann machine is to have a storage unit (the memory) where
both the program and the data are stored. Instructions are fetched from memory one by one and are
executed by the control unit. Data items are also fetched from or written into the same memory. This idea
has proved so powerful and useful that even today most computers are von-Neumann machines and relatively
few computers operate according to different principles.
Two good references on the early history of computers and computing machines are [Goldstine 72] and
[Randall 82]. In addition, the Computer Science Encyclopedia [Ralston 85] contains several articles on the
history of computers and computing machines, and on the main participants.
344 A. History of Computers
In 1947 Eckert and Mauchly formed the Eckert-Mauchly computer corporation. They started making
the UNIVAC line of computers and were subsequently acquired by the Sperry-Rand corporation to become
its UNIVAC division.
The IBM corporation, long a maker of punched card equipment and business machines, supported the
Mark I computer, the last of the electro-mechanical relay computers. After the success of the first UNIVAC
computers, IBM decided to enter the computer business and delivered its first commercial computer, the
701, in 1953.
Thus, the early 1950s saw two main computer makers, Sperry and IBM. Sperry developed its UNIVAC
line and IBM, its 700 series.
A.3 Second Generation: Transistors
The invention of the transistor in 1947 was the cause of the first major revolution in computer design.
Suddenly, it became possible to build computers that were reasonably sized, did not consume huge quantities
of electrical power, and were reliable. The first two manufacturers to take advantage of the new technology
were NCR and RCA. IBM came in late but had a number of very good entries in the form of the 7000 series
of computers. The 7000 series computers were the first ones with an instruction back-up register (to buffer
the next instruction), I/O channels, and a multiplexor . The multiplexor was the forerunner of today’s bus
controllers.
The invention of core memories, by J. Forrester in 1953, provided computers with new powers. Almost
overnight it became possible to build memories with hundreds, or even thousands, of words. Core memories
were faster, smaller, cheaper, and much more reliable than any of the older storage methods.
A.4 Third Generation: ICs
In the early 1950s, with just a few years’ experience with transistors, engineers have pinpointed their next
problem—interconnections. With transistors, it became possible to build dense circuits, containing many
components on a small circuit board. However, the individual components still had to be connected by wires,
which put a limit to the density of a circuit. The problem was solved in October, 1958 by Jack Kilby of
Texas Instruments, and in April 1959, by Robert Noyce, then at Fairchild Camera and Instrument. They
invented the integrated circuit, popularly known as the chip.
Computer manufacturers immediately realized the importance of this development and came up, in the
early 1960’s, with computers made with ICs. The two most important early ones were the IBM/360 and the
PDP-8. The former has already been mentioned in connection with the definition of computer architecture.
The latter was the first true minicomputer, and a very successful one! It has changed the way computers
were used. The PDP-8 was first used as a stand-alone inexpensive computer until people realized that it was
small enough to be incorporated into other devices and instruments. This turned out to be a very successful
concept and it started what is known today as the OEM market (Original Equipment Manufacturers). An
OEM is a manufacturer who buys a computer and includes it as part of a larger system or device.
A.5 Later Developments
There is no general agreement as to what constitutes the fourth or fifth generations of computers, or even
whether they exist at all. Many authors view the advent of VLSI as the start of the fourth computer
generation. Some experts suggest that advances in artificial intelligence will soon start the fifth generation
of very intelligent computers.
The term VLSI (Very Large Scale Integration) refers to chips with many components. To qualify as
VLSI, a chip should have a minimum of 10,000 or 100,000 components, depending on who you listen to.
Many people regard the microprocessor as the start of the fourth generation. It should be noted that a
microprocessor is not a computer on a chip. A microprocessor is a processor on a chip (see Section 1.1 for
the definition of the term processor) and is not a complete computer. There are relatively few computers on
a chip (single chip computers) but they are computationally weak and are mostly special-purpose.
The story of the microprocessor starts on November 15,1971, when Intel, a large electronic manufacturer,
announced the 4004 microprocessor in an ad in Electronic News, a trade magazine. This microprocessor
consisted of about 2300 transistors, and was designed by Ted Hoff, an engineer with Intel. His problem
was to design a new calculator with as few parts as possible, and he realized that one microprocessor could
replace many of the parts traditionally used in a calculator. The 4004 was an immediate success, so a year
A.5 Later Developments 345
later, Intel introduced the 8008, the first 8-bit microprocessor. It was organized around a set of six 8-bit
registers, and a set of 45 instructions. It could address 16Kb of memory. The first competition came, also
in 1972, from Fairchild semiconductor and from Rockwell, who came up with their microprocessors.
In 1973, Intel announced the 8080 microprocessor, consisting of about 5000 transistors. It had 74
instructions and could address 64Kb of memory—a huge memory space at that time. The 8080 offered
about 10 times the performance of the 8008. In 1974, Motorola announced the 6800 microprocessor, a direct
competitor to the 8080. By the end of 1974, there were about 20 different microprocessors on the market.
The Altair computer was advertised in the January, 1975 issue of Electronics magazine. This was a
personal computer based on the 8080, which was sold as a kit. It became an immediate success, and it
encouraged other manufacturers to come up with other microprocessors and personal computers. By the
end of 1976, more than 55 microprocessors were available, including the Zilog Z-80, a powerful 8-bit machine
with a set of 158 instructions.
In 1978, Intel came up with the 8086 microprocessor, the first successful 16-bit microprocessor. It
consisted of 29,000 transistors, and could address 1Mb of memory. Immediate competitors were the Zilog
Z-8000, Motorola 68000 and National Semiconductor 16000. The first 32-bit microprocessors took longer to
develop. They appeared on the scene in 1985–86 and included the DEC VAX, the Motorola 68020, National
Semiconductor NS32032, and the Intel iAPX 432. The latter consisted of about 200,000 transistors, could
address a main memory of 16Mb, and could execute two million instructions/second.
In the 1990s, microprocessors such as the Intel Pentium and the Motorola PowerPC have been ap-
proaching the performance of supercomputers. If current trends continue, supercomputers may eventually
disappear, and only microprocessors (and networks of microprocessors) will be used in the future.
The reader should notice that this short history discusses just computers. It does not include details on
the development of programming languages, operating systems, and networks. The history of these topics is,
of course, well known and widely available. Section 3.27 discusses the Internet, Section 4.2 is a short history
of microprogramming, and Section 6.2 is a short history of RISC.
characters currently defined by ISO 10646. All other ISO 10646 code values are reserved for future expansion.
ISO 10646’s full codeset is called Universal Character Set, four octets form (UCS-4).
What Characters Does the Unicode Standard Include?
The Unicode Standard defines codes for characters used in every major language written today. Scripts
include Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese kana, the complete set of
modern Korean hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs.
The Unicode Standard also includes punctuation marks, diacritics, mathematical symbols, technical
symbols, arrows, dingbats, etc. It provides codes for diacritics, which are modifying character marks such
as the tilde (~), that are used in conjunction with base characters to encode accented or vocalized letters
(ñ, for example). In all, the Unicode Standard provides codes for nearly 39,000 characters from the world’s
alphabets, ideograph sets, and symbol collections.
The Unicode Standard has 18,000 unused code values for future expansion. The standard also reserves
over 6,000 code numbers for private use, which software and hardware developers can assign internally for
their own characters and symbols.
Design Basis
To make it possible for a character set to successfully encode, process, and interpret text, the character
set must:
define the smallest useful elements of text to be encoded;
assign a unique code to each element; and,
provide basic rules for encoding and interpreting text so that programs can successfully read and
process text.
These requirements are the basis for the design of the Unicode Standard.
Defining Elements of Text
When spoken language is written, it is represented by textual elements that are used to create words
and sentences. These elements may be letters such as “w” or “M”; characters such as those used in Japanese
hiragana to represent syllables; or ideographs such as those used in Chinese to represent full words or
concepts.
Each writing system defines its own text elements, a definition that often changes depending on the
process handling the text. For example, in historic Spanish language sorting, “ll” counts as a single text
element. When Spanish words are typed, however, “ll” is two separate text elements: “l” and “l”.
To avoid deciding what is and is not a text element in different processes, the Unicode Standard breaks
up a writing system into code elements (commonly called “characters” when writing about the Unicode
Standard). A code element is a part of a writing system defined by the Unicode Standard to be fundamental
and useful for computer text processing. For the most part, code elements correspond to the most commonly
used text elements. For ex-ample, each upper- and lowercase letter in the English alphabet is a single code
element.
The Unicode Standard, in cases that lack universal agreement over what constitutes a text element,
defines a unique code element using the text element definition that is most useful for encoding text with
a computer. For example, instead of defining “ll” as a single code element for Spanish text, the Unicode
Standard defines each “l” as a separate code element. The task of combining two “l”s together for alphabetic
sorting is left to the software processing the text.
The Unicode Standard defines a few codes for the presentation of text. Some control the direction of
text—left-to-right or right-to-left—for rare cases where text must change directions within a single run of
text.
The Unicode Standard defines explicit characters for line and paragraph endings. With ASCII, the Line
Feed and Carriage Return characters are often used ambiguously. The Unicode Standard eliminates that
ambiguity.
Text Processing
Almost all computer text processes encode and interpret text as they process it. For example, consider
a word processor user typing text at a keyboard. The computer’s system software receives a message that
the user pressed a key combination for “T”, which it encodes as U+0054. The word processor stores the
Unicode 349
number in memory, and also passes it on to the display software responsible for putting the character on the
screen. The display software, which may be a window manager or part of the word processor itself, uses the
number as an index to find an image of a “T”, which it draws on the monitor screen. The process continues
as the user types in more characters.
The Unicode Standard directly addresses only encoding and interpreting text and not any other actions
performed on the text. For example, the word processor may check the typist’s input after it has been
encoded to look for misspelled words, and then beep if it finds any. Or it may insert line breaks when it
counts a certain number of characters entered since the last line break. It is an important principle of the
Unicode Standard that it does not specify how to carry out these processes as long as the character encoding
and decoding is performed properly.
Interpreting Characters and Rendering Glyphs
The difference between interpreting a character (identifying a code value) and rendering it on screen or
paper is crucial to understanding the Unicode Standard’s role in text processing. The character identified
by a Unicode code value is an abstract entity, such as: “latin character capital a” or “bengali digit 5.” The
mark made on screen or paper—called a glyph—is a visual representation of the character.
The Unicode Standard does not define glyph images. Glyph images are not defined by the Unicode
Standard, which defines only how characters are interpreted, not how glyphs are rendered. The software or
hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen.
The Unicode Standard does not specify the size, shape, nor orientation of onscreen characters; it specifies
only the code values assigned to those characters.
Creating Composite Characters
Textual elements may be encoded as composed character sequences; in presentation, the multiple char-
acters are rendered together. For example, “â” is a composite character created by rendering “a” and “^”
together. A composed character sequence is typically made up of a base letter, which occupies a single space,
and one or more non-spacing marks, which are rendered in the same space as the base letter.
The Unicode Standard specifies the order of characters used to create a composite character. The base
character comes first, followed by one or more non-spacing marks. If a textual element is encoded with
more than one non-spacing mark, the order in which the non-spacing marks are stored isn’t important if the
marks don’t interact typographically. If they do, order is important. The Unicode Standard specifies how
competing non-spacing characters are applied to a base character.
The Unicode Standard offers precomposed characters to retain compatibility with established standards.
The Unicode Standard offers another option for some composite characters—precomposed characters. Each
precomposed character is represented by a single code value rather than two or more code values, which may
combine during rendering and thus create a “prefabricated” composite character. For example, the character
“ü” can be encoded as the single code value U+00FC “ü” or as the base character U+0075 “u” followed
by the non-spacing character U+0308 “¨”. The Unicode Standard offers precomposed characters to retain
compatibility with established standards such as Latin1, which includes many precomposed characters such
as “ü” and “ñ”.
The Unicode Standard defines decompositions for all precomposed characters. Precomposed charac-
ters may be decomposed for consistency or analysis. For example, a word processor importing a text file
containing the precomposed character “ü” may decompose that character into a “u” followed by the non-
spacing character “¨”. Once the character has been decomposed, it may be easier for the word processor to
work with the character because the word processor can now easily recognize the character as a “u” with
modifications. This allows easy alphabetical sorting for languages where character modifiers do not affect
alphabetical order.
Principles of the Unicode Standard
The Unicode Standard was created by a team of computer professionals, linguists, and scholars to
become a worldwide character standard, one easily used for text encoding everywhere. To that end, the
Unicode Standard follows a set of fundamental principles: * Sixteen-bit characters * Logical order * Full
Encoding * Unification * Characters, not glyphs * Dynamic composition * Semantics * Equivalent Sequence
* Plain Text * Convertibility
The character sets of many existing standards are incorporated within the Unicode Standard. For
example, Latin-1 character set is its first 256 characters. The Unicode Standard includes the repertoire of
350 B. Unicode
characters from many other international, national and corporate standards as well.
Duplicate encoding of characters is avoided by unifying characters within scripts across languages;
characters that are equivalent in form are given a single code. Chinese/Japanese/Korean (CJK) consolidation
is achieved by assigning a single code for each ideograph that is common to more than one of these languages.
It does this instead of providing a separate code for the ideograph each time it appears in a different language.
(These three languages share many thousands of identical characters because their ideograph sets evolved
from the same source.)
Dynamic character composition allows marked character creation. Each character and diacritic or vowel
mark is encoded separately, which allows the characters to be combined to create a marked character (such
as é). The Unicode Standard also provides single codes for marked characters in order to provide consistency
with many preexisting character set standards.
Characters are stored in logical order. The Unicode Standard includes characters to specify changes in
direction when scripts of different directionality are mixed. For all scripts Unicode text is in logical order
within the memory representation, corresponding to the order in which text is typed on the keyboard. The
Unicode Standard specifies an algorithm for the presentation of text of opposite directionality, for example,
Arabic and English; as well as, occurrences of mixed directionality text within.
Assigning Character Codes
A single 16-bit number is assigned to each code element defined by the Unicode Standard. Each of these
16-bit numbers is called a code value and, when referred to in text, is listed in hexadecimal form following
the prefix “U+”. For example, the code value U+0041 is the hexadecimal number 0041 (equal to the decimal
number 65). It represents the character “A” in the Unicode Standard.
Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is
assigned the character name “latin capital letter a”. U+0A1B is assigned the character name “gurmukhi
letter cha”. These Unicode names are identical to the ISO/IEC 10646 names for the same characters.
The Unicode Standard groups characters together by scripts in code blocks. A script is any system
of related characters. The standard retains the order of characters in a source set where possible. When
the characters of a script are traditionally arranged in a certain order— alphabetic order, for example—the
Unicode Standard arranges them in its code space using the same order whenever possible. Code blocks
vary greatly in size. For example, the Cyrillic code block doesn’t exceed 256 code values, while the CJK
code block has a range of thousands of code values.
Code elements are grouped logically throughout the range of code values (called the codespace). The
coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew,
Arabic, Indic and other scripts, followed by symbols and punctuation. The code space continues with
hiragana, katakana, and bopomofo. The unified Han ideographs are followed by the complete set of mod-
ern hangul. The surrogate range of code values is reserved for future expansion. Towards the end of the
codespace is a range of code values reserved for private use, followed by the compatibility zone; the com-
patibility zone contains character variants that are encoded only to transcoding to earlier standards and old
implementations. The last two code values are reserved—U+FFFE and U+FFFF.
A range of code values are reserved as user space. Those code values have no universal meaning, and
may be used for characters specific to a program or by a group of users for their own purposes. For example,
a group of choreographers may design a set of characters for dance notation and encode the characters using
code values in user space. A set of page-layout programs may use those same code values as control codes to
position text on the page. The main point of user space is that the Unicode Standard assigns no meaning to
these code values, and reserves them as user space, promising never to assign them meaning in the future.
Conformance to the Unicode Standard
The Unicode Standard specifies unambiguous requirements for conformance in terms of the principles
and encoding architecture it embodies. A conforming implementation has the following characteristics, as a
minimum requirement:
characters are 16-bit units;
characters are interpreted with Unicode semantics;
unassigned codes are not used; and,
unknown characters are not corrupted.
Unicode 351
The full conformance requirements are available within The Unicode Standard, Version 2.0, Addison
Wesley, 1996.
For Further Information
The fundamental source of information about Unicode is The Unicode Standard, Version 2.0, published
by Addison Wesley, 1996. This book comes with a CD-ROM that contains character names and properties, as
well as tables for mapping and transcoding. The Unicode Standard, Version 2.0, as well as Proceedings of the
International Unicode Conferences may be ordered from the Unicode Consortium by using the Publications
Order Form. Updates and Errata of the Unicode Standard are posted on this web site.
Chapter 1—Introduction
This excerpt from the book The Unicode Standard, Version 2.0 has been reformatted and edited for use
on the web.
The Unicode Standard is a fixed-width, uniform encoding scheme for written characters and text. The
repertoire of this international character code for information processing includes characters for the major
scripts of the world, as well as technical symbols in common use. The Unicode character encoding treats
alphabetic characters, ideographic characters, and symbols identically, which means that they can be used
in any mixture and with equal facility. The Unicode Standard is modeled on the ASCII character set, but
uses a 16-bit encoding to support fall multilingual text. No escape sequence or control code is required to
specify any character in any language.
The Unicode Standard specifies a numerical value and a name for each of its characters; in this respect, it
is similar to other character encoding standards from ASCII onwards (see Figure 1-1). The Unicode Standard
is code-for-code identical with International Standard ISO/IEC 10646-1:1993. Information Technology—
Universal Multiple-Octet Coded Character Set (UCS)-Part 1: Architecture and Basic Multilingual Plane.
As well as assigning character codes and names, the Unicode Standard provides other information not
found in conventional character set standards, but crucial for using character encoding in implementations.
The Unicode Standard defines properties for characters and includes application data such as case mapping
tables and mappings to the repertoires of international, national, and industry character sets. The Unicode
Consortium provides this additional information to ensure consistency in interchange of Unicode data.
1.1 Design Goals
The primary goal of the development effort for the Unicode Standard was to remedy two serious problems
common to most multilingual computer programs: overloading of the font mechanism when encoding char-
acters, and use of multiple, inconsistent character codes due to conflicting national and industry character
standards. The ASCII 7-bit code space and its 8-bit extensions, although used in most computing systems,
are limited to 128 and 256 code positions, respectively. These 7- and 8-bit code spaces are inadequate in the
global computing environment.
when the Unicode project began in 1988, groups most affected by the lack of a consistent international
character standard included the publishers of scientific and mathematical software, newspaper and book
publishers, bibliographic information services, and academic researchers. More recently, the computer in-
dustry has adopted an increasingly global outlook, building international software that can be easily adapted
to meet the needs of particular locations and cultures. The explosive growth of the Internet has added to
the demand for a character set standard that can be used all over the world.
The designers of the Unicode Standard envisioned a uniform method of character identification which
would be more efficient and flexible than previous encoding systems. The new system would be complete
enough to satisfy the needs of technical and multilingual computing and would encode a broad range of
characters for professional quality typesetting and desktop publishing worldwide.
The original design goals of the Unicode Standard were established as:
Universal The repertoire must be large enough to encompass all characters that were likely to be used
in general text interchange, including those in major international, national, and industry character sets.
Efficient Plain text, composed of a sequence of fixed-width characters. provides an extremely useful
model because it is simple to parse: software does not have to maintain state, look for special escape
sequences, or search forward or backward through text to identify characters.
Uniform A fixed character code allows efficient sorting. searching, display. and editing of text.
Unambiguous. Any given 16-bit value always represents the same character
352 B. Unicode
Figure 1-2 demonstrates some of these features, contrasting Unicode encoding to mixtures of single-byte
character sets, with escape sequences to shift the meanings of bytes.
1.2 Coverage
The Unicode Standard, Version 2.0 contains 38,885 characters from the world’s scripts. These characters
are more than sufficient not only for modem communication, but also for
the classical forms of many languages. Languages that can be encoded include Russian3 Arabic, Anglo-
Saxon, Greek) Hebrew, Thai, and Sanskrit. The unified Han subset contains 20,902 ideographic characters
defined by national and industry standards of China, Japan, Korea) and Taiwan. In addition, the Unicode
Standard includes mathematical operators and technical symbols) geometric shapes) and dingbats. Overall
character allocation and the code ranges are detailed in Chapter 2, General Structure.
Included in the Unicode Standard are characters from all major international standards approved and
published before December 31, 1990, in particular, the ISO International Register of Character Sets, the
ISO/lEC 6937 and ISO/IEC 8859 families of standards, as well as ISO/IEC 8879 (SGML). Other primary
sources included bibliographic standards used in libraries (such as ISO/lEC 5426 and ANSI Z39.64), the most
prominent national standards, and various industry standards in very common use (including code pages and
character sets from Adobe, Apple, Fujitsu. Hewlett-Packard, IBM, Lotus, Microsoft, NEC, WordPerfect,
and Xerox). The complete Hangul repertoire of Korean National Standard KS C 5601 was added in Version
2.0. For a complete list of ISO and national standards used as sources, see the bibliography.
The Unicode Standard does not encode idiosyncratic, personal. novel, rarely exchanged, or private-use
characters, nor does it encode logos or graphics. Artificial entities, whose sole function is to serve transiently
in the input of text, are excluded. Graphologies unrelated to text, such as musical and dance notations, are
outside the scope of the Unicode Standard. Font variants are explicitly not encoded. The Unicode Standard
includes a Private Use Area, which may be used to assign codes to characters not included in the repertoire
of the Unicode Standard.
The Unicode Consortium (see Section 1.4, The Unicode Consortium) periodically develops proposals
for new scripts. The Consortium welcomes the submission of new characters for possible inclusion in the
Unicode Standard. (For instructions on how to submit characters to the Unicode Consortium, see Appendix
B, Submitting New Characters.)
1.3 About this book
This book defines Version 2.0 of the Unicode Standard. The general principles and architecture of the
Unicode Standard, requirements for conformance, and guidelines for implementers precede the actual coding
information. The accompanying CD-ROM carries tables of use to implementers.
Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific
topics such as text processes, overall character properties, and the use of non-spacing marks.
Chapters constitutes the formal statement of conformance. It opens with the conformance clauses
themselves, which are followed by sections that define more precisely terms used in the clauses. The remainder
of this chapter presents the normative algorithms for three processes: the canonical ordering of combining
marks, the encoding of Korean Hangul syllables by conjoining jamo, and the formatting of bidirectional text.
Chapter 4 describes character properties, both normative (required) and informative. Since code charts
alone are not sufficient for implementation, the Unicode Standard also specifies character properties, some
of which are required for conformance.
Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown
and missing characters, and transcoding to other standards.
Chapter 6 contains character block descriptions, part of the coding information in the Unicode Standard.
A character block generally contains characters from a single script (for example. Tibetan) or is a collection
of a particular type of character (for example, Mathematical Operators). A character block description gives
basic information about the script or collection and may discuss specific characters.
Chapter 7 presents the individual characters, arranged by character block An overview of a particular
character block is given by means of a code chart. With the exception of the blocks for East Asian ideographs
and Korean hangul syllables, the individual characters of a block are identified in the accompanying names
list
Chapter 8 provides a radical/stroke index to East Asian ideographs.
The major table on the CD-ROM is the Unicode Character Database, which gives character codes, char-
Unicode 353
acter names (with Version 1.0 name if different), character properties, and decompositions for decomposable
or compatibility characters. The CD-ROM also includes property-based mapping tables (for example, tables
for case) and transcoding tables for international, national, and industry character sets (including the Han
cross-reference table). (For the complete contents of the CD-ROM, see its READ ME file.)
Notational Conventions
Throughout this book, certain typographic conventions are used. In running text, an individual Unicode
value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation, using the digits
0–9 and the letters A–F (for 10 through 15 respectively). In tables, the U+ may be omitted for brevity
U+0416 is the Unicode value for the character named CYRILLIC CAPITAL LETTER ZHE.
A range of Unicode values is expressed as U+xxxr–U+yyyy or U+xxxx–U+yyyy, where xxxx and yyyy
are the first and last Unicode values in the range, and the arrow or long dash indicates a contiguous range.
The range U+0900–U+097F contains 128 character values.
Unicode characters have unique names, which are identical to those of the English language version of
International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A
through Z, space, and hyphen-minus; this convention makes it easy to generate computer language identifiers
automatically from the names. Unified East Asian ideographs are named CJK UNIFIED IDEOGRAPH-X,
where X is replaced with the hexadecimal Unicode value; for example, CJK UNIFIED IDEOGRAPH-4E00.
The names of Hangul syllables are generated algorithmically; for details, see Hangul Syllable Names in
Section 3.10, Combining Jamo Behavior
In running text, a formal Unicode name is shown in small capitals (for example, GREEK SMALL
LETTER MU), and alternative names (aliases) appear in italics (for example, umlaut).
Italics are also used to refer to a text element that is not explicitly encoded (for example. pasekh alef),
or to set off a foreign word (for example, the Welsh word ynghyd).
The symbols used in the character names list are described at the beginning of Chapter 7, Code Charts.
In the text of this book, the word “Unicode” if used alone as a noun refers to the Unicode Standard or
a Unicode character value.
1.4 The Unicode Consortium
The Unicode Consortium was incorporated in January 1991, under the name Unicode, Inc., to promote
the Unicode Standard as an international encoding system for information interchange, to aid in its imple-
mentation, and to maintain quality control over future revisions. The Unicode Consortium is the central
focus and contact point for conducting these activities.
To further these goals, the Unicode Consortium cooperates with the International Organization for
Standardization (ISO). The Consortium holds a Class C liaison membership with ISO/IEC JTCI/SC2; it
participates both in the work of JTC1/SC2/WG2 (the working group within ISO responsible for computer
character sets) and in the work of the Ideographic Rapporteur Group of WG2. The Consortium is a member
company of ANSI Subcommittee X3U. In addition, member representatives in many countries also work
with their national standards bodies.
A number of standards organizations are Liaison Members of the Unicode Consortium. ECMA (a
European-based organization for standardizing information and communication systems), Association of
Common Chinese Code of the Center for Computer & Information Development Research (China), and the
Technical Committee on Information Technology of the Viet Nam General Department for Standardization,
Metrology; and Quality Control CTCVN/TC), and the WG2 standards committee of Korea.
Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world
who support the Unicode Standard and who would like to assist in its extension and widespread implemen-
tation. Full and Associate Members represent a broad spectrum of corporations and organizations in the
computer and information processing industry. The Consortium is supported through the volunteer efforts
of employees of member companies and individual members, and financially through membership dues.
The Unicode Technical Committee
The Unicode Technical Committee (UTC) is the working group within the Consortium responsible for
the creation, maintenance, and quality of the Unicode Standard. The UTC controls all technical input to the
standard and makes associated content decisions. UTC members represent the companies that are Pull and
Associate Members of the Consortium. Observers are welcome to attend UTC meetings and may participate
354 B. Unicode
in the discussions, since the intent of the UTC is to act as an open forum for the free exchange of technical
ideas.
1.5 The Unicode Standard and ISO/IEC 10646
During 1991, the Unicode Consortium and the International Organization for Standardization (ISO)
recognized that a single, universal character code was highly desirable. Mutually acceptable changes were
made to Version 1.0 of the Unicode Standard and to the first ISO/TEC Draft International Standard DIS
10646.1, and their repertoires were merged into a single character encoding in January 1992. After inter-
national ballot and editorial changes to accommodate comments, the final ISO standard was published in
May 1993 as ISO/IEC 10646-1:1993, Information Technology-Universal Multiple-Octet Coded Character Set
(UCS)-Part 1: Architecture and Basic Multilingual Plane.
In accord with the merger agreement, a revision of the Unicode Standard was published in 1993 as
Unicode Technical Report #4, with the title: The Unicode Standard, Version 1.1, Prepublication Edition.
Version 1.1 of the Unicode Standard specified a repertoire and set of code assignments identical to those of
the new ISO/IEC standard.
After the initial release of ISO/IEC 10646 and the Unicode Standard Ver. 1.1, both ISO JTC1/SC2/WG2
(the ISO working group responsible for ISO/IEC 10646) and the Unicode Technical Committee continued
to develop the merged standard. These developments lead to Version 2.0 of the Unicode Standard, incorpo-
rating the first seven amendments made to or proposed for ISO/IEC 10646. (For details, see Appendix C.
Relationship to ISO/IEC 10646. and Appendix D, Cumulative Changes)
Supported Scripts
The Unicode Character Standard primarily encodes scripts rather than languages. That is, where more
than one language shares a set of symbols that have a historically related derivation, the union of the set of
symbols of each such language is unified into a single collection identified as a single script. These collections
of symbols (i.e., scripts) then serve as inventories of symbols which are drawn upon to write particular
languages. In many cases, a single single script may serve to write tens or even hundreds of languages (e.g.,
the Latin script). In other cases only one language employs a particular script (e.g., Hangul).
The primary scripts currently supported by Unicode 2.0 are:
* Arabic * Gurmkhi * Lao * Armenian * Han * Malayalam * Bengali * Hangul * Oriya * Bopomofo *
Hebrew * Phonetic * Cyrillic * Hiragana * Tamil * Devanagari * Kannada * Telugu * Georgian * Katakana
* Thai * Greek * Latin * Tibetan * Gujarati
In addition to the above primary scripts, a number of other collections of symbols are also encoded by
Unicode. These collections, which may be referred to as secondary scripts (or as pseudo-scripts), consist of
the following:
* Numbers * General Diacritics * General Punctuation * General Symbols * Mathematical Symbols *
Technical Symbols * Dingbats * Arrows, Blocks, Box Drawing Forms, and Geometric Shapes * Miscellaneous
Symbols * Presentation Forms
XA, Photo-CD, and CD-I. Older recordable CDs (CD-R) were labeled “74 min., 650 Mbytes.” Such a CD
had 336,075 sectors, which translates to 74 min., 43 sec. playing time. The capacity was 336,075×2048 =
688,281,600 bytes or 656.396484375 Mbytes (where mega is 220 = 1,048,576). However, it is possible to
record more sectors on a CD, thereby increasing its capacity. This is somewhat risky, since the extra sectors
approach the edge of the CD, where they can easily get scratched or dirty. New CD-Rs are labeled as
700 Mbyte or 80 minutes. They have 359,849 blocks for a total capacity of 359,849 ×2048 = 736,970,752
bytes or 702.83 Mbytes.
In order to read 702.83 Mbytes in 80 minutes, a CD drive has to read 150Kbytes per second. This speed
is designated 1X and was typical for CD players made before 1993. Currect CD drives (year 2002) can read
CDs at speeds of up to 56X, where 700 Mbytes are read in just under 86 seconds!
Exercise C.1: What is the capacity of a CD-ROM with 345,000 sectors in mode 1 and in mode 2?
C.2 Description
Physically, the CD is a disc, 1.2 millimeters thick, with a 120mm diameter. The hole, at the center, is 15mm
in diameter. The distance between the inner and outer circumferences is thus (120 − 15)/2 = 52.5 mm.
Of this, only 35mm are actually used, leaving a safety margin both inside and outside. The information is
recorded on a metallic layer (typically aluminum, silver or gold), that’s .5–1μ thick (where μ or micron, is
10−6 meter). Above this layer there is a protective lacquer coating (10–30μ thick), with the printed label.
Below the metal layer there is the disc substrate, normally made of transparent polycarbonate. It occupies
almost the entire thickness of the disc. Since the protective layer is so thin, any scratches on the label can
directly damage the metallic layer. Even pen marks can bleed through and cause permanent damage. On
the other hand, scratches on the substrate are usually handled by the error correcting code (see below).
The digital information is recorded in pits arranged in a spiral track that runs from the inner circumfer-
ence to the outer one. The pits are extremely small. Each is .5μ wide and .11μ deep. Pit lengths range from
.833μ to 3.56μ. The track areas between pits are called land. The distance between successive laps of the
track is 1.6μ. As a result, the track makes 22,188 revolutions in the 35 mm recording area. Its total length
is about 3.5 miles. The information is recorded such that any edge of a pit corresponds to binary 1, and
the area in the pits and in the lands (between pits) corresponds to consecutive zeros. To reduce fabrication
problems, the pits should not be too short or too long, which means that the number of binary ones recorded
should be carefully controlled (see below).
To read the disc, a laser beam is focused on the track through the disc substrate, and its reflection
measured. When the beam enters a pit, the reflection drops to almost zero, because of interference. When
it leaves the pit, the reflection goes back to high intensity. Each change in the reflection is read as binary
one. To read the zeros, the length of a pit, and between pits, must be measured accurately. The disc must
thus rotate with a constant linear velocity (CLV). This implies that, when the reading laser moves from the
inner parts of the track to the outer ones, the rotational speed has to decrease (from 500 RPM to about
200 RPM). The track contains synchronization information used by the CD player to adjust the speed. The
aim of the player is to read 4.3218 million bits per second, which translates to a CLV of 1.2–1.4 meter/sec.
The CD was made possible by two advances, a technological one (high precision manufacturing), and a
scientific one (error-correcting codes). Here are some numbers illustrating the two points:
1. Remember the old LP vinyl records? Information is stored on such a record in a narrow spiral groove
whose width is about the diameter of a human hair (about 0.05 mm). More than 30 laps of the CD track
would fit in such a groove!
2. A CD is read at a rate of 4.3218 million bits per second. The reading is done by measuring the
reflection of a laser beam from the track, and is thus sensitive to surface scratches, to imperfections in the
track, and to surface contamination because of fingerprints, dust particles, etc. All these factors introduce
errors in reading, but the digital nature of the recording allows for error correction.
(A note about cleaning. Clean a CD only if you must, since cleaning may create many invisible scratches
and cause more harm than good. Use a soft, moistened cloth, and work radially, from the center to the rim.
Cleaning with a circular motion might create a long scratch paralleling a track, thus introducing a long burst
of errors.)
C.3 Error-Correction 357
C.3 Error-Correction
It is obvious that reading a CD-ROM must be error free, but error correction is also important in an audio
CD, because one bad bit can cause a big difference in the note played. Consider the two 16-bit numbers
0000000000000000 and 1000000000000000. The first represents silence and the second, a loud click. Yet they
differ by one bit only! The size of a typical dust particle is 40μ, enough to cover more than 20 laps of the
track, and cause several bursts of errors (Figure C.1b). Without extensive error correction, the music would
sound like one big scratch.
Any error correction method used must be very sophisticated, since the errors may come in bursts, or
may be individual. The use of parity bits makes it possible to correct individual errors, but not a burst of
consecutive errors. This is why interleaving is used, in addition to parity bits. The principle of interleaving
is to rearrange the samples before recording them on the CD, and to reconstruct them after they have been
read. This way a burst of errors during the read is translated to individual errors (Figure C.1a), that can
then be corrected by their parity bits.
I N T E R L E A V E D D A T A Original data
N E E I T L R V D A A D E T A Interleaved data
N E E I T ? ? ? ? ? A D E T A A burst of errors
I N T E ? ? E ? ? E ? D A T A Individual errors
a: Interleaving data
Human hair
75μm
Dust particle
Fingerprint 40μm
15μm b: Relative sizes
The actual code used in CDs is called the Cross Interleaved Reed-Solomon Code (CIRC). It was devel-
oped by Irving S. Reed and Gustave Solomon at Bell labs in 1960 and is a powerful code. One version of
358 C. The Compact Disc
this code can correct up to 4000 consecutive bit errors, which means that even a scratch as long as three
millimeters can be tolerated on a CD. The principle of CIRC is to use a geometric pattern that is so familiar
that it can be reconstructed even if large parts of it are missing. It’s like being able to recognize the shape
of a rectangular chunk of cheese after a mouse has nibbled away large parts of it.
Suppose that the data consists of the two numbers 3.6 and 5.9. We consider them to be the y coordinates
of two-dimensional points and we assign them x coordinates of 1 and 2. We thus end up with the points
(1, 3.6) and (2, 5.9). We consider those points the endpoints of a line and we calculate four more points on
this line, with x coordinates of 3, 4, 5, and 6. They are (3, 8.2), (4, 10.5), (5, 12.8), and (6, 15.1). Since the x
coordinates are so regular, we only need to store the y coordinates of these points. We thus store (or write
on the CD) the six numbers 3.6, 5.9, 8.2, 10.5, 12.8, and 15.1.
Now suppose that two errors occur among those six numbers. When the new sequence of six numbers
is checked for the straight line property, the remaining four numbers can be identified as being collinear and
can still be used to reconstruct the line. Once this is done, the two bad numbers can be corrected since their
x coordinates are known. Even three bad numbers out of those six can be corrected since the remaining
three numbers would still be enough to identify the original straight line.
It is even more reliable to start with three numbers a, b, and c, to convert them to the points (1, a),
(2, b), and (3, c), and to calculate the (unique) parabola that passes through these points. Four more points,
with x coordinates of 4, 5, 6, and 7, can then be calculated on this parabola. Once the seven points are
known they provide a strong pattern. Even if four of the seven get corrupted the remaining three can be
used to reconstruct the parabola and correct the four bad ones. It may happen that three numbers will get
corrupted in such a way that they will lie on a new parabola, but this is extremely rare.
C.4 Encoding
The coding process starts with a chunk of 24 data bytes (twelve 16-bit samples, or six 32-bit stereo samples),
and ends with 32 bytes of scrambled data and parity which are written on the CD (after EFM modulation,
see below) as a frame. For each frame the hardware generates another byte called the subcode (see below).
The overhead is thus 8 + 1 bytes for every 24 bytes, or 37.5%. It can be shown that even a burst of about
4000 bad data bits can be completely corrected by this code, which justifies the overhead.
Before writing a 33-byte frame on the CD, the recording hardware performs Eight-to-Fourteen Modu-
lation (EFM). Each byte is used as a pointer to a table with 14-bit entries, and it is the table entry which
is finally written on the CD. The idea is to control the length of the pits by controlling the number of con-
secutive zeros. It is easier to fabricate the pits if they are not too short and not too long. EFM modulation
produces 14-bit numbers that have at least two binary zeros and at most ten zeros between successive binary
ones. There are, of course, 214 = 16384 14-bit numbers, and 267 of them satisfy the condition above. Of
those, 256 were selected and placed in the table. (Two more are used to modulate the subcodes.) Here are
the first ten entries of the table:
8-bit pointer 14-bit pattern
00000000 01001000100000
00000001 10000100000000
00000010 10010000100000
00000011 10001000100000
00000100 01000100000000
00000101 00000100010000
00000110 00100001000000
00000111 00100100000000
00001000 01001001000000
00001001 10000001000000
00001010 10010001000000
There is still the possibility of a 14-bit number ending with a binary 1, followed by another 14-bit number
starting with a 1. To avoid that, three more merging bits are recorded on the track after each 14-bit number.
Two of them are zeros and the third one is selected to suppress the signal’s low frequency component. This
process results in about 3 billion pits being fabricated on the track.
C.5 The Subcode 359
To summarize the encoding process, it starts with a group of 24 bytes (192 bits), which are encoded
into 24 + 8 + 1 = 33 bytes. These are translated into 33 14-bit numbers, which are recorded, each with 3
merging bits, on the track. A synchronization pattern of 24 bits (plus 3 merging bits) is also recorded at
the start of the track. The original 192 bits thus require 24 + 3 + 33 × (14 + 3) = 588 bits to be recorded,
which corresponds to an overhead of (588 − 192)/192 = 396/192 ≈ 206%. To record 635,040,000 data bytes,
we thus need to record about 1,309,770,000 bytes (about 10.5 billion bits)! The following summarizes the
format of a frame.
The CD player has to read the information at the rate of 44,100 samples per second (corresponding to
a 44.1KHz sampling rate). Each (stereo) sample consists of two 16-bit numbers, so there are six samples per
frame. This is why 44100/6 = 7350 frames have to be read each second, which translates to a bit rate of
7350 × 588 = 4321800bits/sec.
C.5 The Subcode
As stated earlier, each frame contains an additional byte called the subcode. The eight bits of the subcode
are labeled PQRSTUVW. In an audio CD, only the P & Q bits are used. When the CD is read, the subcode
bytes from 98 consecutive frames are read and assembled into a subcode frame. Recall that the frames
are read at a rate of 7350 per second. Since 7350/98 = 75, we get that 75 subcode frames are assembled
each second by the CD player! A subcode frame gets the two 14-bit sync patterns 00100000000001 and
00000000010010 (two of the patterns not used by the EFM table). The frame is then considered eight 98-bit
numbers. The 98 bits of P give the CD player information about the start and end of each song (each music
track). The bits of Q contain more information, such as whether the CD is audio or ROM, and whether
other digital recorders are legally permitted to copy the CD.
C.6 Data Readout
To read the data off the CD track, a narrow, focused laser beam is shined on it, and the reflection is measured.
The process is based on two physical principles: refraction and interference. Refraction is the case where a
beam of light passes from one medium to another; both its wave length and its speed change, causing it to
bend. If it passes from a rare medium (low index of refraction) to a dense one (high index of refraction),
both its speed and its wave length decrease.
The refraction index of vacuum is defined as 1 (that of air is virtually the same), and that of the
substrate must be 1.55. The wavelength of the laser beam in the air is 780 nanometers, so in the substrate
it drops to about 500 nanometers. The height of a pit is about a quarter of that (actually a little less
than a quarter), or 110 nanometers. The area of the beam hitting the metallic layer is about twice the
area of a pit (Figure C.2a), so half of it hits the pit, and the other half, the land around the pit. The two
halves are reflected with a phase difference of one-half the wavelength (λ/2), causing destructive interference
(Figure C.2b). This is why the reflection intensity drops when the beam scans a pit. The reflection drops
to about 25% of its maximum because the height of a pit is not exactly λ/4, and because the parts of the
beam hitting the pit and the land are not exactly equal.
C.7 Fabrication
The process of fabricating CDs consists of the following steps:
1. The digital information has to be recorded on a master tape. It may consist of sound samples for
an audio CD, or data files for a CD-ROM. The tape has to checked carefully, since any errors would be
transferred to all the final CDs made. Old music tapes can be translated to digital but the quality will
normally be low.
2. The master tape is transferred to a CD master tape, with the subcode information.
3. A master CD is made of a glass disc. It is about 240mm in diameter and 5.9mm thick. It is polished
smooth, and covered with a thin layer of photoresist material. The disc is then checked with a laser beam
to make sure it is flat, and the thickness of the photoresist material is uniform.
360 C. The Compact Disc
Metallic layer
Pit
λ/4 Pit
Substrate
λ/2
Laser beam
2μm
(b) (a)
Figure C.2. Reading a CD
4. A special “cutting” machine is used to fabricate the track on the disc. A computer reads the CD
master tape, encodes the information (i.e., performs the interleaving, calculates the parity bits and does the
EFM modulation) and uses a laser beam to expose the disc where the pits should be. This process requires
a quite environment, so the machine must be isolated from vibrations.
The glass disc is then flooded by a chemical that etches away the exposed areas of the photoresist (not
the glass), thus creating the pits. A laser beam monitors this process by measuring the depth of the pits as
they are being formed.
5. A thin silver layer is evaporated on the disc, which is then played (or read), to ensure reliability.
6. A container with electrolytic solution is used to make a “negative” of the silvered glass disc. The
negative is made of nickel and is called the “father”. Making the father destroys the photoresist layer, so
the glass master can no longer be used. Several “mothers” may be made of the single father, and several
“sons” (also called stampers) made of each “mother”.
7. A stamper is used to made many copies of the CD. Each copy is made by injecting plastic over the
stamper, thus transferring the pits to the plastic. This becomes the substrate. A metallic layer is evaporated
over the track, followed by a top layer of acrylic, and by the printed label.
The entire process requires high precision machines, a very clean environment, and isolation from vibra-
tions. In spite of this, the final price of a CD depends on market demands and popularity of the performer,
more than on the technical difficulties of fabrication.
C.8 The CD-ROM Format
Because of the large capacity of the audio CD, it can be used to store large quantities of digital data of any
kind. This is the principle of the CD-ROM. It can be used as a read-only memory with a capacity of about
600Mbytes. The data is recorded as pits on a track, same as an audio CD, but the format is different. Data
is recorded on an audio CD in 24-byte frames, with a subcode (of 1 byte per frame) for each group of 98
frames. In the CD-ROM, the basic data unit is a group of 98 frames, called a sector. The size of a sector is
98 × 24 = 2352 bytes, and it is divided into four parts: (1) a 12-byte sync pattern; (2) a 4-byte header; (3)
2048 bytes (=2K) of user data; (4) 288 bytes of system data (used for error correction, in addition to the
CIRC code).
The maximum capacity of the CD-ROM is 333,000 sectors, which translates to 681,984,000 bytes (about
C.9 Recordable CDs (CD-R) 361
650.4Mbytes). If less data is required, the outer part of the CD remains blank, making it easier to handle
the disc by hand. (Even 550Mbytes is quite a large capacity, equal to about 275,000 pages of text.) Because
of the additional error correction information, CD-ROMs are extremely reliable. They typically create one
uncorrectable error for every 1016 or 1017 bits read! For comparison, there are about 31.5 × 1017 nanoseconds
in a century.
The CD-ROM reader is not compatible with the audio CD player. It has the same laser, modulation,
and error correction circuits, but instead of converting the information read to analog, the CD-ROM reader
sends it to the computer as digital information in sectors of size 2K each. Today it is common to have a CD
player and CD-ROM reader combined in one product. It uses the same laser to read both types of discs,
and uses the Q subcode information to distinguish between them.
It is important to realize that the CD-ROM standard does not specify how to organize files on the
disc, or even where to place the directory. This is why a CD-ROM for a PC cannot normally be read by
a Macintosh. To overcome this, an additional standard has been developed by the High Sierra group—a
group of CD-ROM makers—that specifies the logical organization of files on the disc. CD-ROMs made to
this standard can be read by any operating system that has the High Sierra utility programs.
permanent; the CD-R cannot be erased. The device that “burns” the CD-R is called a CD recorder or a
CD-R drive (a popular name is a CD burner) and current drives can record CDs at speed of 24X to 40X
(they can also read CDs, and do this at higher speeds).
Exercise C.2: How long does it take to burn a CD-R at 24X and 40X speeds?
High quality CD-Rs have a gold reflective layer, so they look golden when viewed from the top (the label
side). Low quality CD-Rs have an aluminum reflective layers. When viewed from the bottom, the green
layer of dye may look either light-green (the high qulity CDs) or dark green (the low quality ones). Also,
high-quality CD-Rs have unique serial numbers, written both in digits and in a bar code, thereby making it
easy to identify and track them.
It’s important to realize that the dye is fairly sensitive to light; it has to be in order for a laser to
zap it fast. It is therefore important to avoid exposing CD-R discs to sunlight. The low-quality CD-Rs
are especially sensitive, but it is good practice to keep CD-Rs in opaque containers both before and after
recording data. Another consideartion is cleanliness. The laser beam that writes the CD-R has to penetrate
through the thick plastic substrate, which is why scratches and especially fingerprints and dirt are much
more dangerous to a recordable CD before writing than afterward. The fingerprint, dirt, or smudge can
scatter the beam of the writing laser, possibly weakening it to the point where the burn it makes is too small
362 C. The Compact Disc
or too light to be read later. Thus, extra care must be taken in the handling of a recordable CD before any
data is written on it.
C.10 Summary
The discussion here has emphasized the main technical problems that had to be solved in order to achieve
both the high-storage capacity, and the extreme reliability of the CD. The submicroscopic manufacturing
details, and the complex circuits needed to read and decode the information at such high rates, make the
CD player (and CD-ROM reader) a very complex consumer electronic device.
C.11 DVD
DVD (digital versatile disc) is the next step in the CD saga. The DVD format was developed jointly by
Philips and Sony between 1980 and 1984. We start with the name itself. Depending on who you listen to,
DVD means “digital versatile disc,” “digital video disc,” or it is not an acronym and does not stand for
anything.
DVD-ROM stands for DVD-Read Only Memory, a format (similar to the familiar CD-ROM) that is
defined by Book A of the DVD specification. DVD-Video is defined by Book B of the specification. Basically,
DVD-Video is DVD-ROM with an added application layer that limits its use to playing movies. Thus, a
DVD-Video disc is DVD-ROM, but a DVD-ROM disc is not always DVD-Video. Divx stands for Digital
Video Express. This is a DVD-Video with Triple DES encoding and a unique serial number added. This
limits its application. A Divx player plays DVD-Video, but a DVD player will not play a Divx disc. DVD-
Audio hasn’t been not defined yet, but when it will, you will find it in Book C of the DVD specification.
There are currently four competing proposals.
There are more DVD formats. DVD-R is DVD-Recordable, defined by Book D of the DVD specification.
This type is currently made by Pioneer. DVD-RW stands for DVD-Rewritable. This type is rewritable up
to 1,000 times. It has the same capacity as DVD-R, 3.95Gb, and this is expected to go up to 4.7Gb. A
DVD-R/W drive will record DVD-R and rewrite DVD-R/W. Also, DVD-R/W discs can be read by first-
generation DVD hardware. This backwards compatibility with old, existing players is, of course, a great
marketing idea. Notice that DVD-R/W is not an official DVD format. It has been presented to the DVD
Forum for ratification, but it faces competition from the already-accepted DVD-RAM, which isn’t readable
on anything but DVD-RAM.
The following paragraphs constitute a summary of DVD features and applications.
A TECHNICAL TRIUMPH
How is DVD different from CD? For greater data density, there are smaller pits, a more closely-spaced
track and a shorter-wavelength red laser. The error correction is more robust. And thanks to Sony technology,
the modulation scheme is more efficient. All this means that a standard DVD can hold 4.7 gigabytes of data—
an amazing seven times the data capacity of a current Compact Disc! So you’ll enjoy higher-resolution
pictures, more channels of digital sound, richer graphics, and far more multimedia. Not enough? Dual-layer
DVDs can hold more than twelve times the information of a CD on a single side. So there’s no need to turn
the disc over.
HOLLYWOOD SPECTACULAR
DVD is major news for anyone who owns a television and anyone who enjoys movies. Thanks to digital
video decoding technology, DVD delivers far and away the best color, sharpness and clarity in home video. In
fact, a DVD picture approaches the “D-1” TV studio production standard. And thanks to variable bit-rate
MPEG2 compression, it all fits easily onto a single side of a 4-3/4-inch disc. In fact, a single-layer DVD
can hold a two hour, 13 minute movie—with room to spare for Dolby AC-3(TM) discrete 5.1-channel digital
sound in your choice of three languages! Dual-layer, single-sided discs can hold movies more than four hours
long. And because DVD is an optical disc, you get instant access. You can play it repeatedly without wear
and tear. You’ll never need to rewind.
ULTIMATE MULTIMEDIA
Digital convergence is erasing the old distinctions between entertainment and information, recreation
and education. So it’s the perfect time for DVD-ROM. With up to 8.5 Gb capacity on a single side it’s
the optical disc that gives you more. More on-line capacity for software publishers. More room for large
databases. And high-quality full-motion video for better interactive games. DVD-ROM drives are also
C.11 DVD 363
speed readers. Even a standard DVD-ROM blasts along at higher data transfer rates than even the fastest
current CD-ROM. Since they’re able to play your existing CD-ROMs, DVD-ROM drives also respect the
past. With the recent development of DVD-Write Once and DVD-Rewritable, the DVD format is focused
on the future. Best of all, DVD is supported by the world’s leading hardware manufacturers and software
companies (Table C.3).
Huffman, D., (1952) “A Method for the Construction of Minimum Redundancy Codes,” Proceedings of the
IRE, 40,1098–1101, Sept.
Hyman, Anthony (1982) Charles Babbage: Pioneer of the Computer, Oxford, Oxford University Press.
IBM Corp (1963) The IBM System/360 Principles of Operation, form GA22-6821,
IBM Corp (1963a) IBM 7040 and 7044 Data Processing Systems, Student Text, IBM Form No. C22-6732,
1963
Intel Corp (1975) Intel 8080 Microcomputer Systems User’s Manual, ref. 98-153c, Santa Clara, CA.
Intel Corp (1985) Introduction to the 80386, Order No. 231252-001, Santa Clara, CA.
Intel Corp (1987) Microprocessor and Peripheral Handbook, Vol. I, Order #230843-004, Santa Clara, CA.
IowaState (2000) is URL http://www.cs.iastate.edu/jva/jva-archive.shtml.
ispworld 2002 is URL http://www.ispworld.com/.
Juffa (2000) is URL ftp://ftp.math.utah.edu/pub/tex/bib/fparith.bib.
Kane, G. (1981) 68000 Microprocessor Handbook, Berkeley, CA, Osborne/McGraw-Hill,
Krol, E. (1994) The Whole Internet, Sebastopol, CA, O’Reilly and Assoc.
Leventhal, L. A. (1979) 6502 Assembly Language Programming, Berkeley, CA, Osborne/McGraw-Hill.
Lin, Shu, (1970) An Introduction to Error Correcting Codes, Englewood Cliffs, NJ, Prentice-Hall.
Linde, Y., A. Buzo, and R. M. Gray (1980) “An Algorithm For Vector Quantization Design,” IEEE Trans-
actions on Communications, COM-28:84–95, January.
List 1999 “The List: The Definitive ISP Buyer’s Guide,” http://thelist.internet.com/
Mano, Morris M. (1991) Digital Design, Englewood Cliffs, NJ, Prentice-Hall.
Mano, Morris M. (1997) Logic and Computer Design Fundamentals, Upper Saddle River, NJ, Prentice Hall.
Marks, Leo (1999) Between Silk and Cyanide: A Codemaker’s War 1941–1945, Free Press.
Matula, D., and P. Kornerup (1978) A Feasibility Analysis of Binary Fixed-Slash and Floating-Slash Number
Systems, Tech. Rep. CS 7818, Southern Methodist University, Dallas, TX.
Metcalfe, R. M. and D. R. Boggs (1976) “Ethernet: Distributed Packet Switching for Local Computer
Networks,” Communications of the ACM, 19,395–404, July.
Mokhoff, N. (1986) “New RISC Machines Appear as Hybrids with both RISC and CISC Features.”
Mollenhoff, Clark (1988) Atanasoff, Forgotten Father of the Computer, Ames, Iowa State University Press.
Osborne, A. (1978) An Introduction to Microcomputers, Vol. III, Berkeley, CA, Osborne and Assoc.
Osborne, A. and G. Kane (1981) 4 and 8 Bit Microprocessor Handbook, Berkeley, CA, Osborne/McGraw-Hill.
Patterson, D. A., and D. R. Ditzel (1980) “The Case for the RISC,” Computer Architecture News, 8(6)25–33,
Oct.
Patterson, D. (1985) “Reduced Instruction Set Computers,” Communications of the ACM, 28(1)8–21, Jan.
Press, W. H., B. P. Flannery, et al. (1988) Numerical Recipes in C: The Art of Scientific Computing,
Cambridge University Press. (Also available on-line from http://www.nr.com/.)
Radin, G. (1983) “The 801 Minicomputer,” IBM Journal of Research and Development, 27(3)237–246, May.
Ralston, A., ed. (1985) Encyclopedia of Computer Science and Engineering, 2nd Ed., Van Nostrand.
Ramabadran, Tenkasi V., and Sunil S. Gaitonde (1988) “A Tutorial on CRC Computations,” IEEE Micro
pp. 62–75, August.
References 371
Randell, B. ed. (1982) The Origin of Digital Computers, Berlin, Springer verlag.
RFC821 (1982) is http://www.cis.ohio-state.edu/htbin/rfc/rfc821.html.
Roja, Raul (2000) is URL http://hjs.geol.uib.no/html/zuse/zusez1z3.htm.
Rosin, R. F. (1969) “Contemporary Concepts of Microprogramming and Emulation,” Computer Surveys,
1(4)197–212, Dec.
RSA (2001) is http://www.rsasecurity.com/rsalabs/challenges/factoring/index.html.
Salisbury, Alan B. (1976) Microprogrammable Computer Architecture, NY, American Elsevier.
Savard, John (2001) is http://home.ecn.ab.ca/~jsavard/crypto/mi0604.htm.
Scott McCartney (1999) ENIAC: The Triumphs and Tragedies of the World’s First Computer, New York,
Walker and Company.
Seitz, C. L. (1985) “The Cosmic Cube,” Communications of the ACM 28(1)22–33.
Siegel, H. J., and McMillan, R. J. (1981) “The Multistage Cube,” IEEE Computer, 14(12),65–76, Dec.
Siewiorek, D., et al (1982) Computer Structures, McGraw-Hill.
Singh, Simon (1999) The Code Book, New York, Doubleday.
Shallit 2000 is URL http://www.math.uwaterloo.ca/~shallit/Courses/134/history.html.
Smith, David Eugene (1923) History of Mathematics, reprinted by Dover Publications 1958, volume 2, p. 514.
Smith, D. A. (1929) A Source Book in Mathematics, New York, NY, McGraw-Hill.
Smotherman (1999) is URL http://www.cs.clemson.edu/~mark/uprog.html.
Stevenson, D. (1981) “A Proposed Standard for Binary Floating-Point Arithmetic,” IEEE Computer, 14(3)51-
62.
Tabak, D. (1987) RISC Architecture, New York, John Wiley.
Watson, I., and J. Gurd (1982) “A Practical Data Flow Computer,” IEEE Computer, 15(2),51–57, Feb.
Zimmermann, Philip (1995) PGP Source Code and Internals, MIT Press.
Zimmermann, Philip (2001) is URL http://www.philzimmermann.com/.
Ziv J. and A. Lempel (1977) “A Universal Algorithm for Sequential Data Compression,” IEEE Trans on
Information Theory, IT-23(3), pp. 337–343.
1.14: Yes. The program sends the content of an entire page to the printer, then has to check status and
wait until the printer has finished printing the page before it can send the next page. In practice, the user
program invokes a print manager (part of the operating system), and it is the print manager that sends the
output to the printer and checks status.
1.15: Register I/O is simple to use, and it requires a minimum amount of hardware. It make sense to use
it in cases where only a few pieces of data need to be input or output; it also makes sense in cases where
computer time is inexpensive and can be carelessly spent. This is why register I/O is commonly used on
personal computers, but not on large, multiuser computers.
2.1: A JMP instruction in the direct mode is a good example. Such an instruction is long, because the
entire address (25–35 bits) is included in the instruction. It is executed by resetting the PC, so it is fast.
2.2: It is obvious, from Figure 2.2, that there can be 16 and 64 instructions of formats 3 and 4, respectively.
2.3: “LOD R5,#ARY”. This instruction uses the immediate mode.
2.4: In a computer using cascaded indirect, the assembler should have a special directive for this purpose.
2.5: Many stack operations use the element or two elements on top of the stack as their operands. Such an
instruction removes the operands from the stack and pushes the result back into the stack. If the programmer
wants to use the same operands in the future, they should stay in the stack, and an easy way to accomplish
this is to generate their copies before executing a stack operation. Example:
LOD #38 loads the element right below the top.
1. PUSH it becomes the new top.
LOD #51 loads the element right below the new top (the old top).
2. PUSH it becomes the new top.
3. ADD adds the two elements on the top and removes them.
The successive states of the stack are shown in Figure Ans.1.
51 51
38 38 38 89
2.6: The identity operation, where each bit is transformed into itself.
2.7: The password is used to select cover files C1 , C2 , and C3 . They are XORed to obtain 1010, which is
XORed with F to yield 0011, which is XORed with one of the Cj ’s, say C2 , to produce a new C2 = 1101.
Once this is done, file F can be recovered (if P is known) by
2.10: No. The smallest mantissa is 0.5, so the smallest product of two normalized mantissas is 0.25 = 0.012 .
2.11: The exponents are equal, so the mantissas can be added directly. Adding the two fractions 0.10 . . . 02 =
0.5 and 0.11 . . . 02 = 0.75 produces 1.25, or in binary 1.010 . . . 0. The sum has to be normalized by shifting it
to the right one position. To compensate, the exponent should be incremented by 1 but this causes overflow,
since the exponent is already the largest one that can be expressed in three magnitude bits. The result is
0|1000|101 . . . 0 and the V flag should be set.
2.12: We start with an example of the real division 5/2 = 2.5. We have to find a way to divide the integers
10 and 4 and end up with 5 (the integer representing 2.5). We already know that to obtain the right product
we have to divide by the resolution, so it is natural to try to obtain the right quotient by multiplying by the
resolution. We therefore try (10 · 2)/4 = 20/4 = 5. The rule for dividing fixed-point numbers is: Multiply
the dividend (10) by the resolution (2), then divide the result (which is greater than the dividend, since the
resolution is normally greater than 1) by the divisor (4).
Next, we try the real division 2.5/2. Applying the rule above to this case results in (5 · 2)/4 = 10/4 and
yields the integer 2, which is the representation of the fixed-point number 1. Obviously, the real division
2.5/2 should produce 1.25, but this number cannot be represented in our fixed-point system whose resolution
is 2. Therefore, the value 1.25 has to be truncated to 1.
2.13: The number of 4-bit codes where only 10 of the 16 combinations are used is
16! 16!
= ≈ 29 · 109 .
(16 − 10)! 6!
There can be, of course, longer BCD codes, and the longer the code, the more codes are possible.
2.14: These codes are listed in Table Ans.2.
2.15:
Table Ans.3 shows how individual bits change when moving through the binary codes of the first 32
integers. The 5-bit binary codes of these integers are listed in the odd-numbered columns of the table, with
the bits of integer i that differ from those of i − 1 shown in boldface. It is easy to see that the least-significant
bit (bit b0 ) changes all the time, bit b1 changes for every other number, and, in general, bit bk changes every k
integers. The even-numbered columns list one of the several possible reflected Gray codes for these integers.
The table also lists a recursive Matlab function to compute RGC.
2.16: Such an addition is really a subtraction. The (absolute value of the) result is less than either of the
two original operands. If each operand fits in a register, the result is certainly going to fit in a register, and
there can be no overflow.
3.1: It is possible to have programs that do not input any data. Such a program always performs the same
operations and always ends up with the same results. Examples are (1) a program that prints the ASCII
code table, (2) a program that calculates π to a certain, fixed precision, and (3) a program that generates
and plays a song on the computer’s speaker. Such programs are not very useful, though, and even they
376 Answers to Exercises
must generate output. A program that does not generate any output is useless, since there is no way to use
its results. The output of a program, of course, does not have to be printed or displayed. It can be stored
in memory or in registers, or it can be sent on a communications line to another device. An example is a
computer that controls traffic lights at an intersection. The outputs control the lights, and no printer or
monitor is used.
3.2: In principle, yes, but in practice an I/O device is normally assigned two or more select numbers, for
its status and data.
3.3: It is unimportant to know the character codes in any base. When we want to use the character A in a
computer, we can simply type A. In the few cases where we need to know the code of a character, perhaps
a control character, it is preferable to know it in binary. It turns out that it is much easier to translate
between binary and either octal or hex, than between binary and decimal.
3.4: Answer not provided.
3.5: Assuming that a file of n bits is given and that 0.9n is an integer, the number of files of sizes up to
0.9n is
20 + 21 + · · · + 2.9n = 21+.9n − 1 ≈ 21+.9n .
For n = 100 there are 2100 files and 21+90 = 291 can be compressed well. The ratio of these numbers is
291 /2100 = 2−9 ≈ 0.00195. For n = 1000, the corresponding fraction is 2901 /21000 = 2−99 ≈ 1.578 · 10−30 .
These are still extremely small fractions.
3.6: The receiver will interpret the transmission as “01 00 111 110 10 0”, and will hiccup on the last zero.
3.7: A typical fax machine scans lines that are about 8.2 inches wide (≈ 208mm). A blank scan line thus
produces 1664 consecutive white pels.
3.8: There may be fax machines (now or in the future) built for wider paper, so the group 3 code was
designed to accommodate them.
3.9: Each scan line starts with a white pel, so when the decoder inputs the next code it knows whether it
is for a run of white or black pels. This is why the codes of Table 3.20 have to satisfy the prefix property in
each column but not between the columns.
3.10: The code of a run length of one white pel is 000111, and that of one black pel is 010. Two consecutive
pels of different colors are thus coded into 9 bits. Since the uncoded data requires just two bits (01 or 10),
the compression ratio is 9/2 = 4.5 (the compressed file is 4.5 times bigger than the uncompressed one).
3.11: The next step matches the space and encodes the string ‘e’.
sirsid|eastmaneasily ⇒ (4,1,‘e’)
sirside|astmaneasilyte ⇒ (0,0,‘a’)
and the next one matches the ‘a’.
Answers to Exercises 377
3.12: Figure Ans.4 shows two 32×32 matrices. The first one, a, with random (and therefore decorrelated)
values and the second one, b, is its inverse (and therefore with correlated values). Their covariance matrices
are also shown and it is obvious that matrix cov(a) is close to diagonal, whereas matrix cov(b) is far from
diagonal. The Matlab code for this figure is also listed.
5 5
10 10
15 15
20 20
25 25
30 30
5 10 15 20 25 30 5 10 15 20 25 30
a b
5 5
10 10
15 15
20 20
25 25
30 30
5 10 15 20 25 30 5 10 15 20 25 30
cov(a) cov(b)
a=rand(32); b=inv(a);
figure(1), imagesc(a), colormap(gray); axis square
figure(2), imagesc(b), colormap(gray); axis square
figure(3), imagesc(cov(a)), colormap(gray); axis square
figure(4), imagesc(cov(b)), colormap(gray); axis square
3.13: The Mathematica code of Figure Ans.5 yields the coordinates of the rotated points
(7.071, 0), (9.19, 0.7071), (17.9, 0.78), (33.9, 1.41), (43.13, −2.12),
(notice how all the y coordinates are small numbers) and shows that the cross-correlation drops from 1729.72
before the rotation to −23.0846 after it. A significant reduction!
378 Answers to Exercises
p={{5,5},{6, 7},{12.1,13.2},{23,25},{32,29}};
rot={{0.7071,-0.7071},{0.7071,0.7071}};
Sum[p[[i,1]]p[[i,2]], {i,5}]
q=p.rot
Sum[q[[i,1]]q[[i,2]], {i,5}]
3.14: In an 8×8 template there should be 8 · 8/4 = 16 holes. The template is written as four 4×4 small
templates, and each of the 16 holes can be selected in four ways. The total number of hole configurations is
therefore 416 = 4,294,967,296.
3.15: For 12 November 2001, the weighted sum is
50 · 1 + 51 · 2 + 52 · 1 + 53 · 1 + 54 · 0 + 55 · 1 = 312
4.9: There is no difference. The individual phrases of a microinstruction can be written in any order. As
long as they are written on one line, the microassembler will assemble them into one, 31-bit microinstruction
with the individual fields in the right order.
4.10: This is simply a memory write operation where the content of R1 is written in m[x].
mar:=ir; mbr:=r1; wr;
wr;
4.11: Because there is no direct path from the MBR to the MAR. The MBR can be moved only to the
ALU and, from there, to the C bus. The MAR can be loaded only from the B latch. This is another aspect
of the microinstructions being low level. Even though we use a high-level notation for our microinstructions,
we cannot write anything that we want. We have to limit ourselves to existing registers and existing data
paths. Our hypothetical microassembler has to be able to identify invalid microinstructions.
4.12: Perhaps the simplest format for a “Load Index” instruction is obtained when we assume that the
index register is always R0. Such an instruction should have the usual 4-bit opcode, followed by a 12-bit
indx field. The microinstructions should add the index and R0, using the sum as the effective address.
Assuming that R0 has been preloaded with the number 5, the instruction ‘LODX R1,7’ would load R1 with
the content of memory location 12. The microprogram is.
axl F’X;
axr FF’X; [ax=0FFF hex]
ax:=band(ax,ir); [ax becomes the 12 lsb of the ir]
ax:=ax+r0; [ax becomes the effective address]
mar:=ax; rd;
rd;
r1:=mbr;
Notice how similar it is to the microprogram for relative load.
4.13: None. Obviously, the PC and IR should not be disturbed by the microprograms. Also, the machine
instructions may be using the SP and any of the four general-purpose registers at any time, so the microin-
structions should not disturb these registers either. A real computer has more than one auxiliary register,
that’s invisible to the machine instructions, and that the control unit can use at any time. Also, the ALU
in a real computer has internal registers that it uses for temporary, intermediate results.
4.14: The ADD R1 instruction is easy to implement. It adds memory location m[x] to R1 using the direct
mode, but it is easy to modify the microprogram to use other modes.
mar:=ir, rd;
rd;
r1:=r1+mbr;
The “ADD R1,Ri” instruction requires more microinstructions, since Ri can be one of four registers.
The microprogram is a little similar to the one for “PUSH Ri”.
axl 00001000’B;
axr 0; [ax becomes the mask 0000100...0]
ax:=band(ax,ir); if Z goto r0r1; [isolate bit ir(11)]
r2r3: axl 00000100’B; [ir(11)=1, check ir(10)]
axr 0;
ax:=band(ax,ir); if Z goto r2;
r3: r1:=r1+r3; goto fech;
r2: r1:=r1+r2; goto fech;
r0r1: ax 00000100’B; [ir(11)=0, check ir(10)]
axr 0;
ax:=band(ax,ir); if Z goto r0;
r1: r1:=r1+r1; goto fech;
r0: r1:=r1+r0; goto fech;
380 Answers to Exercises
4.15: When the computer starts, and also each time it is reset, a special hardware circuit resets many
components in the computer, among them the PC and the MPC. The microcode stops either when the
power to the computer is turned off, or when the HLT machine instruction is executed. The microprogram
for this instruction is simply one microinstruction that jumps to itself, thus, ‘l: jump l;’. The computer
is still running at full speed but it is effectively idle since no further machine instructions will be fetched. It
can be restarted by hitting the ‘reset’ button.
4.16: In principle, it is possible to have procedures in the microcode, but in practice it is better to write
the same sequence of microinstructions several times than to have the overhead associated with procedures.
Procedures illustrate another difference between code and microcode. Procedures are used in programs
because a program should be readable and easy to write. A microprogram, however, should satisfy one
important criterion, namely it should be fast. Calling a procedure and returning from it are operations that
require time, time which constitutes overhead associated with the procedure. This overhead can be tolerated
in a program, but in a microprogram, where every microsecond counts, it may be degrading performance
too much. This is why procedures are rarely used in microcode.
Procedures can be added to our microinstructions by adding two 1-bit fields ‘CALL’ and ‘RET’. A
microinstruction where ‘CALL’ is set to 1 saves the MPC in a special, auxiliary register PROC, then resets
the MPC to the ‘ADDR’ field. A microinstruction where ‘RET’ is set to 1 resets the MPC from register
PROC. These operations should take place in subcycle 4. They also require extra hardware (gates and buses)
to move eight bits between the MPC and PROC.
4.17: Yes. It is possible to also move one of the eight registers into the MAR. The microinstruction
‘r1=mbr:=r2; mar:=pc;’ does just that. It selects R2, moves it to the A bus, the A latch, through the ALU
and, from there, to the C bus (where it ends up in R1) and to the MBR. It also selects the PC, moves it to
the B bus, the B latch, and to the MAR.
4.18: The first approach that comes to mind is to isolate the 12-bit x field, save it in the AX register, and
write a microprogram with a loop. In each iteration, this microprogram should shift R1 one position to the
left, decrement AX by 1, and exit if the new AX is negative (by means of ‘if N goto...’). If AX is not
negative, the microprogram should loop again. The only trouble is that decrementing is done by adding a
−1, and the only way to get a −1 in our microprograms is to prepare it in the AX register. Since that register
is being used, our approach cannot be implemented. This is an example of a relatively simple microprogram
that requires two auxiliary registers (see also exercise 4.13).
4.19: Neither. They translate instructions from one language to another. The same is true for a compiler.
5.1: When generating the four-dimensional cube we have used a certain convention. All the nodes located
in the left three-dimensional subcube were numbered 0xxx, and those in the right subcube were numbered
1xxx. To obtain different numbers, this convention can be reversed. Note that the particular number of a
node does not matter as long as we realize that each bit position corresponds to one of the dimensions of
the cube.
5.2: It has 212 = 4K = 4096 nodes, each connected to 12 other nodes. The node number has 12 bits, each
corresponding to one of the 12 dimensions of the cube.
5.3: By changing the topology to a ring. The programs would also have to be generalized
b
5.4: To calculate a f (x) dx, the interval [a, b] is divided into many small subintervals. Each node calculates
the area under the function in a number of intervals determined by the host, sums the areas, adds the total
to those sent by its children, and sends the result to its parent.
5.5: The best way is to use a zero-dimensional subcube (just one node, typically #0). To take advantage of
the parallel architecture, however, one must find a problem that can be divided into identical subproblems.
5.6: Start with a comparand of 00 . . . 00 and a mask of 10 . . . 00. A search will tell whether any words in
memory have a zero on the left. If yes, then the smallest number has a zero in that position. Otherwise, it
has a one. Generate that value in the leftmost position of the comparand, shift the mask to the right, and
loop N times.
5.7: By now it should be obvious that a good choice for the matching store is an associative memory. The
matching store is an ideal application for an associative memory since a quick search is important.
Answers to Exercises 381
5.8: A FIFO queue. In such a queue, the items on top are the oldest, so searching top to bottom always
yields a match with the oldest candidate.
5.9: A recursive procedure. Each time a recursive procedure calls itself, the same instructions are executed.
To match properly, however, the tokens generated by each call should be different.
6.1: A recursive procedure calling itself repeatedly because of a bug.
6.2: Because it is transparent to the user.
7.1: An OR gate outputs a 1 when any or all of its inputs are 1’s. In fact, the only case where it outputs a
0 is when all its inputs are 0’s. An XOR gate is similar but it excludes the case where more than one input
is 1. Such a gate outputs a 1 when one of its inputs, and only one, is a 1.
7.2: Figure Ans.6 shows how NOR gates can be combined to produce the outputs of NOT, OR, and AND.
A+B A
A A
A A+B A+B=AB
B
B
Inverse OR AND
Figure Ans.6: The universality of NOR
7.3: Figure Ans.7 shows how an XOR gate is constructed from a NAND, OR, and AND gates. Simply
replace the OR and AND gates with NAND gates, or replace all three with NOR gates.
A
B
A⊕B
7.4: The last output Q3 , goes high after four input pulses.
7.5: Figure Ans.8 shows this design. The idea is to keep both inputs J and K permanently high.
J Q1 J Q2 J Q3
Input CK CK CK
K K K
High
7.6: Generalizing the approach shown in Figure 7.14, we (1) select the smallest n such that 2n−1 < N ≤ 2n ,
(2) construct a modulo-n binary counter, (3) select all the Qi outputs that go high at the N th step and
connect them to an AND gate, and (4) use the output of that gate to reset the counter. Another approach
is to connect a decoder to the outputs of the counter, and use the N th decoder output line to reset the
counter.
7.7: This is done by triggering the T flipflop on a rising clock edge (Figure 7.15b).
7.8: Yes. All that is needed is a set of AND gates connecting the outputs of B to the inputs of A. These
gates can be fabricated regardless of whether A can be transferred to B or not.
7.9: The number c of control lines should satisfy k = 2c , so the number of control lines needed is c = log2 k.
382 Answers to Exercises
Ai
Bi
Oi
Ci
Di
Control
decoder
7.10: This is shown in Figure Ans.9. Notice how the two control lines are decoded, so only one of the four
AND gates is selected.
7.11: Table Ans.10 is the truth table of this device. It is clear, from this table, that the three outputs are
defined by
Y2 = I4 + I5 + I6 + I7 ,
Y1 = I2 + I3 + I6 + I7 ,
Y0 = I1 + I3 + I5 + I7 .
Each of the three outputs is the output of an OR gate. Each of the three gates has four inputs taken from
among the eight input lines Ii .
I0 I1 I2 I3 I4 I5 I6 I7
Y2 0 0 0 0 1 1 1 1
Y1 0 0 1 1 0 0 1 1
Y0 0 1 0 1 0 1 0 1
Table Ans.10: A truth table for an encoder
7.12: Yes, it can have fewer than 2n inputs. An encoder with three outputs generally has eight inputs. If
it happens to have, say, five inputs, then only five of the eight possible outputs can be obtained.
8.1: Destructive memory must have hardware that reconstructs a word after it’s been read.
8.2: The address decoder should be enabled each time memory is accessed. This can be achieved by
connecting both the R and W lines to the decoder’s “enable” input through an OR gate. The R line should
also be connected to all the BCs (in the same way the R/W line is connected in Figure 8.2a), so they receive
a 1 for a memory read and 0 otherwise.
8.3: The 20 address lines needed by a 1M memory can be divided into four sets of five lines each. Each
set is decoded by a 1-of-5 decoder, so the total number of decoder outputs is 4×32 = 128. Each BC receives
four select lines, so it should have AND gates with four inputs.
8.4: A good example of such an application is the control store of a new computer. When a new, micropro-
grammable computer is developed, the microprograms should be written, debugged, and stored permanently
in the control store. The best candidate for the control store is ROM. However, while the microprograms
are being debugged, that ROM has to be modified several times, suggesting the use of EPROM.
8.5: Yes. This would require a 4K×8 ROM where only 96 of the 4096 locations are actually used.
8.6: Address 1010 has 11. Addresses 0001 and 0101 contain 01, and addresses 0000, 0001, 0010, 0011
contain 10.
9.1: The design is shown in Figure Ans.11.
Answers to Exercises 383
A
B X
Z
B
Z
A
T Y
B
Z
9.2: This observation is based on the fact that 1 + x + x̄ equals either 1 + 1 + 0 or 1 + 0 + 1. In either case
the sum bit is 0 and there is a carry bit. Once this is understood, it is easy to see that the sum
xx . . . x10 . . . 0
+
x̄x̄ . . . x̄10 . . . 0
01100
11010
11010 00000
01100 01100
00000 00000
00000 01100
1111010 01100
111010 0100111000
00000 0110000000 subtract (25 ×011002 )
1110111000 1110111000
(a) (b)
Table Ans.12: Multiplying signed integers
9.8: An integer with alternating bits, such as 0101 . . . 01. Table 9.12 implies that the compact representation
of this multiplier is 11̄11̄ . . . 11̄. This multiplier does not reduce the number of partial products, and there is
also the overhead associated with two’s complementing the multiplicand.
9.9: Yes, but the overhead and the complex algorithm more than cancel out any gains from the shorter
loop.
384 Answers to Exercises
10.1: Suppressing the listing file makes sense if a listing exists from a previous assembly. Also, if a printer
is not immediately available, the user may want to save the listing file and to print it later.
10.2: In many higher-level languages, a semicolon is used to indicate the end of a statement, so it seems a
good choice to indicate the end of an instruction. Also, we will see that many other characters already have
special meaning to the assembler.
10.3: The F indicates a floating-point operation, so this is a floating-point add.
10.4: The number of registers should be of the form 2k . This implies that a register number is k bits long,
and fits in a k-bit field in the instruction. A choice of 20 registers would mean a 5-bit register number—and
thus a 5-bit field in many instructions—but that field would not be fully utilized since with five bits it is
possible to select one of 32 registers.
10.5: To indicate an index register and a base register (these were special registers used for addressing;
they are not discussed in this book).
10.6: An addressing mode should be used to assemble the instruction. Addressing modes are discussed in
Section 2.3.
10.7: The symbol is simply undefined, an error situation that’s discussed, among other errors, at the end
of this chapter.
10.8: Certain characters, such as ‘–’ and ‘_’ allow for a natural division of the name into easy-to-read
components. Other characters, such as ‘$’ and ‘=’ make labels more descriptive. Examples: NO_OF_FONTS
is more readable than NoOfFonts. REG=DATA is more descriptive than RegEqData.
10.9:
a. type
address=0..4095; an address in a 4K memory
node=record
info: address;
next: ^node
end;
list=^node; the type ‘list’ is a pointer to the beginning of such a list
b. const
lim=500; max =1000; some suitable constants
type
list=o..lim; type ‘list’ is the index of the first list element in array house
var
house: array[1..lim] of 0..max; lists of pointers are housed here
house2: array[1..lim] of 0..lim; pointers pointing to
the next element inside each list, are housed here.
10.10: Using logical operations and, perhaps, shifts to mask the rest of the instruction.
10.11: Yes, it is valid, its value is −11 and its type is absolute. It simply represents a negative distance.
10.12: External symbols, and the way they are handled by the loader, are discussed in texts on systems
programming. In the above example, the assembler calculates as much of the expression as it can (A-B) and
generates two modify loader directives, one to add the value of K and the other, to subtract L, both executed
at load time.
10.13: Address 24, because of the earlier phrasing . . .with the name ‘1’ and a value ≥ 17 . . .
10.14: ‘JMP *’ means jump to the current instruction. Such an instruction jumps to itself and thus causes
an infinite loop. In the early microprocessors, this was a common way to end the program, since many of
those microprocessors did not have a HLT instruction.
‘JMP *-*’ means jump to location 0. However, since the LC symbol ‘*’ is relative, the expression *-*
is rel − rel and is therefore absolute. The instruction would jump to location 0 absolute, not to the first
location in the program.
10.15: Because the USE is supposed to have the name, not the value, of a LC, in its operand field.
Answers to Exercises 385
10.16: Yes. The only problem is that the loader needs to be told where to start the execution of the entire
program. This, however, is specified in the END directive and is a standard feature, used even where no
multiple LCs are supported.
10.17: It is identical to:
JMP TMP
.
.
.
TMP: DC *
The computer will branch to the location labeled TMP but that location contains an address, not an
instruction. The likely result will be an interrupt (invalid opcode).
10.18: It depends on the characters allowed. The ASCII codes of the characters ‘<’, ‘=’, ‘>’, ‘?’, ‘@’
immediately precede the code of ‘A’. Similarly, the codes of ‘[’, ‘\’, ‘]’ immediately follow ‘Z’ in the ASCII
sequence. If those codes are used, then it is still easy to use buckets. Given the first character of a symbol
name, we only need to subtract from it the ASCII code of the first of the allowed characters, say ‘<’, to get
the bucket number. If other characters are allowed, then buckets may not be a good data structure for the
symbol table.
11.1: Either make the label a parameter, use local labels (see Chapter 10), or use automatic label generation.
11.2: Generally not. In the above example, it makes no sense for the assembler to substitute X for B,
since it does not know if this is what the programmer wants. However, there are two cases where double
substitution (or even multiple substitution) makes sense. The first is where an argument is the name of a
macro. The second is the case of an assembler where parameters are identified syntactically, with an ‘&’
or some other character. The first case is called nested macro expansion and is discussed in Section 11.6.1.
The second case occurs in the IBM360 where parameters must start with an ‘&’ and are therefore easy to
identify. The 360 assembler performs multiple substitution until no ‘&’ are left in the source line.
11.3: It depends on the MDT organization and on the size of the data structures used for binding. Typically,
the maximum number of parameters ranges between a few tens and a few hundreds.
11.4: If the last argument is null, it should be preceded by a comma. A missing last comma indicates a
missing argument.
11.5: Add another pass (pass −1?) to collect all macro definitions and store them in the MDT. Pass 0
would now be concerned only with macro expansions.
11.6: Nested macro definition. In such a case, each expansion of certain macros causes another macro
definition to be stored in the MDT, and space in the MDT may be exhausted very rapidly.
11.7: All three parameters are bound to the same argument and are therefore substituted by the same
string.
11.8: To retain the last binding and use default values. Thus, P2 is first bound to MAN and then to DAN.
Since P3 is not bound to any argument, it should get a default value.
11.9: The macro processor would continue scanning the source, reading and assigning more and more text
to the second argument, until one of the following happens:
1. It finds a period-space combination somewhere in the source. This would terminate the scan and
would bound the second parameter to a (long) argument.
2. It gets to the end of the source and realizes that something went wrong.
3. It finds a character (rather, a token) in a context that’s invalid inside a macro argument. In the case
of TEX, such a token could be the start of a new paragraph.
In cases 2, 3 the macro processor would issue a ‘run away argument’ error message, and either terminate
the macro expansion or give the user a chance to correct the source file interactively.
11.10: The difference is in scope. Attributes of macro arguments exist only while the macro is expanded.
Attributes of a symbol exist while the symbol is stored in the symbol table.
386 Answers to Exercises
11.11: The programmer probably meant a default value of null for parameter M2. However, depending on
the syntax rules of the assembler, this may also be considered an error.
11.12: It should be, since it alerts the reader to the fact that some of the listing is suppressed.
11.13: Because it allows the implementation of recursive macros.
11.14: Yes, many assemblers support simple expressions in the AIF directive.
11.15: A constant, a SET symbol, or a macro argument.
11.16: No, one ‘ENDM X’ is enough. It signals to the assembler the end of macro X and all its inner
definitions.
11.17: The string ‘A’[AY](AYX).
C.1: The capacities are shown in the table
Mode 1 Mode 2
345000×2048 345000×2336
Bytes 706,560,000 805,920,000
Mbytes 673.828 768.585
Notice that mega is defined as 220 = 1, 048, 576, whereas a million equals 106 . The playing time (at 75
sectors per second) is 76 min, 40 s in either mode.
C.2: A full CD contains 700 Mbytes which is the equivalent of 80 minutes. At 24X speed it therefore takes
80/24 = 3.33 minutes and at 40X it takes 80/40 = 2 minutes to record. In addition, the drive needs a few
second to finish up the CD and (optionally) 2–3 minutes to verify the recording by comparing the new CD
with the original data.