Computer-Architecture Hari Aryal Ioe
Computer-Architecture Hari Aryal Ioe
Chapter – 1
Introduction
1.1 Computer Organization and Architecture
Computer Architecture refers to those attributes of a system that have a direct impact on
the logical execution of a program. Examples:
o the instruction set
o the number of bits used to represent various data types
o I/O mechanisms
o memory addressing techniques
Computer Organization refers to the operational units and their interconnections that
realize the architectural specifications. Examples are things that are transparent to the
programmer:
o control signals
o interfaces between computer and peripherals
o the memory technology being used
So, for example, the fact that a multiply instruction is available is a computer architecture
issue. How that multiply is implemented is a computer organization issue.
Peripherals Computer
Central
Main
Processing
Memory
Unit
Systems
Computer
Interconnection
Input
CPU
Arithmetic
Computer
Registers and
I/O
System
CPU
Login Unit
Bus Internal CPU
Memory
Interconnection
Control
Unit
Control Unit
CPU
Sequencing
ALU
Bus Control
Speculative execution: Using branch prediction and data flow analysis, some
processors speculatively execute instructions ahead of their actual appearance in
the program execution, holding the results in temporary locations.
• Performance Mismatch
Processor speed increased
Memory capacity increased
Memory speed lags behind processor speed
Below figure depicts the history; while processor speed and memory capacity have grown
rapidly, the speed with which data can be transferred between main memory and the processor
has lagged badly.
The effects of these trends are shown vividly in figure below. The amount of main memory
needed is going up, but DRAM density is going up faster (number of DRAM per system is going
down).
Solutions
Increase number of bits retrieved at one time
o Make DRAM “wider” rather than “deeper” to use wide bus data paths.
Change DRAM interface
o Cache
Reduce frequency of memory access
o More complex cache and cache on chip
Increase interconnection bandwidth
o High speed buses
o Hierarchy of buses
• Execute Cycle
o Processor interprets instruction and performs required actions, such as:
Processor - memory
o data transfer between CPU and main memory
Processor - I/O
o Data transfer between CPU and I/O module
Data processing
o Some arithmetic or logical operation on data
Control
o Alteration of sequence of operations
o e.g. jump
Combination of above
• The old contents of AC and the contents of location 941 are added and the result is stored
in the AC.
• The next instruction (2941) is fetched from location 302 and the PC is incremented.
• The contents of the AC are stored in location 941.
Interrupts:
• Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of
processing
• Program
o e.g. overflow, division by zero
• Timer
o Generated by internal processor timer
o Used in pre-emptive multi-tasking
• I/O
o from I/O controller
• Hardware failure
o e.g. memory parity error
• Instruction Cycle
o Added to instruction cycle
o Processor checks for interrupt
Indicated by an interrupt signal
o If no interrupt, fetch next instruction
o If interrupt pending:
Suspend execution of current program
Save context
Set PC to start address of interrupt handler routine
Process interrupt
Restore context and continue interrupted program
• Multiple Interrupts
o Disable interrupts (approach #1)
Processor will ignore further interrupts whilst processing one interrupt
Interrupts remain pending and are checked after first interrupt has been
processed
Interrupts handled in sequence as they occur
o Define priorities (approach #2)
Low priority interrupts can be interrupted by higher priority interrupts
When higher priority interrupt has been processed, processor returns to
previous interrupt
• Memory Connection
o Receives and sends data
o Receives addresses (of locations)
o Receives control signals
Read
Write
Timing
• I/O Connection
o Similar to memory from computer’s viewpoint
o Output
Receive data from computer
Send data to peripheral
o Input
Receive data from peripheral
Send data to computer
o Receive control signals from computer
o Send control signals to peripherals
e.g. spin disk
o Receive addresses from computer
e.g. port number to identify peripheral
o Send interrupt signals (control)
• CPU Connection
o Reads instruction and data
o Writes out data (after processing)
o Sends control signals to other units
o Receives (& acts on) interrupts
• Data Bus
o Carries data
Remember that there is no difference between “data” and “instruction” at
this level
o Width is a key determinant of performance
8, 16, 32, 64 bit
• Address Bus
o Identify the source or destination of data
o e.g. CPU needs to read an instruction (data) from a given location in memory
o Bus width determines maximum memory capacity of system
e.g. 8080 has 16 bit address bus giving 64k address space
• Control Bus
o Control and timing information
Memory read
Memory write
I/O read
I/O write
Transfer ACK
Bus request
Bus grant
Interrupt request
Interrupt ACK
Clock
Reset
• Timing
o Co-ordination of events on bus
o Synchronous
Events determined by clock signals
Control Bus includes clock line
A single 1-0 is a bus cycle
All devices can read clock line
Usually sync on leading edge
Usually a single cycle for an event
• Bus Width
o Address: Width of address bus has an impact on system capacity i.e. wider bus
means greater the range of locations that can be transferred.
o Data: width of data bus has an impact on system performance i.e. wider bus
means number of bits transferred at one time.
• Data Transfer Type
o Read
o Write
o Read-modify-write
o Read-after-write
o Block
1.8 PCI
PCI is a popular high bandwidth, processor independent bus that can function as mezzanine
or peripheral bus.
PCI delivers better system performance for high speed I/O subsystems (graphic display
adapters, network interface controllers, disk controllers etc.)
PCI is designed to support a variety of microprocessor based configurations including both
single and multiple processor system.
It makes use of synchronous timing and centralised arbitration scheme.
PCI may be configured as a 32 or 64-bit bus.
Current Standard
o up to 64 data lines at 33Mhz
o requires few chips to implement
o supports other buses attached to PCI bus
o public domain, initially developed by Intel to support Pentium-based systems
o supports a variety of microprocessor-based configurations, including multiple
processors
o uses synchronous timing and centralized arbitration
Note: Bridge acts as a data buffer so that the speed of the PCI bus may differ from that of the
processor’s I/O capability.
Note: In a multiprocessor system, one or more PCI configurations may be connected by bridges
to the processor’s system bus.
• Error lines
• Interrupt lines
o Not shared
• Cache support
• 64-bit Bus Extension
o Additional 32 lines
o Time multiplexed
o 2 lines to enable devices to agree to use 64-bit transfer
• JTAG/Boundary Scan
o For testing procedures
PCI Commands
• Transaction between initiator (master) and target
• Master claims bus
• Determine type of transaction
o e.g. I/O read/write
• Address phase
• One or more data phases
Chapter – 2
Central Processing Unit
The part of the computer that performs the bulk of data processing operations is called the
Central Processing Unit (CPU) and is the central component of a digital computer. Its purpose is
to interpret instruction cycles received from memory and perform arithmetic, logic and control
operations with data stored in internal register, memory words and I/O interface units. A CPU is
usually divided into two parts namely processor unit (Register Unit and Arithmetic Logic Unit)
and control unit.
Processor Unit:
The processor unit consists of arithmetic unit, logic unit, a number of registers and internal buses
that provides data path for transfer of information between register and arithmetic logic unit. The
block diagram of processor unit is shown in figure below where all registers are connected
through common buses. The registers communicate each other not only for direct data transfer
but also while performing various micro-operations.
Here two sets of multiplexers select register which perform input data for ALU. A decoder
selects destination register by enabling its load input. The function select in ALU determines the
particular operation that to be performed.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 1
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Control unit:
The control unit is the heart of CPU. It consists of a program counter, instruction register, timing
and control logic. The control logic may be either hardwired or micro-programmed. If it is a
hardwired, register decodes and a set of gates are connected to provide the logic that determines
the action required to execute various instructions. A micro-programmed control unit uses a
control memory to store micro instructions and a sequence to determine the order by which the
instructions are read from control memory.
The control unit decides what the instructions mean and directs the necessary data to be moved
from memory to ALU. Control unit must communicate with both ALU and main memory and
coordinates all activities of processor unit, peripheral devices and storage devices. It can be
characterized on the basis of design and implementation by:
Defining basic elements of the processor
Describing the micro-operation that processor performs
Determining the function that the control unit must perform to cause the micro-operations
to be performed.
Control unit must have inputs that allow determining the state of system and outputs that allow
controlling the behavior of system.
Flag: flags are headed to determine the status of processor and outcome of previous ALU
operation.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 2
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Clock: All micro-operations are performed within each clock pulse. This clock pulse is
also called as processor cycle time or clock cycle time.
Control signal from control bus: The control bus portion of system bus provides interrupt,
acknowledgement signals to control unit.
Control signal within processor: These signals causes data transfer between registers,
activate ALU functions.
Control signal to control bus: These are signals to memory and I/O module. All these
control signals are applied directly as binary inputs to individual logic gate.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 3
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Register Organization
Registers are at top of the memory hierarchy. They serve two functions:
1. User-Visible Registers - enable the machine- or assembly-language programmer
to minimize main-memory references by optimizing use of registers
2. Control and Status Registers - used by the control unit to control the operation
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 4
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Design Issues
Completely general-purpose registers or specialized use?
- Specialized registers save bits in instructions because their use can be implicit
- General-purpose registers are more flexible
- Trend is toward use of specialized registers
Number of registers provided?
- More registers require more operand specifier bits in instructions
- 8 to 32 registers appears optimum (RISC systems use hundreds, but are a
completely different approach)
Register Length?
- Address registers must be long enough to hold the largest address
- Data registers should be able to hold values of most data types
- Some machines allow two contiguous registers for double-length values
Automatic or manual save of condition codes?
- Condition restore is usually automatic upon call return
- Saving condition code registers may be automatic upon call instruction, or may be
manual
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 6
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 7
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Data Flow
- Exact sequence depends on CPU design
- We can indicate sequence in general terms, assuming CPU employs:
a memory address register (MAR)
a memory buffer register (MBR)
a program counter (PC)
an instruction register (IR)
Fetch cycle data flow
- PC contains address of next instruction to be fetched
- This address is moved to MAR and placed on address bus
- Control unit requests a memory read
- Result is
placed on data bus
result copied to MBR
then moved to IR
- Meanwhile, PC is incremented
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 8
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 9
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Data are presented to ALU in register and the result of operation is stored in register. These
registers are temporarily storage location within the processor that are connected by signal path
to the ALU. The ALU may also set flags as the result of an operation. The flags values are also
stored in registers within the processor. The control unit provides signals that control the
operation of ALU and the movement of data into an out of ALU.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 10
Computer Organization and Architecture Chapter 2 : Central Processing Unit
4X1
Ei
MUX
S0 S1
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 11
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 12
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Example: Design a 2-bit ALU that can perform addition, AND, OR, & XOR.
Cin
A0
B0 FA
A1
B1
Cout
4X1
Result0
MUX
S1 S0
4X1
Result1
MUX
A computer usually has a variety of Instruction Code Formats. It is the function of the control
unit within the CPU to interpret each instruction code and provide the necessary control
functions needed to process the instruction. An n bit instruction that k bits in the address field
and m bits in the operation code field come addressed 2k location directly and specify 2m
different operation.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 13
Computer Organization and Architecture Chapter 2 : Central Processing Unit
The bits of the instruction are divided into groups called fields.
The most common fields in instruction formats are:
o An Operation code field that specifies the operation to be performed.
o An Address field that designates a memory address or a processor
register.
o A Mode field that specifies the way the operand or the effective address is
determined.
Types of Instruction
Computers may have instructions of several different lengths containing varying
number of addresses.
The number of address fields in the instruction format of a computer depends on
the internal organization of its registers.
Most computers fall into one of 3 types of CPU organizations:
General register organization:- The instruction format in this type of computer needs
three register address fields. For example: ADD R1,R2,R3
Computers may have instructions of several different lengths containing varying number of
addresses. Following are the types of instructions.
1. Three address Instruction
With this type of instruction, each instruction specifies two operand location and a result
location. A temporary location T is used to store some intermediate result so as not to
alter any of the operand location. The three address instruction format requires a very
complex design to hold the three address references.
Format: Op X, Y, Z; X Y Op Z
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 14
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Example: ADD X, Y, Z; X Y + Z
ADVANTAGE: It results in short programs when evaluating arithmetic
expressions.
DISADVANTAGE: The instructions requires too many bits to specify 3
addresses.
Example: To illustrate the influence of the number of address on computer programs, we will
evaluate the arithmetic statement X=(A+B)*(C+D) using Zero, one, two, or three address
instructions.
1. Three-Address Instructions:
ADD R1, A, B; R1 M[A] + M[B]
ADD R2, C, D; R2 M[C] + M[D]
MUL X, R1,R2; M[X] R1 * R2
It is assumed that the computer has two processor registers R1 and R2. The symbol M[A]
denotes the operand at memory address symbolized by A.
2. Two-Address Instructions:
MOV R1, A; R1 M[A]
ADD R1, B; R1 R1 + M[B]
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 15
Computer Organization and Architecture Chapter 2 : Central Processing Unit
3. One-Address Instruction:
LOAD A; Ac M[A]
ADD B; Ac Ac + M[B]
STORE T; M[T] Ac
LOAD C; Ac M[C]
ADD D; Ac Ac + M[D]
MUL T; Ac Ac * M[T]
STORE X; M[X] Ac
Here, T is the temporary memory location required for storing the intermediate result.
4. Zero-Address Instructions:
PUSH A; TOS A
PUSH B; TOS B
ADD; TOS (A + B)
PUSH C; TOS C
PUSH D; TOS D
ADD; TOS (C + D)
MUL; TOS (C + D) * (A + B)
POP X ; M[X] TOS
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 16
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 17
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Instruction
Opcode Register Register
Operand
Operand
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 18
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Operand
Operand
Register Memory
+ Operand
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 19
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Implicit
Top of Stack
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 20
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 21
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Arithmetic Instructions
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 22
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Shift Instructions
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 23
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Interrupt
The interrupt procedure is, in principle, quite similar to a subroutine call except for three
variations:
The interrupt is usually initiated by an external or internal signal rather than from
execution of an instruction.
The address of the interrupt service program is determined by the hardware rather
than from the address field of an instruction.
An interrupt procedure usually stores all the information necessary to define the
state of the CPU rather than storing only the program counter.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 24
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 25
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 26
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 27
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Berkeley RISC I
The Berkeley RISC I is a 32-bit integrated circuit CPU.
o It supports 32-bit address and either 8-, 16-, or 32-bit data.
o It has a 32-bit instruction format and a total of 31 instructions.
o There are three basic addressing modes: Register addressing, immediate operand,
and relative to PC addressing for branch instructions.
o It has a register file of 138 registers; 10 global register and 8 windows of 32
registers in each
o The 32 registers in each window have an organization similar to overlapped
register window.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 28
Computer Organization and Architecture Chapter 2 : Central Processing Unit
64-bit processors have 64-bit ALUs, 64-bit registers, and 64-bit buses.
A 64-bit register can address up to 264 bytes of logical address.
64-bit processors have been with us since 1992.
Eg: 64-bit AMD processor.
Internal Architecture
The internal logic design of microprocessor which determines how and when various
operations are performed.
The various function performed by the microprocessor can be classified as:
o Microprocessor initiated operations
o Internal operations
o Peripheral operations
Microprocessor initiated operations mainly deal with memory and I/O read and write
operations.
Internal operations determines how and what operations can be performed with the
data.The operations include:
1. storing
2. performing arithmetic and logical operations
3. test for conditions
4. store in the stack
External initiated operations are initiated by the external devices to perform special
operations like reset, interrupt, ready, etc.
The block diagram of 64-bit microprocessor is shown below.
The major parts of the block diagram are:
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 29
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Architecture Elements
Addressing Modes
General Purpose Registers
Non-modal and modal Instructions
New Instructions in Support of 64-bit
New immediate Instructions
Addressing modes
This addressing mode determines the working environment. i.e 24,32 or 64 bit mode
PSW bits 31 and 32 designate addressing mode (out of 64 bit).
o Addressing modes bits:00=24 bit-mode
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 30
Computer Organization and Architecture Chapter 2 : Central Processing Unit
01=32 bit-mode
11=64 bit-mode
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 31
Computer Organization and Architecture Chapter 2 : Central Processing Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] References: W. Stalling & M. Mano | 32
Computer Organization and Architecture Chapter 3 : Control Unit
Chapter – 3
Control Unit
3.1 Control Memory
Extra Stuff:
Microprogram
Program stored in memory that generates all the control signals required to execute the
instruction set correctly
Consists of microinstructions
Microinstruction
Contains a control word and a sequencing word
Control Word - All the control information required for one clock cycle
Sequencing Word - Information needed to decide the next microinstruction address
Vocabulary to write a microprogram
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 1
Computer Organization and Architecture Chapter 3 : Control Unit
Microrogrammed Sequencer
The next address generator is sometimes called a microprogram sequencer, as it
determines the address sequence that is read from control memory.
Typical functions of a microprogram sequencer are:
o Incrementing the control address register by one
o Loading into the control address register an address from control memory
o Transferring an external address
o Loading an initial address to start the control operations
Pipeline Register
The data register is sometimes called a pipeline register.
o It allows the execution of the microoperations specified by the control word
simultaneously with the generation of the next microinstruction.
This configuration requires a two-phase clock
o The system can operate by applying a single-phase clock to the address register.
Without the control data register
Thus, the control word and next-address information are taken directly
from the control memory.
Advantages
The main advantage of the microprogrammed control is the fact that once the hardware
configuration is established; there should be no need for further hardware or wiring
change.
Most computers based on the reduced instruction set computer (RISC) architecture
concept use hardwired control rather than a control memory with a microprogram.
(Why?)
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 2
Computer Organization and Architecture Chapter 3 : Control Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 3
Computer Organization and Architecture Chapter 3 : Control Unit
Conditional Branching
The branch logic of Fig. 3-2 provides decision-making capabilities in the control unit.
The status conditions are special bits in the system that provides parameter
information.
o e.g. the carry-out, the sign bit, the mode bits, and input or output status
The status bits, together with the field in the microinstruction that specifies a branch
address, control the conditional branch decisions generated in the branch logic.
The branch logic hardware may be implemented by multiplexer.
o Branch to the indicated address if the condition is met;
o Otherwise, the address register is incremented.
An unconditional branch microinstruction can be implemented by loading the branch
address from control memory into the control address register.
If Condition is true, then Branch (address from the next address field of the current
microinstruction)
else Fall Through
Conditions to Test: O(overflow), N(negative), Z(zero), C(carry), etc.
Unconditional Branch
Fixing the value of one status bit at the input of the multiplexer to 1
Mapping of Instructions
A special type of branch exists when a microinstruction specifies a branch to the first
word in control memory where a microprogram routine for an instruction is located.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 4
Computer Organization and Architecture Chapter 3 : Control Unit
The status bits for this type of branch are the bits in the operation code part of the
instruction.
One simple mapping process that converts the 4-bit operation code to a 7-bit address
for control memory is shown in Fig. 3-3.
o Placing a 0 in the most significant bit of the address
o Transferring the four operation code bits
o Clearing the two least significant bits of the control address register
This provides for each computer instruction a microprogram routine with a capacity
of four microinstructions.
o If the routine needs more than four microinstructions, it can use addresses
1000000 through 1111111.
o If it uses fewer than four microinstructions, the unused memory locations
would be available for other routines.
One can extend this concept to a more general mapping rule by using a ROM or
programmable logic device (PLD) to specify the mapping function.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 5
Computer Organization and Architecture Chapter 3 : Control Unit
Mapping from the OP-code of an instruction to the address of the Microinstruction which
is the starting microinstruction of its execution microprogram.
Subroutine
Subroutines are programs that are used by other routines to accomplish a particular
task.
Microinstructions can be saved by employing subroutines that use common sections
of microcode.
e.g. effective address computation
The subroutine register can then become the source for transferring the address for
the return to the main routine.
The best way to structure a register file that stores addresses for subroutines is to
organize the registers in a last-in, first-out (LIFO) stack.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 6
Computer Organization and Architecture Chapter 3 : Control Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 7
Computer Organization and Architecture Chapter 3 : Control Unit
Microinstruction Format
The microinstruction format for the control memory is shown in Fig. 3-6.
The 20 bits of the microinstruction are divided into four functional parts.
o The three fields F1, F2, and F3 specify microoperations for the computer.
o The CD field selects status bit conditions.
o The BR field specifies the type of branch.
o The AD field contains a branch address.
Microoperations
The three bits in each field are encoded to specify seven distinct microoperations
as listed in Table 3-1.
o No more than three microoperations can be chosen for a microinstruction,
one from each field.
o If fewer than three microoperations are used, one or more of the fields will
use the binary code 000 for no operation.
It is important to realize that two or more conflicting microoperations cannot be
specified simultaneously. e.g. 010 001 000
Each microoperation in Table 3-1 is defined with a register transfer statement and
is assigned a symbol for use in a symbolic microprogram.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 8
Computer Organization and Architecture Chapter 3 : Control Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 9
Computer Organization and Architecture Chapter 3 : Control Unit
The CD field consists of two bits which are encoded to specify four status bit
conditions as listed in Table 3-1.
The BR field consists of two bits. It is used, in conjunction with the address
field AD, to choose the address of the next microinstruction.
o The jump and call operations depend on the value of the CD field.
o The two operations are identical except that a call microinstruction stores
the return address in the subroutine register SBR.
o Note that the last two conditions in the BR field are independent of the
values in the CD and AD fields.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 10
Computer Organization and Architecture Chapter 3 : Control Unit
Sample Format
Five fields: label; micro-ops; CD; BR; AD
The label field: may be empty or it may specify a symbolic address
terminated with a colon
The microoperations field: of one, two, or three symbols separated by
commas , the NOP symbol is used when the microinstruction has no
microoperations
The CD field: one of the letters {U, I, S, Z} can be chosen where
U: Unconditional Branch
I: Indirect address bit
S: Sign of AC
Z: Zero value in AC
The BR field: contains one of the four symbols {JMP, CALL, RET,
MAP}
The AD field: specifies a value for the address field of the
microinstruction with one of {Symbolic address, NEXT, empty}
o When the BR field contains a RET or MAP symbol, the AD field
is left empty
Fetch Subroutine
During FETCH, Read an instruction from memory and decode the instruction and update PC.
The first 64 words are to be occupied by the routines for the 16 instructions.
The last 64 words may be used for any other purpose.
o A convenient starting location for the fetch routine is address 64.
The three microinstructions that constitute the fetch routine have been listed in three
different representations.
o The register transfer representation:
To see how the transfer and return from the indirect subroutine occurs:
o MAP microinstruction caused a branch to address 0
o The first microinstruction in the ADD routine calls subroutine INDRCT when
I=1
o The return address is stored in the subroutine register SBR.
o The INDRCT subroutine has two microinstructions:
INDRCT: READ U JMP NEXT
DRTAR U RET
o Therefore, the memory has to be accessed to get the effective address, which
is then transferred to AR.
o The execution of the ADD instruction is carried out by the microinstructions
at addresses 1 and 2
o The first microinstruction reads the operand from memory into DR.
o The second microinstruction performs an add microoperation with the content
of DR and AC and then jumps back to the beginning of the fetch routine.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 12
Computer Organization and Architecture Chapter 3 : Control Unit
Binary Microprogram
The symbolic microprogram must be translated to binary either by means of an
assembler program or by the user if the microprogram is simple.
The equivalent binary form of the microprogram is listed in Table 7-3.
Even though address 3 is not used, some binary value, e.g. all 0’s, must be
specified for each word in control memory.
However, if some unforeseen error occurs, or if a noise signal sets CAR to the
value of 3, it will be wise to jump to address 64.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 13
Computer Organization and Architecture Chapter 3 : Control Unit
Control Memory
When a ROM is used for the control memory,the microprogram binary list
provides the truth table for fabricating the unit.
o To modify the instruction set of the computer, it is necessary to generate a
new microprogram and mask a new ROM.
The advantage of employing a RAM for the control memory is that the
microprogram can be altered simply by writing a new pattern of 1’s and 0’s
without resorting to hardware procedure.
However, most microprogram systems use a ROM for the control memory
because it is cheaper and faster than a RAM.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 14
Computer Organization and Architecture Chapter 3 : Control Unit
Types of Micro-operation
Transfer data between registers
Transfer data from register to external interface
Transfer data from external interface to register
Perform arithmetic/logical ops with register for i/p, o/p
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 15
Computer Organization and Architecture Chapter 3 : Control Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 16
Computer Organization and Architecture Chapter 3 : Control Unit
Hardwired Implementation
In this implementation, CU is essentially a combinational circuit. Its i/p signals
are transformed into set of o/p logic signal which are control signals.
Control unit inputs
Flags and control bus
o Each bit means something
Instruction register
o Op-code causes different control signals for each different instruction
o Unique logic for each op-code
o Decoder takes encoded input and produces single output
o Each decoder i/p will activate a single unique o/p
Clock
o Repetitive sequence of pulses
o Useful for measuring duration of micro-ops
o Must be long enough to allow signal propagation along data paths and
through processor circuitry
o Different control signals at different times within instruction cycle
o Need a counter as i/p to control unit with different control signals being
used for t1, t2 etc.
o At end of instruction cycle, counter is re-initialised
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 17
Computer Organization and Architecture Chapter 3 : Control Unit
Implementation
For each control signal, a Boolean expression of that signal as a function of the
inputs is derived
With that the combinatorial circuit is realized as control unit.
Micro-programmed Implementation
An alternative to hardwired CU
Common in contemporary CISC processors
Use sequences of instructions to perform control operations performed by micro
operations called micro-programming or firmware
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 18
Computer Organization and Architecture Chapter 3 : Control Unit
Micro-instruction Types
Each micro-instruction specifies single or few micro-operations to be performed -
vertical micro-programming
Each micro-instruction specifies many different micro-operations to be performed in
parallel - horizontal micro-programming
Horizontal Micro-programming
Wide memory word
High degree of parallel operations possible
Little encoding of control information
Vertical Micro-programming
Width is narrow
n control signals encoded into log2 n bits
Limited ability to express parallelism
Considerable encoding of control information requires external memory word
decoder to identify the exact control line being manipulated
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 19
Computer Organization and Architecture Chapter 3 : Control Unit
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 20
Computer Organization and Architecture Chapter 3 : Control Unit
Microprogram Sequencer
The basic components of a microprogrammed control unit are the control memory and
the circuits that select the next address.
The address selection part is called a microprogram sequencer.
A microprogram sequencer can be constructed with digital functions to suit a particular
application.
To guarantee a wide range of acceptability, an integrated circuit sequencer must provide
an internal organization that can be adapted to a wide range of application.
The purpose of a microprogram sequencer is to present an address to the control memory
so that a microinstruction may be read and executed.
The block diagram of the microprogram sequencer is shown in Fig. 3-12.
o The control memory is included to show the interaction between the sequencer
and the attached to it.
o There are two multiplexers in the circuit; first multiplexer selects an address from
one of the four sources and routes to CAR, second multiplexer tests the value of
the selected status bit and result is applied to an input logic circuit.
o The output from CAR provides the address for control memory, contents of CAR
incremented and applied to one of the multiplexers input and to the SBR.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 21
Computer Organization and Architecture Chapter 3 : Control Unit
o Although the diagram shows a single subroutine register, a typical sequencer will
have a register stack about four to eight levels deep. In this way, a push, pop
operation and stack pointer operates for subroutine call and return instructions.
o The CD (Condition) field of the microinstruction selects one of the status bits in
the second multiplexer.
o The Test variable (either 1 or 0) i.e. T value together with the two bits from the
BR (Branch) field go to an input logic circuit.
o The input logic circuit determines the type of the operation.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 22
Computer Organization and Architecture Chapter 3 : Control Unit
Based on the function listed in each entry was defined in Table 3-1, the truth table for the input
logic circuit is shown in Table 3-4.
Therefore, the simplified Boolean functions for the input logic circuit can be given as:
S1 I1
S 0 I1 I 0 I1T
L I1I 0T
o The bit values for S1 and S0 are determined from the stated function and the path in the
multiplexer that establishes the required transfer.
o Note that the incrementer circuit in the sequencer of Fig. 7-12 is not a counter constructed
with flip-flops but rather a combinational circuit constructed with gates.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 23
Computer Organization and Architecture Chapter 4 : Pipeline and Vector processing
Chapter – 4
Pipeline and Vector Processing
4.1 Pipelining
Pipelining is a technique of decomposing a sequential process into
suboperations, with each subprocess being executed in a special dedicated
segment that operates concurrently with all other segments.
The overlapping of computation is made possible by associating a register
with each segment in the pipeline.
The registers provide isolation between each segment so that each can operate
on distinct data simultaneously.
Perhaps the simplest way of viewing the pipeline structure is to imagine that
each segment consists of an input register followed by a combinational
circuit.
o The register holds the data.
o The combinational circuit performs the suboperation in the particular
segment.
A clock is applied to all registers after enough time has elapsed to perform all
segment activity.
The pipeline organization will be demonstrated by means of a simple
example.
o To perform the combined multiply and add operations with a stream of
numbers
Ai * Bi + Ci for i = 1, 2, 3, …, 7
Each suboperation is to be implemented in a segment within a pipeline.
R1 Ai, R2 Bi Input Ai and Bi
R3 R1 * R2, R4 Ci Multiply and input Ci
R5 R3 + R4 Add Ci to product
Each segment has one or two registers and a combinational circuit as shown in
Fig. 9-2.
The five registers are loaded with new data every clock pulse. The effect of
each clock is shown in Table 4-1.
General Considerations
Any operation that can be decomposed into a sequence of suboperations of
about the same complexity can be implemented by a pipeline processor.
The general structure of a four-segment pipeline is illustrated in Fig. 4-2.
We define a task as the total operation performed going through all the
segments in the pipeline.
The behavior of a pipeline can be illustrated with a space-time diagram.
o It shows the segment utilization as a function of time.
SIMD
Represents an organization that includes many processing units under the
supervision of a common control unit.
All processors receive the same instruction from the control unit but operate on
different items of data.
The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
Pipeline Conflicts
In general, there are three major difficulties that cause the instruction pipeline
to deviate from its normal operation.
o Resource conflicts caused by access to memory by two segments at the
same time.
Loop buffer: This is a small very high speed register file maintained by the
instruction fetch segment of the pipeline.
Branch prediction: A pipeline with branch prediction uses some additional logic
to guess the outcome of a conditional branch instruction before it is executed.
Delayed branch: in this procedure, the compiler detects the branch instructions
and rearranges the machine language code sequence by inserting useful
instructions that keep the pipeline operating without interruptions.
o A procedure employed in most RISC processors.
o e.g. no-operation instruction
Fig 4-9(a): Three segment pipeline timing - Pipeline timing with data conflict
Fig. 4-9(b) shows the same program with a no-op instruction inserted after the
load to R2 instruction.
Fig 4-9(b): Three segment pipeline timing - Pipeline timing with delayed load
Thus the no-op instruction is used to advance one clock cycle in order to
compensate for the data conflict in the pipeline.
The advantage of the delayed load approach is that the data dependency is taken
care of by the compiler rather than the hardware.
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine
the branches so that they take effect at the proper time in the pipeline. This
method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch
and rearrange the program sequence by inserting useful instructions in the delay
steps.
It is up to the compiler to find useful instructions to put after the branch
instruction. Failing that, the compiler can insert no-op instructions.
An Example of Delayed Branch
The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X
In Fig. 4-10(a) the compiler inserts two no-op instructions after the branch.
o The branch address X is transferred to PC in clock cycle 7.
The program in Fig. 4-10(b) is rearranged by placing the add and subtract
instructions after the branch instruction.
o PC is updated to the value of X in clock cycle 5.
Vector Operations
Many scientific problems require arithmetic operations on large arrays of
numbers.
A vector is an ordered set of a one-dimensional array of data items.
A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
To examine the difference between a conventional scalar processor and a vector
processor, consider the following Fortran DO loop:
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
This is implemented in machine language by the following sequence of
operations.
Initialize I=0
20 Read A(I)
Read B(I)
Store C(I) = A(I)+B(I)
Increment I = I + 1
If I <= 100 go to 20
Continue
A computer capable of vector processing eliminates the overhead associated with
the time it takes to fetch and execute the instructions in the program loop.
C(1:100) = A(1:100) + B(1:100)
A possible instruction format for a vector instruction is shown in Fig. 4-11.
o This assumes that the vector operands reside in memory.
It is also possible to design the processor with a large number of registers and
store all operands in registers prior to the addition operation.
o The base address and length in the vector instruction specify a group of
CPU registers.
Matrix Multiplication
The multiplication of two n x n matrices consists of n2 inner products or n3
multiply-add operations.
o Consider, for example, the multiplication of two 3 x 3 matrices A and B.
o c11= a11b11+ a12b21+ a13b31
o This requires three multiplication and (after initializing c11 to 0) three
additions.
In general, the inner product consists of the sum of k product terms of the form
C= A1B1+A2B2+A3B3+…+AkBk.
o In a typical application k may be equal to 100 or even 1000.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings | 15
Computer Organization and Architecture Chapter 4 : Pipeline and Vector processing
Memory Interleaving
Pipeline and vector processors often require simultaneous access to memory from
two or more sources.
o An instruction pipeline may require the fetching of an instruction and an
operand at the same time from two different segments.
o An arithmetic pipeline usually requires two or more operands to enter the
pipeline at the same time.
Instead of using two memory buses for simultaneous access, the memory can be
partitioned into a number of modules connected to a common memory address
and data buses.
o A memory module is a memory array together with its own address and
data registers.
Fig. 4-13 shows a memory unit with four modules.
The advantage of a modular memory is that it allows the use of a technique called
interleaving.
In an interleaved memory, different sets of addresses are assigned to different
memory modules.
By staggering the memory access, the effective memory cycle time can be
reduced by a factor close to the number of modules.
Supercomputers
A commercial computer with vector instructions and pipelined floating-point
arithmetic operations is referred to as a supercomputer.
o To speed up the operation, the components are packed tightly together to
minimize the distance that the electronic signals have to travel.
This is augmented by instructions that process vectors and combinations of
scalars and vectors.
A supercomputer is a computer system best known for its high computational
speed, fast and large memory systems, and the extensive use of parallel
processing.
o It is equipped with multiple functional units and each unit has its own
pipeline configuration.
It is specifically optimized for the type of numerical calculations involving
vectors and matrices of floating-point numbers.
They are limited in their use to a number of scientific applications, such as
numerical weather forecasting, seismic wave analysis, and space research.
A measure used to evaluate computers in their ability to perform a given number
of floating-point operations per second is referred to as flops.
A typical supercomputer has a basic cycle time of 4 to 20 ns.
The examples of supercomputer:
Cray-1: it uses vector processing with 12 distinct functional units in parallel; a
large number of registers (over 150); multiprocessor configuration (Cray X-MP
and Cray Y-MP)
o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300
megaflops
Chapter – 5
Computer Arithmetic
Integer Representation: (Fixed-point representation):
An eight bit word can be represented the numbers from zero to 255 including
00000000 = 0
00000001 = 1
-------
11111111 = 255
In general if an n-bit sequence of binary digits an-1, an-2 …..a1, a0; is interpreted as unsigned
integer A.
n 1
A=
i 0
2iai
There are several drawbacks to sign-magnitude representation. One is that addition or subtraction
requires consideration of both signs of number and their relative magnitude to carry out the
required operation. Another drawback is that there are two representation of zero. For an
example:
+010 = 00000000
-010 = 10000000 this is inconvenient.
n2
A=
i 0
2iai for A ≥ 0
The number zero is identified as +ve and therefore has zero sign bit and magnitude of all 0’s. We
can see that the range of +ve integer that may be represented is from 0 (all the magnitude bits
are zero) through 2n-1-1 (all of the magnitude bits are 1).
Now for –ve number integer A, the sign bit an-1 is 1. The range of –ve integer that can be
represented is from -1 to -2n-1
n2
2’s complement, A = -2n-1 an-1 +
i 0
2iai
Defines the twos complement of representation of both positive and negative number.
For an example:
+18 = 00010010
1’s complement = 11101101
2’s complement = 11101110 = -18
The first four examples illustrate successful operation if the result of the operation is +ve then we
get +ve number in ordinary binary notation. If the result of the operation is –ve we get negative
number in twos complement form. Note that in some instants there is carry bit beyond the end of
what which is ignored. On any addition the result may larger then can be held in word size being
use. This condition is called over flow. When overflow occur ALU must signal this fact so that
no attempt is made to use the result. To detect overflow the following rule observed. If two
numbers are added, and they are both +ve or both –ve; then overflow occurs if and only if the
result has the opposite sign.
The data path and hardware elements needed to accomplish addition and subtraction is shown in
figure below. The central element is binary adder, which is presented two numbers for addition
and produces a sum and an overflow indication. The binary adder treats the two numbers as
unsigned integers. For addition, the two numbers are presented to the adder from two registers A
and B. The result may be stored in one of these registers or in the third. The overflow indication
is stored in a 1-bit overflow flag V (where 1 = overflow and 0 = no overflow). For subtraction,
the subtrahend (B register) is passed through a 2’s complement unit so that its 2’s complement is
presented to the adder (a – b = a + (-b)).
B Register A Register
Complement
Switch
B7
XOR
n bit Adder
B8
V Z S C / n bit
B7
Check for Zero
Fig: Block diagram of hardware for addition / subtraction
Start
M ß Multiplicand, Q ß Multiplier
C, A ß 0, Count ß No. of bits of Q
Is
No Yes
Q0 = 1 AßA+M
?
Right Shift C, A, Q
Count ß Count - 1
Is
No
Count = 0
?
Yes
Stop Result in AQ
Fig. : Flowchart of Unsigned Binary Multiplication
X ß Multiplicand, Y ß Multiplier
Sum ß 0, Count ß No. of bits of Y
Is
No Yes
Y0 = 1 Sum ß Sum + X
?
Is
No
Count = 0
?
Yes
Stop Result in Sum
Fig: Unsigned Binary Multiplication Alternate method
Algorithm:
Step 1: Clear the sum (accumulator A). Place the multiplicand in X and multiplier in Y.
Step 2: Test Y0; if it is 1, add content of X to the accumulator A.
Step 3: Logical Shift the content of X left one position and content of Y right one position.
Step 4: Check for completion; if not completed, go to step 2.
Example: Multiply 7 X 6
Sum X Y Count Remarks
000000 000111 110 3 Initialization
Result = 101010 = 25 + 23 + 21 = 42
Start
Initialize A ß 0, Q-1 ß 0,
M ß Multiplicand, Q ß Multiplier,
Count ß No. of bits of Q
Is
= 10 = 01
AßA-M Q0Q-1 AßA+M
?
= 11
= 00
Arithmetic Shift Right
A, Q, Q-1
Count ß Count - 1
Is
No
Count = 0
?
Yes
Stop Result in AQ
Fig.: Flowchart of Signed Binary Numbers (using 2’s Complement, Booth Method)
11110 01011 1 - - 1
Arithmetic Shift Right A, Q, Q-1
as Q0Q-1 = 11
11111 00101 1 - - 0 Arithmetic Shift Right A, Q, Q-1
as Q0Q-1 = 11
9 8 7 6 5 2 0
Result in AQ = 11111 00101 = -2 +2 +2 +2 +2 +2 +2 = -512+256+128+64+32+4+1 = -27
First, the bits of the dividend are examined from left to right, until the set of bits examined
represents a number greater than or equal to the divisor; this is referred to as the divisor being
able to divide the number. Until this event occurs, 0s are placed in the quotient from left to right.
When the event occurs, a 1 is placed in the quotient and the divisor is subtracted from the partial
dividend. The result is referred to as a partial remainder. The division follows a cyclic pattern.
At each cycle, additional bits from the dividend are appended to the partial remainder until the
result is greater than or equal to the divisor. The divisor is subtracted from this number to
produce a new partial remainder. The process continues until all the bits of the dividend are
exhausted.
Shift Left
Add / Subtract
Control Unit
N+1 Bit
Adder
Algorithm:
Step 1: Initialize A, Q and M registers to zero, dividend and divisor respectively and counter to n
where n is the number of bits in the dividend.
Step 2: Shift A, Q left one binary position.
Step 3: Subtract M from A placing answer back in A. If sign of A is 1, set Q0 to zero and add M
back to A (restore A). If sign of A is 0, set Q0 to 1.
Step 4: Decrease counter; if counter > 0, repeat process from step 2 else stop the process. The
final remainder will be in A and quotient will be in Q.
Start
Initialize: A ß 0, M ß Divisor,
Q ß Dividend, Count ß No. of bits of Q
Is
No Yes
Left Shift AQ A<0 Left Shift AQ
?
AßA+M AßA-M
Is
No Yes
Q0 ß 1 A<0 Q0 ß 0
?
Count ß Count - 1
Is
Yes
Count > 0
?
No
Is
Yes No
A>0 AßA+M
?
Quotient in Q
Remainder in A
Stop
The floating point is always interpreted to represent a number in the following form ±M × R±E.
Only the mantissa M and the exponent E are physically represented in the register (including
their sign). The radix R and the radix point position of the mantissa are always assumed.
A floating point binary no is represented in similar manner except that it uses base 2 for the
exponent.
For example, the binary no +1001.11 is represented with 8 bit fraction and 0 bit exponent as
follows.
0.1001110 × 2100
Fraction Exponent
01001110 000100
The fraction has zero in the leftmost position to denote positive. The floating point number is
equivalent to M × 2E = +(0.1001110)2 × 2+4
There are four basic operations for floating point arithmetic. For addition and subtraction, it is
necessary to ensure that both operands have the same exponent values. This may require shifting
the radix point on one of the operands to achieve alignment. Multiplication and division are
straighter forward.
A floating point operation may produce one of these conditions:
Exponent Overflow: A positive exponent exceeds the maximum possible exponent value.
Exponent Underflow: A negative exponent which is less than the minimum possible
value.
Significand Overflow: The addition of two significands of the same sign may carry in a
carry out of the most significant bit.
Significand underflow: In the process of aligning significands, digits may flow off the
right end of the significand.
Example: Addition
X = 0.10001 * 2110
Y = 0.101 * 2100
Since EY < EX, Adjust Y
Y = 0.00101 * 2100 * 2010 = 0.00101 * 2110
So, EZ = EX = EY = 110
Now, MZ = MX + MY = 0.10001 + 0.00101 = 0.10110
Hence, Z = MZ * 2EZ = 0.10110 * 2110
Example: Subtraction
X = 0.10001 * 2110
Y = 0.101 * 2100
Since EY < EX, Adjust Y
Y = 0.00101 * 2100 * 2010 = 0.00101 * 2110
So, EZ = EX = EY = 110
Now, MZ = MX - MY = 0.10001 - 0.00101 = 0.01100
Z = MZ * 2EZ = 0.01100 * 2110 (Un-Normalized)
Hence, Z = 0.1100 * 2110 * 2-001 = 0.1100 * 2101
Start
Is
X == 0 Zß Y
?
Stop
Is
Y == 0 Zß X
?
EY = EX
Is
No
½ ≤ MZ < 1 Post Normalize
?
Yes
Stop
Example:
X = 0.101 * 2110 0.1001
Y = 0.1001 * 2-010 * 0.101
As we know, Z = X * Y = (MX * MY) * 2(EX + EY) 1001
Z = (0.101 * 0.1001) * 2(110-010) 0000*
= 0.0101101 * 2100 +1001**
= 0.101101 * 2011 (Normalized) 101101 = 0.0101101
Example:
X = 0.101 * 2110
Y = 0.1001 * 2-010
As we know, Z = X / Y = (MX / MY) * 2(EX – EY)
MX / MY = 0.101 / 0.1001 = (1/2 + 1/8) / (1/2 + 1/16) = 1.11 = 1.00011
0.11 * 2 = 0.22 0
0.22 * 2 = 0.44 0
0.44 * 2 = 0.88 0
0.88 * 2 = 1.76 1
0.76 * 2 = 1.52 1
EX – EY = 110 + 010 = 1000
Now, Z = MZ * 2EZ = 1.00011 * 21000 = 0.100011 * 21001
Chapter – 6
Memory System
6.1 Microcomputer Memory
Memory is an essential component of the microcomputer system.
It stores binary instructions and datum for the microcomputer.
The memory is the place where the computer holds current programs and data that are in
use.
None technology is optimal in satisfying the memory requirements for a computer
system.
Computer memory exhibits perhaps the widest range of type, technology, organization,
performance and cost of any feature of a computer system.
The memory unit that communicates directly with the CPU is called main memory.
Devices that provide backup storage are called auxiliary memory or secondary memory.
Location
• Processor memory: The memory like registers is included within the processor and
termed as processor memory.
• Internal memory: It is often termed as main memory and resides within the CPU.
• External memory: It consists of peripheral storage devices such as disk and magnetic
tape that are accessible to processor via i/o controllers.
Capacity
• Word size: Capacity is expressed in terms of words or bytes.
— The natural unit of organisation
• Number of words: Common word lengths are 8, 16, 32 bits etc.
— or Bytes
Unit of Transfer
• Internal: For internal memory, the unit of transfer is equal to the number of data lines
into and out of the memory module.
• External: For external memory, they are transferred in block which is larger than a
word.
• Addressable unit
— Smallest location which can be uniquely addressed
— Word internally
— Cluster on Magnetic disks
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 1
Computer Organization and Architecture Chapter 6 : Memory System
Access Method
• Sequential access: In this access, it must start with beginning and read through a
specific linear sequence. This means access time of data unit depends on position of
records (unit of data) and previous location.
— e.g. tape
• Direct Access: Individual blocks of records have unique address based on location.
Access is accomplished by jumping (direct access) to general vicinity plus a
sequential search to reach the final location.
— e.g. disk
• Random access: The time to access a given location is independent of the sequence of
prior accesses and is constant. Thus any location can be selected out randomly and
directly addressed and accessed.
— e.g. RAM
• Associative access: This is random access type of memory that enables one to make a
comparison of desired bit locations within a word for a specified match, and to do this
for all words simultaneously.
— e.g. cache
Performance
• Access time: For random access memory, access time is the time it takes to perform a
read or write operation i.e. time taken to address a memory plus to read / write from
addressed memory location. Whereas for non-random access, it is the time needed to
position read / write mechanism at desired location.
— Time between presenting the address and getting the valid data
• Memory Cycle time: It is the total time that is required to store next memory access
operation from the previous memory access operation.
Memory cycle time = access time plus transient time (any additional time required
before a second access can commence).
— Time may be required for the memory to “recover” before next access
— Cycle time is access + recovery
• Transfer Rate: This is the rate at which data can be transferred in and out of a
memory unit.
— Rate at which data can be moved
— For random access, R = 1 / cycle time
— For non-random access, Tn = Ta + N / R; where Tn – average time to read or
write N bits, Ta – average access time, N – number of bits, R – Transfer rate
in bits per second (bps).
Physical Types
• Semiconductor
— RAM
• Magnetic
— Disk & Tape
• Optical
— CD & DVD
• Others
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 2
Computer Organization and Architecture Chapter 6 : Memory System
— Bubble
— Hologram
Physical Characteristics
• Decay: Information decays mean data loss.
• Volatility: Information decays when electrical power is switched off.
• Erasable: Erasable means permission to erase.
• Power consumption: how much power consumes?
Organization
• Physical arrangement of bits into words
• Not always obvious
- e.g. interleaved
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 3
Computer Organization and Architecture Chapter 6 : Memory System
CPU logic is usually faster than main memory access time, with the result that processing
speed is limited primarily by the speed of main memory
The cache is used for storing segments of programs currently being executed in the CPU
and temporary data frequently needed in the present calculations
The memory hierarchy system consists of all storage devices employed in a computer
system from slow but high capacity auxiliary memory to a relatively faster cache memory
accessible to high speed processing logic. The figure below illustrates memory hierarchy.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 4
Computer Organization and Architecture Chapter 6 : Memory System
Hierarchy List
Registers
L1 Cache
L2 Cache
Main memory
Disk cache
Disk
Optical
Tape
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 5
Computer Organization and Architecture Chapter 6 : Memory System
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 6
Computer Organization and Architecture Chapter 6 : Memory System
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 7
Computer Organization and Architecture Chapter 6 : Memory System
Types of ROM
Programmable ROM (PROM)
o It is non-volatile and may be written into only once. The writing process is
performed electrically and may be performed by a supplier or customer at
a time later than the original chip fabrication.
Erasable Programmable ROM (EPROM)
o It is read and written electrically. However, before a write operation, all
the storage cells must be erased to the same initial state by exposure of the
packaged chip to ultraviolet radiation (UV ray). Erasure is performed by
shining an intense ultraviolet light through a window that is designed into
the memory chip. EPROM is optically managed and more expensive than
PROM, but it has the advantage of the multiple update capability.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 8
Computer Organization and Architecture Chapter 6 : Memory System
External Memory
The devices that provide backup storage are called external memory or auxiliary
memory. It includes serial access type such as magnetic tapes and random access
type such as magnetic disks.
Magnetic Tape
A magnetic tape is the strip of plastic coated with a magnetic recording medium.
Data can be recorded and read as a sequence of character through read / write
head. It can be stopped, started to move forward or in reverse or can be rewound.
Data on tapes are structured as number of parallel tracks running length wise.
Earlier tape system typically used nine tracks. This made it possible to store data
one byte at a time with additional parity bit as 9th track. The recording of data in
this form is referred to as parallel recording.
Magnetic Disk
A magnetic disk is a circular plate constructed with metal or plastic coated with
magnetic material often both side of disk are used and several disk stacked on one
spindle which Read/write head available on each surface. All disks rotate together
at high speed. Bits are stored in magnetize surface in spots along concentric
circles called tracks. The tracks are commonly divided into sections called
sectors. After the read/write head are positioned in specified track the system has
to wait until the rotating disk reaches the specified sector under read/write head.
Information transfer is very fast once the beginning of sector has been reached.
Disk that are permanently attached to the unit assembly and cannot be used by
occasional user are called hard disk drive with removal disk is called floppy disk.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 9
Computer Organization and Architecture Chapter 6 : Memory System
Optical Disk
The huge commercial success of CD enabled the development of low cost optical
disk storage technology that has revolutionized computer data storage. The disk is
form from resin such as polycarbonate. Digitally recorded information is
imprinted as series of microscopic pits on the surface of poly carbonate. This is
done with the finely focused high intensity leaser. The pitted surface is then
coated with reflecting surface usually aluminum or gold. The shiny surface is
protected against dust and scratches by the top coat of acrylic.
Information is retrieved from CD by low power laser. The intensity of reflected
light of laser changes as it encounters a pit. Specifically if the laser beam falls on
pit which has somewhat rough surface the light scatters and low intensity is
reflected back to the surface. The areas between pits are called lands. A land is a
smooth surface which reflects back at higher intensity. The change between pits
and land is detected by photo sensor and converted into digital signal. The sensor
tests the surface at regular interval.
DVD-Technology
Multi-layer
Very high capacity (4.7G per layer)
Full length movie on single disk
Using MPEG compression
Finally standardized (honest!)
Movies carry regional coding
Players only play correct region films
DVD-Writable
Loads of trouble with standards
First generation DVD drives may not read first generation DVD-W disks
First generation DVD drives may not read CD-RW disks
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 10
Computer Organization and Architecture Chapter 6 : Memory System
Cache memory is intended to give memory speed approaching that of the fastest
memories available, and at the same time provide a large memory size at the price of less
expensive types of semiconductor memories. There is a relatively large and slow main
memory together with a smaller, faster cache memory contains a copy of portions of
main memory.
When the processor attempts to read a word of memory, a check is made to determine if
the word is in the cache. If so, the word is delivered to the processor. If not, a block of
main memory, consisting of fixed number of words is read into the cache and then the
word is delivered to the processor.
The locality of reference property states that over a short interval of time, address
generated by a typical program refers to a few localized area of memory repeatedly. So if
programs and data which are accessed frequently are placed in a fast memory, the
average access time can be reduced. This type of small, fast memory is called cache
memory which is placed in between the CPU and the main memory.
When the CPU needs to access memory, cache is examined. If the word is found in
cache, it is read from the cache and if the word is not found in cache, main memory is
accessed to read word. A block of word containing the one just accessed is then
transferred from main memory to cache memory.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 11
Computer Organization and Architecture Chapter 6 : Memory System
When a cache hit occurs, the data and address buffers are disabled and the
communication is only between processor and cache with no system bus traffic. When a
cache miss occurs, the desired word is first read into the cache and then transferred from
cache to processor. For later case, the cache is physically interposed between the
processor and main memory for all data, address and control lines.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 12
Computer Organization and Architecture Chapter 6 : Memory System
Locality of Reference
The reference to memory at any given interval of time tends to be confined within
a few localized area of memory. This property is called locality of reference. This
is possible because the program loops and subroutine calls are encountered
frequently. When program loop is executed, the CPU will execute same portion of
program repeatedly. Similarly, when a subroutine is called, the CPU fetched
starting address of subroutine and executes the subroutine program. Thus loops
and subroutine localize reference to memory.
This principle states that memory references tend to cluster over a long period of
time, the clusters in use changes but over a short period of time, the processor is
primarily working with fixed clusters of memory references.
Spatial Locality
It refers to the tendency of execution to involve a number of memory locations
that are clustered.
It reflects tendency of a program to access data locations sequentially, such as
when processing a table of data.
Temporal Locality
It refers to the tendency for a processor to access memory locations that have been
used frequently. For e.g. Iteration loops executes same set of instructions
repeatedly.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 13
Computer Organization and Architecture Chapter 6 : Memory System
Direct Mapping
It is the simplex technique, maps each block of main memory into only one possible
cache line i.e. a given main memory block can be placed in one and only one place on
cache.
i = j modulo m
Where I = cache line number; j = main memory block number; m = number of lines in
the cache
The mapping function is easily implemented using the address. For purposes of cache
access, each main memory address can be viewed as consisting of three fields.
The least significant w bits identify a unique word or byte within a block of main
memory. The remaining s bits specify one of the 2s blocks of main memory.
The cache logic interprets these s bits as a tag of (s-r) bits most significant position and a
line field of r bits. The latter field identifies one of the m = 2r lines of the cache.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 14
Computer Organization and Architecture Chapter 6 : Memory System
24 bit address
2 bit word identifier (4 byte block)
22 bit block identifier
8 bit tag (=22-14), 14 bit slot or line
No two blocks in the same line have the same Tag field
Check contents of cache by finding line and checking Tag
Cache Line 0 1 2 3 4
0 1 2 3 4
Main 5 6 7 8 9
Memory 10 11 12 13 14
Block 15 16 17 18 19
20 21 22 23 24
Note that
o all locations in a single block of memory have the same higher order bits (call them the
block number), so the lower order bits can be used to find a particular word in the block.
o within those higher-order bits, their lower-order bits obey the modulo mapping given
above (assuming that the number of cache lines is a power of 2), so they can be used to
get the cache line for that block
o the remaining bits of the block number become a tag, stored with each cache line, and
used to distinguish one block from another that could fit into that same cache
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 15
Computer Organization and Architecture Chapter 6 : Memory System
line.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 16
Computer Organization and Architecture Chapter 6 : Memory System
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 17
Computer Organization and Architecture Chapter 6 : Memory System
o If a program accesses 2 blocks that map to the same line repeatedly, cache
misses are very high
Associated Mapping
It overcomes the disadvantage of direct mapping by permitting each main memory block
to be loaded into any line of cache.
Cache control logic interprets a memory address simply as a tag and a word field
Tag uniquely identifies block of memory
Cache control logic must simultaneously examine every line’s tag for a match which
requires fully associative memory
very complex circuitry, complexity increases exponentially with size
Cache searching gets expensive
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 18
Computer Organization and Architecture Chapter 6 : Memory System
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 21
Computer Organization and Architecture Chapter 6 : Memory System
e.g
Address Tag Data Set number
1FF 7FFC 1FF 12345678 1FFF
001 7FFC 001 11223344 1FFF
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 22
Computer Organization and Architecture Chapter 6 : Memory System
Write Through
All write operations are made to main memory as well as to cache, so main
memory is always valid
Other CPU’s monitor traffic to main memory to update their caches when needed
This generates substantial memory traffic and may create a bottleneck
Anytime a word in cache is changed, it is also changed in main memory
Both copies always agree
Generates lots of memory writes to main memory
Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up
to date
Lots of traffic
Slows down writes
Remember bogus write through caches!
Write back
When an update occurs, an UPDATE bit associated with that slot is set, so when
the block is replaced it is written back first
During a write, only change the contents of the cache
Update main memory only when the cache line is to be replaced
Causes “cache coherency” problems -- different values for the contents of an
address are in the cache and the main memory
Complex circuitry to avoid this problem
Accesses by I/O modules must occur through the cache
Multiple caches still can become invalidated, unless some cache coherency
system is used. Such systems include:
o Bus Watching with Write Through - other caches monitor memory writes
by other caches (using write through) and invalidates their own cache line
if a match
o Hardware Transparency - additional hardware links multiple caches so
that writes to one cache are made to the others
o Non-cacheable Memory - only a portion of main memory is shared by
more than one processor, and it is non-cacheable
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 23
Computer Organization and Architecture Chapter 6 : Memory System
Split Cache
o Cache splits into two parts first for instruction and second for data. Can
outperform unified cache in systems that support parallel execution and
pipelining (reduces cache contention)
o Trend is toward split cache because of superscalar CPU’s
o Better for pipelining, pre-fetching, and other parallel instruction execution
designs
o Eliminates cache contention between instruction processor and the
execution unit (which uses data)
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: W. Stallings & M. Mano | 24
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Chapter – 7
Input-Output organization
7.1 Peripheral devices
In addition to the processor and a set of memory modules, the third key element of a
computer system is a set of input-output subsystem referred to as I/O, provides an
efficient mode of communication between the central system and the outside
environment.
Programs and data must be entered into computer memory for processing and results
obtained from computations must be recorded or displayed for the user.
Devices that are under the direct control of the computer are said to be connected on-
line. These devices are designed to read information into or out of the memory unit
upon command from CPU.
Input or output devices attached to the computer are also called peripherals.
Among the most common peripherals are keyboards, display units, and printers.
Perhaps those provide auxiliary storage for the systems are magnetic disks and tapes.
Peripherals are electromechanical and electromagnetic devices of some complexity.
We can broadly classify peripheral devices into three categories:
o Human Readable: Communicating with the computer users, e.g. video
display terminal, printers etc.
o Machine Readable: Communicating with equipments, e.g. magnetic disk,
magnetic tape, sensor, actuators used in robotics etc.
o Communication: Communicating with remote devices means exchanging
data with that, e.g. modem, NIC (network interface Card) etc.
Control logic associated with the device controls the device's operation in
response to direction from the I/O module.
The transducer converts data from electrical to other forms of energy during
output and from other forms to electrical during input.
Buffer is associated with the transducer to temporarily hold data being transferred
between the I/O module and external devices i.e. peripheral environment.
Input Device
Keyboard
Optical input devices
o Card Reader
o Paper Tape Reader
o Optical Character Recognition (OCR)
o Optical Bar code reader (OBR)
o Digitizer
o Optical Mark Reader
Magnetic Input Devices
o Magnetic Stripe Reader
o Magnetic Ink Character Recognition (MICR)
Screen Input Devices
o Touch Screen
o Light Pen
o Mouse
Analog Input Devices
Output Device
Card Puncher, Paper Tape Puncher
Monitor (CRT, LCD, LED)
Printer (Impact, Ink Jet, Laser, Dot Matrix)
Plotter
Analog
Voice
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 2
Computer Organization and Architecture Chapter 7 : Input-Output Organization
o Synchronizes the data flow and supervises the transfer rate between peripheral
and CPU or Memory
I/O commands that the interface may receive:
o Control command: issued to activate the peripheral and to inform it what to do.
o Status command: used to test various status conditions in the interface and the
peripheral.
o Output data: causes the interface to respond by transferring data from the bus into
one of its registers.
o Input data: is the opposite of the data output.
Memory-mapped I/O
o A single set of read/write control lines (no distinction between memory and I/O
transfer)
o Memory and I/O addresses share the common address space which reduces memory
address range available
o No specific input or output instruction so the same memory reference instructions can
be used for I/O transfers
o Considerable flexibility in handling I/O operations
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 6
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Information in each port can be assigned a meaning depending on the mode of operation
of the I/O device
o Port A = Data; Port B = Command; Port C = Status
CPU initializes (loads) each port by transferring a byte to the Control Register
o Allows CPU can define the mode of operation of each port
o Programmable Port: By changing the bits in the control register, it is possible to
change the interface characteristics
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 7
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 8
Computer Organization and Architecture Chapter 7 : Input-Output Organization
I/O device places the data on the I/O bus and enables its data valid signal
The interface accepts the data in the data register and sets the F bit of status
register and also enables the data accepted signal.
Data valid line is disables by I/O device.
CPU is in a continuous monitoring of the interface in which it checks the F bit of
the status register.
o If it is set i.e. 1, then the CPU reads the data from data register and sets F
bit to zero
o If it is reset i.e. 0, then the CPU remains monitoring the interface.
Interface disables the data accepted signal and the system goes to initial state
where next item of data is placed on the data bus.
Characteristics:
Continuous CPU involvement
CPU slowed down to I/O speed
Simple
Least hardware
Polling, or polled operation, in computer science, refers to actively sampling the status of an
external device by a client program as a synchronous activity. Polling is most often used in terms
of input/output (I/O), and is also referred to as polled I/O or software driven I/O.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 9
Computer Organization and Architecture Chapter 7 : Input-Output Organization
The problem with programmed I/O is that the processor has to wait a long time for
the I/O module of concern to be ready for either reception or transmission of data.
The processor, while waiting, must repeatedly interrogate the status of the I/O
module. As a result, the level of the performance of the entire system is severely
degraded. An alternative is for the processor to issue an I/O command to a module
and then go on to do some other useful work. The I/O module will then interrupt the
processor to request service when it is ready to exchange data with processor. The
processor then executes the data transfer, and then resumes its former processing. The
interrupt can be initiated either by software or by hardware.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 10
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Priority Interrupt
Determines which interrupt is to be served first when two or more requests are
made simultaneously
Also determines which interrupts are permitted to interrupt the computer while
another is being serviced
Higher priority interrupts can make requests while servicing a lower priority
interrupt
2. Parallel Priority
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 12
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Priority Encoder
Determines the highest priority interrupt when more than one interrupts take place
Interrupt Cycle
At the end of each Instruction cycle
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 13
Computer Organization and Architecture Chapter 7 : Input-Output Organization
The transfer of data between the peripheral and memory without the interaction of CPU
and letting the peripheral device manage the memory bus directly is termed as Direct
Memory Access (DMA).
The transfer of data between the memory and I/O of course facilitates in two ways which
are DMA Burst and Cycle Stealing.
DMA Burst: The block of data consisting a number of memory words is transferred at a
time.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 14
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Cycle Stealing: DMA transfers one data word at a time after which it must return control
of the buses to the CPU.
CPU is usually much faster than I/O (DMA), thus CPU uses the most of the
memory cycles
DMA Controller steals the memory cycles from CPU
For those stolen cycles, CPU remains idle
For those slow CPU, DMA Controller may steal most of the memory cycles
which may cause CPU remain idle long time
DMA Controller
The DMA controller communicates with the CPU through the data bus and control lines.
DMA select signal is used for selecting the controller, the register select is for selecting
the register. When the bus grant signal is zero, the CPU communicates through the data
bus to read or write into the DMA register. When bus grant is one, the DMA controller
takes the control of buses and transfers the data between the memory and I/O.
The address register specifies the desired location of the memory which is incremented
after each word is transferred to the memory. The word count register holds the number
of words to be transferred which is decremented after each transfer until it is zero. When
it is zero, it indicates the end of transfer. After which the bus grant signal from CPU is
made low and CPU returns to its normal operation. The control register specifies the
mode of transfer which is Read or Write.
DMA Transfer
DMA request signal is given from I/O device to DMA controller.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 15
Computer Organization and Architecture Chapter 7 : Input-Output Organization
DMA sends the bus request signal to CPU in response to which CPU disables its current
instructions and initialize the DMA by sending the following information.
o The starting address of the memory block where the data are available (for read)
and where data to be stored (for write)
o The word count which is the number of words in the memory block
o Control to specify the mode of transfer
o Sends a bust grant as 1 so that DMA controller can take the control of the buses
o DMA sends the DMA acknowledge signal in response to which peripheral device
puts the words in the data bus (for write) or receives a word from the data bus (for
read).
DMA Operation
CPU tells DMA controller:-
o Read/Write
o Device address
o Starting address of memory block for data
o Amount of data to be transferred
CPU carries on with other work
DMA controller deals with transfer
DMA controller sends interrupt when finished
A computer may incorporate one or more external processors and assign them the task of
communicating directly with the I/O devices so that no each interface need to communicate with
the CPU. An I/O processor (IOP) is a processor with direct memory access capability that
communicates with I/O devices. IOP instructions are specifically designed to facilitate I/O
transfer. The IOP can perform other processing tasks such as arithmetic logic, branching and
code translation.
The memory unit occupies a central position and can communicate with each processor by
means of direct memory access. The CPU is responsible for processing data needed in the
solution of computational tasks. The IOP provides a path for transferring data between various
peripheral devices and memory unit.
In most computer systems, the CPU is the master while the IOP is a slave processor. The CPU
initiates the IOP and after which the IOP operates independent of CPU and transfer data between
the peripheral and memory. For example, the IOP receives 5 bytes from an input device at the
device rate and bit capacity. After which the IOP packs them into one block of 40 bits and
transfer them to memory. Similarly the O/P word transfer from memory to IOP is directed from
the IOP to the O/P device at the device rate and bit capacity.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 17
Computer Organization and Architecture Chapter 7 : Input-Output Organization
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 18
Computer Organization and Architecture Chapter 7 : Input-Output Organization
A data communication (command) processor is an I/O processor that distributes and collects data
from remote terminals connected through telephone and other communication lines. In processor
communication, processor communicates with the I/O device through a common bus i.e. data
and control with sharing by each peripherals. In data communication, processor communicates
with each terminal through a single pair of wires.
The way that remote terminals are connected to a data communication processor is via telephone
lines or other public or private communication facilities. The data communication may be either
through synchronous transmission or through asynchronous transmission. One of the functions of
data communication processor is check for transmission errors. An error can be detected by
checking the parity in each character received. The other ways are checksum, longitudinal
redundancy check (LRC) and cyclic redundancy check (CRC).
Data can be transmitted between two points through three different modes. First is simplex
where data can be transmitted in only one direction such as TV broadcasting. Second is half
duplex where data can be transmitted in both directions at a time such as walkie-talkie. The third
is full duplex where data can be transmitted in both directions simultaneously such as telephone.
The communication lines, modems and other equipment used in the transmission of information
between two or more stations is called data link. The orderly transfer of information in a data
link is accomplished by means of a protocol.
Compiled By: Er. Hari Aryal [haryal4@gmail.com] Reference: M. Mano & W. Stallings | 19
Computer Organization and Architecture Chapter 8 : Multiprocessors
Chapter – 8
Multiprocessors
8.1 Characteristics of multiprocessors
A multiprocessor system is an interconnection of two or more CPUs with memory
and input-output equipment.
The term “processor” in multiprocessor can mean either a central processing unit
(CPU) or an input-output processor (IOP).
Multiprocessors are classified as multiple instruction stream, multiple data stream
(MIMD) systems
The similarity and distinction between multiprocessor and multicomputer are
o Similarity
Both support concurrent operations
o Distinction
The network consists of several autonomous computers that may
or may not communicate with each other.
A multiprocessor system is controlled by one operating system that
provides interaction between processors and all the components of
the system cooperate in the solution of a problem.
Multiprocessing improves the reliability of the system.
The benefit derived from a multiprocessor organization is an improved system
performance.
o Multiple independent jobs can be made to operate in parallel.
o A single job can be partitioned into multiple parallel tasks.
Multiprocessing can improve performance by decomposing a program into
parallel executable tasks.
o The user can explicitly declare that certain tasks of the program be
executed in parallel.
This must be done prior to loading the program by specifying the
parallel executable segments.
o The other is to provide a compiler with multiprocessor software that can
automatically detect parallelism in a user’s program.
Multiprocessor are classified by the way their memory is organized.
o A multiprocessor system with common shared memory is classified as a
shared-memory or tightly coupled multiprocessor.
Tolerate a higher degree of interaction between tasks.
o Each processor element with its own private local memory is classified as
a distributed-memory or loosely coupled system.
Are most efficient when the interaction between tasks is minimal
Multiport Memory
A multiport memory system employs separate buses between each memory module and
each CPU.
The module must have internal control logic to determine which port will have access to
memory at any given time.
Memory access conflicts are resolved by assigning fixed priorities to each memory port.
Adv.:
o The high transfer rate can be achieved because of the multiple paths.
Disadv.:
o It requires expensive memory control logic and a large number of cables and
connections
Crossbar Switch
Consists of a number of crosspoints that are placed at intersections between processor
buses and memory module paths.
The small square in each crosspoint is a switch that determines the path from a processor
to a memory module.
Adv.:
o Supports simultaneous transfers from all memory modules
Disadv.:
o The hardware required to implement the switch can become quite large and
complex.
Below fig. shows the functional design of a crossbar switch connected to one memory
module.
Using the 2x2 switch as a building block, it is possible to build a multistage network to
control the communication between a number of sources and destinations.
o To see how this is done, consider the binary tree shown in Fig. below.
o Certain request patterns cannot be satisfied simultaneously. i.e., if P1 000~011,
then P2 100~111
One such topology is the omega switching network shown in Fig. below
Some request patterns cannot be connected simultaneously. i.e., any two sources cannot
be connected simultaneously to destination 000 and 001
In a tightly coupled multiprocessor system, the source is a processor and the destination
is a memory module.
Set up the path transfer the address into memory transfer the data
In a loosely coupled multiprocessor system, both the source and destination are
processing elements.
Hypercube System
The hypercube or binary n-cube multiprocessor structure is a loosely coupled system
composed of N=2n processors interconnected in an n-dimensional binary cube.
o Each processor forms a node of the cube, in effect it contains not only a CPU but
also local memory and I/O interface.
o Each processor address differs from that of each of its n neighbors by exactly one
bit position.
Fig. below shows the hypercube structure for n=1, 2, and 3.
Routing messages through an n-cube structure may take from one to n links from a
source node to a destination node.
o A routing procedure can be developed by computing the exclusive-OR of the
source node address with the destination node address.
o The message is then sent along any one of the axes that the resulting binary value
will have 1 bits corresponding to the axes on which the two nodes differ.
A representative of the hypercube architecture is the Intel iPSC computer complex.
o It consists of 128(n=7) microcomputers, each node consists of a CPU, a floating-
point processor, local memory, and serial communication interface units.
Interprocess Synchronization
The instruction set of a multiprocessor contains basic instructions that are used to
implement communication and synchronization between cooperating processes.
o Communication refers to the exchange of data between different processes.
o Synchronization refers to the special case where the data used to communicate
between processors is control information.