Chap1 Intro

Chapter 1
Basics of Architectural
Design
RISC Machines
• The underlying philosophy of RISC machines is that
a system is better able to manage program
execution when the program consists of only a few
different instructions that are the same length and
require the same number of clock cycles to decode
and execute.
• RISC systems access memory only with explicit load
and store instructions.
• In CISC systems, many different kinds of instructions
access memory, making instruction length variable
and fetch-decode-execute time unpredictable.
2
RISC Machines
• RISC systems shorten execution time by reducing
the clock cycles per instruction.
• CISC systems improve performance by reducing the
number of instructions per program.
• The simple instruction set of RISC machines
enables control units to be hardwired for maximum
speed.
• With fixed-length instructions, RISC lends itself to
pipelining and speculative execution.
3
RISC Machines
Hardwired control unit
• Hardwired control units are implemented through use of
combinational logic units, featuring a finite number of
gates that can generate specific results based on the
instructions that were used to invoke those responses.
• No ROM inside CU and no microprogramming available.
• Hardwired control units are generally faster than
microprogrammed designs.
• The more complex and variable instruction set of CISC
machines requires microprogrammed-based control
units that interpret instructions as they are fetched from
memory.
4
– This translation takes time.
RISC Machines
• Consider the program fragments: mov ax, 0
mov bx, 10
mov ax, 10 mov cx, 5
CISC mov bx, 5 RISC
Begin add ax, bx
mul bx, ax loop Begin
• The total clock cycles for the CISC version might be:
(2 movs  1 cycle) + (1 mul  30 cycles) = 32 cycles
• While the clock cycles for the RISC version is:

(3 movs  1 cycle) + (5 adds  1 cycle) + (5 loops 
1 cycle) = 13 cycles
• With RISC clock cycle being shorter, RISC gives us
much faster execution speeds.
5
RISC Machines
• Because of their load-store ISAs, RISC
architectures require a large number of CPU
registers.
• These register provide fast access to data during
sequential program execution.
• They can also be employed to reduce the overhead
typically caused by passing parameters to
subprograms.
• Instead of pulling parameters off of a stack, the
subprogram is directed to use a subset of registers.
6
RISC Machines
• RISC • CISC
– Multiple register sets. – Single register set.
– Three operands per – One or two register
instruction. operands per
– Parameter passing instruction.
through register – Parameter passing
windows. through memory.
– Single-cycle – Multiple cycle
instructions. instructions.
– Hardwired – Microprogrammed
control. control.
– Highly pipelined. – Less pipelined.
Continued....
7
RISC Machines
• RISC • CISC
– Simple instructions, – Many complex
few in number. instructions.
– Fixed length – Variable length
instructions. instructions.
– Complexity in – Complexity in
compiler. microcode.
– Only LOAD/STORE – Many instructions
instructions access can access memory.
memory. – Many addressing
– Few addressing modes.
modes.
8
RISC Pipelining
• One of the driving forces for creating RISC processors was the
opportunity they would provide for efficient pipelining.
• In RISC architectures, most instructions are register to register,
and an instruction cycle has the following two stages:
– I: Instruction fetch.
– E: Execute (Performs an ALU operation with register input and output)
• For load and store operations, three stages are required:
– I: Instruction fetch.
– E: Execute (Calculates memory address)
– D: Memory (Register-to-memory or memory-to-register operation)
9
Figure (a) depicts the timing of a sequence of instructions
using no pipelining. Clearly, this is a wasteful process.
Even very simple pipelining can substantially improve
performance.
10
 Figure (b) shows a two-stage pipelining scheme, in
which the I and E stages of two different instructions
are performed simultaneously. It is assumed that a
single-port memory is used and that only one
memory access is possible per stage.
 So E and D can not be done simultaneously.
11
• We see that the instruction fetch stage of the
second instruction can be performed in parallel
with the first part of the execute/ memory stage.
• However, the execute/memory stage of the
second instruction must be delayed until the first
instruction clears the second stage of the
pipeline.
• This scheme can yield up to twice the execution
rate of a serial scheme 12
• Two problems prevent the maximum
speedup from being achieved.
1. First, we assume that a single port

memory is used and that only one
memory access is possible per stage. This
requires the insertion of a wait state in
some instructions.
2. Second, a branch instruction interrupts the
sequential flow of execution.
To accommodate this with minimum circuitry,
a NOOP instruction can be inserted into the
instruction stream by the compiler or
assembler 13
Pipelining can be improved further by permitting two
memory accesses per stage which is shown in Figure (c).
14
• Now, up to three instructions can be overlapped, and
the improvement is as much as a factor of 3.
• Again, branch instructions cause the speedup to fall
short of the maximum possible.
• Also, note that data dependencies have an effect.
• If an instruction needs an operand that is altered by the
preceding instruction, a delay is required.
• Again, this can be accomplished by a NOOP.
15
 Figure (d) shows the result with a four-stage pipeline. Since
E stage usually involves an ALU operation, it may be
longer than other stages. In this case, we can divide into
two substages:
◦ E1-Register file read E2-ALU operation and register
write
16
RISC Pipelining
• Up to four instructions at a time can be under way, and the

maximum potential speedup is a factor of 4.
• Note again the use of NOOPs to account for data and
branch delays
NOOP: No Operation
17
Optimization of Pipelining
• Because of the simple and regular nature
of RISC instructions, pipelining schemes
can be efficiently employed.
• There are few variations in instruction
execution duration, and the pipeline can
be tailored to reflect this.
• However, data and branch dependencies
reduce the overall execution rate.
18
To compensate for these dependencies, code reorganization
techniques have been developed.
Delayed branch
• A way of increasing the efficiency of the pipeline, makes

use of a branch that does not take effect until after
execution of the following instruction (hence the term
delayed).
• The instruction location immediately following the branch
is referred to as the delay slot.
19
• After 102 is executed, the next instruction to be executed
is 105.
• To regularize the pipeline, a NOOP is inserted after this
branch.
• However, increased performance is achieved if the
instructions at 101 and 102 are interchanged.
20
• The JUMP instruction is fetched at time 4.
• At time 5, the JUMP instruction is executed at the same
time that instruction 103 (ADD instruction) is fetched.
• Because a JUMP occurs, which updates the program
counter, the pipeline must be cleared of instruction 103; at
time 6, instruction 105, which is the target of the JUMP, is
loaded.
21
• The table above shows the same pipeline handled by
a typical RISC organization. The timing is the same.
• However, because of the insertion of the NOOP
instruction, we do not need special circuitry to clear
the pipeline;
• the NOOP simply executes with no effect.
22
Classifying Instruction Set
Architectures
• The type of internal storage in a processor is the
most basic differentiation.
• The major choices are a stack, an accumulator, or a
set of registers.
• Operands may be named explicitly or implicitly:
– The operands in a stack architecture are implicitly on the
top of the stack,
– in an accumulator architecture one operand is implicitly
the accumulator.
– the general-purpose register architectures have only
explicit operands
23
The code sequence for C = A + B for four
classes of instruction sets.
25
Classifying ISA
• Every new architecture designed after 1980 uses a load-
store register architecture.
• The major reasons for the emergence of general-
purpose register (GPR) computers are;
– Registers, like other forms of storage internal to the processor,
are faster than memory.
– Registers are more efficient for a compiler to use than other
forms of internal storage.
• For example, on a register computer the expression (A *
B) + (B * C) – (A * D) may be evaluated by doing the
multiplications in any order.
– may be more efficient because of the location of the operands or
pipelining concerns
26
Classifying ISA
• On a stack computer the hardware must evaluate the
expression in only one order.
– because operands are on the stack, it loades an operand
multiple times.
• Registers can be used to hold variables. When variables
are allocated to registers, the memory traffic reduces,
the program speeds up, and the code density improves
(register named with fewer bits).
• Two major instruction set characteristics divide GPR
architectures.
• Both characteristics concern the nature of operands for
logical instruction (ALU instruction).
27
Classifying ISA
• The first concerns whether an ALU instruction has two or
three operands.
– In three-operand format, the instruction contains one result and
two source operands.
– In two-operand format, one of the operands is both a source and
a result for the operation.
• The second concerns how many of the operands may be
memory addresses in ALU instructions.
• The number of memory operands supported by ALU
instruction may vary from none to three.
• Among all possible combinations, three serve to classify
existing computers.
– load-store (register-register), register-memory, and memory-
28
memory.
29

Chap1 Intro

Uploaded by

Copyright:

Available Formats

Chap1 Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap1 Intro

Uploaded by

Copyright:

Available Formats

Chapter 1

• While the clock cycles for the RISC version is:

1. First, we assume that a single port

• Up to four instructions at a time can be under way, and the

• A way of increasing the efficiency of the pipeline, makes

You might also like