PS4-Solution
PS4-Solution
a) (2 pts) List all the true data dependences in the code above within one loop iteration. Record
the register, source instruction, and destination instruction.
b) (4 pts) Show the timing of the above instruction sequence for the 5-stage MIPS pipeline
without any forwarding hardware. Use a pipeline timing chart to show all stall cycles. Assume
that the branch is handled by predicting it as NOT taken. If the branch outcome is TAKEN, it
kills the next two instructions in the pipeline. How many cycles does this loop take to execute?
What is the average CPI?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
LD IF ID EX M WB
DADDI IF s s s ID EX M WB
SD IF s s s ID EX M
DADDI IF ID EX M WB
DADDI IF ID EX M WB
BNE IF s s s ID EX
next1
next2
LD IF ID EX M WB
Average of 17 cycles per iteration Start of next iteration
1 2 3 4 5 6 7 8 9 10 11 12 13
LD IF ID EX M WB
DADDI IF s ID EX M WB
SD IF ID EX M
DADDI IF ID EX M WB
BNE IF ID EX
DADDI IF ID EX M WB
next
LD IF ID EX M WB
8 cycles per iteration Next iterate
d) (4 pts) Cache memory stages sometimes take longer to access than other pipeline stages.
Consider a 7-stage pipeline: IF1, IF2, ID, EX, MEM1, MEM2, WB, where instruction fetch is split
into two stages: IF1 and IF2, and the data memory is also split into two stages: MEM1 and
MEM2. Show the timing of the above instruction sequence for the 7-stage pipeline will full
forwarding hardware. Assume that the branch is handled by predicting it as always TAKEN
with zero delay in the IF1 stage. How many cycles does this loop take to execute? What is the
average CPI?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
LD IF1 IF2 ID EX M1 M2 WB
DADDI IF1 IF2 s s ID EX M1 M2 WB
SD IF1 s s IF2 ID EX M1 M2
DADDI IF1 IF2 ID EX M1 M2 WB
DADDI IF1 IF2 ID EX M1 M2 WB
BNE IF1 IF2 ID EX
LD IF1 IF2 ID EX M1 M2 WB
8 cycles per iteration Next Iterate
a) (2 pts) We are examining a 5-stage processor pipeline where the unconditional jump and call
instructions are resolved at the end of the second stage, and the conditional branches are
resolved at the end of the third stage. Ignoring other pipeline stalls, how much faster would
the processor pipeline be without any control hazards?
b) (3 pts) Now assume a 10-stage deep pipeline, where unconditional jumps and calls are
resolved at the end of the fourth stage and conditional branches are resolved at the end of
the seventh stage. Ignoring other pipeline stalls, how much faster would the processor
pipeline be without any control hazards?
3) (7 pts) In this problem, we will explore how a deep processor pipeline affects performance in
two ways: faster clock cycle and increased stalls due to data and control hazards. Assume that
the original processor is a 5-stage pipeline with a 1 ns clock cycle. The second processor is a
12-stage pipeline with a 0.5 ns clock cycle. The 5-stage pipeline experiences one stall cycle due
to a data hazard every 5 instructions, whereas the 12-stage pipeline experiences 3 stall cycles
every 8 instructions. In addition, branches constitute 20% of the instruction count, and the
misprediction rate for both pipelines is 5%.
a) (3 pts) What is the speedup of the 12-stage pipeline over the 5-stage pipeline, taking into
account only data hazards?
Average CPI (5-stage pipeline) = 1 + 1/5 = 6/5 (Data hazard stalls only)
Average CPI (12-stage pipeline) = 1 + 3/8 = 11/8 (Data hazard stalls only)
Speedup = (6/5 × 1 ns) / (11/8 × 0.5 ns) = 1.745 (Data hazards only)
Average CPI (5-stage pipeline) = 1 + 1/5 + 0.2 × 0.05 × 2 = 1.22 (Data + Branch Hazards)
Average CPI (12-stage pipeline) = 1 + 3/8 + 0.2 × 0.05 × 6 = 1.435 (Data + Branch Hazards)
Speedup = (1.22 × 1 ns) / (1.435 × 0.5 ns) = 1.70 (Data + Branch Hazards)
4) (13 pts) We will now add support for register-memory ALU operations to the classic five-stage
MIPS pipeline. To simplify the problem, all memory addressing will be restricted to register
indirect. All addresses are simply a value held in a register. No displacement may be added to
the register value. For example, ADD R4, R5, (R8) means R4 = R5 + Memory(R8). Only one
memory operand can be read, but not written. To write memory, the store instruction should
be used instead. Register-register ALU operations are unchanged. For example, the instruction
ADD R4, R5, R8 means R4 = R5 + R8.
a) (2 pts) List a rearranged order of the five traditional stages of the MIPS pipeline that will
support register-memory operations implemented exclusively by register indirect addressing.
The memory stage should come before the execute stage to allow a memory operand to be
read from memory before execution.
b) (5 pts) Describe what forwarding paths are needed for the rearranged pipeline by stating the
source stage, destination stage, and information transferred on each needed new path. Give
an instruction sequence showing each data hazard that can be resolved by forwarding data
between stages. Draw a timing diagram showing the forwarding between stages.
1 2 3 4 5 6 7 8 9 10
LD R7, (R6) IF ID MEM EX WB
LD R8, (R7) IF ID MEM EX WB
c) (3 pts) For the reordered stages of the pipeline, what data hazards cannot be forwarded and
cause stall cycles? Give an instruction sequence showing each data hazard that causes stall
cycles. Draw a timing diagram showing the stall cycles caused by each data hazard.
Because the EX stage is rearranged after the MEM stage, some RAW data hazards cause stall
cycles in the new pipeline.
DADD R4, R5, (R6) ; R4 = R5 + Memory(R6)
SD R4, (R9) ; Memory(R9) = R4
Stall 1 cycle until the value of R4 is computed in the EX stage. The MEM stage is waiting for
data to be computed in the EX stage.
1 2 3 4 5 6 7 8 9 10
DADD R4, R5, (R6) IF ID MEM EX WB
SD R4, (R9) IF ID stall MEM
Because the EX stage is now the fourth stage in the pipeline, the penalty of the branch
instruction has increased from 2 cycles to 3 cycles.
e) (2 pts) List all of the ways that the new pipeline with register-memory ALU operations can
have a different instruction count for a given program than the original pipeline (that supports
register-register ALU operations only). Give specific instruction sequences, one for the original
pipeline and one for the rearranged pipeline, to illustrate each way.
Because register-memory operations are supported, the number of load instructions can be
reduced. For example, to translate A = B + C requires 4 instructions in the original MIPS
architecture, while 3 instructions only if register-memory operations are supported.
Because Register Indirect addressing can be used only and there is no displacement
addressing, additional ALU instructions are required to calculate memory addresses. For
example, the following instruction sequence:
LD R10, 8(R4)
LD R11, 16(R4)
LD R12, 24(R4)