F10 E1 Solution
F10 E1 Solution
F10 E1 Solution
CPIs of P1
CPIs of P2
Frequency
Class A
1
2
10%
Class B
2
2
10%
Class C
3
2
50%
Class D
4
2
30%
Given a program with 106 instructions divided into the four classes according to the frequencies in the
above table, (A) which implementation is faster? (B) And how much?
CPU Time
CPU Time P1
CPU Time P2
1 of 5
Q2. Assume that you have a program that has two execution phases. One phase runs for 8 seconds and the
second phase runs for 2 seconds. You can improve the performance of only one phase by a factor of 4.
(A) What best overall speedup you can get through this improvement? (B) And what is the best overall
speedup you can get through improving one phase only?
A) Best result is achieved when improving the 8-seconds phase.
f = 8 / (8+2) = 0.8
Overall speedup = 1 / (1-f + f/s)
= 1 / (0.2 + 0.8/4) = 1/.4 = 2.5
B) Best speedup is limited by the unenhanced phase.
Overall speedup = 1 / (1-f)
= 1 / (0.2) = 1/.4 = 5
2 of 5
Q3. Assume that you have a processor with the five-stage pipeline described in the class. Assume also that,
of all instructions executed in this processor, the following fraction of these instructions has a particular
type of RAW data dependence. The type of RAW data dependence is identified by the stage that
produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that
follows the one that produces the result, 2nd instruction that follows, or both). We assume that the
register write is done in the first half of the clock cycle and that register reads are done in the second half
of the cycle. Also, assume that the CPI of the processor is 1.0 if there are no data hazards.
EX to 1st only
15%
EX to 2nd only
10%
MEM to 1st
20%
If we use no forwarding, what fraction of cycles are we stalling due to data hazards? You must show
your work clearly including four pipeline diagrams.
EX to 1st only:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
Fraction of stalling cycles = 2 * 0.15 = 0.30
EX to 1st and 2nd:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
I2
F F F D E M W
Fraction of stalling cycles = 2 * 0.05 = 0.10
EX to 2nd only:
I0
F D E M W
I1
F D E M W
I2
F D D E M W (1 stall cycle)
Fraction of stalling cycles = 1 * 0.10 = 0.10
MEM to 1st:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
Fraction of stalling cycles = 2 * 0.20 = 0.40
Total Fraction of stalling cycles = 0.30 + 0.10 + 0.10 + 0.40 = 0.90
3 of 5
Q4. Unroll the following loop twice and schedule the unrolled code to achieve shortest execution time on
the five-stage pipeline described in the class. Assume that branches are resolved in the decode stage.
Then using pipeline diagrams, find how many cycles are needed to execute one iteration of the unrolled
loop.
Loop:
sw
r1, 0(r2)
addui r2, r2, -4
bne
r2, r3, Loop
6 7 8
W
M W
E M W
8 cycles
4 of 5
Q5. Show the best schedule for the following operations for the five-stage static dual-issue MIPS processor
described in the class. Remember that each issue packet can have one ALU/branch instruction and one
load/store instruction. Assume that the processor has full forwarding paths, branch instructions are
executed in one cycle, and memory instructions are executed in two cycles.
Loop:
lw
lw
add
sw
lw
lw
add
sw
addi
bne
r1,
r3,
r1,
r1,
r4,
r5,
r4,
r4,
r2,
r2,
0(r2)
1000(r2)
r1, r3
2000(r2)
-4(r2)
996(r2)
r4, r5
1996(r2)
r2, -8
zero, Loop
Load/store
nop
lw
r1, 0(r2)
nop
lw
r3, 1000(r2)
nop
lw
r4, -4(r2)
add
r1, r1, r3
lw
r5, 996(r2)
addi
r2, r2, -8
sw
r1, 2000(r2)
add
r4, r4, r5
nop
bne
sw
r4, 2004(r2)
<Good Luck>
5 of 5
Cycle
1
2
3
4
5
6
7
8
9
10