F10 E1 Solution

22541 Computer Architecture (Fall 2010)
First Exam Solution

1
:
:
:
===========================================================================
Instructions: Time 60 min. Closed books & notes. No calculators or mobile phones. No questions are
allowed. Show your work clearly. Every problem is for 5 marks.
===========================================================================
Q1. Consider two different implementations of the same instruction set architecture, P1 and P2. Processor
P1 runs on a clock rate of 1.5 GHz and P2 runs on 2.0 GHz. There are four classes of instructions, A, B,
C, and D. The CPIs of each implementation are given in the following table.
CPIs of P1
CPIs of P2
Frequency
Class A
1
2
10%
Class B
2
2
10%
Class C
3
2
50%
Class D
4
2
30%
Given a program with 106 instructions divided into the four classes according to the frequencies in the
above table, (A) which implementation is faster? (B) And how much?
CPU Time
CPU Time P1
= 106 * (1*0.1 + 2*0.1 + 3*0.5 + 4*0.3) / 1.5*109

= 2.0 * 10-3 sec
CPU Time P2
= 106 * (2*0.1 + 2*0.1 + 2*0.5 + 2*0.3) / 2.0*109

= 1.0 * 10-3 sec
Processor P2 is (2.0 * 10-3 / 1.0 * 10-3 = 2) faster than Processor P1.
1 of 5
Q2. Assume that you have a program that has two execution phases. One phase runs for 8 seconds and the
second phase runs for 2 seconds. You can improve the performance of only one phase by a factor of 4.
(A) What best overall speedup you can get through this improvement? (B) And what is the best overall
speedup you can get through improving one phase only?
A) Best result is achieved when improving the 8-seconds phase.
f = 8 / (8+2) = 0.8
Overall speedup = 1 / (1-f + f/s)
= 1 / (0.2 + 0.8/4) = 1/.4 = 2.5
B) Best speedup is limited by the unenhanced phase.
Overall speedup = 1 / (1-f)
= 1 / (0.2) = 1/.4 = 5
2 of 5
Q3. Assume that you have a processor with the five-stage pipeline described in the class. Assume also that,
of all instructions executed in this processor, the following fraction of these instructions has a particular
type of RAW data dependence. The type of RAW data dependence is identified by the stage that
produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that
follows the one that produces the result, 2nd instruction that follows, or both). We assume that the
register write is done in the first half of the clock cycle and that register reads are done in the second half
of the cycle. Also, assume that the CPI of the processor is 1.0 if there are no data hazards.
EX to 1st only
15%
EX to 1st and 2nd

5%
EX to 2nd only
10%
MEM to 1st
20%
If we use no forwarding, what fraction of cycles are we stalling due to data hazards? You must show
your work clearly including four pipeline diagrams.
EX to 1st only:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
Fraction of stalling cycles = 2 * 0.15 = 0.30
EX to 1st and 2nd:
I0
F D E M W
I1
I2
F F F D E M W
EX to 2nd only:
I0
F D E M W
I1
F D E M W
I2
F D D E M W (1 stall cycle)
MEM to 1st:
I0
F D E M W
I1
Total Fraction of stalling cycles = 0.30 + 0.10 + 0.10 + 0.40 = 0.90
3 of 5
Q4. Unroll the following loop twice and schedule the unrolled code to achieve shortest execution time on
the five-stage pipeline described in the class. Assume that branches are resolved in the decode stage.
Then using pipeline diagrams, find how many cycles are needed to execute one iteration of the unrolled
loop.
Loop:
sw
r1, 0(r2)
addui r2, r2, -4
bne
r2, r3, Loop
Duplicate the loop body and remove loop overhead.

Loop:
sw
r1, 0(r2)
addui r2, r2, -4
bne
r2, r3, Loop
sw
r1, -4(r2)
addui r2, r2, -8
bne
r2, r3, Loop
Schedule the loop (assume that there is forwarding from MEM to ID).
1 2 3 4 5 6 7 8
Loop:
sw
r1, 0(r2) F D E M W
addui r2, r2, -8
F D E M W
sw
r1, 4(r2)
F D E M W
(r2 is forwarded from MEM to EX)
bne
r2, r3, Loop
F D E M W
8 cycles
In case there is no forwarding from MEM to ID:
1 2 3 4 5
Loop:
addui r2, r2, -8 F D E M W
sw
r1, 8(r2)
F D E M
sw
r1, 4(r2)
F D E
bne
r2, r3, Loop
F D
6 7 8
W
M W
E M W
8 cycles
4 of 5
(r2 is forwarded from MEM to EX)

(r2 is forwarded from WB to EX)
Q5. Show the best schedule for the following operations for the five-stage static dual-issue MIPS processor
described in the class. Remember that each issue packet can have one ALU/branch instruction and one
load/store instruction. Assume that the processor has full forwarding paths, branch instructions are
executed in one cycle, and memory instructions are executed in two cycles.
Loop:
lw
lw
add
sw
lw
lw
add
sw
addi
bne
r1,
r3,
r1,
r1,
r4,
r5,
r4,
r4,
r2,
r2,
0(r2)
1000(r2)
r1, r3
2000(r2)
-4(r2)
996(r2)
r4, r5
1996(r2)
r2, -8
zero, Loop
Assuming that the branch instruction is resolved in the decode stage:

ALU/branch
Loop:
Load/store
nop
lw
r1, 0(r2)
nop
lw
r3, 1000(r2)
nop
lw
r4, -4(r2)
add
r1, r1, r3
lw
r5, 996(r2)
addi
r2, r2, -8
sw
r1, 2000(r2)
add
r4, r4, r5
nop
bne
r2, zero, Loop
sw
r4, 2004(r2)
<Good Luck>
5 of 5
Cycle
1
2
3
4
5
6
7
8
9
10

F10 E1 Solution

Uploaded by

Copyright:

Available Formats

F10 E1 Solution

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

F10 E1 Solution

Uploaded by

Copyright:

Available Formats

22541 Computer Architecture (Fall 2010)

First Exam Solution

= 106 * (10.1 + 20.1 + 30.5 + 40.3) / 1.5*109

= 106 * (20.1 + 20.1 + 20.5 + 20.3) / 2.0*109

Processor P2 is (2.0 * 10-3 / 1.0 * 10-3 = 2) faster than Processor P1.

EX to 1st and 2nd

Duplicate the loop body and remove loop overhead.

(r2 is forwarded from MEM to EX)

Assuming that the branch instruction is resolved in the decode stage:

r2, zero, Loop

You might also like