F10 E1 Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

22541 Computer Architecture (Fall 2010)

First Exam Solution


1
:
:
:
===========================================================================
Instructions: Time 60 min. Closed books & notes. No calculators or mobile phones. No questions are
allowed. Show your work clearly. Every problem is for 5 marks.
===========================================================================
Q1. Consider two different implementations of the same instruction set architecture, P1 and P2. Processor
P1 runs on a clock rate of 1.5 GHz and P2 runs on 2.0 GHz. There are four classes of instructions, A, B,
C, and D. The CPIs of each implementation are given in the following table.

CPIs of P1
CPIs of P2
Frequency

Class A
1
2
10%

Class B
2
2
10%

Class C
3
2
50%

Class D
4
2
30%

Given a program with 106 instructions divided into the four classes according to the frequencies in the
above table, (A) which implementation is faster? (B) And how much?
CPU Time

CPU Time P1

= 106 * (1*0.1 + 2*0.1 + 3*0.5 + 4*0.3) / 1.5*109


= 2.0 * 10-3 sec

CPU Time P2

= 106 * (2*0.1 + 2*0.1 + 2*0.5 + 2*0.3) / 2.0*109


= 1.0 * 10-3 sec

Processor P2 is (2.0 * 10-3 / 1.0 * 10-3 = 2) faster than Processor P1.

1 of 5

Q2. Assume that you have a program that has two execution phases. One phase runs for 8 seconds and the
second phase runs for 2 seconds. You can improve the performance of only one phase by a factor of 4.
(A) What best overall speedup you can get through this improvement? (B) And what is the best overall
speedup you can get through improving one phase only?
A) Best result is achieved when improving the 8-seconds phase.
f = 8 / (8+2) = 0.8
Overall speedup = 1 / (1-f + f/s)
= 1 / (0.2 + 0.8/4) = 1/.4 = 2.5
B) Best speedup is limited by the unenhanced phase.
Overall speedup = 1 / (1-f)
= 1 / (0.2) = 1/.4 = 5

2 of 5

Q3. Assume that you have a processor with the five-stage pipeline described in the class. Assume also that,
of all instructions executed in this processor, the following fraction of these instructions has a particular
type of RAW data dependence. The type of RAW data dependence is identified by the stage that
produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that
follows the one that produces the result, 2nd instruction that follows, or both). We assume that the
register write is done in the first half of the clock cycle and that register reads are done in the second half
of the cycle. Also, assume that the CPI of the processor is 1.0 if there are no data hazards.
EX to 1st only
15%

EX to 1st and 2nd


5%

EX to 2nd only
10%

MEM to 1st
20%

If we use no forwarding, what fraction of cycles are we stalling due to data hazards? You must show
your work clearly including four pipeline diagrams.
EX to 1st only:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
Fraction of stalling cycles = 2 * 0.15 = 0.30
EX to 1st and 2nd:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
I2
F F F D E M W
Fraction of stalling cycles = 2 * 0.05 = 0.10
EX to 2nd only:
I0
F D E M W
I1
F D E M W
I2
F D D E M W (1 stall cycle)
Fraction of stalling cycles = 1 * 0.10 = 0.10
MEM to 1st:
I0
F D E M W
I1
F D D D E M W (2 stall cycles)
Fraction of stalling cycles = 2 * 0.20 = 0.40
Total Fraction of stalling cycles = 0.30 + 0.10 + 0.10 + 0.40 = 0.90

3 of 5

Q4. Unroll the following loop twice and schedule the unrolled code to achieve shortest execution time on
the five-stage pipeline described in the class. Assume that branches are resolved in the decode stage.
Then using pipeline diagrams, find how many cycles are needed to execute one iteration of the unrolled
loop.
Loop:

sw
r1, 0(r2)
addui r2, r2, -4
bne
r2, r3, Loop

Duplicate the loop body and remove loop overhead.


Loop:
sw
r1, 0(r2)
addui r2, r2, -4
bne
r2, r3, Loop
sw
r1, -4(r2)
addui r2, r2, -8
bne
r2, r3, Loop
Schedule the loop (assume that there is forwarding from MEM to ID).
1 2 3 4 5 6 7 8
Loop:
sw
r1, 0(r2) F D E M W
addui r2, r2, -8
F D E M W
sw
r1, 4(r2)
F D E M W
(r2 is forwarded from MEM to EX)
bne
r2, r3, Loop
F D E M W
8 cycles
In case there is no forwarding from MEM to ID:
1 2 3 4 5
Loop:
addui r2, r2, -8 F D E M W
sw
r1, 8(r2)
F D E M
sw
r1, 4(r2)
F D E
bne
r2, r3, Loop
F D

6 7 8
W
M W
E M W

8 cycles

4 of 5

(r2 is forwarded from MEM to EX)


(r2 is forwarded from WB to EX)

Q5. Show the best schedule for the following operations for the five-stage static dual-issue MIPS processor
described in the class. Remember that each issue packet can have one ALU/branch instruction and one
load/store instruction. Assume that the processor has full forwarding paths, branch instructions are
executed in one cycle, and memory instructions are executed in two cycles.

Loop:

lw
lw
add
sw
lw
lw
add
sw
addi
bne

r1,
r3,
r1,
r1,
r4,
r5,
r4,
r4,
r2,
r2,

0(r2)
1000(r2)
r1, r3
2000(r2)
-4(r2)
996(r2)
r4, r5
1996(r2)
r2, -8
zero, Loop

Assuming that the branch instruction is resolved in the decode stage:


ALU/branch
Loop:

Load/store

nop

lw

r1, 0(r2)

nop

lw

r3, 1000(r2)

nop

lw

r4, -4(r2)

add

r1, r1, r3

lw

r5, 996(r2)

addi

r2, r2, -8

sw

r1, 2000(r2)

add

r4, r4, r5

nop

bne

r2, zero, Loop

sw

r4, 2004(r2)

<Good Luck>
5 of 5

Cycle
1
2
3
4
5
6
7
8
9
10

You might also like