Numerical: Central Processing Unit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28
At a glance
Powered by AI
The document discusses instruction sets for different machine architectures, instruction encoding schemes, Amdahl's law applications for performance analysis and vectorization using VLIW processors.

The three types of instruction sets discussed are stack machine, accumulator machine and load/store machine.

Numerical

Central Processing Unit


Problem: ISA
Write down the instruction sequence to evaluate the expression for the
following architecture specifications. S=(B*B) -(A*C). Here A, B, C and S
are memory locations.
(a) Stack machine. (b) Accumulator machine
(c) Load Store machine with four registers R1, R2, R3 and R4
Stack machine Accumulator machine Load Store machine
 Push B
 Load A  Load R1, B
 Push B
 Mul C  Load R2, A
 Mul
 Store S  Load R3, C
 Push A
 Load B  Mul R4, R1, R1
 Push C
 Mul B  Mul R2, R2, R3
 Mul
 Sub S  Sub R4, R4, R1
 Sub
 Store S  Store R4, S
 Pop S
Problem: Instruction Encoding
Consider the case of a processor with 32 general purpose registers and an
instruction length of 16-bits. The processor has a main memory (RAM)
addressability of 256Bytes. It has 32 Type -1 instructions each with two
register operands, two Type-II instructions each with a register and a
memory operand, and N type III instruction each with one register operand.
What is the maximum possible value of N? Show an instruction encoding
format and its interpretation.
16 bit Instruction
I 0 XXXXX (5 bits) 5 Bit Reg name 5 Bit Reg name
(32 COMBINATIONS)

II 1 00 /01 5 Bit Reg name 8 Bit Memory Address


(2)

III 1 10 XXXXXXXX (8 bits) 5 Bit Reg name


11 (512 COMBINATIONS)

Hence there can be a maximum of 512 Type III instructions.


Problem: Amdahl’s Law
A new floating-point unit speeds up floating point operations by two times. In
an application one fifth of the instructions are floating-point operations.
(a) What is the overall speedup? (Ignore the penalty to other instructions).
(b) Assume that the speeding up of the floating-point unit mentioned above
slowed down data cache accesses resulting in a 1.5x slowdown.
Assume the load instructions constitute 15% and store instructions
constitute 9% of the total instruction what is the effective overall
speedup now?

(c) S = 1/ { (1-f) + (f/N) } = 1 / { (1- 0.2) + (0.2/2) } = 1.11 times

(d) S = 1/ { (1-f1-f2) + (f1/N1) + (f2/N2) }


= 1 / { (1- 0.2-0.24) + (0.2/2) + (0.24/0.67) }
= 0.98 times
Problem: Amdahl’s Law
Suppose a processor spends 75% of its time in memory operations (both I-
cache and D-cache access together) when running a given program A and
its manufacturers make a change that improves its performance on memory
access by a factor of 5. The program A originally took 100s to execute.
What is the speed up from the old system to the new system?
What fraction of its execution time does the new system spend in memory
operations?

(a) S = 1/ {(1-f)+(f/N) } = 1 / {(1- 0.75)+(0.75/5)} = 2.5 times


S = ExT old/ ExT new  2.5 = 100/ ExT new
ExT new = 40 s
(b) Out of 100 s in old system 75% is memory  75 s.
This 75 s got speeded up be factor of 5  75/5 = 15 s
Hence, fraction of memory in new is 15/40 = 37.5%
Problem: Basic Performance Analysis
Consider two programs A and B that solves a given problem. A is scheduled
to run on a processor P1 operating at 1 GHz and B is scheduled to run on
processor P2 running at 1.4 GHz. A has total 10000 instructions, out of
which 20% are branch instructions, 40% load store instructions and rest are
ALU instructions. B is composed of 25% branch instructions. The number of
load store instructions in B is twice the count of ALU instructions. Total
instruction count of B is 12000. In both P1 and P2 branch instructions have
an average CPI of 5 and ALU instructions has an average CPI of 1.5. Both
the architectures differ in the CPI of load-store instruction. They are 2 and 3
for P1 and P2, respectively. Which mapping (A on P1 or B on P2) solves the
problem faster, and by how much?

A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)


IC=10000 IC=12000
Fraction BR: L/S: ALU = 20: 40: 40 Fraction BR: L/S: ALU = 25: 50: 25
CPI of BR: L/S: ALU = 5: 2: 1.5 CPI of BR: L/S: ALU = 5: 3 : 1.5
Problem: Basic Performance Analysis

A on P1 (1GHz  1ns) B on P2 (1.4 GHz0.714ns)


IC=10000 IC=12000
Fraction BR: L/S: ALU = 20: 40: 40 Fraction BR: L/S: ALU = 25: 50: 25
CPI of BR: L/S: ALU = 5: 2: 1.5 CPI of BR: L/S: ALU = 5: 3 : 1.5

(a) CPI A_P1=(0.2x5 + 0.4x2 + 0.4x1.5) = 2.4


ExT = 2.4 x10000x1= 24000 ns

(b) (b) CPI B_P2=(0.25x5 + 0.5x3 + 0.25x1.5) = 3.125


ExT = 3.125 x12000x0.714= 26775 ns

Hence A on P1 is faster.
Problem: Amdahl’s Law
A company is releasing 2 latest versions (beta and gamma) of its basic
processor architecture named alpha. Beta and gamma are designed by
making modifications on three major components (X, Y and Z) of the alpha.
It was observed that for a program A the fractions of the total execution time
on these three components, X, Y, and Z are 40%, 30%, and 20%,
respectively. Beta speeds up X and Z by 2 times but slows down Y by 1.3
times, where as gamma speeds up X, Y and Z by 1.2, 1.3 and 1.4 times,
respectively.
(a) How much faster is gamma over alpha for running A?
(b) Whether beta or gamma is faster for running A? Find the speedup factor
Beta: S = 1/ {(1-fx-fy-fz) + (fx/Nx) + (fy/Ny) + (fz/Nz) }
= 1 / {(0.1 + (0.4/2) + (0.3/0.77) + (0.2/2) } = 1.266 times
(a) Gamma: S = 1/ {(1-fx-fy-fz) + (fx/Nx) + (fy/Ny) + (fz/Nz) }
= 1 / {(0.1 + (0.4/1.2) + (0.3/1.3) + (0.2/1.4) } = 1.239 times
(b) Beta is faster than gamma 1.266/1.239 = 1.021 times
Problem: Pipeline Hazards
Given a non-pipelined architecture running at 1.5 GHz, that takes 5 cycles to
finish an instruction. You want to make it pipelined with 5 stages. Due to
hardware overhead the pipelined design will operate only at 1 GHz. 5% of
memory instructions cause a stall of 50 cycles, 30% of branch instruction cause
a stall of 2 cycles and load-ALU combinations cause a stall of 1 cycle. Assume
that in a given program, there exist 20% of branch instructions and 30% of
memory instructions. 10% of instructions are load-ALU combinations. What is
the speedup of pipelined design over the non-pipelined design?

(a) CPI_up = 5 1.5Ghz0.67 ns : 1 Ghz1 ns


Ex_T up= CPI x CCT = 5 x 0.66 ns = 3.33 ns / instruction
(b) Effective CPIp = Base CPI + stall CPI
{stalls= Memory stalls+ Branch stalls+ Load –ALU stalls}
= 1 + (0.3x0.05x50) + (0.2x0.3x2) + (0.1x1) = 1 + 0.75 + 0.12 +0.1 = 1.97
Ex_p = CPI x CCT = 1.97 x 1 ns = 1.97 ns / instruction
Speedup = Ex_up/ Ex_p = 3.33/1.97 = 1.69
Problem:– Scheduling and Delay slots
Consider a 5 stage MIPS pipelined processor which uses a perfect I and D caches.
Default latency of both caches is 1 cycle. The instruction mix percentage for load,
branch, store and ALU operations are 30%, 20%, 8% and 42 % respectively. Out of
the branch instructions two third are conditional branch instructions and rest
unconditional branch instructions. It was observed that 50% of conditional branch
instructions and all unconditional branch instructions are schedulable to fill 1 cycle
delay slots. Half of conditional branch instructions are always taken and rest always
not taken. Generally load and branch instructions are offering a 2 and 1 cycle stall,
respectively. 60% of load instructions can be reordered to avoid these delay slots.
25% of load instructions cannot be reordered. Rest of the load instructions can be
reordered to fill in 1 delay slot. Under these circumstances what is the resultant CPI
of the machine.
Without reordering or scheduling of any instructions.
With reordering and scheduling the instructions for filling delay slots.
Without reordering or scheduling of any instructions.
Effective CPI = Base CPI + stall CPI
Penalty of 1 cycle stall for conditional taken + unconditional taken
Branch Stalls = 0.2x{0.67x0.5x1+0.33x1} = 0.133
Load Stalls = 0.3x2 = 0.6 // 2 cycles for all loads
CPI = 1 + 0.133 + 0.6 = 1.733
Problem:– Scheduling and Delay slots
With reordering and scheduling the instructions for filling delay slots.

If a branch instruction is schedulable then no stall.


Branch (20% of instruction mix)
2/3 conditional (50% non schedulable = 1 cycle)
(50% schedulable  no stalls)
1/3 unconditional (100% schedulable  no stalls)
Stall CPI=0.2x0.667x0.5x1 = 0.0667

Load (30% of instruction mix)


60% pipeline re-ordering (no stalls)
25% pipeline re-ordering not possible( 2 cycles stalls)
15% pipeline re-ordering partial (1 cycle stall)
Stall CPI =0.3 x (0.25x2 + 0.15x1) = 0.195

Effective CPI = Base CPI + stall CPI


= 1 + 0.0667 + 0.195 = 1.2617
Problem: Pipeline Hazards
A program has 2000 instructions in the sequence L.D, ADD.D, L.D, ADD.D,..... L.D,
ADD.D. The ADD.D instruction depends on the L.D instruction right before it. The L.D
instruction depends on the ADD.D instruction right before it. If the program is
executed on the 5-stage pipeline what would be the actual CPI with and without
operand forwarding technique?
Without operand forwarding.
ID of nth instruction can be only after WB of n-1th instruction.
3 stalls in each instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14

L.D IF ID EX ME WB
ADD IF * * * ID EX ME WB

L.D IF * * * ID EX ME WB

ADD IF * * * ID

Instructions reach WB at clock cycles 5, 9, 13, 17, 21, 25, 29,…..


Last instruction (ADD) reaches WB in 5 + (1999x4) = 6002 cycles.
CPI= 6002/2000=3.001
Problem: Pipeline Hazards
A program has 2000 instructions in the sequence L.D, ADD.D, L.D, ADD.D,..... L.D,
ADD.D. The ADD.D instruction depends on the L.D instruction right before it. The L.D
instruction depends on the ADD.D instruction right before it. If the program is
executed on the 5-stage pipeline what would be the actual CPI with and without
operand forwarding technique?
With operand forwarding.
Every ADD after L.D has a stall, but L.D after ADD do not have a stall.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

L.D IF ID EX ME WB
ADD IF * ID EX ME WB
L.D IF ID EX ME WB

ADD IF * ID EX ME WB

Instructions reach WB at clock cycles 5,7, 8,10, 11,13, 14,16


Last instruction (ADD) reaches WB in 7 + (999x3) = 3004 cycles.
CPI= 3004/2000=1.502
Problem: Pipeline Hazards
Assume that the original machine is an 8-stage pipeline with a 1ns clock cycle. The
second machine is a 12-stage pipeline with 0.8ns clock cycle. The 8-stage pipeline
experiences a stall due to a data hazard for every 5 instructions, whereas the 12-
stage pipeline experiences 3 stalls for every 10 instructions. In addition, branches
constitute 30% of the instructions, and the mis-prediction rate for both machines is
10%. If the branch mis-prediction penalty for the first machine is 2 cycles but the
second machine is 5 cycles, What is the speedup of the 12-stage pipeline over the 8-
stage pipeline, taking into account (a) only data hazards (b) both data hazard and
mis-prediction?
8-stage pipeline (1ns cc). 6 cc / 5 instructions + 2 cc for branch
12-stage pipeline (0.8ns cc). 13 cc / 10 instructions+ 5 cc for branch.
30 % branches with 10% mis prediction.

Considering only data hazards


8 stage : CPI = 6/5= 1.2
ET_8 stage = CPI x CCT = 1.2 x1 ns = 1.2 ns/ instruction
12 stage : CPI = 13/10= 1.3
ET_12 stage = CPI x CCT = 1.3 x0.8 ns = 1.04 ns/ instruction
Speedup= 1.2/1.04 = 1.153 times
Problem: Pipeline Hazards
Assume that the original machine is an 8-stage pipeline with a 1ns clock cycle. The
second machine is a 12-stage pipeline with 0.8ns clock cycle. The 8-stage pipeline
experiences a stall due to a data hazard for every 5 instructions, whereas the 12-
stage pipeline experiences 3 stalls for every 10 instructions. In addition, branches
constitute 30% of the instructions, and the mis-prediction rate for both machines is
10%. If the branch mis-prediction penalty for the first machine is 2 cycles but the
second machine is 5 cycles, What is the speedup of the 12-stage pipeline over the 8-
stage pipeline, taking into account (a) only data hazards (b) both data hazard and
mis-prediction?
8-stage pipeline (1ns cc). 6 cc / 5 instructions + 2 cc for branch
12-stage pipeline (0.8ns cc). 13 cc / 10 instructions+ 5 cc for branch.
30 % branches with 10% mis prediction.

Considering data hazards + branch mis-prediction


CPI_8 stage : CPI with data hzd + stall branch = 1.2 + 0.3x0.1x2 = 1.26
ET_8 stage = CPI x CCT = 1.26 x1 ns = 1.26 ns/ instruction
CPI_12 stage : CPI with data hzd + stall branch = 1.3 +0.3x0.1x5 = 1.45
ET_12 stage = CPI x CCT = 1.45 x0.8 ns = 1.16 ns/ instruction
Speedup= 1.26/1.16 = 1.086 times
Problem: Branch Prediction
Consider the last 16 actual outcomes of a single static branch. T means
branch is taken and N means not taken.
{oldest T T N N T N T T T N T N T T N T  latest}
A two level branch predictor of (1,2) type is used. Since there is only one
branch in the program indexing to BHT with PC is irrelevant. Hence only last
branch outcome only is used to index to the BHT. How many mis-predictions
are there and which of the branches in this sequence would be mis-
predicted? Fill up the table for 16 branch outcomes.
Problem: Branch Prediction
T
T
N
N
T
N
T
T
T
N
T
N
T
T
N
T
Problem: Multicycle Pipeline
Consider the following instruction sequence executed on a MIPS floating
point pipeline. Operand forwarding is implemented. [R indicates integer
registers and F indicates floating point registers]. Find the clock cycle in
which STOR instruction reaches MEM stage.
LOAD F4, 8(R2);
FMUL F0, F4, F2;
FADD F3, F0, F2;
STOR F3, 16(R3);
Problem: Multicycle Pipeline
Consider the following instruction sequence executed on a MIPS floating
point pipeline. Operand forwarding is implemented. [R indicates integer
registers and F indicates floating point registers]. Find the clock cycle in
which STOR instruction reaches MEM stage.
LOAD F4, 8(R2);
FMUL F0, F4, F2;
FADD F3, F0, F2;
STOR F3, 16(R3);

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

F D X M W

F D M1 M2 M3 M4 M5 M6 M7 M W
*
F D A1 A2 A3 A4 M W
* * * * * * *
F D M W
* * * * * * * * * *
Problem:– Compiler Scheduling
Assume a multi-cycle MIPS pipeline with 1 integer unit (EX), 1 FP Adder (4 stage
pipelined unit), 1 Multiplier (7 stage pipelined unit), 1 Divider (24 stage un-pipelined
unit).
ADDI, R8, R0,#4 // R0 is by default stored with ‘0’MIPS
Loop: LW F1,0(R5) // value A is loaded to F1
LW F2,0(R6) // value B is loaded to F2
MUL.D F3, F1,F1
MUL.D F4, F2,F2
MULI.D F1,F1,#2
MUL.D F1,F1,F2
ADD.D F1, F1,F3
ADD.D F1,F1,F4
SUB.D F2,F3,F4
DIV.D F1,F1,F2
SW F1, 0(R7) // value C is stored to memory
ADD R5,R5,#8
ADD R6,R6,#8
ADD R7,R7,#8
SUB R8,R8,#1
BNEZ R8, Loop
ADDI R8, R0,#8
Problem:– Compiler Scheduling
O INSTRUCTION DIFF WB CLK REMARKS
CYCLE

1 ADDI, R8, R0,#4 5 5


2 LW F1,0(R5) +1 6 Fetch @ 2
3 LW F2,0(R6) +1 7
4 MUL.D F3, F1,F1 +1, +6 14 M7@12, MUL is 7, 6 extra cc
5 MUL.D F4, F2,F2 +1 15 M7@13
6 MULI.D F1,F1,#2 +1 16 M7@14
7 MUL.D F1,F1,F2 +1, +6 23 M1@15, M7@21, RAW stall- (6)
8 ADD.D F1, F1,F3 +1,+3 27 A1@22, A4@25, RAW Stall – (3)
9 ADD.D F1,F1,F4 +1,+3 31 A1@26, A4@29, RAW stall –(3)
10 SUB.D F2,F3,F4 +1 32 A4@30
11 DIV.D F1,F1,F2 +1, +23 56 D1@31, D24@54 , DIV is 24, 23 extra cc
12 SW F1, 0(R7) +1 57
13 ADD R5,R5,#8 +1 58
14 ADD R6,R6,#8 +1 59
15 ADD R7,R7,#8 +1 60
16 SUB R8,R8,#1 +1 61
17 BNEZ R8, Loop +1 62 WB @ 62
18 ADDI R8, R0,#8 +1 63 1 loop takes 62-2 = 60 cycles
Problem-2:– Compiler Scheduling
OO NO INSTRUCTION DIFF WB REMARKS
CC
1 1 ADDI, R8, R0,#4 5 5
2 2 LW F1,0(R5) +1 6 Fetch @ 2
3 3 LW F2,0(R6) +1 7
4 4 MUL.D F3, F1,F1 +1, +6 14 M7@12, MUL is 7, 6 extra cc
5 5 MUL.D F4, F2,F2 +1 15 M7@13
6 6 MULI.D F1,F1,#2 +1 16 M7@14
7 7 MUL.D F1,F1,F2 +1, +6 23 M1@15, M7@21, RAW stall- (6)
10 8 SUB.D F2,F3,F4 +1 24 Scheduled line
13 9 ADD R5,R5,#8 +1 25 Scheduled line
14 10 ADD R6,R6,#8 +1 26 Scheduled line
8 11 ADD.D F1, F1,F3 +1 27 A1@22, A4@25 RAW Stall nil
15 12 ADD R7,R7,#8 +1 28 Scheduled line
16 13 SUB R8,R8,#1 +1 29 Scheduled line
9 14 ADD.D F1,F1,F4 +1,+1 31 A1@26, A4@29, RAW stall –(1)
11 15 DIV.D F1,F1,F2 +1, +23 55 D1@31, D24@54 , DIV is 24, 23
extra
12 16 SW F1, -8(R7) +1 56
17 17 BNEZ R8, Loop +1 57 WB @ 62
18 18 ADDI R8, R0,#8 +1 63 1 loop takes 57-2 = 55 cycles
Problem: Tomasulo’s Algorithm
Consider an instruction pipeline of an issue width of 1 that uses Tomasulo's algorithm
with two reservation station per functional unit. There is one instance of each
functional unit type and a single CDB. There is an arbitration mechanism for
resolving CDB entry collision. Preference is given in round robin fashion. The
functional units are not pipelined. An instruction waiting for data on CDB can move to
its EX stage in the cycle after the CDB broadcast.
Assume the following information about functional units.
Functional unit type Cycles in Execution stage
Integer Mul 4
Integer Div 8
Integer Add 1
Complete the following table using Tomasulo's algorithm with the above
specifications. Fill in the cycle numbers in each pipeline stage for each instruction,
and indicate where its source operand's are read from (use RF for register file, ROB
for reorder buffer and CDB for common data bus). Some entries are filled in for you.
Fill the rest.
Problem: Tomasulo’s Algorithm

# Instruction Issue Src_Operand1 Src Operand2 EX WB Commit

1 DIVI R4, R4, #12 1 RF Imm 2 10 11

2 MUL R2, R6, R12 2 RF RF 3 7 12

3 DIV R1, R1, R2 3 RF CDB 8 16 17

4 ADD R5, R1, R3 4 CDB RF 17 18 19

5 ADDI R7, R2, #4 5 CDB Imm 8 9 20

6 ADD R5, R6, R7 10 RF ROB 11 12 21

7 ADDI R8,R8, #24 13 RF Imm 14 15 22

8 ADD R9, R6, R8 16 RF ROB 17 19* 23

9 MUL R5, R5, R10 17 ROB RF 18 22 24

10 ADD R6, R8, R5 19 ROB CDB 23 24 25


Problem: VLIW Processors
Consider the following code that performs the vector operation on Y=aX+Y. Assume
the initial value of R3 as 4. Consider a VLIW processor that could issue one
memory reference (L.D /S.D), one floating point operation and one integer
operation in an instruction bundle. Branch is also scheduled in the integer unit.
The floating point unit‘s EX stage is of 4 stage as is fully pipelined. The integer unit
EX stage takes the normal 1 cycle and full operand forwarding is permitted.
foo:L.D F2, 0(R1) ; load X(i)
MUL.D F4, F2, F0 ; multiply a*X(i)
L.D F6, 0(R2) ; load Y(i)
ADD.D F6, F4, F6 ; add a*X(i) + Y(i)
S.D 0(R2), F6 ; store Y(i)
DADDIU R1, R1, #8 ; increment X index
DADDIU R2, R2, #8 ; increment Y index
DSUBI R3,R3,#1 ;R3=R3-1
BNEZ R3, foo ;branch to Loop if R3!=0
For each of the following cases, perform unrolling, list the instruction bundles, and
obtain the number of cycles for execution in the VLIW processor. For each of the
cases what is the effective issue rate and % slot wastage cases due to lack of
available instruction to fill the bundle slot?
Case 1: unroll loop 2 times (b) Case 2: unroll loop fully
Problem: Multicycle Pipeline
Case 1: With 2 iterations of the loop unrolled
Bundle LD/SD FP INT
1 L.D F2, 0(R1)
2 L.D F12, 8(R1)
3 L.D F6, 0(R2) MUL.D F4, F2, F0
4 L.D F16, 8(R2) MUL.D F14, F12, F0
5 DADDIU R1, R1, #16
6 DADDIU R2, R2, #16
7 ADD.D F6, F4, F6 DSUBI R3,R3,#2
8 ADD.D F16, F14, F16
9
10
11 S.D -16(R2), F6
12 S.D -8(R2), F16 BNEZ R3, foo

Issue rate = 14/12 = 1.16 :: Total slot wastage = 22/36 = 61.1%


Problem: Multicycle Pipeline
Case 1: With loop fuly unrolled (4 times)
Bundle LD/SD FP INT
1 L.D F2, 0(R1)
2 L.D F12, 8(R1)
3 L.D F22, 16(R1) MUL.D F4, F2, F0
4 L.D F32, 24(R1) MUL.D F14, F12, F0
5 L.D F6, 0(R2) MUL.D F24, F22, F0
6 L.D F16, 8(R2) MUL.D F34, F32, F0
7 L.D F26, 16(R2) ADD.D F6, F4, F6
8 L.D F36, 24(R2) ADD.D F16, F14, F16
9 ADD.D F26, F24, F26
10 ADD.D F36, F34, F36
11 S.D F6, 0(R2)
12 S.D F16, 8(R2)
13 S.D F26, 16(R2)
14 S.D F36, 24(R2)

Issue rate = 20/14 = 1.43 :: Total slot wastage = 22/42 = 52.4%


THANK YOU

You might also like