CA PDF
CA PDF
.j
Nati onal Inst itute of Technology, Tirnchirappalli_
Dep artm ent of Computer Science and Engineering
:~
·1
CYCLE TES T-I
l CSPCS l -Computer Architecture
:i
I Date : 08.09 .2023 Max Marks : 15
I
:i
A B C D
Application
41 27 18 14
% resources needed
50 80 60 90
% resources
parallelizable
5. There is a 5 stage processor having the stages Instruction Fetch (IF), Instruction
_. D~de (ID), Operand Fetch (OF), Execute (EX) and Write Operand (WO). The
EX stage talces 1 clock cycle for ADD and SUB instructions, .J. clock cycles for
MUL instruction, and 2. clock cycles for DIV instruction. Operand forwarding is
used in the pipeline (for data dependency, OF stage of the.dependent instruction can
be executed only. If the phases- IF, ID, OF, and WO stages take.l.,clock cycle then
what is the number of clock cycles talcen to complete the following sequence of
instructions? (3)
JO: MUL R2 ,RO ,Rt
11 : DIV RS ,R3 ,R4
12 : ADD R2 ,RS ,R2 ;
13 : SUB RS ,R2 ,R6
*****************Best Wishes****************
~ I ''WI U. 1
CE & ENGINEERING
DEPARTMENT OF CO MP UT ER SCIEN
NATIONAL INSTITUTE OF TE CH, NO LO GY
INDIA DU
TIRUCHIRAPPALLI - 620 015, TAMIL NA
CY CL E TEST-1
CSPC51 - CO MP UT ER AR CID TEC TU RE
: CS EN /A Time: 03:30 PM to 04.3 0 PM
Branch/Semester/Sec
: 30.8 .202 4 Max Marks: 20 ---
Date ------------------------------------------------------------
·-------------------------------------------- Ans wer All Questions
The
tion versions of its proc esso r arch itect ure.
1. A Company ls relea sing thre e modifica multiple ramificat ions to the thre e majo r
modification that are befn e r.nnsfcfP.rP.cf have ned In the follo wing table ,
ramifications are outli
com pone nts (X. Y & Z) of the proc esso r. The co~ pon ent In the colu mn
,I
h the spee dup of that
In which each entr y deno tes the facto r by whic
tion s of the total execution time for the thre e com pone nts X,
head er will be affected. The frac be
tify the spee dups (or slow dow n) that are to
Y & Z are 30%, 45% & 25% respectively, Iden k the mod ifica tion s in term s of
tion vers ions. Ran
expe cted from each of thes e thre e modifica
spee dup. y X z
A 1.4 0.8 1.5
{6}
one
ed: one is a prod uct of 3 scal ar varia bles , and
2. • Let ther e are two oper ation s to be perf orm let's
arrays, with dime nsio ns 9 by 9. For now
is a matr ix sum of a pair of two- dime nsio nal
e;
assu me only the matr ix sum is parallelizabl
proc esso rs? I
i. Wha t spee d-up do you get with 10 vers us 40
ii. Calculate the spee d-up s assu ming the
matrices grow to 20 by 20. J
[3+3 =6]
lines
a real time application in whic h spec ific dead
3. Supp ose you are designing a syst em for noth ing. Find that you r syst em can
r gains
mus t be met Finishing the com puta tion faste ·
y code , in the wor st case twice as fast as nece ssary .
exec ute the nece ssar the curr ent spee d and turn off the
ute at
a. How muc h ener gy do you save if you exec
syste m whe n the com puta tion is complete?
the voltage and freq uenc y to be half as
11 b. How muc h ener gy do you save if you set
much? [2+3 =5]
40% bran ch instr uctio ns, 40% load -sto re
4. A prog ram with 2000 instr uctio ns has
are ALU instr uctio ns. The prog ram is runn ing on a proc esso r
instr uctio ns and the rest
e, and ALU instr uctio ns are 2, 2, and 3
oper ating at S GHz. The CPI of bran ch, load-stor
prog ram in micr osec o~ds (cor rect to 2
,I
respectively. Wha t is the exec ution time of this
decimal places)?
.
•..
'I
[3]
. 1<1T:, .
"I
t ,;
i
<t :.ll
0..
J-' Xatioual Ins(itute of Technology, Tirachirappalli
~ ' Department o f Computer Science and Engineering
CYCLE TEST - n
CSPC51 - Computer Architectu re
I. TI1e following loop is the so-called DAX PY loop (double-precision aX plus Y) and is the
central operation in Gaussian elimination. The following code implements the DAXPY
operation, Y = aX + Y. for a vector length I 00. Initially. RI is set to the base address of
aITUy X and R2 is s~t to the base address of Y: l4J
0/IDOHI R4,R 1Jq()O R1 = 11ppf:lr bound for X
foo: L. 0 f 2. 0 (R1) ( f2) = X( i )
MUL.O F4.F2. FO ( F4) -:: a*X {i)
L.D F6.0{R2) {F6) = 'f(i)
AOO.D F6,F4,F6 (F6) = a*X(i} + Y(i)
5.1} F~,0(R2) ; ': ( i ) = a"'X(i) + y (i)
DADDIU Rl,Rl, #8 increment X i nde-x
DADDIU R2,R2,#8 increment Y index
OSLru R3. Rl,R4 test: contiriue loop?
BNEZ R3,foo ?C•OP if needed
Assume the functional unit latencies as shown in the table below. Assume a one cycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed.
Instruction iRodudng
result Instruction using result latency in dock cycles
FPmultiply FPALUop 6
FPadd . FPALlf op 4
1---P multiply FP s!or~ 5
FPadd FPstorc 4
Integer operations and aII Any 2 ~ ,'-.L ..f l:d
Joads ·,;,.,:;_,,.. _.·. -;,· . •.·.. ·• :; ' ... "· ·,.'-"•• ..·,. - . :;;_. f! "'--S }(: Cf
.
•· .
Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it -;.. 2. ~(. ·
without any stalls, collapsing the loop overhead instructions. How many times must the loop
be unrolled? Show the instruction schedule. What is the execution time per element of the ~2.. l,;i ~ =- 4 i~,
result? No. ~ b~-=. ~ b7B
.2n:, .
= 3i2..M S -
::. s:2..
~ -:. t; 'M ~ tJ,, CV,,,
- ~~~' <?' / Depar tment of Comp uter Science and Engin eering
2. Consider the following program code. What are the contents of ROB at the
time when we have
issued all the instructions in the loop twice? Let's also assume that the L.D and MUL.D
from
the first iteration have co111111it1ed and all other instructions have completed exe<.:uti
on. rJJ
Loop: L.D F0,0(R I) ;F0=arr ay elemen t
ADD.D f-4,F0,F2 :add scalar in F2
S.D F4,0(R I) ;store result
DADDU I RI ,R 1,#-S ;decrement pointer;R bytes (per OW)
BNE RI ,R2.Loop :branch RI !=R2
ea I
3. Consid er the usage of Cl'iticnl word first and early restart on L2 cache misses. Assum
e that the
MB l2 cache with 64 byte blocks and a refill path thatis 16 bytes wide. Assum
the first 16
L2 can be written with I 6 bytes every- 4 process or cycles, the time to receive
16 byte block from
byte block from the memor y control ler is 120 cycles, each .additional
the read port of
main memor y requires 16 cycles, and data can be bypassed directly into
cache and the
the L2 cache. Ignore any cycles to transfer the miss request to the L2
to service an L2 cache_miss
requested data to the LI cache.I-low many cycles would it take
(3)
with and withou t critical word first and early restart?
~
,,
.4. A 8-wa_y set associative cache memor y unit with a capacity of 3i KB is
built using a block
l address space is 8 GB.
size of 8 word,s . The word length is 32 bits. The size of the physica
~ [31
What is the number of bits for the TAG fi eld?
r memor y
5. a) Which cache configu ration, blockin g or non-blo cking, offers superio
perform ance? Justify your answer .
more likely to
b) Betwee n hardwa re prefetc hing and compil er prefetching, which one is
[2J
incre3s e memor y access time?
CYCLE TEST-2
CSPCSl - COMPUTER ARCHITECTURE
Branch/Semester/Sec : CSENIA Time: 10:30 PM to 11.30 PM
Date : 15.10.2024 Max Marks: 20
Answer All Questions
1. a State the advantages of 000 pipelines over In-order pipelines.
b. Assume a five-stage single-pipeline
microarchi tecture (fetch, decode, execute, Loop: LW R3.0(R0)~
memory, write-back ) and the . code in the given LW Rl 0(R3)·
1
Figure 1. All ops are one cycle except LW and SW, ADDI Rl Rl #l ·
1 1
• 3.
NATIONAL INSTITUTE OF TECHNOLOGY
TIRUCHIRAPPALLI - 620 015, TAMIL NADU, INDIA
\'
l
.,
h. fr ap pa lll
,n al In st itu te or Te ch no lo gy , Tl r.t lc
Na tto gi ne er in g
e an d En
Co m pu te r Sc ie nc
De pa rtm en i: of
End Semester
tecture
CS PC 31 -C om pu ter Archi
E/ V/ A
Branch/Semester/ Section : CS
Time: 10:00AM to 01:00 PM Max Marks: 50
Date : 11.12.2023
I. iables,
o operations to be perfonned: one is a product of 3 scalar var
a) Let there are tw dimensions 9 by
is a matrix sµm of a pai r of two-dimensional arrays, with
and one
9.
matrix sum is parallelizable;
For now let's assume only the 3
do you get wit h JO ver sus 40 processors?
i. What speed- up 20. 3
lcu late the spe ed- up s ass um ing the matrices grow to 20 by
ii. Ca deadlines
ing a sys tem for a rea l-ti me application in which specific
b) You are design t your
ish ing the com pu tation fas ter gains nothing. You find tha
must be met. Fin
ary code, in the worst case,
system can execute the necess 4
twice as fast as nec essary .
off the
rgy do yo u sav e if you exe cute at the current speed and tum
a) How much ene
is complete?
system when the computation to be half as
ene rgy do you sav e if you set the voltage and frequency
b) How much
much?
2. is 2.4 clock
ry access tim e for a mi cro processor with 1 level of cache
a) The average memo
cycles I clock cycle
- If data is present and valid
in the cache, it can be found in
-chip
the cac he, 80 clock cyc les are needed to get it from off
- If data is not found in the average memory access
time to obtain a
ner s are trying to im pro ve
memory Desig ering adding a 2nd
vem ent in ave rag e me mo ry access time, and are consid
65% im pro
level of cache on-chip.
ld be accessed in 6 clock cycles
- This second level of cache cou or hit
he do es no t affect the firs t level cache's access patterns
- The addition of this cac
still require 80 additional CCs.
times - Off-chip accesses would 6
p, ho w oft en mu st data be found in the 2nd level cache?
To obtain the desired speedu
pa ll l
0£ Te .cl in ol .o gy , T tr tt c~ ap
~• tt ~n at I1 ;1 stt tu te ee rin e
Co m pu te r Sc ie nc e an d -E ng in
..,e pa rt m en t of
30 ns to
a lin e size of 32 by tes and a main memory that requires
b) Consider a cache wi th ng swapped out
wo rd. Fo r any lin e that is written at least once before bei ore
transfer a 4-byte of times that the line must be written bef
rag e nu mb er
of the cache, what is the ave be mo re efficient that a write-through
cache?
a wr ite -ba ck cache to
being swapped out for me mory uses a block transfer cap
ability that has
we r cha ng e if the ma in 4
How does the ans tim e of 5 ns for each word thereafter?
ns and an acc ess
a firstword access time of 30
3. ction sequence: 4
WAR, and WA W dependencies in the following instru
a. Identify the RAW,
11: Rl = 100
12: Rl = R2 + R4
13: R2 =~ -2 5
14: R4 = RI + R3
IS: RI = RI + 30
execute, memory,
single-pi peline mi croarchitecture (fetch, decode,
Assume a five-s tage h are I + 2_
elow. Al l op s are one cyc le except LW and sw-:-whic
writeback) and .the code.b the phases of each
es, wh ich are _1 + I cyc les. There is no forwarding. Show per loop
cycles, and branch ion of the loop. How many clock cycles
for on e iterat
instruction per clock cycle ir, 2 >.io '-'1 u,,_ •
ion are lost 6
iterat
to branch overhead?
Loop: - LWR3,0(RO)
LWRI,O(R3)
ADDI Rl,Rl,#1
SUB R4,R3 ,R2
SW RI ,O(R3)
BNZR4,Loop
4.
ling.
tic scheduling and dynamic schedu
a. Distinguish between sta
plain
jor tec hn iqu es that can reduce control hazard stalls? Ex
b. Name three ma lation? What are the
w.is the Toma sul o's algori thm modified to include specu 2+3+5
c. Ho
req uired ?
additional hardware
'log y, Tlr il.c hlr •pp all l
Na tto na l Ins titu te orT ecf lno
r Sci enc e and Eng ln~ erll tg:
De pa rtm en t of Co mp ute
5.
characteristics:
a) Assume a hypothetical GPU with the following
4
••••••••