0% found this document useful (0 votes)
28 views10 pages

CA PDF

Explore our comprehensive collection of previous year question papers for the subject Computer Architecture. These papers provide valuable insights into the exam pattern, frequently asked questions, and key topics covered in this subject. Designed to help students strengthen their understanding of concepts such as processor design, memory hierarchy, and instruction sets, these papers are an excellent resource for exam preparation.

Uploaded by

Mohd Shoaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

CA PDF

Explore our comprehensive collection of previous year question papers for the subject Computer Architecture. These papers provide valuable insights into the exam pattern, frequently asked questions, and key topics covered in this subject. Designed to help students strengthen their understanding of concepts such as processor design, memory hierarchy, and instruction sets, these papers are an excellent resource for exam preparation.

Uploaded by

Mohd Shoaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

I

.j
Nati onal Inst itute of Technology, Tirnchirappalli_
Dep artm ent of Computer Science and Engineering

:~
·1
CYCLE TES T-I
l CSPCS l -Computer Architecture

Branch/Semester/ Section : CSE/ V/A Time : 03:00 to 4:00 pm

:i
I Date : 08.09 .2023 Max Marks : 15
I
:i

, a single failure does not


1. In a server farm such as that used by Amazon or eBay
number of requests that
cause the entire syste m to crash. Instead, it will reduce the
computers, each with an
can be satisfied at any one time. If a company has l 0,000
if 1/3 of the computers
M1T F of 35 days, and it experience catastrophic failure only
(2)
fail, what is the MIT F for the system?

and you have been tasked


2. Your comp any has just bought a new 22 core processor,
run four applications on
with optimizing your software for this processor. You will
. Assume the system and
this system, but the resource requirements are not equal
application characteristics listed in the table below:

A B C D
Application
41 27 18 14
% resources needed
50 80 60 90
% resources
parallelizable

serial. Assume that when


The percentage of resources of assuming they are all run in
for that portion is X. (4)
you parallelize a porti on of the program by X, the speedup
cation A on the entire
a) How much speedup would result from running appli
22-core processor, as compared to running it serially?
rces, if we statically
b) Given that application A requires 41 % of the resou
is run parallelized
assign it 41 % ofthe cores, what is the overall speedup if A
but every thing else is runserially?

and 300 picoseconds. The first


11 3. The stage delays in a 4 stage pipeline are 800, 500, 400
ving two stages with
stage is replaced with a functionally equivalent design invol
ghput increase (in% ) of the
respective delays 600 and 350 picoseconds. What is the throu
(4)
II pipeline?
4. What are the different types of hazards in pipelining, and how do these hazards
impact.the overall performance of pipelin~d processor? (2)

5. There is a 5 stage processor having the stages Instruction Fetch (IF), Instruction
_. D~de (ID), Operand Fetch (OF), Execute (EX) and Write Operand (WO). The
EX stage talces 1 clock cycle for ADD and SUB instructions, .J. clock cycles for
MUL instruction, and 2. clock cycles for DIV instruction. Operand forwarding is
used in the pipeline (for data dependency, OF stage of the.dependent instruction can
be executed only. If the phases- IF, ID, OF, and WO stages take.l.,clock cycle then
what is the number of clock cycles talcen to complete the following sequence of
instructions? (3)
JO: MUL R2 ,RO ,Rt
11 : DIV RS ,R3 ,R4
12 : ADD R2 ,RS ,R2 ;
13 : SUB RS ,R2 ,R6

- - - -------- ---- - ---------

*****************Best Wishes****************
~ I ''WI U. 1

CE & ENGINEERING
DEPARTMENT OF CO MP UT ER SCIEN
NATIONAL INSTITUTE OF TE CH, NO LO GY
INDIA DU
TIRUCHIRAPPALLI - 620 015, TAMIL NA

CY CL E TEST-1
CSPC51 - CO MP UT ER AR CID TEC TU RE
: CS EN /A Time: 03:30 PM to 04.3 0 PM
Branch/Semester/Sec
: 30.8 .202 4 Max Marks: 20 ---
Date ------------------------------------------------------------
·-------------------------------------------- Ans wer All Questions
The
tion versions of its proc esso r arch itect ure.
1. A Company ls relea sing thre e modifica multiple ramificat ions to the thre e majo r
modification that are befn e r.nnsfcfP.rP.cf have ned In the follo wing table ,
ramifications are outli
com pone nts (X. Y & Z) of the proc esso r. The co~ pon ent In the colu mn
,I
h the spee dup of that
In which each entr y deno tes the facto r by whic
tion s of the total execution time for the thre e com pone nts X,
head er will be affected. The frac be
tify the spee dups (or slow dow n) that are to
Y & Z are 30%, 45% & 25% respectively, Iden k the mod ifica tion s in term s of
tion vers ions. Ran
expe cted from each of thes e thre e modifica
spee dup. y X z
A 1.4 0.8 1.5

B 0.6 1.6 1.8

C 1.3 1.4 0.9


I

{6}

one
ed: one is a prod uct of 3 scal ar varia bles , and
2. • Let ther e are two oper ation s to be perf orm let's
arrays, with dime nsio ns 9 by 9. For now
is a matr ix sum of a pair of two- dime nsio nal
e;
assu me only the matr ix sum is parallelizabl
proc esso rs? I
i. Wha t spee d-up do you get with 10 vers us 40
ii. Calculate the spee d-up s assu ming the
matrices grow to 20 by 20. J
[3+3 =6]

lines
a real time application in whic h spec ific dead
3. Supp ose you are designing a syst em for noth ing. Find that you r syst em can
r gains
mus t be met Finishing the com puta tion faste ·
y code , in the wor st case twice as fast as nece ssary .
exec ute the nece ssar the curr ent spee d and turn off the
ute at
a. How muc h ener gy do you save if you exec
syste m whe n the com puta tion is complete?
the voltage and freq uenc y to be half as
11 b. How muc h ener gy do you save if you set
much? [2+3 =5]
40% bran ch instr uctio ns, 40% load -sto re
4. A prog ram with 2000 instr uctio ns has
are ALU instr uctio ns. The prog ram is runn ing on a proc esso r
instr uctio ns and the rest
e, and ALU instr uctio ns are 2, 2, and 3
oper ating at S GHz. The CPI of bran ch, load-stor
prog ram in micr osec o~ds (cor rect to 2
,I
respectively. Wha t is the exec ution time of this
decimal places)?
.
•..
'I
[3]
. 1<1T:, .

"I

t ,;
i
<t :.ll
0..
J-' Xatioual Ins(itute of Technology, Tirachirappalli
~ ' Department o f Computer Science and Engineering

CYCLE TEST - n
CSPC51 - Computer Architectu re

Branch/Semester/ Section: CSE/ V/ A Time : 3:00 to 4:00 pm


Date : 03.11.2023 Max Marks : 15

Answer All Questions

I. TI1e following loop is the so-called DAX PY loop (double-precision aX plus Y) and is the
central operation in Gaussian elimination. The following code implements the DAXPY
operation, Y = aX + Y. for a vector length I 00. Initially. RI is set to the base address of
aITUy X and R2 is s~t to the base address of Y: l4J
0/IDOHI R4,R 1Jq()O R1 = 11ppf:lr bound for X
foo: L. 0 f 2. 0 (R1) ( f2) = X( i )
MUL.O F4.F2. FO ( F4) -:: a*X {i)
L.D F6.0{R2) {F6) = 'f(i)
AOO.D F6,F4,F6 (F6) = a*X(i} + Y(i)
5.1} F~,0(R2) ; ': ( i ) = a"'X(i) + y (i)
DADDIU Rl,Rl, #8 increment X i nde-x
DADDIU R2,R2,#8 increment Y index
OSLru R3. Rl,R4 test: contiriue loop?
BNEZ R3,foo ?C•OP if needed
Assume the functional unit latencies as shown in the table below. Assume a one cycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed.

Instruction iRodudng
result Instruction using result latency in dock cycles

FPmultiply FPALUop 6
FPadd . FPALlf op 4
1---P multiply FP s!or~ 5
FPadd FPstorc 4
Integer operations and aII Any 2 ~ ,'-.L ..f l:d
Joads ·,;,.,:;_,,.. _.·. -;,· . •.·.. ·• :; ' ... "· ·,.'-"•• ..·,. - . :;;_. f! "'--S }(: Cf
.
•· .

Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it -;.. 2. ~(. ·
without any stalls, collapsing the loop overhead instructions. How many times must the loop
be unrolled? Show the instruction schedule. What is the execution time per element of the ~2.. l,;i ~ =- 4 i~,
result? No. ~ b~-=. ~ b7B
.2n:, .
= 3i2..M S -
::. s:2..
~ -:. t; 'M ~ tJ,, CV,,,

~6--i -h'cl cl • b:-b. =- 2.\ b


-::::.;
,.,u:-c ·<;~
~:•.
(>
.. ~
z
()
,, :'\aHo nal Institu te of Trdan ology . Tiruc hirapp alli
. ~

- ~~~' <?' / Depar tment of Comp uter Science and Engin eering

2. Consider the following program code. What are the contents of ROB at the
time when we have
issued all the instructions in the loop twice? Let's also assume that the L.D and MUL.D
from
the first iteration have co111111it1ed and all other instructions have completed exe<.:uti
on. rJJ
Loop: L.D F0,0(R I) ;F0=arr ay elemen t
ADD.D f-4,F0,F2 :add scalar in F2
S.D F4,0(R I) ;store result
DADDU I RI ,R 1,#-S ;decrement pointer;R bytes (per OW)
BNE RI ,R2.Loop :branch RI !=R2

ea I
3. Consid er the usage of Cl'iticnl word first and early restart on L2 cache misses. Assum
e that the
MB l2 cache with 64 byte blocks and a refill path thatis 16 bytes wide. Assum
the first 16
L2 can be written with I 6 bytes every- 4 process or cycles, the time to receive
16 byte block from
byte block from the memor y control ler is 120 cycles, each .additional
the read port of
main memor y requires 16 cycles, and data can be bypassed directly into
cache and the
the L2 cache. Ignore any cycles to transfer the miss request to the L2
to service an L2 cache_miss
requested data to the LI cache.I-low many cycles would it take
(3)
with and withou t critical word first and early restart?
~
,,
.4. A 8-wa_y set associative cache memor y unit with a capacity of 3i KB is
built using a block
l address space is 8 GB.
size of 8 word,s . The word length is 32 bits. The size of the physica
~ [31
What is the number of bits for the TAG fi eld?
r memor y
5. a) Which cache configu ration, blockin g or non-blo cking, offers superio
perform ance? Justify your answer .
more likely to
b) Betwee n hardwa re prefetc hing and compil er prefetching, which one is
[2J
incre3s e memor y access time?

****** ****Be st \\fishes ****** ***


DEPARTM ENT OF COMPUTER SCIENCE & ENGINEERING
NATION AL INSTITUTE OF TECHNOLOGY
TIRUCHIRAPPALLI - 620 015, TAMIL NADU, INDIA

CYCLE TEST-2
CSPCSl - COMPUTER ARCHITECTURE
Branch/Semester/Sec : CSENIA Time: 10:30 PM to 11.30 PM
Date : 15.10.2024 Max Marks: 20
Answer All Questions
1. a State the advantages of 000 pipelines over In-order pipelines.
b. Assume a five-stage single-pipeline
microarchi tecture (fetch, decode, execute, Loop: LW R3.0(R0)~
memory, write-back ) and the . code in the given LW Rl 0(R3)·
1

Figure 1. All ops are one cycle except LW and SW, ADDI Rl Rl #l ·
1 1

which are 1 + 2 cycles, and branches, which are 1 SUB R4.R3.R2",

+ 1 cycles. There is no forwarding. Show the SW Rl.0(R3) ·


phases of each instruction per clock cycle for one .BNZ R4. Loop ,
iteration of the loop.
a. How many clock cycles per loop iteration are Flau,e 1
lost to branch overhead?
b. Assume a slatic branch predictor, capable of recognizing a
backwards branch in the Decode stage. Now how many clock cycles are
wasted on branch overhead?
. . .
c. Assume a dynamic branch predictor. How many cycles are lost on a
correct prediction ?
[1+6=7]

2. a. Discuss tqe Tomasulo's algorithms . & its riecessity. Use the


hardware configurati on and show the dis.a dvantage of it.
b. Suppose the Following instruction 10 F6 34+ R2
sequence js being executed where the system· LD F2 45+ R3
which is able to issue instruction
:
s..--.,
in . order
. but
·
can MULTD FO F2 ·F4
perform out of order execution, which cons~sts of SUBD -F~ F~ F2
DIVD F10F0F6
3 load units, 3 adder units and 2 multiplier units. ADDD p6 FS FZ
The floating point adder unit takes 2 clock cycle~
floating point multiplier takes 10 c~k cycles and divide operation takes
40 clock cycles. Ifwe use Tomsula's algorithm along with dynamic•brancl)
predi~on, then find at what cycle the final value of F10 will be available
and s.t ate the.state of the.functional units when the DIVD instruction is in
execution. Also picturise the states of the reservation station.
[2+5=7]
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

• 3.
NATIONAL INSTITUTE OF TECHNOLOGY
TIRUCHIRAPPALLI - 620 015, TAMIL NADU, INDIA

Consider the usage of critical word first and early restart on L2


cache misses. Assume a 1 MB L2 cache with 64 byte blocks and a refill
path that is 16 bytes wide. Assume that the L2 can be written with 16
bytes every 4 processor cycles, the time to receive the first 16-byte block
from the memory controller is 120 cycles, each additional 16-byte block
from main memory requires 16 cycles, and data can be bypassed directly
into the read port of the L2 cache. Ignore any cycles to transfer the miss
request to the L2 cache and the requested data to the Ll cache.
How many cycles would it take to service an L2 cache miss with and
without critical word first and early restart?
Do you think critical word first and early restart would be more
important for Ll caches or L2 caches, and what factors would contribute
to their relative importance?
[31
4. Now you are designing a write buffer between a write-through L1
cache and a write-back L2. cache. The L2 cache.write data bus is 16 B wide
and can perform a write to an independent cache add~ess every 4
processor cycles.
What should be the size of each write buffer entry?
.What speedup could b_e expected in the stea?y state by using a
merging write buffer instead of a non-me:r_-ging buffer when zeroing
memo_ry by the execution of 64-bi_t stores jf all -other instructions could
be fs·s ued in parallel _w ith the stores and the blocks are presei:it in the L2
cache?
[3]

\'

l
.,
h. fr ap pa lll
,n al In st itu te or Te ch no lo gy , Tl r.t lc
Na tto gi ne er in g
e an d En
Co m pu te r Sc ie nc
De pa rtm en i: of

End Semester
tecture
CS PC 31 -C om pu ter Archi
E/ V/ A
Branch/Semester/ Section : CS
Time: 10:00AM to 01:00 PM Max Marks: 50
Date : 11.12.2023

Answer All Questions

I. iables,
o operations to be perfonned: one is a product of 3 scalar var
a) Let there are tw dimensions 9 by
is a matrix sµm of a pai r of two-dimensional arrays, with
and one
9.
matrix sum is parallelizable;
For now let's assume only the 3
do you get wit h JO ver sus 40 processors?
i. What speed- up 20. 3
lcu late the spe ed- up s ass um ing the matrices grow to 20 by
ii. Ca deadlines
ing a sys tem for a rea l-ti me application in which specific
b) You are design t your
ish ing the com pu tation fas ter gains nothing. You find tha
must be met. Fin
ary code, in the worst case,
system can execute the necess 4
twice as fast as nec essary .
off the
rgy do yo u sav e if you exe cute at the current speed and tum
a) How much ene
is complete?
system when the computation to be half as
ene rgy do you sav e if you set the voltage and frequency
b) How much
much?

2. is 2.4 clock
ry access tim e for a mi cro processor with 1 level of cache
a) The average memo
cycles I clock cycle
- If data is present and valid
in the cache, it can be found in
-chip
the cac he, 80 clock cyc les are needed to get it from off
- If data is not found in the average memory access
time to obtain a
ner s are trying to im pro ve
memory Desig ering adding a 2nd
vem ent in ave rag e me mo ry access time, and are consid
65% im pro
level of cache on-chip.
ld be accessed in 6 clock cycles
- This second level of cache cou or hit
he do es no t affect the firs t level cache's access patterns
- The addition of this cac
still require 80 additional CCs.
times - Off-chip accesses would 6
p, ho w oft en mu st data be found in the 2nd level cache?
To obtain the desired speedu
pa ll l
0£ Te .cl in ol .o gy , T tr tt c~ ap
~• tt ~n at I1 ;1 stt tu te ee rin e
Co m pu te r Sc ie nc e an d -E ng in
..,e pa rt m en t of

30 ns to
a lin e size of 32 by tes and a main memory that requires
b) Consider a cache wi th ng swapped out
wo rd. Fo r any lin e that is written at least once before bei ore
transfer a 4-byte of times that the line must be written bef
rag e nu mb er
of the cache, what is the ave be mo re efficient that a write-through
cache?
a wr ite -ba ck cache to
being swapped out for me mory uses a block transfer cap
ability that has
we r cha ng e if the ma in 4
How does the ans tim e of 5 ns for each word thereafter?
ns and an acc ess
a firstword access time of 30

3. ction sequence: 4
WAR, and WA W dependencies in the following instru
a. Identify the RAW,
11: Rl = 100
12: Rl = R2 + R4
13: R2 =~ -2 5
14: R4 = RI + R3
IS: RI = RI + 30
execute, memory,
single-pi peline mi croarchitecture (fetch, decode,
Assume a five-s tage h are I + 2_
elow. Al l op s are one cyc le except LW and sw-:-whic
writeback) and .the code.b the phases of each
es, wh ich are _1 + I cyc les. There is no forwarding. Show per loop
cycles, and branch ion of the loop. How many clock cycles
for on e iterat
instruction per clock cycle ir, 2 >.io '-'1 u,,_ •
ion are lost 6
iterat
to branch overhead?
Loop: - LWR3,0(RO)
LWRI,O(R3)
ADDI Rl,Rl,#1
SUB R4,R3 ,R2
SW RI ,O(R3)
BNZR4,Loop

4.
ling.
tic scheduling and dynamic schedu
a. Distinguish between sta
plain
jor tec hn iqu es that can reduce control hazard stalls? Ex
b. Name three ma lation? What are the
w.is the Toma sul o's algori thm modified to include specu 2+3+5
c. Ho
req uired ?
additional hardware
'log y, Tlr il.c hlr •pp all l
Na tto na l Ins titu te orT ecf lno
r Sci enc e and Eng ln~ erll tg:
De pa rtm en t of Co mp ute

5.
characteristics:
a) Assume a hypothetical GPU with the following
4

■ Clock rate 1.5 GHz


■ Contains 16 SIMD processors, each
containing 16 single-precision floatingpoint units
■ Has I00 OB/sec off-chip memory band
width
is the peak single-precision floating-point
Without considering memory bandwidth, what
ming that all memory latencies can be
throughput for this GPU in OLFOP/sec, assu
memory bandwidth limitation?
hidden? ls this throughput sustainable given the

ng the processor cores is all implemented


b) With multicore processors, the coherence- amo
central directory protocol. Which cache
on chip, using either a snooping or simple 2
coherence protocol does Intel i7 use? Explain.
state cache coherence protocol, one of the
c) There are many extensions of this basic three
I protocol has four states. What is the
most common extensions is MESI. The MES
does the protocol solve cache coherence
advantage of adding this fourth state and how 4
ple.
problem? Explain transition of states with exam

••••••••

You might also like