Gate CoA

1 CO and Architecture (206)
Machine instructions and Addressing modes. ALU, data‐path and control unit. Instruction pipelining. Pipeline hazards, Memory
hierarchy: cache, main memory and secondary storage; I/O interface (Interrupt and DMA mode)
Mark Distribution in Previous GATE

Year 2021-1 2021-2 2020 2019 2018 2017-1 2017-2 2016-1 2016-2 Minimum Average Maximum
1 Mark Count 1 2 3 2 2 2 0 1 1 0 1.5 3
2 Marks Count 2 2 4 1 3 4 3 2 5 1 2.8 5
Total Marks 5 6 11 4 8 10 6 5 11 4 7.3 11
1.1 Addressing Modes (18) top☝
1.1.1 Addressing Modes: GATE CSE 1987 | Question: 1-V top☝ ☛ https://gateoverflow.in/80194
The most relevant addressing mode to write position-independent codes is:

A. Direct mode
B. Indirect mode
C. Relative mode
D. Indexed mode
gate1987 co-and-architecture addressing-modes
Answer ☟
1.1.2 Addressing Modes: GATE CSE 1988 | Question: 9iii top☝ ☛ https://gateoverflow.in/94388
In the program scheme given below indicate the instructions containing any operand needing relocation for position
independent behaviour. Justify your answer.
Y = 10
MOV X(R0 ), R1
MOV X, R0
MOV 2(R0 ), R1
MOV Y (R0 ), R5
⋅
⋅
⋅
X: WORD 0, 0, 0
gate1988 normal descriptive co-and-architecture addressing-modes
Answer ☟
1.1.3 Addressing Modes: GATE CSE 1989 | Question: 2-ii top☝ ☛ https://gateoverflow.in/87078
Match the pairs in the following questions:
(A) Base addressing (p) Reentranecy

(B) Indexed addressing (q) Accumulator
(C) Stack addressing (r) Array
(D) Implied addressing (s) Position independent
gate1989 match-the-following co-and-architecture addressing-modes easy
Answer ☟
1.1.4 Addressing Modes: GATE CSE 1993 | Question: 10 top☝ ☛ https://gateoverflow.in/2307
The instruction format of a CPU is:
© Copyright GATE Overflow. Some rights reserved.

Mode and RegR together specify the operand. RegR specifies a CPU register and Mode specifies an addressing mode. In
particular, Mode = 2 specifies that ‘the register RegR contains the address of the operand, after fetching the operand, the contents
of RegR are incremented by 1'.
An instruction at memory location 2000 specifies Mode = 2 and the RegR refers to program counter (PC).
A. What is the address of the operand?

B. Assuming that is a non-jump instruction, what are the contents of PC after the execution of this instruction?
gate1993 co-and-architecture addressing-modes normal descriptive
Answer ☟
1.1.5 Addressing Modes: GATE CSE 1996 | Question: 1.16, ISRO2016-42 top☝ ☛ https://gateoverflow.in/2720
Relative mode of addressing is most relevant to writing:

A. Co – routines
B. Position – independent code
C. Shareable code
D. Interrupt Handlers
gate1996 co-and-architecture addressing-modes easy isro2016
Answer ☟
1.1.6 Addressing Modes: GATE CSE 1998 | Question: 1.19 top☝ ☛ https://gateoverflow.in/1656
Which of the following addressing modes permits relocation without any change whatsoever in the code?
A. Indirect addressing
B. Indexed addressing
C. Base register addressing
D. PC relative addressing
gate1998 co-and-architecture addressing-modes easy
Answer ☟
A certain processor supports only the immediate and the direct addressing modes. Which of the following programming
language features cannot be implemented on this processor?
A. Pointers
B. Arrays
C. Records
D. Recursive procedures with local variable
gate1999 co-and-architecture addressing-modes normal multiple-selects
Answer ☟
The most appropriate matching for the following pairs

X: Indirect addressing 1: Loops
Y: Immediate addressing 2: Pointers
Z: Auto decrement addressing 3: Constants
is
A. X − 3, Y − 2, Z − 1
B. X − 1, Y − 3, Z − 2
C. X − 2, Y − 3, Z − 1
D. X − 3, Y − 1, Z − 2
gate2000-cse co-and-architecture normal addressing-modes
Answer ☟
Which is the most appropriate match for the items in the first column with the items in the second column:
X. Indirect Addressing I. Array implementation

Y. Indexed Addressing II. Writing relocatable code
Z. Base Register Addressing III. Passing array as parameter
A. (X, III), (Y, I), (Z, II)

B. (X, II), (Y, III), (Z, I)
C. (X, III), (Y, II), (Z, I)
D. (X, I), (Y, III), (Z, II)
gate2001-cse co-and-architecture addressing-modes normal
Answer ☟
In the absolute addressing mode:
A. the operand is inside the instruction

B. the address of the operand in inside the instruction
C. the register containing the address of the operand is specified inside the instruction
D. the location of the operand is implicit
gate2002-cse co-and-architecture addressing-modes easy
Answer ☟
Which of the following addressing modes are suitable for program relocation at run time?
I. Absolute addressing
II. Based addressing
III. Relative addressing
IV. Indirect addressing
A. I and IV
B. I and II
C. II and III
D. I, II and IV
Answer ☟

Consider a three word machine instruction

ADDA[R0 ], @B
The first operand (destination) “A[R0 ]” uses indexed addressing mode with R0 as the index register. The second operand (source)
“@B” uses indirect addressing mode. A and B are memory addresses residing at the second and third words, respectively. The first
word of the instruction specifies the opcode, the index register designation and the source and destination addressing modes. During
execution of ADD instruction, the two operands are added and stored in the destination (first operand).
The number of memory cycles needed during the execution cycle of the instruction is:
A. 3
B. 4
C. 5
D. 6
gate2005-cse co-and-architecture addressing-modes normal
Answer ☟
Match each of the high level language statements given on the left hand side with the most natural addressing mode from
those listed on the right hand side.
(1) A[I] = B[J] (a) Indirect addressing

(2) while (∗ A++); (b) Indexed addressing
(3) int temp =∗ x (c) Auto increment
A. (1, c), (2, b), (3, a)

B. (1, c), (2, c), (3, b)
C. (1, b), (2, c), (3, a)
D. (1, a), (2, b), (3, c)
Answer ☟
1.1.14 Addressing Modes: GATE CSE 2008 | Question: 33, ISRO2009-80 top☝ ☛ https://gateoverflow.in/444
Which of the following is/are true of the auto-increment addressing mode?
I. It is useful in creating self-relocating code

II. If it is included in an Instruction Set Architecture, then an additional ALU is required for effective address calculation
III. The amount of increment depends on the size of the data item accessed
A. I only
B. II only
C. III only
D. II and III only
gate2008-cse addressing-modes co-and-architecture normal isro2009
Answer ☟
Consider a hypothetical processor with an instruction of type LW R1, 20(R2) , which during execution reads a 32 − bit
word from memory and stores it in a 32 − bit register R1 . The effective address of the memory location is obtained by the addition
of a constant 20 and the contents of register R2 . Which of the following best reflects the addressing mode implemented by this
instruction for the operand in memory?
A. Immediate addressing
B. Register addressing
C. Register Indirect Scaled Addressing

D. Base Indexed Addressing
Answer ☟
1.1.16 Addressing Modes: GATE CSE 2017 Set 1 | Question: 11 top☝ ☛ https://gateoverflow.in/118291
Consider the C struct defined below:

struct data {
int marks [100];
char grade;
int cnumber;
};
struct data student;
The base address of student is available in register R1. The field student.grade can be accessed efficiently using:
A. Post-increment addressing mode, (R1)+

B. Pre-decrement addressing mode, −(R1)
C. Register direct addressing mode, R1
D. Index addressing mode, X(R1), where X is an offset represented in 2′ s complement 16 − bit representation
gate2017-cse-set1 co-and-architecture addressing-modes
Answer ☟
1.1.17 Addressing Modes: GATE IT 2006 | Question: 39, ISRO2009-42 top☝ ☛ https://gateoverflow.in/3578
Which of the following statements about relative addressing mode is FALSE?
A. It enables reduced instruction size

B. It allows indexing of array element with same instruction
C. It enables easy relocation of data
D. It enables faster address calculation than absolute addressing
gate2006-it co-and-architecture addressing-modes normal isro2009
Answer ☟
1.1.18 Addressing Modes: GATE IT 2006 | Question: 40 top☝ ☛ https://gateoverflow.in/3581
The memory locations 1000, 1001 and 1020 have data values 18, 1 and 16 respectively before the following program is
executed.
MOVI Rs , 1 ; Move immediate

LOAD Rd , 1000(Rs ) ; Load from memory
ADDI Rd , 1000 ; Add immediate
STOREI 0(Rd ), 20 ; Store immediate
Which of the statements below is TRUE after the program is executed ?

A. Memory location 1000 has value 20
B. Memory location 1020 has value 20
C. Memory location 1021 has value 20
D. Memory location 1001 has value 20
gate2006-it co-and-architecture addressing-modes normal
Answer ☟
Answers: Addressing Modes
1.1.1 Addressing Modes: GATE CSE 1987 | Question: 1-V top☝ ☛ https://gateoverflow.in/80194


(C) Relative Mode since we can just change the content of base register if we wish to relocate.
REFERENCE: https://gateoverflow.in/155280/self-doubt-computer-organization?show=155312
References
 26 votes -- srestha (85.2k points)
1.1.2 Addressing Modes: GATE CSE 1988 | Question: 9iii top☝ ☛ https://gateoverflow.in/94388
MOV instructions here are problematic for Position Independent behaviour if they use Indexed addressing mode and the
base address is loaded using some Absolute value.
I'm assuming R0 is having some relative address. So, the first two MOV instructions are fine for Position Independent
behaviour.
But MOV 2(R0 ), R1

is not fine for position independent behaviour as the source operand address here is 2 + R0 = 2 + X which remains the same
even when program is loaded to a new address. So, it should catch invalid content when program is loaded to a new address.
Similarly the last MOV will also cause problem.
 5 votes -- Arjun Suresh (332k points)
1.1.3 Addressing Modes: GATE CSE 1989 | Question: 2-ii top☝ ☛ https://gateoverflow.in/87078
(A) Base addressing (s) Position independent (By changing the value in
base register, location of address can be changed)
(B) Indexed addressing (r) Array
(C) Stack addressing (p) Reentranecy (Whenever code happens to be used
again, address need not be the same)
(D) Implied addressing (q) Accumulator (If an address is not specified, it is
assumed/implied to be the Accumulator)
 32 votes -- Prashant Singh (47.2k points)

A. Address of the operand = content of PC = 2001 as PC holds the address of the next instruction to be executed and
instruction size is 1 − word as given in the diagram.
B. After execution of the current instruction PC will be automatically incremented by 1 when the next instruction is fetched.
Also one extra increment will be done by operand fetch. So, PC = 2003 supposing next instruction is fetched. If we
assume next instruction fetch is not done (this should be the default here), it should be 2002 .
1.1.5 Addressing Modes: GATE CSE 1996 | Question: 1.16, ISRO2016-42 top☝ ☛ https://gateoverflow.in/2720

Answer is (B).
Relative mode addressing is most relevant to writing a position-independent code.
Reference: http://en.wikipedia.org/wiki/Addressing_mode#PC-relative
References

 26 votes -- Rajarshi Sarkar (27.9k points)

(D) PC relative addressing is the best option. For Base register addressing, we have to change the address in the base
register while in PC relative there is absolutely no change in code needed.

Pointer access requires indirect addressing which can be simulated with indexed addressing or register indirect
addressing but not with direct and immediate addressing. An array and record access needs a pointer access. So, options (A), (B)
and (C) cannot be implemented on such a processor.
Now, to handle recursive procedures we need to use stack. A local variable inside the stack will be accessed as *(SP + offset)
which is nothing but a pointer access and requires indirect addressing. Usually this is done by moving the SP value to Base
register and then using Base Relative addressing to avoid unnecessary memory accesses for indirect addressing- but not possible
with just direct and immediate addressing.
So, options (A), (B), (C) and (D) are correct.

(C) is the most appropriate one.
General instruction format: |opcode|Mode|Address Field|

Pointer dereference → Indirect addressing, E. A = M [Value stored in address field]
Operating with a constant → Immediate addressing, E. A = Address field of the instruction.
Loop iteration counter check → Auto decrement addressing R1 = R1 − 1 ; E. A = R1
E.A = Effective address, where the required operand will be found.
 28 votes -- Bhagirathi Nayak (11.7k points)

(A) is the answer.
Array implementation can use Indexed addressing.
While passing array as parameter we can make use of a pointer (as in (C)) and hence can use Indirect addressing.
Base Register addressing can be used to write relocatable code by changing the content of Base Register.

(B) is the answer. Absolute addressing mode means address of operand is given in the instruction.
option (A), operand is inside the instruction → immediate addressing

option (C), register containing the address in specified in operand → register Indirect addressing

option (D), the location of operand is implicit → implicit addressing
 60 votes -- gatecse (63.3k points)

Answer: (C)
A displacement type addressing should be preferred. So, (I) is not the answer.
Indirect Addressing leads to extra memory reference which is not preferable at run time. So, (IV) is not the answer.

1 memory read to get the first operand from memory address A + R0 (A is given as part of instruction)
1 memory read to get the address of the second operand (since second uses indirect addressing)
1 memory read to get the second operand from the address given by the previous memory read
1 memory write to store to first operand (which is the destination)
So, total of 4 memory cycles once the instruction is fetched.

The second and third words of the instruction are loaded as part of the Instruction fetch and not during the execute stage.
Reference: http://www.cs.iit.edu/~cs561/cs350/fetch/fetch.html
Correct Answer: B
References

C is the answer.
A[i] = B[j] ; Indexed addressing

(∗ A + +)

while (∗ A + +) ; Auto increment
temp =∗ x; Indirect addressing
1.1.14 Addressing Modes: GATE CSE 2008 | Question: 33, ISRO2009-80 top☝ ☛ https://gateoverflow.in/444

In auto increment addressing mode, the base address is incremented after operand fetch. This is useful in fetching
elements from an array. But this has no effect in self-relocating code (where code can be loaded to any address) as this works on
the basis of an initial base address.
An additional ALU is desirable for better execution especially with pipelining, but never a necessity.
Amount of increment depends on the size of the data item accessed as there is no need to fetch a part of a data.
So, answer must be C only.

Answer is (D).
Base Index Addressing, as the content of register R2 will serve as the index and 20 will be the Base address.
1.1.16 Addressing Modes: GATE CSE 2017 Set 1 | Question: 11 top☝ ☛ https://gateoverflow.in/118291

Answer is option (D).
Displacement Mode :-
Similar to index mode, except instead of a index register a base register will be used. Base register contains a pointer to a
memory location. An integer (constant) is also referred to as a displacement. The address of the operand is obtained by adding
the contents of the base register plus the constant. The difference between index mode and displacement mode is in the number
of bits used to represent the constant. When the constant is represented a number of bits to access the memory, then we have
index mode. Index mode is more appropriate for array accessing; displacement mode is more appropriate for structure (records)
accessing.
Reference:
http://www.cs.iit.edu/~cs561/cs350/addressing/addsclm.html
References
 50 votes -- Niraj Raghuvanshi (263 points)
1.1.17 Addressing Modes: GATE IT 2006 | Question: 39, ISRO2009-42 top☝ ☛ https://gateoverflow.in/3578

A. is true as instead of absolute address we can use a much smaller relative address in instructions which results in smaller
instruction size.
B. By using the base address of array as displacement and index in a base register (base relative addressing mode) we can
index array elements using relative addressing.
Ref: http://service.scs.carleton.ca/sivarama/asm_book_web/Instructor_copies/ch5_addrmodes.pdf
C. is true as we only need to change the base address in case of relocation- instructions remain the same.
D. is false. Relative addressing cannot be faster than absolute addressing as absolute address must be calculated from relative
address. With specialized hardware unit, this can perform equally as good as absolute addressing but not faster.

Correct Answer: D
References
1.1.18 Addressing Modes: GATE IT 2006 | Question: 40 top☝ ☛ https://gateoverflow.in/3581

Before the execution of the program, the memory contents are-
Now, let's execute the program instructions one by one-
1. Instruction-01: MOVI Rs , 1
This instruction uses immediate addressing mode.
The instruction is interpreted as Rs ← 1.
Thus, value = 1 is moved to the register Rs .
2. Instruction-02: LOAD Rd , 1000 (Rs )
This instruction uses indexed addressing mode.
The instruction is interpreted as Rd ← [1000 + [Rs ]].
Value of the operand = [1000 + [Rs ]] = [1000 + 1] = [1001] = 1.
Thus, value = 1 is moved to the register Rd.
3. Instruction-03: ADDI Rd , 1000
This instruction uses immediate addressing mode.
The instruction is interpreted as Rd ← [Rd ] + 1000.
Value of the operand = [Rd ] + 1000 = 1 + 1000 = 1001.
Thus, value = 1001 is moved to the register Rd.
4. Instruction-04: STOREI 0(Rd ), 20
This instruction uses indexed addressing mode.
The instruction is interpreted as 0 + [Rd ] ← 20.
Value of the destination address = 0 + [Rd ] = 0 + 1001 = 1001.
Thus, value = 20 is moved to the memory location 1001.
Hence, after the program execution is completed, memory location 1001 has a value 20.

Option D is correct.
 39 votes -- Sambhrant Maurya (2.6k points)
D is the answer. Memory location 1001 has value 20.

Rs ← 1 (Immediate Addressing)
Rd ← 1 (Indexed Addressing, value at memory location 1 + 1000 = 1001 is loaded to Rd which is 1)
Rd ← 1001 (Rd becomes 1 + 1000 )
store in address 1001 ← 20
 49 votes -- Abhinav Rana (723 points)
1.2 Cache Memory (63) top☝
1.2.1 Cache Memory: GATE CSE 1987 | Question: 4b top☝ ☛ https://gateoverflow.in/81360
What is cache memory? What is rationale of using cache memory?
gate1987 co-and-architecture cache-memory descriptive
Answer ☟
1.2.2 Cache Memory: GATE CSE 1989 | Question: 6a top☝ ☛ https://gateoverflow.in/88557
A certain computer system was designed with cache memory of size 1 Kbytes and main memory size of 256 Kbytes. The
cache implementation was fully associative cache with 4 bytes per block. The CPU memory data path was 16 bits and the memory
was 2− way interleaved. Each memory read request presents two 16−bit words. A program with the model shown below was run to
evaluate the cache design.
Answer the following questions:
i. What is the hit ratio?

ii. Suggest a change in the program size of model to improve the hit ratio significantly.
gate1989 descriptive co-and-architecture cache-memory
Answer ☟

A block-set associative cache memory consists of 128 blocks divided into four block sets. The main memory consists of
16, 384 blocks and each block contains 256 eight bit words.
1. How many bits are required for addressing the main memory?
2. How many bits are needed to represent the TAG, SET and WORD fields?
gate1990 descriptive co-and-architecture cache-memory
Answer ☟
1.2.4 Cache Memory: GATE CSE 1992 | Question: 5-a top☝ ☛ https://gateoverflow.in/584
The access times of the main memory and the Cache memory, in a computer system, are 500 n sec and 50 nsec, respectively.
It is estimated that 80% of the main memory request are for read the rest for write. The hit ratio for the read access only is 0.9 and a
write-through policy (where both main and cache memories are updated simultaneously) is used. Determine the average time of the
main memory (in ns).
gate1992 co-and-architecture cache-memory normal numerical-answers
Answer ☟
1.2.5 Cache Memory: GATE CSE 1993 | Question: 11 top☝ ☛ https://gateoverflow.in/2308
In the three-level memory hierarchy shown in the following table, pi denotes the probability that an access request will refer to
Mi .
Hierarchy Level Access Time Probability of Access Page Transfer Time

(Mi ) (ti ) (pi ) (Ti )
M1 10−6 0.99000 0.001 sec
−5
M2 10 0.00998 0.1 sec
M3 10−4 0.00002 ---
If a miss occurs at level Mi , a page transfer occurs from Mi+1 to Mi and the average time required for such a page swap is Ti .
Calculate the average time tA required for a processor to read one word from this memory system.
gate1993 co-and-architecture cache-memory normal descriptive
Answer ☟
1.2.6 Cache Memory: GATE CSE 1995 | Question: 1.6 top☝ ☛ https://gateoverflow.in/2593
The principle of locality justifies the use of:

A. Interrupts
B. DMA
C. Polling
D. Cache Memory
gate1995 co-and-architecture cache-memory easy
Answer ☟
A computer system has a 4 K word cache organized in block-set-associative manner with 4 blocks per set, 64 words per
block. The number of bits in the SET and WORD fields of the main memory address format is:
A. 15, 40
B. 6, 4
C. 7, 2
D. 4, 6

gate1995 co-and-architecture cache-memory normal
Answer ☟
A computer system has a three-level memory hierarchy, with access time and hit ratios as shown below:
Level 1 (Cache memory)Access time = 50nsec/byte Level 2 (Main memory)Access time = 200nsec/byte
Size Hit ratio Size Hit ratio Level 3Access time = 5μsec/byte
8M bytes 0.80 4M bytes 0.98 Size Hit ratio
16M bytes 0.90 16M bytes 0.99 260M bytes 1.0
64M bytes 0.95 64M bytes 0.995
A. What should be the minimum sizes of level 1 and 2 memories to achieve an average access time of less than 100nsec ?
B. What is the average access time achieved using the chosen sizes of level 1 and level 2 memories?
Answer ☟
For a set-associative Cache organization, the parameters are as follows:
tc Cache Access Time

tm Main memory access time
l Number of sets
b Block size
k×b Set size
Calculate the hit ratio for a loop executed 100 times where the size of the loop is n × b , and n = k × m is a non-zero integer and
1 ≤ m ≤ l.
Give the value of the hit ratio for l = 1.
gate1998 co-and-architecture cache-memory descriptive
Answer ☟
The main memory of a computer has 2 cm blocks while the cache has 2 c blocks. If the cache uses the set associative
mapping scheme with 2 blocks per set, then block k of the main memory maps to the set:
A. (k mod m) of the cache

B. (k mod c) of the cache
C. (k mod 2c) of the cache
D. (k mod 2 cm) of the cache
Answer ☟
1.2.11 Cache Memory: GATE CSE 2001 | Question: 1.7, ISRO2008-18 top☝ ☛ https://gateoverflow.in/700
More than one word are put in one cache block to:
A. exploit the temporal locality of reference in a program
B. exploit the spatial locality of reference in a program
C. reduce the miss penalty
D. none of the above

gate2001-cse co-and-architecture easy cache-memory isro2008
Answer ☟
A CPU has 32 − bit memory address and a 256 KB cache memory. The cache is organized as a 4 − way set associative
cache with cache block size of 16 bytes.
A. What is the number of sets in the cache?

B. What is the size (in bits) of the tag field per cache block?
C. What is the number and size of comparators required for tag matching?
D. How many address bits are required to find the byte offset within a cache block?
E. What is the total amount of extra memory (in bytes) required for the tag bits?
gate2001-cse co-and-architecture cache-memory normal descriptive
Answer ☟
In a C program, an array is declared as float A[2048] . Each array element is 4 Bytes in size, and the starting address of the
array is 0x00000000 . This program is run on a computer that has a direct mapped data cache of size 8 Kbytes , with block (line)
size of 16 Bytes .
A. Which elements of the array conflict with element A[0] in the data cache? Justify your answer briefly.
B. If the program accesses the elements of this array one by one in reverse order i.e., starting with the last element and ending
with the first element, how many data cache misses would occur? Justify your answer briefly. Assume that the data cache is
initially empty and that no other data or instruction accesses are to be considered.
gate2002-cse co-and-architecture cache-memory normal descriptive
Answer ☟
Consider a small two-way set-associative cache memory, consisting of four blocks. For choosing the block to be replaced, use
the least recently used (LRU) scheme. The number of cache misses for the following sequence of block addresses is:
8, 12, 0, 12, 8 .
A. 2
B. 3
C. 4
D. 5
gate2004-cse co-and-architecture cache-memory normal
Answer ☟
Consider a direct mapped cache of size 32 KB with block size 32 bytes . The CPU generates 32 bit addresses. The number
of bits needed for cache indexing and the number of tag bits are respectively,
A. 10, 17
B. 10, 22
C. 15, 17
D. 5, 17
gate2005-cse co-and-architecture cache-memory easy
Answer ☟

Consider two cache organizations. First one is 32 KB 2-way set associative with 32 byte block size, the second is of same
size but direct mapped. The size of an address is 32 bits in both cases . A 2-to-1 multiplexer has latency of 0.6 ns while a k-bit
k
comparator has latency of 10 ns. The hit latency of the set associative organization is h1 while that of direct mapped is h2 .
The value of h1 is:
A. 2.4 ns
B. 2.3 ns
C. 1.8 ns
D. 1.7 ns
Answer ☟
Consider two cache organizations. First one is 32 kB 2-way set associative with 32 byte block size, the second is of same size
but direct mapped. The size of an address is 32 bits in both cases . A 2-to-1 multiplexer has latency of 0.6ns while a k− bit
k
comparator has latency of 10 ns. The hit latency of the set associative organization is h1 while that of direct mapped is h2 .
The value of h2 is:
A. 2.4 ns
B. 2.3 ns
C. 1.8 ns
D. 1.7 ns
Answer ☟
A CPU has a 32KB direct mapped cache with 128 byte-block size. Suppose A is two dimensional array of size 512 × 512
with elements that occupy 8-bytes each. Consider the following two C code segments, P1 and P2 .
P1:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[i] [j];
}
}
P2:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[j] [i];
}
}
P1 and P2 are executed independently with the same initial state, namely, the array A is not in the cache and
i,
j,
x are in registers. Let the number of cache misses experienced by P1 be M1 and that for P2 be M2 .
The value of M1 is:
A. 0
B. 2048
C. 16384
D. 262144
Answer ☟

A CPU has a 32 KB direct mapped cache with 128 byte-block size. Suppose A is two dimensional array of size 512 × 512
with elements that occupy 8 − bytes each. Consider the following two C code segments, P1 and P2 .
P1 :
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[i] [j];
}
}
P2:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[j] [i];
}
}
P1 and P2 are executed independently with the same initial state, namely, the array A is not in the cache and i, j, x are in registers.
Let the number of cache misses experienced by P1 be M1 and that for P2 be M2.
M1
The value of the ratio M2
:
A. 0
1
B. 16
C. 18
D. 16
co-and-architecture cache-memory normal gate2006-cse
Answer ☟
Consider a 4-way set associative cache consisting of 128 lines with a line size of 64 words. The CPU generates a 20 − bit
address of a word in main memory. The number of bits in the TAG, LINE and WORD fields are respectively:
A. 9, 6, 5
B. 7, 7, 6
C. 7, 5, 8
D. 9, 5, 6
Answer ☟
Consider a machine with a byte addressable main memory of 216 bytes. Assume that a direct mapped data cache consisting of
32 lines of 64 bytes each is used in the system. A 50 × 50 two-dimensional array of bytes is stored in the main memory starting
from memory location 1100H . Assume that the data cache is initially empty. The complete array is accessed twice. Assume that the
contents of the data cache do not change in between the two accesses.
How many data misses will occur in total?
A. 48
B. 50
C. 56
D. 59
Answer ☟

Consider a machine with a byte addressable main memory of 216 bytes. Assume that a direct mapped data cache consisting of 32
lines of 64 bytes each is used in the system. A 50 x 50 two-dimensional array of bytes is stored in the main memory starting from
memory location 1100H . Assume that the data cache is initially empty. The complete array is accessed twice. Assume that the
contents of the data cache do not change in between the two accesses.
Which of the following lines of the data cache will be replaced by new blocks in accessing the array for the second time?
A. line 4 to line 11
B. line 4 to line 12
C. line 0 to line 7
D. line 0 to line 8
Answer ☟
For inclusion to hold between two cache levels L1 and L2 in a multi-level cache hierarchy, which of the following are
necessary?
I. L1 must be write-through cache

II. L2 must be a write-through cache
III. The associativity of L2 must be greater than that of L1
IV. The L2 cache must be at least as large as the L1 cache
A. IV only
B. I and IV only
C. I, II and IV only
D. I, II, III and IV
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes . The cache is
managed using 32 bit virtual addresses and the page size is 4 Kbytes . A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;
/*Initialize array ARR to 0.0 */
for(i = 0; i < 1024; i++)
for(j = 0; j < 1024; j++)
ARR[i][j] = 0.0;
The size of double is 8 bytes. Array ARR is located in memory starting at the beginning of virtual page 0xFF000 and stored in
row major order. The cache is initially empty and no pre-fetching is done. The only data memory references made by the program
are those to array ARR .
The total size of the tags in the cache directory is:
A. 32 Kbits
B. 34 Kbits
C. 64 Kbits
D. 68 Kbits
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes. The cache is managed
using 32 bit virtual addresses and the page size is 4 Kbytes. A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;

for(i = 0; i < 1024; i++)
for(j = 0; j < 1024; j++)
ARR[i][j] = 0.0;
Which of the following array elements have the same cache index as ARR[0][0]?
A. ARR[0][4]
B. ARR[4][0]
C. ARR[0][5]
D. ARR[5][0]
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes . The cache is
managed using 32 bit virtual addresses and the page size is 4 Kbytes . A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;
for(i = 0; i < 1024; i++)
for(j = 0; j < 1024; j++)
ARR[i][j] = 0.0;
The cache hit ratio for this initialization loop is:
A. 0%
B. 25%
C. 50%
D. 75%
Answer ☟
Consider a 4-way set associative cache (initially empty) with total 16 cache blocks. The main memory consists of 256 blocks
and the request for memory blocks are in the following order:
0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155.
Which one of the following memory block will NOT be in cache if LRU replacement policy is used?
A. 3
B. 8
C. 129
D. 216
Answer ☟
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown below. The block size in L1
cache is 4 words. The block size in L2 cache is 16 words. The memory access times are 2 nanoseconds, 20 nanoseconds and 200
nanoseconds for L1 cache, L2 cache and the main memory unit respectively.

When there is a miss in L1 cache and a hit in L2 cache, a block is transferred from L2 cache to L1 cache. What is the time taken
for this transfer?
A. 2 nanoseconds
B. 20 nanoseconds
C. 22 nanoseconds
D. 88 nanoseconds
gate2010-cse co-and-architecture cache-memory normal barc2017
Answer ☟
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown below. The block size in L1
cache is 4 words. The block size in L2 cache is 16 words. The memory access times are 2 nanoseconds, 20 nanoseconds and 200
nanoseconds for L1 cache, L2 cache and the main memory unit respectively.
When there is a miss in both L1 cache and L2 cache, first a block is transferred from main memory to L2 cache, and then a block is
transferred from L2 cache to L1 cache. What is the total time taken for these transfers?
A. 222 nanoseconds
B. 888 nanoseconds
C. 902 nanoseconds
D. 968 nanoseconds
Answer ☟
An 8KB direct-mapped write-back cache is organized as multiple blocks, each size of 32-bytes . The processor generates
32-bit addresses. The cache controller contains the tag information for each cache block comprising of the following.
1 valid bit
1 modified bit
As many bits as the minimum needed to identify the memory block mapped in the cache.
What is the total size of memory needed at the cache controller to store meta-data (tags) for the cache?
A. 4864 bits
B. 6144 bits
C. 6656 bits
D. 5376 bits
Answer ☟

A computer has a 256-KByte, 4-way set associative, write back data cache with block size of 32-Bytes . The processor sends
32-bit addresses to the cache controller. Each cache tag directory entry contains, in addition to address tag, 2 valid bits, 1 modified
bit and 1 replacement bit.
The number of bits in the tag field of an address is
A. 11
B. 14
C. 16
D. 27
Answer ☟
A computer has a 256-KByte , 4-way set associative, write back data cache with block size of 32 Bytes . The processor sends
32 bit addresses to the cache controller. Each cache tag directory entry contains, in addition to address tag, 2 valid bits, 1 modified
bit and 1 replacement bit.
The size of the cache tag directory is:
A. 160 Kbits
B. 136 Kbits
C. 40 Kbits
D. 32 Kbits
normal gate2012-cse co-and-architecture cache-memory
Answer ☟
In a k -way set associative cache, the cache is divided into v sets, each of which consists of k lines. The lines of a set are
placed in sequence one after another. The lines in set s are sequenced before the lines in set (s + 1) . The main memory blocks are
numbered 0 onwards. The main memory block numbered j must be mapped to any one of the cache lines from
A. (j mod v) ∗ k to (j mod v) ∗ k + (k − 1)
B. (j mod v) to (j mod v) + (k − 1)
C. (j mod k) to (j mod k) + (v − 1)
D. (j mod k) ∗ v to (j mod k) ∗ v + (v − 1)
Answer ☟
1.2.34 Cache Memory: GATE CSE 2014 Set 1 | Question: 44 top☝ ☛ https://gateoverflow.in/1922
An access sequence of cache block addresses is of length N and contains n unique block addresses. The number of unique
block addresses between two consecutive accesses to the same block address is bounded above by k . What is the miss ratio if the
access sequence is passed through a cache of associativity A ≥ k exercising least-recently-used replacement policy?
n
A. ( )
N
1
B. ( )
N
1
C. ( )
A
D. ( )
k
n
gate2014-cse-set1 co-and-architecture cache-memory normal

Answer ☟
In designing a computer's cache system, the cache block (or cache line) size is an important parameter. Which one of the
following statements is correct in this context?
A. A smaller block size implies better spatial locality

B. A smaller block size implies a smaller cache tag and hence lower cache tag overhead
C. A smaller block size implies a larger cache tag and hence lower cache hit time
D. A smaller block size incurs a lower cache miss penalty
Answer ☟
If the associativity of a processor cache is doubled while keeping the capacity and block size unchanged, which one of the
following is guaranteed to be NOT affected?
A. Width of tag comparator
B. Width of set index decoder
C. Width of way selection multiplexer
D. Width of processor to main memory data bus
Answer ☟
A 4-way set-associative cache memory unit with a capacity of 16 KB is built using a block size of 8 words. The word length is
32 bits. The size of the physical address space is 4 GB. The number of bits for the TAG field is ____
gate2014-cse-set2 co-and-architecture cache-memory numerical-answers normal
Answer ☟
The memory access time is 1 nanosecond for a read operation with a hit in cache, 5 nanoseconds for a read operation with a
miss in cache, 2 nanoseconds for a write operation with a hit in cache and 10 nanoseconds for a write operation with a miss in
cache. Execution of a sequence of instructions involves 100 instruction fetch operations, 60 memory operand read operations and
40 memory operand write operations. The cache hit-ratio is 0.9 . The average memory access time (in nanoseconds) in executing the
sequence of instructions is ______.
gate2014-cse-set3 co-and-architecture cache-memory numerical-answers normal
Answer ☟
Assume that for a certain processor, a read request takes 50 nanoseconds on a cache miss and 5 nanoseconds on a cache
hit. Suppose while running a program, it was observed that 80% of the processor's read requests result in a cache hit. The average
read access time in nanoseconds is ______.
gate2015-cse-set2 co-and-architecture cache-memory easy numerical-answers
Answer ☟
Consider a machine with a byte addressable main memory of 220 bytes, block size of 16 bytes and a direct mapped cache
having 212 cache lines. Let the addresses of two consecutive bytes in main memory be (E201F)16 and (E2020)16 . What are the
(E201F)

tag and cache line addresses ( in hex) for main memory address (E201F)16 ?
A. E, 201
B. F, 201
C. E, E20
D. 2, 01F
Answer ☟
The width of the physical address on a machine is 40 bits. The width of the tag field in a 512 KB 8-way set associative cache
is ________ bits.
gate2016-cse-set2 co-and-architecture cache-memory normal numerical-answers
Answer ☟
A file system uses an in-memory cache to cache disk blocks. The miss rate of the cache is shown in the figure. The latency to
read a block from the cache is 1 ms and to read a block from the disk is 10 ms. Assume that the cost of checking whether a block
exists in the cache is negligible. Available cache sizes are in multiples of 10 MB.
The smallest cache size required to ensure an average read latency of less than 6 ms is _________ MB.
Answer ☟
Consider a two-level cache hierarchy with L1 and L2 caches. An application incurs 1.4 memory accesses per instruction on
average. For this application, the miss rate of L1 cache is 0.1 ; the L2 cache experiences, on average, 7 misses per 1000
instructions. The miss rate of L2 expressed correct to two decimal places is ________.
gate2017-cse-set1 co-and-architecture cache-memory numerical-answers
Answer ☟
A cache memory unit with capacity of N words and block size of B words is to be designed. If it is designed as a direct
mapped cache, the length of the TAG field is 10 bits. If the cache unit is now designed as a 16 − way set-associative cache, the
length of the TAG field is ____________ bits.

Answer ☟
In a two-level cache system, the access times of L1 and L2 caches are 1 and 8 clock cycles, respectively. The miss penalty
from the L2 cache to main memory is 18 clock cycles. The miss rate of L1 cache is twice that of L2 . The average memory access
time (AMAT) of this cache system is 2 cycles. The miss rates of L1 and L2 respectively are
A. 0.111 and 0.056
B. 0.056 and 0.111
C. 0.0892 and 0.1784
D. 0.1784 and 0.0892
gate2017-cse-set2 cache-memory co-and-architecture normal
Answer ☟
The read access times and the hit ratios for different caches in a memory hierarchy are as given below:
Cache Read access time (in nanoseconds) Hit ratio

I -cache 2 0.8
D-cache 2 0.9
L2-cache 8 0.9
The read access time of main memory in 90 nanoseconds . Assume that the caches use the referred-word-first read policy and the
write-back policy. Assume that all the caches are direct mapped caches. Assume that the dirty bit is always 0 for all the blocks in the
caches. In execution of a program, 60% of memory reads are for instruction fetch and 40% are for memory operand fetch. The
average read access time in nanoseconds (up to 2 decimal places) is _________
Answer ☟
Consider a machine with a byte addressable main memory of 232 bytes divided into blocks of size 32 bytes. Assume that a
direct mapped cache having 512 cache lines is used with this machine. The size of the tag field in bits is _______
Answer ☟
The size of the physical address space of a processor is 2P bytes. The word length is 2W bytes. The capacity of cache memory
is 2N bytes. The size of each cache block is 2M words. For a K-way set-associative cache memory, the length (in number of bits) of
the tag field is
A. P − N − log2 K
B. P − N + log2 K
C. P − N − M − W − log2 K
D. P − N − M − W + log2 K
Answer ☟
A certain processor uses a fully associative cache of size 16 kB, The cache block size is 16 bytes. Assume that the main

memory is byte addressable and uses a 32-bit address. How many bits are required for the Tag and the Index fields respectively in
the addresses generated by the processor?
A. 24 bits and 0 bits
B. 28 bits and 4 bits
C. 24 bits and 4 bits
D. 28 bits and 0 bits
Answer ☟
A certain processor deploys a single-level cache. The cache block size is 8 words and the word size is 4 bytes. The memory
system uses a 60-MHz clock. To service a cache miss, the memory controller first takes 1 cycle to accept the starting address of the
block, it then takes 3 cycles to fetch all the eight words of the block, and finally transmits the words of the requested block at the rate
of 1 word per cycle. The maximum bandwidth for the memory system when the program running on the processor issues a series of
read operations is ______×106 bytes/sec
gate2019-cse numerical-answers co-and-architecture cache-memory
Answer ☟
A direct mapped cache memory of 1 MB has a block size of 256 bytes. The cache has an access time of 3 ns and a hit rate of
94%. During a cache miss, it takes 20 ns to bring the first word of a block from the main memory, while each subsequent word
takes 5 ns. The word size is 64 bits. The average memory access time in ns (round off to 1 decimal place) is______.
gate2020-cse numerical-answers co-and-architecture cache-memory
Answer ☟
A computer system with a word length of 32 bits has a 16 MB byte- addressable main memory and a 64 KB, 4-way set
associative cache memory with a block size of 256 bytes. Consider the following four physical addresses represented in
hexadecimal notation.
A1 = 0x42C8A4 ,
A2 = 0x546888 ,
A3 = 0x6A289C ,
A4 = 0x5E4880
Which one of the following is TRUE?
A. A1 and A4 are mapped to different cache sets.

B. A2 and A3 are mapped to the same cache set.
C. A3 and A4 are mapped to the same cache set.
D. A1 and A3 are mapped to the same cache set.
gate2020-cse cache-memory
Answer ☟
Consider a computer system with a byte-addressable primary memory of size 232 bytes. Assume the computer system has a
direct-mapped cache of size 32 KB (1 KB = 210 bytes), and each cache block is of size 64 bytes.
The size of the tag field is __________ bits.
Answer ☟

Consider a set-associative cache of size 2KB (1KB = 210 bytes) with cache block size of 64 bytes. Assume that the cache is
byte-addressable and a 32 -bit address is used for accessing the cache. If the width of the tag field is 22 bits, the associativity of the
cache is _________
gate2021-cse-set2 numerical-answers co-and-architecture cache-memory
Answer ☟
Assume a two-level inclusive cache hierarchy, L1 and L2, where L2 is the larger of the two. Consider the following
statements.
S1 : Read misses in a write through L1 cache do not result in writebacks of dirty lines to the L2
S2 : Write allocate policy must be used in conjunction with write through caches and no-write allocate policy is used with
writeback caches.
Which of the following statements is correct?

A. S1 is true and S2 is false
B. S1 is false and S2 is true
C. S1 is true and S2 is true
D. S1 is false and S2 is false
gate2021-cse-set2 co-and-architecture cache-memory
Answer ☟
1.2.56 Cache Memory: GATE IT 2004 | Question: 12, ISRO2016-77 top☝ ☛ https://gateoverflow.in/3653
Consider a system with 2 level cache. Access times of Level 1 cache, Level 2 cache and main memory are 1 ns, 10 ns, and
500 ns respectively. The hit rates of Level 1 and Level 2 caches are 0.8 and 0.9 , respectively. What is the average access time of
the system ignoring the search time within the cache?
A. 13.0
B. 12.8
C. 12.6
D. 12.4
gate2004-it co-and-architecture cache-memory normal isro2016
Answer ☟
1.2.57 Cache Memory: GATE IT 2004 | Question: 48 top☝ ☛ https://gateoverflow.in/3691
Consider a fully associative cache with 8 cache blocks (numbered 0 − 7 ) and the following sequence of memory block
requests:
4, 3, 25, 8, 19, 6, 25, 8, 16, 35, 45, 22, 8, 3, 16, 25, 7
If LRU replacement policy is used, which cache block will have memory block 7?
A. 4
B. 5
C. 6
D. 7
gate2004-it co-and-architecture cache-memory normal
Answer ☟
Consider a 2-way set associative cache memory with 4 sets and total 8 cache blocks (0 − 7) and a main memory with 128
blocks (0 − 127) . What memory blocks will be present in the cache after the following sequence of memory block references if
LRU policy is used for cache block replacement. Assuming that initially the cache did not have any memory block from the current

job?
0 5 3 9 7 0 16 55
A. 0 3 5 7 16 55
B. 0 3 5 7 9 16 55
C. 0 5 7 9 16 55
D. 3 5 7 9 16 55
Answer ☟
A cache line is 64 bytes. The main memory has latency 32 ns and bandwidth 1 GBytes/s . The time required to fetch the
entire cache line from the main memory is:
A. 32 ns
B. 64 ns
C. 96 ns
D. 128 ns
Answer ☟
A computer system has a level- 1 instruction cache ( 1-cache), a level- 1 data cache (D-cache) and a level- 2 cache (L2-cache)
with the following specifications:
Capacity Mapping Method Block Size

I -Cache 4K words Direct mapping 4 words
D-Cache 4K words 2 -way set associative mapping 4 words
L2-Cache 64K words 4-way set associative mapping 16 words
The length of the physical address of a word in the main memory is 30 bits. The capacity of the tag memory in the I -cache, D-
cache and L2-cache is, respectively,
A. 1 K x 18-bit, 1 K x 19-bit, 4 K x 16-bit
B. 1 K x 16-bit, 1 K x 19-bit, 4 K x 18-bit
C. 1 K x 16-bit, 512 x 18-bit, 1 K x 16-bit
D. 1 K x 18-bit, 512 x 18-bit, 1 K x 18-bit
Answer ☟
Consider a Direct Mapped Cache with 8 cache blocks (numbered 0 − 7 ). If the memory block requests are in the following
order
3, 5, 2, 8, 0, 63, 9, 16, 20, 17, 25, 18, 30, 24, 2, 63, 5, 82, 17, 24.
Which of the following memory blocks will not be in the cache at the end of the sequence ?
A. 3
B. 18
C. 20
D. 30
Answer ☟

Consider a computer with a 4-ways set-associative mapped cache of the following characteristics: a total of 1 MB of main
memory, a word size of 1 byte, a block size of 128 words and a cache size of 8 KB .
The number of bits in the TAG, SET and WORD fields, respectively are:
A. 7, 6, 7
B. 8, 5, 7
C. 8, 6, 6
D. 9, 4, 7
Answer ☟
Consider a computer with a 4-ways set-associative mapped cache of the following characteristics: a total of 1 MB of main
memory, a word size of 1 byte , a block size of 128 words and a cache size of 8 KB .
While accessing the memory location 0C795H by the CPU, the contents of the TAG field of the corresponding cache line is:
A. 000011000
B. 110001111
C. 00011000
D. 110010101
Answer ☟
Answers: Cache Memory
1.2.1 Cache Memory: GATE CSE 1987 | Question: 4b top☝ ☛ https://gateoverflow.in/81360
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time
or energy) to access data from the main memory.
A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main
memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is
usually organized as a hierarchy of more cache levels.
Source :- https://en.wikipedia.org/wiki/CPU_cache
References
 8 votes -- Satbir Singh (21k points)
Total number of block access = (512/4 + 100 × 1028/4 + 512/4) = 25956
All the blocks above and below the loop region can be assumed to be cache misses.
It is given that the cache is fully associative but the replacement policy is not mentioned. Let's assume it is FIFO. Also, for
simplicity let's assume that there's no reuse within a cache block.
In the first loop access every cache access will be a miss. Since the loop body size is 1028 and we assumed FIFO, the last cache
block will replace the first one. In the second iteration, first cache block access will be a miss and this will replace the second
cache block. Going like this every cache block access will be a miss.

(i) So, hit ratio = 0.
(ii) By just reducing the loop body size by 4 bytes, all the loop accesses from second iteration of the loop will be a hit. So, we can
99×1024
get a hit ratio of 101×1024 = 98.01%

For main memory, there are 214 blocks and each block size is 28 bytes (A byte is an eight-bit word)
1. Size of main memory = 214 × 28 = 4MB ( 22 − bits required for addressing the main memory).
2. For WORD field, we require 8 − bits , as each block contains 28 words.
As there are 4 blocks in 1 set, 32 sets will be needed for 128 blocks. Thus SET field requires 5 − bits .
Then, TAG field requires 22 − (5 + 8) = 9 − bits
9-bits (for tag) 5- bits (for set) 8-bits (for word)
 46 votes -- kirti singh (2.6k points)
1.2.4 Cache Memory: GATE CSE 1992 | Question: 5-a top☝ ☛ https://gateoverflow.in/584

Average memory access time = Time spend for read + Time spend for write
= Read time when cache hit + Read time when cache miss + Write time when cache hit + Write time when cache miss
= 0.8 × 0.9 × 50 + 0.8 × 0.1 × (500 + 50)

(assuming hierarchical read from memory and cache as only simultaneous write is mentioned in question)
+0.2 × 0.9 × 500 + 0.2 × 0.1 × 500 (simultaneous write mentioned in question)
= 36 + 44 + 90 + 10 = 180 ns.
Reference: http://www.howardhuang.us/teaching/cs232/24-Cache-writes-and-examples.pdf
References

We are given the probability of access being a hit in each level (clear since their sum adds to 1).
So, we can get the average access time as:
tA = 0.99 × 10−6 + 0.00998 × (10−6 + 10−5 + 0.001)
+0.00002 × (10−6 + 10−5 + 10−4 + 0.1 + 0.001)]
≈ (0.99 + 10 + 2) × [10−6 ] = 13μs .
We can also use the following formula- for 100% of accesses M1 is accessed,
whenever M1 is a miss, M2 is accessed and when both misses only M3 is accessed.
So, average memory access time,
tA = 10−6 + (1 − 0.99) × (10−5 + 0.001) + 0.00002 × (10−4 + 0.1) = 1 + 10.01 + 2μs = 13.01μs.
Quick Cache Maths:-

Suppose that in 250 memory references there are 30 misses in L1 and 10 misses in L2.
30
Miss rate of L1 = 250
10
Miss rate of L2 = 30 (In L1 we miss 30 requests, so at L2 we have 30 requests, but it misses 10 out of 30)
See this question.
Here, Probabilities are given, We need to convert it into Hit ratios.

p1 = 0.99000 , it says we hit 0.99 times in M1 but we miss 0.01 times. Here hit rate is same as probability H1 = 0.99
p2 = 0.00998 , it says we hit 0.00998 times out of 0.01 requests (0.01 misses from M1 ), H2 = 0.00998
0.01 = 0.998
(H3 is of course 1. we hit 0.00002 times in M3 out of 0.00002 misses from M2 )
H1 = 0.99 , H2 = 0.998 , H3 = 1. ti is Access time, and Ti is page transfer time.

tA = t1 + (1 − H1 ) × Miss penalty 1
Miss penalty 1 = (t2 + T1 ) + (1 − H2 ) × Miss penalty 2
Miss penalty 2 = (t3 + T2 )
Ref: question 3 here
https://www.cs.utexas.edu/~fussell/courses/cs352.fall98/Homework/old/Solution3.ps
References
 69 votes -- Sachin Mittal (15.8k points)

Answer is (D).
Locality of reference is actually the frequent accessing of any storage location or some value. We can say in simple language that
whatever things are used more frequently, they are stored in the locality of reference. So we have cache memory for the purpose.
 30 votes -- Gate Keeda (15.9k points)
 4K
Number of sets = = 16
(64 × 4)
So, we need 4-bits to identify a set ⇒ SET = 4 bits.
64 words per block mean WORD is 6-bits.
So, the answer is an option (D).

The equation for access time can be written as follows (assuming a, b are the hit ratios of level 1 and level 2
respectively).
T = T1 + (1 − a)T2 + (1 − a) × (1 − b)T3
Here T ≤ 100, T1 = 50ns, T2 = 200ns and T3 = 5000ns. On substituting the a, b for the first case we get
T = 95ns for a = 0.8 and b = 0.995. i.e., L1 = 8M and L2 = 64M.
T = 75ns for a = 0.9 and b = 0.99. i.e., L1 = 16M and L2 = 4M
B.

1. L1 = 8M, a = 0.8, L2 = 4M, b = 0.98 . So,
T = 50 + 0.2 × 200 + 0.2 × 0.02 × 5000 = 50 + 40 + 20 = 110ns
2. L1 = 16M, a = 0.9, L2 = 16M, b = 0.99 . So,
T = 50 + 0.1 × 200 + 0.1 × 0.01 × 5000 = 50 + 20 + 5 = 75ns
3. L1 = 64M, a = 0.95, L2 = 64M, b = 0.995 . So,
T = 50 + 0.05 × 200 + 0.05 × 0.005 × 5000 = 50 + 10 + 1.25 = 61.25ns
 23 votes -- kireeti (1k points)
Size of the loop = n × b = k × m × b
Size of a set = k × b (k − way associative)
Here, size of the loop is smaller than size of cache as m ≤ l . So we are guaranteed that the entire loop is in cache without any
replacement. (Here we assumed that the memory size of n × b used by the loop is contiguous, or else we could have a scenario
where the entire loop size gets mapped to the same set causing cache replacement).
For the first iteration:
No. of accesses = n × b
No. of misses = n as each new block access is a miss and loop body has n blocks each of size b for a total size of n × b .
For, the remaining 99 iterations:
No. of accesses = n × b
No. of misses = 0
So, total no. of accesses = 100nb
Total no. of hits = Total no. of accesses − Total no. of misses
= 100nb − n
100nb−n 1
So, hit ratio = 100nb
=1− 100b
1
The hit ratio is independent of l, so for l = 1 also we have hit ratio = 1 − 100b

Number of cache blocks = 2c
2c
Number of sets in cache = = c since each set has 2 blocks. Now, a block of main memory gets mapped to a set
2
(associativity of 2 just means there are space for 2 memory blocks in a cache set), and we have 2 cm blocks being mapped to c
sets. So, in each set 2m different main memory blocks can come and block k of main memory will be mapped to k mod c.
Correct Answer: B.
1.2.11 Cache Memory: GATE CSE 2001 | Question: 1.7, ISRO2008-18 top☝ ☛ https://gateoverflow.in/700

Exploit the spatial locality of reference in a program as, if the next locality is addressed immediately, it will already be in
the cache.
Consider the scenario similar to cooking, where when an ingredient is taken from cupboard, you also take the near by ingredients
along with it- hoping that they will be needed in near future.
Correct Answer: B


What is the number of sets in the cache
Cache memory
Number of sets =
(set associativity × cache block size)
256KB
=
(4 × 16B)
= 4096
What is the size (in bits) of the tag field per cache block?
Memory address size = 32-bit
Number of bits required to identify a particular set = 12 (Number of sets = 4096)
Number of bits required to identify a paticular location in cache line = 4 (cache block size = 16)
Size of tag field = 32 − 12 − 4 = 16-bit
What is the number and size of comparators required for tag matching?
We use 4-way set associate cache. So, we need 4 comparators each of size 16-bits
http://ecee.colorado.edu/~ecen2120/Manual/caches/cache.html
How many address bits are required to find the byte offset within a cache block?
Cache block size is 16-byte. so 4-bits are required to find the byte offset within a cache block.
What is the total amount of extra memory (in bytes) required for the tag bits?
size of tag = 16-bits
Number of sets = 4096
Set associativity = 4
Extra memory required to store the tag bits = 16 × 4096 × 4-bits = 218 bits = 215 bytes.
References
 42 votes -- suraj (4.8k points)

A. Data cache size = 8 KB.
Block line size = 16 B.
Since each array element occupies 4 B, four consecutive array elements occupy a block line (elements are aligned as
starting address is 0 )
8 KB
Number of cache blocks = = 512.
16 B
2048
Number of cache blocks needed for the array = = 512.
4
So, all the array elements has its own cache block and there is no collision.
We can also explain this with respect to array address.

Starting address is 0x00000000 = 0b 0000 … 0(32 0's).
Ending address is 0x00001FFF = 0b 0000 … 01111111111111(4 × 2048 = 8192 locations), 0 − 8191 .
Here, the last 4 bits are used as OFFSET bits and the next 9 bits are used as SET bits. So, since the ending address is not
extending beyond these 9 bits, all cache accesses are to diff sets.

B. If the last element is accessed first, its cache block is fetched. (which should contain the previous 3 elements of the array
also since each cache block hold 4 elements of array and 2048 is and exact multiple of 4 ). Thus, for every 4 accesses, we
will have a cache miss ⇒ for 2048 accesses we will have 512 cache misses. (This would be same even if we access array
in forward order).

We have 4 blocks and 2 blocks in a set
⟹ there are 2 sets. So blocks will go to sets as follows:
Set Number Block Number

0 0,8,12
1
Since the lowest bit of block address is used for indexing into the set, so 8, 12 and 0 first miss in cache with 0 replacing 8 (there
are two slots in each set due to 2 − way set) and then 12 hits in cache and 8 again misses. So, totally 4 misses.
Correct Answer: C
 cache size 32-KB

Number of blocks = = = 1024-Bytes.
block size 32
So, indexing requires 10-bits. Number of OFFSET bits required to access 32-bit block = 5.
So, number of TAG bits = 32 − 10 − 5 = 17.
So, answer is (A).

Cache size is 32 KB and cache block size is 32 B. So,
cache size
Number of sets =
no. of blocks in a set × block size
32 KB
= = 512
2 × 32 B
So, number of index bits needed = 9 ( since 29 = 512 ). Number of offset bits = 5 (since 25 = 32 B is the block size and
assuming byte addressing). So, number of tag bits = 32 − 9 − 5 = 18 (as memory address is of 32 bits).
So, time for comparing the data
= Time to compare the data + Time to select the block in set = 0.6 + 18/10 ns = 2.4 ns.
(Two comparisons of tag bits need to be done for each block in a set, but they can be carried out in parallel and the succeeding
one multiplexed as the output).
Reference: https://courses.cs.washington.edu/courses/cse378/09au/lectures/cse378au09-19.pdf
Correct Answer: A
References

 cache size
number of sets =
no. of blocks in a set × block size
32KB
= = 1024
1 × 32B
So, number of index bits = 10, and
number of tag bits = 32 − 10 − 5 = 17.
17
So, h2 = = 1.7 ns
10
Correct Answer: D

Code being C implies array layout is row-major.
http://en.wikipedia.org/wiki/Row-major_order
128
When A[0][0] is fetched, 128 consecutive bytes are moved to cache. So, for the next − 1 = 15 memory references there
8
won't be a cache miss.
For the next iteration of i loop also the same thing happens as there is no temporal locality in the code. So, number of cache
misses for P1 is
512
= × 512
16
= 32 × 512
= 214 = 16384
Correct Answer: C
References

215 B
Number of Cache Lines = = 256
128B
128B
In 1 Cache Line = = 16 elements
8B
total elements in array
P1 =
elements in a cache line
512 × 512
= = 214 = 16384.
16
P2 = 512 × 512 = 218
P1 16384
=
P2 512 × 512
1
= = =
1
= 214−18 = 2−4 =
16
It is so, because for P1 for every line there is a miss, and once a miss is processed we get 16 elements in memory.
So, another miss happens after 16 elements.
For P2 for every element there is a miss because storage is row major order(by default) and we are accessing column wise.
Hence, answer is option B.
 41 votes -- Amar Vashishth (25.2k points)
Code being C implies array layout is row-major.

http://en.wikipedia.org/wiki/Row-major_order
When A[0][0] is fetched, 128 consecutive bytes are moved to cache. So, for the next 128/8 -1= 15 memory references there won't
be a cache miss. For the next iteration of i loop also the same thing happens as there is no temporal locality in the code. So,
number of cache misses for P1
512
= 16 × 512
= 32 × 512
= 214 = 16384
In the case of P2, the memory references are not consecutive. After A[0][0], the next access is A[1][0] which is after 512 * 8
memory locations. Since our cache block can hold only 128 contiguous memory locations, A[1][0] won't be in cache after a[0]
[0] is accessed. Now, the next location after A[0][0] is A[0][1] which will be accessed only after 512 iterations of the inner loop-
after 512 distinct memory block accesses. In our cache we have only space for 32 KB/128 B = 256 memory blocks. So, by the
time A[0][1] is accessed, its cache block would be replaced. So, each of the memory access in P2 results in a cache miss. Total
number of cache miss
= 512 × 512
M1 32×512 1
So, M2
= 512×512 = 16
References
 cache size
Number of sets =
(size of a block * No. of blocks in a set)
128 ∗ 64
= (4 way set associative means 4 blocks in a set)
(64 ∗ 4)
= 32.
So, number of index (LINE) bits = 5 and number of WORD bits = 6 since cache block (line) size is 64.
So, number of TAG bits = 20 − 6 − 5 = 9.
Answer is (D) choice


Bits used to represent the address = log2 216 = 16
Each cache line size = 64 bytes; means offset requires 6-bits
Total number of lines in cache = 32; means line # requires 5-bits
So, tag bits = 16 − 6 − 5 = 5
We have a 2D-array each of its element is of size = 1 Byte;
Total size of this array = 50 × 50 × 1 Byte = 2500 Bytes
So, total number of lines it will require to get contain in cache
2500B
= = 39.0625 ≈ 40
64B
Starting address of array = 1100H = 00010 00100 000000
The group of bits in middle represents Cache Line number ⟹ array starts from cache line number 4,
We require 40 cache lines to hold all array elements, but we have only 32 cache lines
Lets group/partition our 2500 array elements in those 40 array lines, we call this first array line as A0 which will have 64 of its
elements.
This line(group of 64 elements) of array will be mapped to cache line number 4 as found by analysis of starting address of array
above.
This all means that among those 40 array lines some array lines will be mapped to same cache line, coz there are just 32 cache
lines but 40 of array lines.
This is how mapping is:
0 A28
1 A29
2 A30
3 A31
4 A0 A32
5 A1 A33
6 A2 A34
7 A3 A35
8 A4 A36
9 A5 A37
10 A6 A38
11 A7 A39
12 A8
⋮
30 A26
31 A27
So, if we access complete array twice we get = 32 + 8 + 8 + 8 = 56 miss because only 8 lines from cache line number 4 to 11
are miss operation, rest are Hits(not counted) or Compulsory misses(first 32).
Hence, answer is option (C).
216 = 64 KB main memory is mapped to 32 lines of 64 bytes. So, number of offset bits = 6 (to identify a byte in a line) and
number of indexing bits = 5 (to identify the line).
Size of array = 50* 50 = 2500 B. If array is stored in row-major order (first row, second-row..), and if elements are also accessed
in row-major order (or stored and accessed in column-major order), for the first 64 accesses, there will be only 1 cache miss, and
for 2500 accesses, there will be 2500/64 = 40 cache misses during the first iteration.
We have 5 index bits and 6 offset bits. So, for 2 11 (5 + 6 = 11) continuous memory addresses there wont be any cache conflicts
as the least significant bits are offset bits followed by index bits.
So, number of array elements that can be accessed without cache conflict = 2048 (as element size is a byte). The next 452
elements conflict with the first 452 elements. This means 452/64 = 8 cache blocks are replaced. (We used ceil, as even if one
element from a new block is accessed, the whole block must be fetched).

So, during second iteration we incur misses for the first 8 cache blocks as well as for the last 8 cache blocks. So, total data cache
misses across 2 iterations = 40 + 8 + 8 = 56.

Cache Organization:
Staring Address = 1100H = 163 + 162 + 0 + 0 = 4352B is the starting address.
4352 B
We need to find Starting block = = 68th block in main memory from where array start storing elements.
64 B
50 B
50 × 50 B = array size = 50 × = 39.0625 blocks needed ≊ 40 blocks
64 B
68,69,70....107 block we need = 40 blocks
Starting block is 68 (mod 32) = 4th cache block and after that in sequence they will be accessed.
As shown in below table, line number 4 to 11 has been replaced by array in second time
Cache Block Number First Cycle Second cycle

0 96
1 97
2 98
3 99
4 68 // 100 68
5 69 // 101 69
6 70//102 70
7 71//103 71
8 72//104 72
9 73//105 73
10 74/106 74
11 75//107 75
12 76
13 77
14 78
15 79
16 80
17 81
18 82
19 83
20 84
21 85
22 86
23 87
24 88
25 89
26 90
27 91
28 92
29 93
30 94
31 95

Correct Answer: A
 52 votes -- papesh (18k points)

1st is not correct as data need not to be exactly same at the same point of time and so write back policy can be used in
this.
2nd is not needed when talking only about L1 and L2.
For 3rd , associativity can be equal.
So, only 4th statement is Necessarily true - (A) choice.
 49 votes -- Shaun Patel (6.1k points)
 cache size
Number of sets =
size of a set
64 KB
= (two blocks per set)
(16 B × 2)
= 2 K = 211
So, we need 11-bits for set indexing.
Number of WORD bits required = 4 as a cache block consists of 16 bytes and we need 4-bits to address each of them.
So, number of tag bits = 32 − 11 − 4 = 17
Total size of the tag= 17 × Number of cache blocks

= 17 × 211 × 2 (since each set has 2 blocks)
= 68 Kbits
Answer is option D) 68 Kbits
We use the top 17-bits for tag and the next 11-bits for indexing and next 4 for offset. So, for two addresses to have the same
cache index, their 11 address bits after the 4 offset bits from right must be same.
ARR[0][0] is located at virtual address 0x FF000 000. (FF000 is page address and 000 is page offset).
So, index bits are 00000000000
Address of ARR[0][4] = 0xFF000 + 4 × sizeof (double)

= 0xFF000 000 + 4 × 8 = 0xFF000 020 (32 = 20 in hex) (index bits differ)
Address of ARR[4][0] = 0xFF000 + 4 × 1024 × sizeof (double)

[since we use row major storage]
= 0xFF000 000 + 4096 × 8 = 0xFF000 000 + 0x8000 = 0xFF008 000
( index bits matches that of ARR[0][0] as both read 000 0000 0000 )
Address of ARR[0][5] = 0xFF000 + 5 × sizeof (double) = 0xFF000 000 +5 × 8 = 0xFF000 028 (40 = 28 in hex)
(index bits differ)
Address of ARR[5][0] = 0xFF000 + 5 × 1024 × sizeof (double) [since we use row major storage]
= 0xFF000 000 + 5120 × 8 = 0xFF000 000 + 0xA000 = 0xFF00A 000 (index bits differ)
So, only ARR[4][0] and ARR[0][0] have the same cache index.
The inner loop is iterating from 0 to 1023, so consecutive memory locations are accessed in sequence. Since cache block size is
only 16 bytes and our element being double is of size 8 bytes, during a memory access only the next element gets filled in the
cache. i.e.; every alternative memory access is a cache miss giving a hit ratio of 50 (If loops i and j are reversed, all accesses will
be misses and hit ratio will become 0 ).


Number of sets = cache size/ size of a set
= 64 KB/(16 B × 2) (two blocks per set)
= 2 K = 211
So, we need 11 bits for set indexing.
Number of WORD bits required = 4 as a cache block consists of 16 bytes and we need 4 bits to address each of them.
So, number of tag bits = 32 − 11 − 4 = 17
Total size of the tag = 17× Number of cache blocks

= 17 × 211 × 2 (since each set has 2 blocks)
= 68 KB
We use the top 17 bits for tag and the next 11 bits for indexing and next 4 for offset. So, for two addresses to have the same
cache index, their 11 address bits after the 4 offset bits from right must be same.
ARR[0][0] is located at virtual address 0x FF000 000 . (FF000 is page address and 000 is page offset). So, index bits are
00000000000
Address of ARR[0][4] = 0xFF000 + 4× sizeof (double) = 0xFF000000 + 4 × 8 = 0xFF000 020 (32 = 20 in hex)
(index bits differ)
Address of ARR[4][0] = 0xFF000 + 4 × 1024× sizeof (double) [since we use row major storage]
= 0xFF000 000 + 4096 × 8 = 0xFF000 000 + 0x8000 = 0xFF008 000 ( index bits matches that of ARR [0][0] as both
read 000 0000 0000 )
Address of ARR[0][5] = 0xFF000 + 5× sizeof (double) = 0xFF000 000 + 5 × 8 = 0xFF000 028(40 = 28 in hex)
(index bits differ)
Address of ARR[5][0] = 0xFF000 + 5 × 1024× sizeof (double) [since we use row major storage]
= 0xFF000 000 + 5120 × 8 = 0xFF000 000 + 0xA000 = 0xFF00A 000 (index bits differ)
So, only ARR[4][0] and ARR[0][0] have the same cache index.
The inner loop is iterating from 0 to 1023 , so consecutive memory locations are accessed in sequence. Since cache block size is
only 16 bytes and our element being double is of size 8 bytes, during a memory access only the next element gets filled in the
cache. i.e.; every alternative memory access is a cache miss giving a hit ratio of 50%. (If loops i and j are reversed, all accesses
will be misses and hit ratio will become 0).
Correct Answer: B

Block size = 16B and one element = 8B.
So, in one block 2 element will be stored.
1024 × 1024
For 1024 × 1024 element num of block required = = 219 blocks required.
2
In one block the first element will be a miss and second one is hit(since we are transferring two unit at a time)
Total hit
⇒ hit ratio =
Total reference
219
=
220
1
= = 0.5
2

= 0.5 × 100 = 50%
Correct Answer: C
 26 votes -- asutosh kumar Biswal (8k points)

16 blocks and sets with 4 blocks each means there are 4 sets.So, the lower 2 bits are used
for getting a set and 4-way associative means in a set only the last 4 cache accesses can be stored.
0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155
mod 4 gives,
0, 3, 1, 0, 3, 0, 1, 3, 0, 1, 3, 0, 0, 0, 1, 0, 3
Now for each of 0..3, the last 4 accesses will be in cache.

So, {92, 32, 48, 8}, {155, 63, 159, 3}, {73, 129, 133, 1} and {} will be in cache.
So, the missing element from choice is 216.
Correct Answer: D

Ideally the answer should be 20 ns as it is the time to transfer a block from L2 to L1 and this time only is asked in
question. But there is confusion regarding access time of L2 as this means the time to read data from L2 till CPU but here we
need the time till L1 only. So, I assume the following is what is meant by the question.
A block is transferred from L2 to L1. And L1 block size being 4 words (since L1 is requesting we need to consider L1 block
size and not L2 block size) and data width being 4 bytes, it requires one L2 access (for read) and one L1 access (for store). So,
time = 20 + 2 = 22 ns.
Correct Answer: C

The transfer time should be 4 ∗ 200 + 20 = 820 ns. But this is not in option. So, I assume the following is what is meant
by the question.
L2 block size being 16 words and data width between memory and L2 being 4 words, we require 4 memory accesses(for read)
and 4 L2 accesses (for store). Now, we need to send the requested block to L1 which would require one more L2 access (for
read) and one L1 access (for store). So, total time
= 4 ∗ (200 + 20) + (20 + 2)
= 880 + 22
= 902 ns
Correct Answer: C
 cache size
Number of cache blocks =
size of a block
8 KB
=
32 B

= 256
So, we need 8-bits for indexing the 256 blocks of the cache. And since a block is 32 bytes we need 5 WORD bits to address
each byte. So, out of the remaining 19-bits (32 - 8 - 5) should be tag bits.
So, a tag entry size = 19 + 1(valid bit) + 1(modified bit) = 21 bits.
Total size of metadata = 21 × Number of cache blocks

= 21 × 256
= 5376 bits
Correct Answer: D

Total cache size = 256 KB
Cache block size = 32 Bytes
256 K
So, number of cache entries = =8K
32
8K
Number of sets in cache = = 2 K as cache is 4-way associative.
4
So, log(2048) = 11 bits are needed for accessing a set. Inside a set we need to identify the cache entry.
Memory size
No. of memory block possible =
Cache block size
232
= = 227 .
32
So, no. of memory block that can go to a single cache set
227
=
211
= 216 .
So, we need 16 tag bits along with each cache entry to identify which of the possible 216 blocks is being mapped there.
Correct Answer: C

Total cache size = 256 KB
Cache block size = 32 Bytes
256 K
So, number of cache entries = =8K
32
8K
Number of sets in cache = = 2 K as cache is 4-way associative.
4
So, log(2048) = 11 bits are needed for accessing a set. Inside a set we need to identify the cache entry.
232 232
Total number of distinct cache entries = = = 227
cache entry size 32
227
Out of this 227 , each set will be getting only 11 = 216 possible distinct cache entries as we use the first 11 bits to identify a
2
set. So, we need 16 bits to identify a cache entry in a set, which is the number of bits in the tag field.
Size of cache tag directory= Size of tag entry × Number of tag entries
= 16 + (2 + 1 + 1) bits (2 valid, 1 modified, 1 replacement as given in question) × 8 K

= 208 = 160 Kbits
Not needed for this question, still:
Valid bit: Tells if the memory referenced by the cache entry is valid. Initially, when a process is loaded all entries are invalid.
Only when a page is loaded, its entry becomes valid.
Modified bit: When processor writes to a cache location its modified bit is made 1. This information is used when a cache entry
is replaced- entry 0 means no update to main memory needed. Entry 1 means an update is needed.
Replacement bit: This is needed for the cache replacement policy. Explained in the below link:
https://www.seas.upenn.edu/~cit595/cit595s10/handouts/LRUreplacementpolicy.pdf
Correct Answer: A
References

Number of sets in cache = v .
The question gives a sequencing for the cache lines. For set 0, the cache lines are numbered 0, 1, . . , k − 1 . Now for set 1, the
cache lines are numbered k, k + 1, . . . k + k − 1 and so on.
So, main memory block j will be mapped to set (j mod v) , which will be any one of the cache lines from
(j mod v) ∗ k to (j mod v) ∗ k + (k − 1) .
(Associativity plays no role in mapping- k -way associativity means there are k spaces for a block and hence reduces the chances
of replacement.)
Correct Answer: A

There are N accesses to cache.
Out of these n are unique block addresses.
Now, we need to find the number of misses. (min. n misses are guaranteed whatever be the access sequence due to n unique
block addresses).
We are given that between two consecutive accesses to the same block, there can be only k unique block addresses. So, for a
block to get replaced we can assume that all the next k block addresses goes to the same set (given cache is set-associative)
which will be the worst case scenario (they may also go to a different set but then there is lesser chance of a replacement). Now,
if associativity size is ≥ k , and if we use LRU (Least Recently Used) replacement policy, we can guarantee that these k accesses
won't throw out our previously accessed cache entry (for that we need at least k accesses). So, this means we are at the best-cache
scenario for cache replacement -- out of N accesses we miss only n (which are unique and can not be helped from getting
missed and there is no block replacement in cache). So, miss ratio is n/N .
PS: In question it is given "bounded above by k ", which should mean k unique block accesses as k is an integer, but to ensure no
replacement this must be 'k − 1 '. Guess, a mistake in question.
Correct Answer: A

A. A smaller block size means during a memory access only a smaller part of near by addresses are brought to cache-

meaning spatial locality is reduced.
B. A smaller block size means more number of blocks (assuming cache size constant) and hence index bits go up and offset
bits go down. But the tag bits remain the same.
C. A smaller block size implying larger cache tag is true, but this can't lower cache hit time in any way.
D. A smaller block size incurs a lower cache miss penalty. This is because during a cache miss, an entire cache block is
fetched from next lower level of memory. So, a smaller block size means only a smaller amount of data needs to be
fetched and hence reduces the miss penalty (Cache block size can go til the size of data bus to the next level of memory,
and beyond this only increasing the cache block size increases the cache miss penalty).
Correct Answer: D

If associativity is doubled, keeping the capacity and block size constant, then the number of sets gets halved. So, width of
set index decoder can surely decrease - (B) is false.
Width of way-selection multiplexer must be increased as we have to double the ways to choose from- (C) is false
As the number of sets gets decreased, the number of possible cache block entries that a set maps to gets increased. So, we need
more tag bits to identify the correct entry. So, (A) is also false.
(D) is the correct answer- main memory data bus has nothing to do with cache associativity- this can be answered without even
looking at other options.
 cache size
Number of sets=
sizeof a set
Size of a set = blocksize × no. of blocks in a set

= 8 words × 4 (4-way set-associative)
= 8 × 4 × 4 (since a word is 32 bits = 4 bytes)
= 128 bytes.
16 KB
So, number of sets = = 128
(128 B)
Now, we can divide the physical address space equally between these 128 sets.
So, the number of bytes each set can access
4 GB
=
128
= 32 MB
32
= = 8 M words = 1 M blocks. (220 blocks)
4
So, we need 20 tag bits to identify these 220 blocks.


The question is to find the time taken for,
100 fetch operations and 60 operand read operations and 40 memory operand write operations
.
total number of instructions
Total number of instructions = 100 + 60 + 40 = 200

Time taken for 100 fetch operations(fetch = read) = 100 ∗ ((0.9 ∗ 1) + (0.1 ∗ 5))
1 corresponds to time taken for read when there is cache hit = 140 ns
0.9 is cache hit rate
Time taken for 60 read operations,
= 60 ∗ ((0.9 ∗ 1) + (0.1 ∗ 5))

= 84 ns
Time taken for 40 write operations
= 40 ∗ ((0.9 ∗ 2) + (0.1 ∗ 10))

= 112 ns
Here, 2 and 10 are the times taken for write when there is cache hit and no cache hit respectively.
So,the total time taken for 200 operations is,
= 140 + 84 + 112
= 336 ns
Average time taken = time taken per operation
336
=
200
= 1.68 ns
 52 votes -- Divya Bharti (8.8k points)
Fetch is also a memory read operation.
160(0.9 × 1 + 0.1 × 5) + 40(0.9 × 2 + 0.1 × 10) 160 × 1.4 + 40 × 2.8 336

Avg access time = = = = 1.68
200 200 200
 32 votes -- aravind90 (389 points)

Answer is: 14 ns = 0.8(5) + 0.2(50)
PS: Here instead of cache and main memory access times, time taken on a cache hit and miss are directly given in question. So,
Average Access Time = Hit Rate × Hit Time + Miss Rate × Miss Time
 48 votes -- Vikrant Singh (11.2k points)

Block size of 16 bytes means we need 4 offset bits. (The lowest 4 digits of memory address are offset bits)
Number of sets in cache (cache lines) = 212 so the next lower 12 bits are used for set indexing.
The top 4 bits (out of 20) are tag bits.
So, answer is A.

Physical Address = 40

Tag + Set + Block Offset = 40
T + S + B = 40 → (1)
We have: Cache Size = number of sets × blocks per set × Block size
512 KB = number of sets × 8 × Block size

512
Number of sets × Block size = KB = 64 KB
8
S + B = 16 → (2)
From (1), (2)

T = 24 bits (Ans)
Second way :
Cache Size = 219
MM size = 240
240
This means, We need to map = 221 Blocks to one line. And a set contain 23 lines.
219
Therefore, 224 blocks are mapped to one set.
Using Tag field, I need to identify which one block out of 224 blocks are present in this set.
Hence, 24 bits are needed in Tag field.
 94 votes -- Himanshu Agarwal (12.4k points)
In question block size has not been given,so we can assume block size 2x Byte.
512 × 210 19−x

Number of Blocks:- =2
2x
219−x 16−x
Number of sets:- =2
8
So number of bits for sets = 16 − x
Let number of bits for Tag = T
And we have already assumed block size 2x Byte,therefore number of bits for block size is x
And finally,
T + (16 − x) + x = 40
T + 16 = 40
T = 24.
 54 votes -- ajit (2.5k points)

Look aside Cache Latency = 1ms
Main Memory Latency = 10 ms
Lets try with 20 MB
Miss rate = 60% , Hit rate = 40%

Avg = 0.4(1) + 0.6(10)
= 0.4 + 6 = 6.4 ms > 6 ms
Next Take 30 MB
Miss rate = 40% , Hit rate = 60%

Avg = 0.6(1) + 0.4(10)

= 0.6 + 4 = 4.6 ms < 6 ms
So answer is 30 MB
 64 votes -- Akash Kanase (36k points)

Answer = 0.05.
 169 votes -- Debashish Deka (40.8k points)
In set-associative 1 set = 16 lines. So the number of index bits will be 4 less than the direct mapped case.
So, Tag bits increased to 14 bits.
 65 votes -- Ahwan Mishra (10.2k points)

In two-level memory system (hierarchical), it is clear that the second level is accessed only when first level access is a
miss. So, we must include the first level access time in all the memory access calculations. Continuing this way for any level, we
must include that level access time (without worrying about the hit rate in that level), to all memory accesses coming to that level
(i.e., by just considering the miss rate in the previous level). So, for the given question, we can get the following equation:
AMAT = L1 access time

+L1 miss rate × L2 access time
+L1 miss rate × L2 miss rate × Main memory access time
2 = 1 + x × 8 + 0.5x2 × 18
⟹ 9x2 + 8x − 1 = 0
−−−−−−
−8 ± √64 + 36 2
⟹ x= = = 0.111 .
18 18
So, Answer is option (A).


L2 cache is shared between Instruction and Data (is it always?, see below)
So, average read time

= Fraction of Instruction Fetch * Average Instruction fetch time + Fraction of Data Fetch * Average Data Fetch Time
Average Instruction fetch Time = L1 access time + L1 miss rate * L2 access time + L1 miss rate * L2 miss rate * Memory access
time
= 2 + 0.2 × 8 + 0.2 × 0.1 × 90
= 5.4 ns
Average Data fetch Time = L1 access time + L1 miss rate * L2 access time + L1 miss rate * L2 miss rate * Memory access time
= 2 + 0.1 × 8 + 0.1 × 0.1 × 90
= 3.7 ns
So, average memory access time
= 0.6 × 5.4 + 0.4 × 3.7 = 4.72 ns
Now, why L2 must be shared? Because we can otherwise use it for either Instruction or Data and it is not logical to use it for
only 1. Ideally this should have been mentioned in question, but this can be safely assumed also (not enough merit for Marks to
All). Some more points in the question:
" Assume that the caches use the referred-word-first read policy and the writeback policy
Writeback policy is irrelevant for solving the given question as we do not care for writes. Referred-word-first read policy means
there is no extra time required to get the requested word from the fetched cache line.
' Assume that all the caches are direct mapped caches.
Not really relevant as average access times are given
' Assume that the dirty bit is always 0 for all the blocks in the caches
Dirty bits are for cache replacement- which is not asked in the given question. But this can mean that there is no more delay
when there is a read miss in the cache leading to a possible cache line replacement. (In a write-back cache when a cache line is
replaced, if it is dirty then it must be written back to main memory).

232
No. of blocks of main Memory = = 227
25
9
512 =

And there are 512 = 29 lines in Cache Memory.
Tag bits tell us to how many blocks does 1 line in Cache memory points to
227
1 cache line points to = 218 lines
29
So, 18 bits are required as TAG bits.
 48 votes -- Manish Joshi (20.5k points)

Physical Address Space = 2P Bytes i.e. P bits to represent size of total memory.
Cache Size = 2N Byte i.e., N bits to represent Cache memory.
Tag size = 2X Bytes i.e., X bits to represent Tag.
Cache is K− way associative.
Cache Size
(Size of Tag) × K
= Total Memory Size
N
⟹ 2X × 2K = 2P
⟹ 2X+N−log(K) = 2P
⟹ 2X = 2P−N+log(K)
⟹ X(Size of Tag in bits) = P − N + log(K)
Correct Answer: B
 34 votes -- Digvijay (44.9k points)

Given that cache is Fully Associative.
Tag Bits Block Offset

28 4
There are no index bits in fully associative cache because every main memory block can go to any location in the cache
⟹ Index bits = 0.
Given that memory is byte addressable and uses 32-bit address.
Cache Block size is 16 Bytes ⟹ Number of bits required for Block Offset = ⌈log2 16⌉ = 4 bits
∴ Number of Tag bits = 32 − 4 = 28.
Answer is (D).
 50 votes -- Shaik Masthan (50.4k points)

Time to transfer a cache block = 1 + 3 + 8 = 12 cycles.
i.e., 4 bytes ×8 = 32 bytes in 12 cycles.

32 32
So, memory bandwidth = = = 160 × 106 bytes/s
12 cycle time 12/(60×106 )
6 Bytes
−−−−−−−−: 160 × 10
Answer sec
Explanation :
−−−−−−−−−−−−

Given frequency = 60 MHz
This means that the processor completes 60 × 106 cycles in 1 second.
1
∴ One cycle is completed in seconds
60 × 106
Now,
To service a cache miss, number of cycles needed


= 1 cylce (to accept starting address of the block) + 3 cylces (to fetch all the 8 words of the blocks) + 8×1 cylces (to transm
∵1 word per cycle
Note : Total data is the data which is used for trasmitting the words of the requested block at the rate of 1 word per cycle
= 8 words × 4 Byte (Size of each word)
Total Data
∴ Bandwidth =
12 cycle time
8 × 4 Bytes
=
12 cycles × 1 cycle time
32
=
1
12 ×
60 × 106
5
32 × 60 × 106
= 1
12
= 5 × 32 × 106
Bytes
= 160 × 106
sec
Bytes
∴ 160 × 106 is the correct answer.
sec
 16 votes -- Jeet (15.3k points)

Block size is 256 Bytes,word size is 64 bits or 8 bytes. So Block size in words is 8 words.
Number of words per block=32
Time to fetch a word from main-memory to cache is: 20 + 31 × 5 = 175 ns because first word takes 20ns and rest each
subsequent words take 5ns each.
So average Memory acces time is
0.94(3) + 0.06(3 + 175) = 13.5 ns

 23 votes -- Ayush Upadhyaya (28.4k points)

Block size is 256 Bytes. Number of sets in cache = 26 so Set offset bits=6 and word offset bits=8.
So check for set, check for the rightmost 4 digits of each physical address.(Last two byte denote the word address)
A1=C8A4 = C8 = 11001000
A2=6888 = 68 = 01101000
A3=289C = 28 = 00101000
A4=4880 = 48 = 01001000
Now look for lowest order 6 bits in the highlighted part of Each physical address(corresponds to set number).
8 and 8 match and 6=0110 and 2=0010 two low order bits of 6 and 2 match,So A2 and A3 go to same set.

So answer-B
 16 votes -- Ayush Upadhyaya (28.4k points)

Tag bits = PASbits – log2 (Cache Size) + log2 (K) (where K is associativity)
= 32 − 15 + 0 = 17 bits
 5 votes -- Nikhil Dhama (2.5k points)

32 bit address is used for accessing the cache.
It is given that cache is Set-Associative.
The address bits get split as follows:
Block Size = 64B ⟹ Block offset = 6 bits.

Given that Tag field width = 22 bits.
Therefore, width of Set Index field = 32 − 22 − 6 = 4 ⟹ 24 = 16 sets in the cache.
Cache size is 2KB and Block size = 64B ⟹ 25 = 32 blocks present in the cache.
16 sets contain 32 blocks ⟹ 2 blocks per set or associativity = 2.
Correct Answer: 2

S1 : Read Miss in a write through L1 cache results in read allocate. No write back is done here, as in a write through
L1 cache, both L1 and L2 caches are updated during a write operation (no dirty blocks and hence no dirty bits as in a write back
cache). So during a Read miss it will simply bring in the missed block from L2 to L1 which may replace one block in L1 (this
replaced block in L1 is already updated in L2 and so needs no write back). So, S1 is TRUE.
S2 : This statement is FALSE. Both write-through and write-back policies can use either of these write-miss policies, but usually
they are paired in this way.
No write allocation during write through as L1 and L2 are accessed for each write operation (subsequent writes to same
location gives no advantage even if the location is in L1 cache).
In write back we can to do write allocate in L1 after a write operation hoping for subsequent writes to the same location
which will then hit in L1 and thus avoiding a more expensive L2 access.
Correct Answer: A.
Cache Writing Policies
References

1.2.56 Cache Memory: GATE IT 2004 | Question: 12, ISRO2016-77 top☝ ☛ https://gateoverflow.in/3653

By default we consider hierarchical access - because that is the common implementation and simultaneous access cache
has great practical difficulty. But here the question is a bit ambiguous -- it says to ignore search time within the cache - usually
search is applicable for an associative cache but here no such information given. So, may be they are telling to ignore the search
time for L1 and just consider the time for L2 for an L1 miss and similarly just consider memory access time for L2 miss. This is
nothing but simultaneous access.
Access time for hierarchical access,

= t1 + (1 − h1 ) × t2 + (1 − h1 )(1 − h2 )tm
= 1 + 0.2 × 10 + 0.2 × 0.1 × 500
= 13ns.
Access time for simultaneous access,

= h1 × t1 + (1 − h1 )h2 × t2 + (1 − h1 )(1 − h2 )tm
= 0.8 + 0.2 × 0.9 × 10 + 0.2 × 0.1 × 500
= 12.6ns.
Both options in choice.

When 45 comes, the cache contents are:
4 3 25 8 19 6 16 35
LRU array (first element being least recently used)
[4 3 19 6 25 8 16 35] .
So, 45 replaces 4.
45 3 25 8 19 6 16 35 [3 19 6 25 8 16 35 45]
Similarly, 22 replaces 3 to give:
45 22 25 8 19 6 16 35 [19 6 25 8 16 35 45 22]
8 hits in cache.
45 22 25 8 19 6 16 35 [19 6 25 16 35 45 22 8]
3 replaces 19
45 22 25 8 3 6 16 35 [6 25 16 35 45 22 8 3]
16 and 25 hits in cache.
45 22 25 8 3 6 16 35 [6 35 45 22 8 3 16 25]
Finally, 7 replaces 6, which is in block 5.
So, answer is (B).

128 main memory blocks are mapped to 4 sets in cache. So, each set maps 32 blocks each. And in each set there is place
for two blocks (2-way set).

Now, we have 4 sets meaning 2 index bits. Also, 32 blocks going to one set means 5 tag bits.
Now, these 7 bits identify a memory block and tag bits are placed before index bits. (otherwise adjacent memory references-
spatial locality- will hamper cache performance)
So, based on the two index bits (lower 2 bits) blocks will be going to sets as follows:
Set Number Block Numbers

0 0, 16
1 5, 9
2
3 3 ,7, 55
Since, each set has only 2 places, 3 will be thrown out as it is the least recently used block. So, final content of cache will be
0 5 7 9 16 55
(C) choice.

Answer is C.
For 1 sec it is 109 bytes ( Note - by looking at the options, it can be decided that 1GB should be considered as 109 , but not as
230 , and in general, if bandwidth is given, then we consider 1 Giga = 109 )
So, for 64 bytes?
64 × 1
It is so it is 64 ns but mm latency is 32.
109
So, total time required to place cache line is 64 + 32 = 96 ns.
 57 votes -- K Rajashekar (997 points)

1. I-cache
4K
Number of blocks in cache = = 210 blocks.
4
Bits to represent blocks = 10
Number of words in a block = 4 = 22 words.
Bits to represent words = 2.
tag bits = 30 − (10 + 2) = 18.
Each block will have it's own tag bits. So total tag bits = 1K × 18 bits.
2. D-cache
4K
4
210
Number of sets in cache = = 29 sets.
2
Bits to represent sets = 9.
Number of words in a block = 4 = 22 words.
Bits to represent words = 2
tag bits = 30 − (9 + 2) = 19
Each block will have it's own tag bits. So total tag bits = 1K × 19 bits.
3. L2 cache
64K
16
212
Number of sets in cache = = 1024 sets.
4
Bits to represent sets = 10
4
= 16 = words.

Number of words in cache = 16 = 24 words.
Bits to represent words = 4.
tag bits = 30 − (10 + 4) = 16
Each block will have it's own tag bits. So total tag bits = 212 × 16-bits = 4K × 16-bits.
Option (A).
 38 votes -- Viral Kapoor (1.9k points)

Answer is (B).
Cache location (memory block) = block req mod number of cache blocks. Since each block has only one location (associativity is
1) the last mod 8 request will be in cache (no need of any replacement policy as mapping is direct).
3, 5, 2, 8, 0, 63, 9, 16, 20, 17, 25, 18, 30, 24, 2, 63, 5, 82, 17, 24
Block 0 − 8 , 0, 16, 24, 24 . At end contains 24.

1- 9, 17, 25, 17.
2- 2, 18, 2, 82.
3- 3.
4- 20.
5- 5, 5.
6- 30.
7- 6363.
So, memory block 18 is not in cache while 3, 20 and 30 are in cache.
 23 votes -- K Rajashekar (997 points)
 8KB
Number of cache blocks = = 64
(128 × 1)
Number of cache blocks 64

Number of sets in cache = = = 16
4 (4-way set) 4
So, number of SET bits required = 4(as 24 = 16, and with 4 bits we can get 16 possible outputs)
We can now straight away choose (D) as answer but for confirmation can proceed further.
Since, only physical memory information is given we can assume cache is physically tagged (which is anyway the common case
even in case of virtual memory).
So, we can divide the physical memory into 16 regions so that, each set maps into only its assigned region.
1MB 216
So, size of a region a set can address = = 216 Bytes = = 29 cache blocks (as cache block size is 128 words
16 128
= 128 bytes).
So, when an access comes to a cache entry, it must be able to determine which out of the 29 possible physical block it is. In
short, it needs 9 bits for TAG.
Now, cache block size is 128 words and so to identify a word we need 7 bits for WORD.

As shown in https://gateoverflow.in/3403/gate2008-it_80
We have 16 sets in cache and correspondingly 16 regions in physical memory to which each set is mapped.
Now, WORD bit size is 7 as we need 7 bits to address 128 possible words in a cache block.

So, the lowest 7 bits of 0C795H will be used for this giving us the remaining bits as 0000 1100 0111 1
Of these bits, the lower 4 are used for addressing the 16 possible sets, giving us the tag bits: 0000 1100 0 in (A) choice.
References
1.3 Cisc Risc Architecture (2) top☝
1.3.1 Cisc Risc Architecture: GATE CSE 1999 | Question: 2.22 top☝ ☛ https://gateoverflow.in/1499
The main difference(s) between a CISC and a RISC processor is/are that a RISC processor typically
A. has fewer instructions
B. has fewer addressing modes
C. has more registers
D. is easier to implement using hard-wired logic
gate1999 co-and-architecture normal cisc-risc-architecture multiple-selects
Answer ☟
1.3.2 Cisc Risc Architecture: GATE CSE 2018 | Question: 5 top☝ ☛ https://gateoverflow.in/204079
Consider the following processor design characteristics:
I. Register-to-register arithmetic operations only

II. Fixed-length instruction format
III. Hardwired control unit
Which of the characteristics above are used in the design of a RISC processor?
A. I and II only
B. II and III only
C. I and III only
D. I, II and III
gate2018-cse co-and-architecture cisc-risc-architecture easy
Answer ☟
Answers: Cisc Risc Architecture
1.3.1 Cisc Risc Architecture: GATE CSE 1999 | Question: 2.22 top☝ ☛ https://gateoverflow.in/1499

All are properties of the RISC processor.
http://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/whatis/index.html
http://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/risccisc/index.html
http://homepage3.nifty.com/alpha-1/computer/Control_E.html
References

1.3.2 Cisc Risc Architecture: GATE CSE 2018 | Question: 5 top☝ ☛ https://gateoverflow.in/204079

(D) All of these
' Hardwired control units are implemented through use of combinational logic units, featuring a finite number of
gates that can generate specific results based on the instructions that were used to invoke those responses. Their design
uses a fixed architecture—it requires changes in the wiring if the instruction set is modified or changed. This
architecture is preferred in reduced instruction set computers (RISC) as they use a simpler instruction set.
Instructions length cannot vary in RISC usually it's 32 bit. For CISC it can be between 16 to 64 bits.
The hardwired control unit is used when instructions are fixed.
Register to register operations is always possible in RISC. CISC can have memory to memory instructions also.
References: https://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/2000-01/risc/risccisc/
https://en.wikipedia.org/wiki/Control_unit#Hardwired_control_unit
References
 26 votes -- Subham Mishra (11.4k points)
1.4 Clock Frequency (2) top☝
1.4.1 Clock Frequency: GATE CSE 1992 | Question: 01-iii top☝ ☛ https://gateoverflow.in/547
Many microprocessors have a specified lower limit on clock frequency (apart from the maximum clock frequency limit)
because _____
gate1992 normal co-and-architecture clock-frequency fill-in-the-blanks
Answer ☟
1.4.2 Clock Frequency: GATE IT 2007 | Question: 36 top☝ ☛ https://gateoverflow.in/3469
The floating point unit of a processor using a design D takes 2t cycles compared to t cycles taken by the fixed point unit.
There are two more design suggestions D1 and D2 . D1 uses 30% more cycles for fixed point unit but 30% less cycles for floating
point unit as compared to design D. D2 uses 40% less cycles for fixed point unit but 10% more cycles for floating point unit as
compared to design D. For a given program which has 80% fixed point operations and 20% floating point operations, which of the
following ordering reflects the relative performances of three designs?
(Di > Dj denotes that Di is faster than Dj)
A. D1 > D > D2
B. D2 > D > D1
C. D > D2 > D1
D. D > D1 > D2
gate2007-it co-and-architecture normal clock-frequency
Answer ☟
Answers: Clock Frequency
1.4.1 Clock Frequency: GATE CSE 1992 | Question: 01-iii top☝ ☛ https://gateoverflow.in/547

Clock frequency becomes low means time period of clock becomes high. When this time period increases beyond the
time period in which the volatile memory contents must be refreshed, we loose those contents. So, clock frequency can't go

below this value.
Reference: https://gateoverflow.in/261/microprocessors-specified-frequency-frequency-__________
References
1.4.2 Clock Frequency: GATE IT 2007 | Question: 36 top☝ ☛ https://gateoverflow.in/3469

(B) is correct.
'
T = 0.8 × time taken in fixed point + 0.2 × time taken in floating point
D = 0.8 × t + 0.2 × 2t = 1.2t

D1 = 0.8 × 1.3t + 0.2 × 0.7 × 2t = 1.04t + .28t = 1.32t
D2 = 0.8 × 0.6t + 0.2 × 1.1 × 2t = 0.48t + .44t = 0.92t
So, D2 is the best design for this given program followed by D and then D1 . Option B.
 45 votes -- Vicky Bajoria (4.1k points)
1.5 Conflict Misses (1) top☝
1.5.1 Conflict Misses: GATE CSE 2017 Set 1 | Question: 51 top☝ ☛ https://gateoverflow.in/118745
Consider a 2− way set associative cache with 256 blocks and uses LRU replacement. Initially the cache is empty. Conflict
misses are those misses which occur due to the contention of multiple blocks for the same cache set. Compulsory misses occur due
to first time access to the block. The following sequence of access to memory blocks :
{0, 128, 256, 128, 0, 128, 256, 128, 1, 129, 257, 129, 1, 129, 257, 129}
is repeated 10 times. The number of conflict misses experienced by the cache is _________ .
gate2017-cse-set1 co-and-architecture cache-memory conflict-misses normal numerical-answers
Answer ☟
Answers: Conflict Misses
1.5.1 Conflict Misses: GATE CSE 2017 Set 1 | Question: 51 top☝ ☛ https://gateoverflow.in/118745

{0, 128, 256, 128, 0, 128, 256, 128, 1, 129, 257, 129, 1, 129, 257, 129}
1st Iteration:
For {0, 128, 256, 128, 0, 128, 256, 128}

Block ID Type Set 0 content
0 Compulsory Miss 0
128 Compulsory Miss 0 128
256 Compulsory Miss 128 256
128 Hit 256 128
0 Conflict miss 128 0
128 Hit 0 128
128 Hit 256 128
Total number of conflict misses = 2;
Similarly for {1, 129, 257, 129, 1, 129, 257, 129} , total number of conflict misses in set1 = 2
Total number of conflict misses in 1st iteration = 2 + 2 = 4
2nd iteration:
for {0, 128, 256, 128, 0, 128, 256, 128}
Block ID Type Set 0 content

0 Conflict Miss 128 0
128 Hit 0 128
128 Hit 256 128
128 Hit 0 128
128 Hit 256 128
Total number of conflict misses = 4.
Similarly for {1, 129, 257, 129, 1, 129, 257, 129} , total number of conflict misses in set1 = 4
Total Number of conflict misses in 2nd iteration = 4 + 4 = 8
Note that content of each set is same, before and after 2nd iteration. Therefore each of the remaining iterations will also have 8
conflict misses.
Therefore, overall conflict misses = 4 + 8 ∗ 9 = 76

1.6 Control Unit (1) top☝
1.6.1 Control Unit: GATE CSE 1987 | Question: 1-vi top☝ ☛ https://gateoverflow.in/80199
A microprogrammed control unit
A. Is faster than a hard-wired control unit.

B. Facilitates easy implementation of new instruction.
C. Is useful when very small programs are to be run.
D. Usually refers to the control unit of a microprocessor.
gate1987 co-and-architecture control-unit microprogramming
Answer ☟

Answers: Control Unit
1.6.1 Control Unit: GATE CSE 1987 | Question: 1-vi top☝ ☛ https://gateoverflow.in/80199

A. is wrong. Microprogrammed Control Unit (CU) can never be faster than hardwired CU. Microprogrammed CU it has an
extra layer on top of hardwired CU and hence can only be slower than hardwired CU.
B. is a suitable answer as we can add new instruction by changing the content of control memory.
C. is not correct as when only small programs are there, hardwired control makes more sense.
D. control unit can also be hardwired, so this is also not correct.
Reference: Slides
Correct Answer: B
References
1.7 Data Dependences (2) top☝
1.7.1 Data Dependences: GATE CSE 2015 Set 3 | Question: 47 top☝ ☛ https://gateoverflow.in/8556
Consider the following code sequence having five instructions from I1 to I5 . Each of these instructions has the following
format.
OP Ri, Rj, Rk
Where operation OP is performed on contents of registers Rj and Rk and the result is stored in register Ri.
I1 : ADD R1, R2, R3
I2 : MUL R7, R1, R3
I3 : SUB R4, R1, R5
I4 : ADD R3, R2, R4
I5 : MUL R7, R8, R9
Consider the following three statements.
S1: There is an anti-dependence between instructions I2 and I5
S2: There is an anti-dependence between instructions I2 and I4
S3: Within an instruction pipeline an anti-dependence always creates one or more stalls
Which one of the above statements is/are correct?
A. Only S1 is true
B. Only S2 is true
C. Only S1 and S3 are true
D. Only S2 and S3 are true
gate2015-cse-set3 co-and-architecture pipelining data-dependences normal
Answer ☟
1.7.2 Data Dependences: GATE IT 2007 | Question: 39 top☝ ☛ https://gateoverflow.in/3472
Data forwarding techniques can be used to speed up the operation in presence of data dependencies. Consider the following
replacements of LHS with RHS.
i. R1 → Loc, Loc → R2 ≡ R1 → R2, R1 → Loc

ii. R1 → Loc, Loc → R2 ≡ R1 → R2
iii. R1 → Loc, R2 → Loc ≡ R1 → Loc
iv. R1 → Loc, R2 → Loc ≡ R2 → Loc

In which of the following options, will the result of executing the RHS be the same as executing the LHS irrespective of the
instructions that follow ?
A. i and iii
B. i and iv
C. ii and iii
D. ii and iv
gate2007-it data-dependences co-and-architecture
Answer ☟
Answers: Data Dependences
1.7.1 Data Dependences: GATE CSE 2015 Set 3 | Question: 47 top☝ ☛ https://gateoverflow.in/8556

Answer should be (B).
Anti-dependence can be overcome in pipeline using register renaming. So, "always" in S3 makes it false. Also, if I2 is completed
before I4 (execution stage of MUL), then also there won't be any stall.
 55 votes -- ppm (543 points)
1.7.2 Data Dependences: GATE IT 2007 | Question: 39 top☝ ☛ https://gateoverflow.in/3472

i. is true. Both LOC and R2 are getting the value of R1 in LHS and RHS .
ii. false, because R2 gets the correct data in both LHS and RHS , but LOC is not updated in RHS .
iii. is wrong because R2 is writing last, not R1 in LHS , but not in RHS .
iv. is true. The first write to LOC in LHS is useless as it is overwritten by the next write.
So, answer is (B).
1.8 Data Path (6) top☝
1.8.1 Data Path: GATE CSE 1990 | Question: 8a top☝ ☛ https://gateoverflow.in/85669
A single bus CPU consists of four general purpose register, namely, R0, … , R3, ALU, MAR, MDR, PC, SP and IR
(Instruction Register). Assuming suitable microinstructions, write a microroutine for the instruction, ADD R0, R1 .
gate1990 descriptive co-and-architecture data-path
Answer ☟
1.8.2 Data Path: GATE CSE 2001 | Question: 2.13 top☝ ☛ https://gateoverflow.in/731
Consider the following data path of a simple non-pipelined CPU. The registers A, B, A1 , A2 , MDR , the bus and the ALU are
8-bit wide. SP and MAR are 16-bit registers. The MUX is of size 8 × (2 : 1) and the DEMUX is of size 8 × (1 : 2) . Each
memory operation takes 2 CPU clock cycles and uses MAR (Memory Address Register) and MDR (Memory Date Register). SP
can be decremented locally.

The CPU instruction "push r" where, r = A or B has the specification
M[SP] ← r
SP ← SP − 1
How many CPU clock cycles are required to execute the "push r" instruction?
A. 2
B. 3
C. 4
D. 5
gate2001-cse co-and-architecture data-path machine-instructions normal
Answer ☟
1.8.3 Data Path: GATE CSE 2005 | Question: 79 top☝ ☛ https://gateoverflow.in/1402
Consider the following data path of a CPU.
The ALU, the bus and all the registers in the data path are of identical size. All operations including incrementation of the PC and
the GPRs are to be carried out in the ALU. Two clock cycles are needed for memory read operation – the first one for loading
address in the MAR and the next one for loading data from the memory bus into the MDR.
The instruction ‘‘add R0, R1” has the register transfer interpretation R0 ⇐ R0 + R1. The minimum number of clock cycles
needed for execution cycle of this instruction is:
A. 2
B. 3
C. 4
D. 5

gate2005-cse co-and-architecture machine-instructions data-path normal
Answer ☟
Consider the following data path of a CPU.
The ALU, the bus and all the registers in the data path are of identical size. All operations including incrementation of the PC and
the GPRs are to be carried out in the ALU. Two clock cycles are needed for memory read operation – the first one for loading
address in the MAR and the next one for loading data from the memory bus into the MDR.
The instruction "call Rn, sub” is a two word instruction. Assuming that PC is incremented during the fetch cycle of the first word of
the instruction, its register transfer interpretation is
Rn ← PC + 1 ;
PC ← M[PC] ;
The minimum number of CPU clock cycles needed during the execution cycle of this instruction is:
A. 2
B. 3
C. 4
D. 5
co-and-architecture normal gate2005-cse data-path machine-instructions
Answer ☟
1.8.5 Data Path: GATE CSE 2016 Set 2 | Question: 30 top☝ ☛ https://gateoverflow.in/39627
Suppose the functions F and G can be computed in 5 and 3 nanoseconds by functional units UF and UG , respectively. Given
two instances of UF and two instances of UG , it is required to implement the computation F(G(Xi )) for 1 ≤ i ≤ 10 . Ignoring all
other delays, the minimum time required to complete this computation is ____________ nanoseconds.
gate2016-cse-set2 co-and-architecture data-path normal numerical-answers
Answer ☟
Consider the following data path diagram.

Consider an instruction: R0 ← R1 + R2 . The following steps are used to execute it over the given data path. Assume that PC is
incremented appropriately. The subscripts r and w indicate read and write operations, respectively.
1. R2r , TEMP1 r , ALUadd , TEMP2 w

2. R1r , TEMP1 w
3. P Cr , MARw , MEM r
4. TEMP2 r , R0w
5. MDRr , IRw
Which one of the following is the correct order of execution of the above steps?
A. 2, 1, 4, 5, 3
B. 1, 2, 4, 3, 5
C. 3, 5, 2, 1, 4
D. 3, 5, 1, 2, 4
gate2020-cse co-and-architecture data-path
Answer ☟
Answers: Data Path
1.8.1 Data Path: GATE CSE 1990 | Question: 8a top☝ ☛ https://gateoverflow.in/85669

MAR ← PC
PC ← PC +3
MDR ← MEM[MAR]
IR ← MDR
MAR ← MAR+1
MDR ← MEM[MAR]
R0 ← MDR
MAR ← MAR +1
MDR ← MEM[MAR]
R1 ← MDR
R0 ← R0+R1
Reference: CPU Datapath Diagram from Hamacher

References
 3 votes -- Arnab Bhadra (3.7k points)
1.8.2 Data Path: GATE CSE 2001 | Question: 2.13 top☝ ☛ https://gateoverflow.in/731

A microinstruction cannot be further broken down into two or more. It can take more than a cycle if it involves a memory
access. The first instruction given here is not a microinstruction. It is an assembly language instruction.
It can be broken down as:

T 1, T 2 : MAR ← SP
T 3. : MDR ← r, SP ← SP − 1 ( It is not mandatory to decrement it in this cycle. Anyway, it can be decremented locally )
T 4, T 5 : M[MAR] ← MDR
The problem says, 8-bit MDR, 8-bit data bus, 8 bit registers.Can't you see that the given CPU is 8-bit? 8 multiplexers transfer 8
bits when selection input is 0 and 1 respectively. During cycle 1, bits in even positions are moved to MAR. During cycle 2, bits
in odd positions are transferred to MAR. We certainly need to move 16-bit SP to 16-bit MAR via a 8-bit bus. So, 2 cycles to get
SP to MAR.
The given data path has a single bus, which requires r to be carried in a separate cycle. For the contents of r to be moved to
MDR during the cycles T1 or T2, address and data bus should be separate. Here, it ain't the case.
Memory read takes 2 more cycles. In total, we need 5 of them clock cycles to execute a push.
https://www.cise.ufl.edu/~mssz/CompOrg/CDA-proc.html
Computer organization pal chaudari page 334-335
Computer architecture by behrooz parahmi exercise 7.6

Correct Answer: D
References
 50 votes -- Rajaneesh Polavarapu (377 points)

Instruction fetch requires two cycles but the question asks for the execution part only!
Now for execution:
1. R1out , Sin S ← R0 − 1st cycle

2. R2out , Tin T ← R1 − 2nd cycle
3. Sout , Tout , Add R0in R0 ← R0 + R1 − 3rd cycle
So, 3 cycles for execution.

As it is asked for only execution cycles, no of cycles required = 3.
Had it been asked for instruction cycles, then the answer will be 5.
Hence, option B is correct.
 58 votes -- Pooja Palod (24.1k points)

MAR ← PC → 1 cycle
S ← PC (Since these two actions are independent they can be done in same cycle)
MDR ← M[MAR] → 2nd cycle (System BUS)
Rn ← S + 1 (ALU Is free and the two actions are independent.) (Internal BUS)
PC ← MDR → 3 rd cycle
Therefore 3 cycles needed.

A rough sketch:

Correct Answer: B
 66 votes -- Riya Roy(Arayana) (5.3k points)
1.8.5 Data Path: GATE CSE 2016 Set 2 | Question: 30 top☝ ☛ https://gateoverflow.in/39627

The same concept is used in pipelining. Bottleneck here is UF as it takes 5 ns while UG takes 3ns only. We have to do 10
such calculations and we have 2 instances of UF and UG respectively. So, UF can be done in 50/2 = 25 nano seconds. For the
start UF needs to wait for UG output for 3 ns and rest all are pipelined and hence no more wait. So, answer is
3 + 25 = 28ns.

3rd followed by 5th are Instruction fetch cycle micro operations and can be elaborated as follows:
t1 : MAR w ← PCr
t2 : MDRw ← Memoryr ∣ PC ← PC + 1
t3 : IRw ← MDRr
Now we need to perform Execute cycle micro operations. Just observe the figure and it will be very easy to identify the sequence
between 1st , 2nd , 4th

2nd is clearly stating that we need to move R1 content to some temporary register named as TEMP1 and it is very clear that
before performing ALU operation we need the content in TEMP1. Hence 2nd will be performed next after 5th .
TEMP1 w ← R1r
Now we can perform ALU operation and can take second operand directly from R2 and the figure clearly shows us that we need
to put the result of ALU back into TEMP2. All these steps are performed in 1st . So 1st will be next.
TEMP2 w ← TEMP1 r +ALUadd R2r
Lastly we need to put the result present in TEMP2 into R0. This step is performed by 4th .
R0w ← TEMP2 r
Correct Answer (C) : 3, 5, 2, 1, 4
 3 votes -- KUSHAGRA गु ा (10.5k points)
1.9 Dma (6) top☝
1.9.1 Dma: GATE CSE 2004 | Question: 68 top☝ ☛ https://gateoverflow.in/1062
A hard disk with a transfer rate of 10 Mbytes/second is constantly transferring data to memory using DMA. The processor
runs at 600 MHz, and takes 300 and 900 clock cycles to initiate and complete DMA transfer respectively. If the size of the transfer
is 20 Kbytes, what is the percentage of processor time consumed for the transfer operation?
A. 5.0%
B. 1.0%
C. 0.5%
D. 0.1%
gate2004-cse dma normal co-and-architecture
Answer ☟
Consider a disk drive with the following specifications:

16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing
mode whereby whenever one 4 byte word is ready it is sent to memory; similarly, for writing, the disk interface reads a 4 byte word
from the memory in each DMA cycle. Memory cycle time is 40 nsec. The maximum percentage of time that the CPU gets blocked
during DMA operation is:
A. 10
B. 25
C. 40
D. 50
gate2005-cse co-and-architecture disks normal dma
Answer ☟

On a non-pipelined sequential processor, a program segment, which is the part of the interrupt service routine, is given to
transfer 500 bytes from an I/O device to memory.
Initialize the address register
Initialize the count to 500
LOOP: Load a byte from device
Store in memory at address given by address register
Increment the address register
Decrement the count
If count !=0 go to LOOP
Assume that each statement in this program is equivalent to a machine instruction which takes one clock cycle to execute if it is a
non-load/store instruction. The load-store instructions take two clock cycles to execute.
The designer of the system also has an alternate approach of using the DMA controller to implement the same transfer. The DMA
controller requires 20 clock cycles for initialization and other overheads. Each DMA transfer cycle takes two clock cycles to transfer
one byte of data from the device to the memory.
What is the approximate speed up when the DMA controller based design is used in a place of the interrupt driven program based
input-output?
A. 3.4
B. 4.4
C. 5.1
D. 6.7
gate2011-cse co-and-architecture dma normal
Answer ☟
1.9.4 Dma: GATE CSE 2016 Set 1 | Question: 31 top☝ ☛ https://gateoverflow.in/39698
The size of the data count register of a DMA controller is 16bits.

The processor needs to transfer a file of 29, 154 kilobytes from disk to main memory.
The memory is byte addressable. The minimum number of times the DMA
controller needs to get the control of the system bus from the processor to transfer
the file from the disk to main memory is _________-.
gate2016-cse-set1 co-and-architecture dma normal numerical-answers
Answer ☟
Consider a computer system with DMA support. The DMA module is transferring one 8-bit character in one CPU cycle
from a device to memory through cycle stealing at regular intervals. Consider a 2 MHz processor. If 0.5% processor cycles are
used for DMA , the data transfer rate of the device is __________ bits per second.
gate2021-cse-set2 numerical-answers co-and-architecture dma
Answer ☟
1.9.6 Dma: GATE IT 2004 | Question: 51 top☝ ☛ https://gateoverflow.in/3694
The storage area of a disk has the innermost diameter of 10 cm and outermost diameter of 20 cm. The maximum storage
density of the disk is 1400 bits/cm. The disk rotates at a speed of 4200 RPM. The main memory of a computer has 64-bit word
length and 1µs cycle time. If cycle stealing is used for data transfer from the disk, the percentage of memory cycles stolen for
transferring one word is
A. 0.5%
B. 1%
C. 5%
D. 10%
gate2004-it co-and-architecture dma normal
Answer ☟
Answers: Dma

 1
Clock cycle time = [ Frequency = 1/Time]
600×106
(900+300)
For DMA initiation and completion = = 2 microsec .
600×106
Disk Transfer rate = 10 Mbytes/sec

1
1 byte = sec
107
20 Kbytes = 2 milisec = 2000 micro sec
Percentage = ( 2+2000
2
) × 100 = 0.0999 ≃ 0.1%
option (D)
x
% of CPU time consume x+y
Now, when, use x = Data preparation time or Total cycle time used by CPU and y = Data transfer time
x
To calculate the fraction of CPU time to the data transfer time - we use x+y it is burst mode.
512 KB-1/50 sec
4 Byte transfer will take total :4/(512 × 50 × 210 ) = 152.58 ns
DMA will transfer 4B in 40nsec
So, Cpu will be blocked (40/152.58) = 26% of time
Best matching answer is (B).
 55 votes -- Anurag Semwal (6.7k points)
Statement Clock Cycles(s) Needed

Initialize the address register 1
Initialize the count to 500 1
LOOP: Load a byte from device 2
Store in memory at address given by address register 2
Increment the address register 1
Decrement the count 1
If count != 0 go to LOOP 1
Interrupt driven transfer time = 1 + 1 + 500 × (2 + 2 + 1 + 1 + 1) = 3502
DMA based transfer time = 20 + 500 × 2 = 1020
Speedup = 3502/1020 = 3.4
Correct Answer: A
 90 votes -- Manu Thakur (34k points)

' Data count register gives the number of words the DMA can transfer in a single cycle..
Here it is 16 bits.. so max 216 words can be transferred in one cycle..

Since memory is byte addressable.. 1 word = 1 byte
so 216 bytes in 1 cycle..
Now for the given file..
File size = 29154 KB = 29154 × 210 B
1 cylce → DMA transfers 216 B
i.e
1
1 B transfered by DMA → 16 cycles.
2
Now, for full file of size 29154 KB,

(29154 × 210 B)
minimum number of cylces = = 455.53
216
But number of cylces is asked so 455.53 → 456.
 57 votes -- Abhilash Panicker (7.6k points)

Answer is 80, 000 .
1
To complete one cycle at 2 MHz it will take seconds. So the total number of CPU cycles in one second will be 2 × 106 .
2×106
Now 0.5% of these cycles are taken by DMA to transfer the data.
0.5
So total number of cycles taken to transfer the data will be 100 × 2 × 106 = 10, 000 and in each cycle 8 bits are transferred.
So, data transfer rate in bits per second = 8 × 10000 = 80, 000 .
 3 votes -- JATIN MITTAL (2.1k points)
1.9.6 Dma: GATE IT 2004 | Question: 51 top☝ ☛ https://gateoverflow.in/3694

In a disk all tracks have equal capacity and so data density is highest for the innermost track as it has the smallest radius.
Maximum storage density (hence of innermost track) = 1400 bits per cm

Track capacity π × d × 1400 bits = 3.14 × 10 × 1400 = 43960 bits
4200 × 43960
With 4200 rotations per minute, data transfer rate = bits per second.
60
60
Therefore, to transfer 64 bits time required = × 64 = 20.798μs
4200 × 43960
1
With 1μs memory cycle time, the disk will take one memory cycle out of 21 + 1 = × 100 ≈ 5%
21 + 1
(PS: If we consider just one word transfer we add the memory cycle time to the disk transfer time in the denominator but for
continuous DMA transfer, this is not required as when data is transferred to main memory, disk can continue reading new data)
1.10 Dram (1) top☝
1.10.1 Dram: GATE CSE 2019 | Question: 2 top☝ ☛ https://gateoverflow.in/302846
The chip select logic for a certain DRAM chip in a memory system design is shown below. Assume that the memory system
has 16 address lines denoted by A15 to A0 . What is the range of address (in hexadecimal) of the memory system that can get

enabled by the chip select (CS) signal?
A. C800 to CFFF
B. CA00 to CAFF
C. C800 to C8FF
D. DA00 to DFFF
gate2019-cse co-and-architecture dram
Answer ☟
Answers: Dram
1.10.1 Dram: GATE CSE 2019 | Question: 2 top☝ ☛ https://gateoverflow.in/302846

(A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0 )
According to question:
A15 = 1, A14 = 1, A13 = 0, A12 = 0, A11 = 1
So the possible range in binary:
(11001 00000000000) to (1100111111111111)
Converting to Hexadecimal:
(C800) to (CFFF)
Option A.
 40 votes -- Ruturaj Mohanty (3.1k points)
' The memory system that can be enabled by the chip select signal
→ From this we can understood that, CS should be 1.
→ Note that it is AND gate, ⟹ all input lines should be 1.

¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯ ¯¯¯¯¯¯¯¯
→ A15 .A14 .A13 .A12 .A11 = 1 ⟹ (A15 = 1 and A14 = 1 and A13 = 1 and A12 = 1 and A11 = 1)
¯¯¯¯¯¯¯¯
→ A13 = 1 ⟹ A13 = 0
¯¯¯¯¯¯¯¯
→ A12 = 1 ⟹ A12 = 0
∴ The address denoted by A15 to A0 is , (Note A15 as MSB and A0 as LSB)
(1100) (1__ __ __) (__ __ __ __ ) ( __ __ __ __ )
→ For starting address,keep all 0's in the blanks, and for ending address keep all 1's in the blanks.
→ Starting address :- 1100 1000 0000 0000 ⟹ (C800)H
→ Ending address :- 1100 1111 1111 1111 ⟹ (CFFF)H

1.11 Instruction Execution (7) top☝
1.11.1 Instruction Execution: GATE CSE 1990 | Question: 4-iii top☝ ☛ https://gateoverflow.in/85391
State whether the following statements are TRUE or FALSE with reason:
The flags are affected when conditional CALL or JUMP instructions are executed.
gate1990 true-false co-and-architecture instruction-execution
Answer ☟
1.11.2 Instruction Execution: GATE CSE 1992 | Question: 01-iv top☝ ☛ https://gateoverflow.in/548
Many of the advanced microprocessors prefetch instructions and store it in an instruction buffer to speed up processing. This
speed up is achieved because ________
gate1992 co-and-architecture easy instruction-execution fill-in-the-blanks
Answer ☟
1.11.3 Instruction Execution: GATE CSE 1995 | Question: 1.2 top☝ ☛ https://gateoverflow.in/2589
Which of the following statements is true?

A. ROM is a Read/Write memory
B. PC points to the last instruction that was executed
C. Stack works on the principle of LIFO
D. All instructions affect the flags
gate1995 co-and-architecture normal instruction-execution
Answer ☟
Which of the following is not a form of memory

A. instruction cache
B. instruction register
C. instruction opcode
D. translation look-a-side buffer
gate2002-cse co-and-architecture easy instruction-execution
Answer ☟
1.11.5 Instruction Execution: GATE CSE 2006 | Question: 43 top☝ ☛ https://gateoverflow.in/1819
Consider a new instruction named branch-on-bit-set (mnemonic bbs). The instruction “bbs reg, pos, label” jumps to label if bit
in position pos of register operand reg is one. A register is 32 -bits wide and the bits are numbered 0 to 31, bit in position 0 being
the least significant. Consider the following emulation of this instruction on a processor that does not have bbs implemented.
temp ← reg&mask
Branch to label if temp is non-zero. The variable temp is a temporary register. For correct emulation, the variable mask must be
generated by
A. mask ← 0x1 << pos
B. mask ← 0xffffffff << pos
C. mask ← pos
D. mask ← 0xf

gate2006-cse co-and-architecture normal instruction-execution
Answer ☟
1.11.6 Instruction Execution: GATE CSE 2017 Set 1 | Question: 49 top☝ ☛ https://gateoverflow.in/118332
Consider a RISC machine where each instruction is exactly 4 bytes long. Conditional and unconditional branch instructions
use PC-relative addressing mode with Offset specified in bytes to the target location of the branch instruction. Further the Offset is
always with respect to the address of the next instruction in the program sequence. Consider the following instruction sequence
Instr. No. Instruction

i: add R2, R3, R4
i+1: sub R5, R6, R7
i+2: cmp R1, R9, R10
i+3: beq R1, Offset
If the target of the branch instruction is i, then the decimal value of the Offset is ____________ .
gate2017-cse-set1 co-and-architecture normal numerical-answers instruction-execution
Answer ☟
Consider a pipelined processor with 5 stages, Instruction Fetch(IF) , Instruction Decode(ID) , Execute (EX) ,
Memory Access (MEM) , and Write Back (WB) . Each stage of the pipeline, except the EX stage, takes one cycle. Assume that
the ID stage merely decodes the instruction and the register read is performed in the EX stage. The EX stage takes one cycle for
ADD instruction and the register read is performed in the EX stage, The EX stage takes one cycle for ADD instruction and two
cycles for MUL instruction. Ignore pipeline register latencies.
Consider the following sequence of 8 instructions:
ADD, MUL, ADD, MUL, ADD, MUL, ADD, MUL
Assume that every MUL instruction is data-dependent on the ADD instruction just before it and every ADD instruction (except the
first ADD) is data-dependent on the MUL instruction just before it. The speedup defined as follows.
Execution time without operand forwarding

Speedup =
Execution time with operand forearding
The Speedup achieved in executing the given instruction sequence on the pipelined processor (rounded to 2 decimal places) is
_____________
gate2021-cse-set2 co-and-architecture pipelining instruction-execution numerical-answers
Answer ☟
Answers: Instruction Execution
1.11.1 Instruction Execution: GATE CSE 1990 | Question: 4-iii top☝ ☛ https://gateoverflow.in/85391
False. Flags are tested during conditional call and jump not affected or changed
 15 votes -- khush tak (5.9k points)
1.11.2 Instruction Execution: GATE CSE 1992 | Question: 01-iv top☝ ☛ https://gateoverflow.in/548

Because CPU is faster than memory. Fetching instructions from memory would require considerable amount of time
while CPU is much faster. So, prefetching the instructions to be executed can save considerable amount of waiting time.


It is (C).
Only the top of the stack can be accessed at any time. You can imagine a stack to be opened from only one side data structure. So
that if we put one thing over the other, we are able to access the last thing we inserted first. That is Last in First Out (LIFO).
ROM is Read-Only Memory.

PC points to the next instruction to be executed.
Not all instructions affect the flags.

The instruction opcode is a part of the instruction which tells the processor what operation is to be performed so it is not
a form of memory while the others are
 26 votes -- Bhagirathi Nayak (11.7k points)
1.11.5 Instruction Execution: GATE CSE 2006 | Question: 43 top☝ ☛ https://gateoverflow.in/1819

A. mask ← 0x1 << pos
We want to check for a particular bit position say 2 (third from right). Let the number be 0xA2A7 (last 4 bits being 0111 ). Here,
the bit at position 2 from right is 1. So, we have to AND this with 0x0004 as any other flag would give wrong value (may count
other bits or discard the bit at position "pos "). And 0x0004 is obtained by 0x1 << 2 (by shifting 1 "pos " times to the left we
get a flag with 1 being set only for the "pos " bit position).

Answer is −16.
Program Counter is updated with the address of next instruction even before the current instruction is executed.
That is why the question says that the address of the next instruction is updated with next instruction in sequence.
Before executing instruction i + 3 , the current state looks as under:
Please note: BEQ instruction is for Branch Equal
Question says that the target of branch instruction is 'i' which is at 2000 in our example.
So, we need to go to address 2000 from address 2016 (which is currently pointed by PC)
2016 − 2000 = 16
So, we have to specify Offset as −16 which would mean that 16 should be subtracted from next address instruction ( 2016 ).
 51 votes -- Arunav Khare (3.9k points)

Correct Answer: 1.875
Speedup(def in question) =
Time without Operand Forwarding
Speedup(def in question) =
Time with Operand Forwarding
Without Operand Forwarding:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
ADD IF ID EX MEM WB
MUL IF ID EX EX MEM WB
ADD IF ID EX MEM WB
ADD IF ID EX MEM WB
ADD IF ID EX MEM WB
MUL IF ID
Time taken without Operand Forwarding = 30
With Operand Forwarding:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ADD IF ID EX MEM WB
ADD IF ID EX MEM WB
ADD IF ID EX MEM WB
ADD IF ID EX MEM WB
Time taken with Operand Forwarding = 16
Time without Operand Forwarding 30

Speedup = = = 1.875
Time with Operand Forwarding 16
 7 votes -- Cringe is my middle name... (885 points)
1.12 Instruction Format (7) top☝
1.12.1 Instruction Format: GATE CSE 1988 | Question: 2-ii top☝ ☛ https://gateoverflow.in/91676
Using an expanding opcode encoding for instructions, is it possible to encode all of the following in an instruction format
shown in the below figure. Justify your answer.
14 double address instructions

127 single address instructions
60 no address (zero address) instructions
← 4 bits → ← 6 bits → ← 6 bits →

Opcode Operand 1 Operand 2
Address Address
gate1988 normal co-and-architecture instruction-format descriptive
Answer ☟
1.12.2 Instruction Format: GATE CSE 1992 | Question: 01-vi top☝ ☛ https://gateoverflow.in/551
In an 11-bit computer instruction format, the size of address field is 4-bits. The computer uses expanding OP code technique
and has 5 two-address instructions and 32 one-address instructions. The number of zero-address instructions it can support is
________
gate1992 co-and-architecture machine-instructions instruction-format normal numerical-answers
Answer ☟
1.12.3 Instruction Format: GATE CSE 1994 | Question: 3.2 top☝ ☛ https://gateoverflow.in/2479
State True or False with one line explanation
Expanding opcode instruction formats are commonly employed in RISC. (Reduced Instruction Set Computers) machines.

gate1994 co-and-architecture machine-instructions instruction-format normal true-false
Answer ☟
1.12.4 Instruction Format: GATE CSE 2014 Set 1 | Question: 9 top☝ ☛ https://gateoverflow.in/1767
A machine has a 32 − bit architecture, with 1 − word long instructions. It has 64 registers, each of which is 32 bits long. It
needs to support 45 instructions, which have an immediate operand in addition to two register operands. Assuming that the
immediate operand is an unsigned integer, the maximum value of the immediate operand is ____________
gate2014-cse-set1 co-and-architecture machine-instructions instruction-format numerical-answers normal
Answer ☟
Consider a processor with 64 registers and an instruction set of size twelve. Each instruction has five distinct fields, namely,
opcode, two source register identifiers, one destination register identifier, and twelve-bit immediate value. Each instruction must be
stored in memory in a byte-aligned fashion. If a program has 100 instructions, the amount of memory (in bytes) consumed by the
program text is _________.
gate2016-cse-set2 instruction-format machine-instructions co-and-architecture normal numerical-answers
Answer ☟
1.12.6 Instruction Format: GATE CSE 2018 | Question: 51 top☝ ☛ https://gateoverflow.in/204126
A processor has 16 integer registers (R0, R1, … , R15) and 64 floating point registers (F0, F1, … , F63). It uses a 2- byte
instruction format. There are four categories of instructions: Type-1, Type-2, Type-3, and Type-4. Type-1 category consists of
four instructions, each with 3 integer register operands (3Rs). Type-2 category consists of eight instructions, each with 2 floating
point register operands (2Fs). Type-3 category consists of fourteen instructions, each with one integer register operand and one
floating point register operand (1R+1F). Type-4 category consists of N instructions, each with a floating point register operand
(1F).
The maximum value of N is _________.
gate2018-cse co-and-architecture machine-instructions instruction-format numerical-answers
Answer ☟
A processor has 64 registers and uses 16-bit instruction format. It has two types of instructions: I-type and R-type. Each I-type
instruction contains an opcode, a register name, and a 4-bit immediate value. Each R-type instruction contains an opcode and two
register names. If there are 8 distinct I-type opcodes, then the maximum number of distinct R-type opcodes is _______.
gate2020-cse numerical-answers instruction-format machine-instructions
Answer ☟
Answers: Instruction Format
1.12.1 Instruction Format: GATE CSE 1988 | Question: 2-ii top☝ ☛ https://gateoverflow.in/91676

4 bits are for the opcode so number of 2 address instructions will be 24 = 16− so 14 double instructions are possible.
But out of 16 only 14 are used so 2 are still left which can be used for 1 address instruction. For 1 address instruction we can use
not only the 2 left over but also the 6 bits of operand 1 (to make it one address) − so 6 bits that is 64. So, total 2 × 64 single
address instructions can be supported − So, 127 single instructions are possible
But out of 128, 127 are used so 1 left which can be used for zero-address instruction. To make number of zero address we can
use the operand 2 address (we already included operand 1 address) −6 bits. So, total number possible is 64. So,
total 1 × 64 = 64 zero address instructions are possible.
So, all encoding are possible.

 24 votes -- Pavan Kumar Munnam (7.4k points)
1.12.2 Instruction Format: GATE CSE 1992 | Question: 01-vi top☝ ☛ https://gateoverflow.in/551

No. of possible instruction encoding = 211 = 2048
No. of encoding taken by two-address instructions = 5 × 24 × 24 = 1280
No. of encoding taken by one-address instructions = 32 × 24 = 512
So, no. of possible zero-address instructions = 2048 − (1280 + 512) = 256

1.12.3 Instruction Format: GATE CSE 1994 | Question: 3.2 top☝ ☛ https://gateoverflow.in/2479

I think the answer is TRUE.
RISC systems use fixed length instruction to simplify pipeline.
eg: MIPS, PowerPC: Instructions are 4 bytes long.
CISC systems use Variable-length instructions.
eg: Intel 80X86 : Instructions vary from 1 to 17 bytes long.
Now the challenge is: How to fit multiple sets of instruction types into same (limited) number of bits (Fixed size instruction)?
Here comes Expanding opcode into the picture.
RISC systems commonly uses Expanding opcode technique to have fixed size instructions.
 39 votes -- Sachin Mittal (15.8k points)

64 registers means 6 bits (⌈log2 64⌉ = 6) for a register operand. So, 2 registers operand requires 12 bits. Now, 45
instructions require another 6 bits for opcode (⌈log2 45⌉ = 6) . So, totally 18 bits. So, we have 32 − 18 = 14 bits left for the
immediate operand. So, the max value will be 214 − 1 = 16383 (as the operand is unsigned we do not need a sign bit and with
14 bits we can represent from 0 to 214 − 1 )

Answer: 500 bytes
Number of registers = 64
Number of bits to address register = ⌈log2 64⌉ = 6 − bits
Number of Instructions = 12
Opcode size = ⌈log2 12⌉ = 4
Opcode(4) reg1(6) reg2(6) reg3(6) Immediate(12)
Total bits per instruction = 34

Total bytes per instruction = 4.25
Due to byte alignment we cannot store 4.25 bytes, without wasting 0.75 bytes.
So, total bytes per instruction = 5
Total number of instructions = 100
Total size = Number of instructions × Size of an instruction
= 100 × 5 = 500 bytes


We have 2-byte instruction format. So, total number of instruction encodings = 216
PS: This is not the number of different instructions but different encodings; a single instruction can have different encodings
when the address part differs.
No. of bits taken by an integer operand (16 possible integer registers) = log2 16 = 4.
No. of bits taken by a floating point operand (64 possible floating point registers) = log2 64 = 6.
No. of encodings consumed by Type 1 instructions = 4 × 23×4 = 214 .
No. of encodings consumed by Type 2 instructions = 8 × 22×6 = 215 .
No. of encodings consumed by Type 3 instructions = 14 × 2(4+6) = 14336.
No. of encodings left for Type 4 = 216 − (214 + 215 + 14336) = 2048.
2048
No. of different instructions of Type 4 = 64
= 32.
Answer: 32 Instructions
Explanation:
Given,
16 Integer registers. So, we need 4 bits to address any one of them.
64 Floating point registers. This requires 6 bits to uniquely identify them.
Each instruction is 16 bits long.
Type 1 Instructions:
4 instructions, each with 3 integer operands.
The 3 integers, each requires 4 bits. So, 4 ∗ 3 bits for operands. We are left with 16 − 12 = 4 bits.
With 4 bits, 24 = 16 opcodes are possible. Out of these we used 4 opcodes. i.e 2 bits. Let's say first two bits are fixed to 00 and
next two bits are used for 4 different Type1 instructions.
00 00 . . .
00 01 . . .
00 10 . . .
00 11 . . .
8 instructions, each with 2 floating point register operands.
Here we need 6 ∗ 2 bits for operands, and remaining 16 − 12 = 4 bits are left for opcodes.
So using these 4 bits, we need to get 8 opcodes.
Here we can't use 00 . . . for any opcode since it will not distinguish Type 2 from Type 1. So, we are left with 12 opcodes. And
we are going to use 8 out of these 12 for type 2 instructions.
01 00 . . .
01 01 . . .
01 10 . . .
01 11 . . .
10 00 . . .
10 01 . . .

10 10 . . .
10 11 . . .
14 instructions, with 1 integer and 1 floating type operand.
4 + 6 = 10 bits required for opcodes, remaining 16 − 10 = 6 bits available for use in opcode.
The only valid combination left for this first 2 bits is 11 . . . . Rest have been used in Type1 and Type2 instructions.
So, we are left with 4 bits for opcodes. With these 4 bits we can have 24 = 16 opcodes, out of which 14 are required. So, we use
all except last two opcodes:
11 00 00 . . .
11 . . .
11 11 01 . . .
These two opcodes are still left unassigned.
11 11 10 . . .
11 11 11 . . .
N instructions, each with 1 floating point operand.
We have 16 − 6 = 10 bits for opcode. Out of 10 bits, first 6 bits can be either one of the two left opcodes(above). And any
combination for remaining 4 bits.
So we have 2 ∗ 24 opcodes. So, N = 32 .
 103 votes -- Rishabh Gupta (12.5k points)

Instruction Length: 16 bits
To distinguish among 64 registers, we need log2 (64) = 6 bits
I-type instruction format:
Opcode Register Immediate Value

R-type instruction format:
Opcode Register Register

Maximum possible encodings = 216
It is given that there are 8 I-type instructions. Let's assume the maximum R-type instructions to be x.
Therefore, (8 × 26 × 24 ) + (x × 26 × 26 ) = 216
⟹ x = 16 − 2 = 14
 32 votes -- Debasish Das (1.5k points)
1.13 Interrupts (7) top☝
1.13.1 Interrupts: GATE CSE 1987 | Question: 1-viii top☝ ☛ https://gateoverflow.in/80274
On receiving an interrupt from a I/O device the CPU:
A. Halts for a predetermined time.

B. Hands over control of address bus and data bus to the interrupting device.
C. Branches off to the interrupt service routine immediately.
D. Branches off to the interrupt service routine after completion of the current instruction.
gate1987 co-and-architecture interrupts
Answer ☟

1.13.2 Interrupts: GATE CSE 1995 | Question: 1.3 top☝ ☛ https://gateoverflow.in/2590
In a vectored interrupt:
A. The branch address is assigned to a fixed location in memory

B. The interrupting source supplies the branch information to the processor through an interrupt vector
C. The branch address is obtained from a register in the processor
D. None of the above
gate1995 co-and-architecture interrupts normal
Answer ☟
Which of the following is true?
A. Unless enabled, a CPU will not be able to process interrupts.

B. Loop instructions cannot be interrupted till they complete.
C. A processor checks for interrupts before executing a new instruction.
D. Only level triggered interrupts are possible on microprocessors.
gate1998 co-and-architecture interrupts normal
Answer ☟
A device employing INTR line for device interrupt puts the CALL instruction on the data bus while:
¯¯¯¯¯¯¯¯¯¯¯¯¯¯
A. INT A is active
B. HOLD is active
C. READY is inactive
gate2002-cse co-and-architecture interrupts normal
Answer ☟
1.13.5 Interrupts: GATE CSE 2005 | Question: 69 top☝ ☛ https://gateoverflow.in/1392
A device with data transfer rate 10 KB/sec is connected to a CPU. Data is transferred byte-wise. Let the interrupt overhead be
4μsec. The byte transfer time between the device interface register and CPU or memory is negligible. What is the minimum
performance gain of operating the device under interrupt mode over operating it under program-controlled mode?
A. 15
B. 25
C. 35
D. 45
gate2005-cse co-and-architecture interrupts
Answer ☟
1.13.6 Interrupts: GATE CSE 2009 | Question: 8, UGCNET-June2012-III: 58 top☝ ☛ https://gateoverflow.in/1300
A CPU generally handles an interrupt by executing an interrupt service routine:
A. As soon as an interrupt is raised.

B. By checking the interrupt register at the end of fetch cycle.
C. By checking the interrupt register after finishing the execution of the current instruction.

D. By checking the interrupt register at fixed time intervals.
gate2009-cse co-and-architecture interrupts normal ugcnetjune2012iii
Answer ☟
Consider the following statements.
I. Daisy chaining is used to assign priorities in attending interrupts.

II. When a device raises a vectored interrupt, the CPU does polling to identify the source of interrupt.
III. In polling, the CPU periodically checks the status bits to know if any device needs its attention.
IV. During DMA, both the CPU and DMA controller can be bus masters at the same time.
Which of the above statements is/are TRUE?

A. Ⅰ and Ⅱ only
B. Ⅰ and Ⅳ only
C. Ⅰ and Ⅲ only
D. Ⅲ only
gate2020-cse co-and-architecture interrupts
Answer ☟
Answers: Interrupts
1.13.1 Interrupts: GATE CSE 1987 | Question: 1-viii top☝ ☛ https://gateoverflow.in/80274

Answer should be (D) i.e branches off to ISR after completing current instruction.
CPU checks the status bit of interrupt at the completion of each current instruction running when there is a interrupt it service the
interrupt using ISR.
https://gateoverflow.in/18581/isro2009-78
References
 17 votes -- Gate Mission (2.8k points)

Answer: B
A vectored interrupt is a processing technique in which the interrupting device directs the processor to the appropriate interrupt
service routine. This is in contrast to a polled interrupt system, in which a single interrupt service routine must determine the
source of the interrupt by checking all potential interrupt sources, a slow and relatively laborious process.

Answer is (A).
Options (B) and (D) are obviously false.
A processor checks for the interrupt before FETCHING an instruction, so option (C) is also false.
 44 votes -- Hardi Shah (281 points)


INTR is a signal which if enabled then microprocessor has interrupt enabled it receives high INR signal & activates
INTA signal, so another request can’t be accepted till CPU is busy in servicing interrupt. Hence (A) is correct option.
 22 votes -- Tejas Jaiswal (559 points)

In Programmed I/O, the CPU issues a command and waits for I/O operations to complete.
So here, CPU will wait for 1 sec to transfer 10 KB of data.
overhead in programmed I/O = 1 sec
In Interrupt mode , data is transferred word by word (here word size is 1 byte as mentioned in question
"Data is transferred byte-wise").
So to transfer 1 byte of data overhead is 4 × 10−6 sec
Thus to transfer 10 KB of data overhead is= 4 × 10−6 × 104 sec
1 1
Performance gain = −6 4
= = 25
4 × 10 × 10 4 × 10−2
Thus, (b) is correct answer.

 91 votes -- ashwini anand (231 points)
1.13.6 Interrupts: GATE CSE 2009 | Question: 8, UGCNET-June2012-III: 58 top☝ ☛ https://gateoverflow.in/1300

It will be (C).
After finishing the execution of each instruction the CPU reads the interrupt pins to recognize the interrupts.
INTR = 1 = Interrupt is present.(Service the Interrupt)
= 0 = Interrupt is not present.(Goto next Instruction fetch from user program)

Answer : C
I is true
The daisy-chaining method of establishing priority consists of a serial connection of all devices that request an interrupt.
The device with the highest priority is placed in the first position, followed by lower-priority devices up to the device
with the lowest priority, which is placed last in the chain.
II. is false
Vectored interrupts are achieved by assigning each interrupting device a unique code, typically four to eight bits in length.
When a device interrupts, it sends its unique code over the data bus to the processor, telling the processor which interrupt
service routine to execute.
III. is true
The process of periodically checking status bits to see if it is time for the next I/O operation, is called polling. Polling is
the simplest way for an I/O device to communicate with the processor the processor.
IV. is false
Since CPU release bus only after getting request from DMA and get after DMA release the BUS.

1.14 Io Handling (7) top☝
1.14.1 Io Handling: GATE CSE 1987 | Question: 2a top☝ ☛ https://gateoverflow.in/80572
State whether the following statements are TRUE or FALSE
In a microprocessor-based system, if a bus (DMA) request and an interrupt request arrive sumultaneously, the microprocessor
attends first to the bus request.
gate1987 co-and-architecture interrupts io-handling true-false
Answer ☟
1.14.2 Io Handling: GATE CSE 1987 | Question: 2b top☝ ☛ https://gateoverflow.in/80576
State whether the following statements are TRUE or FALSE:
Data transfer between a microprocessor and an I/O device is usually faster in memory-mapped-I/O scheme than in I/O-mapped -I/O
scheme.
gate1987 co-and-architecture io-handling true-false
Answer ☟
1.14.3 Io Handling: GATE CSE 1990 | Question: 4-ii top☝ ☛ https://gateoverflow.in/85390
State whether the following statements are TRUE or FALSE with reason:
The data transfer between memory and I/O devices using programmed I/O is faster than interrupt-driven I/O.
gate1990 true-false co-and-architecture io-handling interrupts
Answer ☟
1.14.4 Io Handling: GATE CSE 1996 | Question: 1.24 top☝ ☛ https://gateoverflow.in/2728
For the daisy chain scheme of connecting I/O devices, which of the following statements is true?
A. It gives non-uniform priority to various devices

B. It gives uniform priority to all devices
C. It is only useful for connecting slow devices to a processor device
D. It requires a separate interrupt pin on the processor for each device
gate1996 co-and-architecture io-handling normal
Answer ☟
1.14.5 Io Handling: GATE CSE 1996 | Question: 25 top☝ ☛ https://gateoverflow.in/2777
A hard disk is connected to a 50 MHz processor through a DMA controller. Assume that the initial set-up of a DMA transfer
takes 1000 clock cycles for the processor, and assume that the handling of the interrupt at DMA completion requires 500 clock
cycles for the processor. The hard disk has a transfer rate of 2000 Kbytes/sec and average block transferred is 4 K bytes. What
fraction of the processor time is consumed by the disk, if the disk is actively transferring 100% of the time?
gate1996 co-and-architecture io-handling dma numerical-answers normal
Answer ☟
The correct matching for the following pairs is:
(A) DMA I/O (1) High speed RAM

(A) DMA I/O (1) High speed RAM
(B) Cache (2) Disk
(C) Interrupt I/O (3) Printer
(D) Condition Code Register (4) ALU
A. A − 4 B−3 C −1 D−2
B. A − 2 B−1 C −3 D−4
C. A − 4 B−3 C −2 D−1
D. A − 2 B−3 C −4 D−1
gate1997 co-and-architecture normal io-handling
Answer ☟
1.14.7 Io Handling: GATE CSE 2008 | Question: 64, ISRO2009-13 top☝ ☛ https://gateoverflow.in/487
Which of the following statements about synchronous and asynchronous I/O is NOT true?
A. An ISR is invoked on completion of I/O in synchronous I/O but not in asynchronous I/O
B. In both synchronous and asynchronous I/O, an ISR (Interrupt Service Routine) is invoked after completion of the I/O
C. A process making a synchronous I/O call waits until I/O is complete, but a process making an asynchronous I/O call does not
wait for completion of the I/O
D. In the case of synchronous I/O, the process waiting for the completion of I/O is woken up by the ISR that is invoked after the
completion of I/O
gate2008-cse operating-system io-handling normal isro2009
Answer ☟
Answers: Io Handling
1.14.1 Io Handling: GATE CSE 1987 | Question: 2a top☝ ☛ https://gateoverflow.in/80572
The HOLD input has a higher priority than the INTR or NMI interrupt inputs.
So the answer is true.

 5 votes -- mystylecse (1.8k points)
1.14.2 Io Handling: GATE CSE 1987 | Question: 2b top☝ ☛ https://gateoverflow.in/80576
True
it will take extra time in IO mapped IO because of control signal.
 0 votes -- Arnabh Gangwar (307 points)
1.14.3 Io Handling: GATE CSE 1990 | Question: 4-ii top☝ ☛ https://gateoverflow.in/85390

False because in programmed I/O, CPU will check the I/O devices' status according to written program. Suppose
CPU requested 5 I/O devices and the program is written to check sequentially and 5 th device is ready before 2nd device, then
also CPU will come to check at its turn.
So, programmed I/O doesn't care about availability status of devices. it blindly works according to written program. That's why it
is slow.
Interrupt driven I/O : Here, if any device is ready then it won't wait for CPU, it will say to CPU that "I am ready" by sending
interrupt request and the delay here will be only "time taken in servicing the interrupt" which is less than programmed I/O.
So, the answer is FALSE.
 40 votes -- bharti (2.3k points)


Daisy chaining approach tells the processor in which order the interrupt should be handled by providing priority to the
devices.
In daisy-chaining method, all the devices are connected in serial. The device with the highest priority is placed in the first
position, followed by lower priority devices. The interrupt pin is common to all.
So answer is option (A).
 46 votes -- Ravi Singh (11.8k points)
1.14.5 Io Handling: GATE CSE 1996 | Question: 25 top☝ ☛ https://gateoverflow.in/2777

2000 KB is transferred in 1 second
4 KB transfer is (4/2000) ∗ 1000 ms = 2 ms
Total cycle required for locking and handling of interrupts after DMA transfer control
= (1000 + 500) clock cycle = 1500 clock cycle
Now, 50 Mhz = 50 ∗ 106 = 0.02 microsecond
So, (1500 ∗ 0.02) = 30 microsecond
30μs for initialization and termination and 2ms for data transfer.
The CPU time is consumed only for initialization and termination.
30μs
Fraction of CPU time consumed = = 0.015
(30μs + 2 ms)
 45 votes -- Pooja Palod (24.1k points)

Correct Option: B. A − 2, B − 1, C − 3, D − 4
A. DMA I/O 2. Disk

B. Cache 1. High-speed RAM
C. Interrupt I/O 3. Printer
D. Condition Code Register 4. ALU
Reason:
DMA I/O - For high speed, high volume data transfer from disk without affecting the processor(in most cases).
Cache-A high speed & low memory version of a RAM.
Interrupt I/O - The printer sends an interrupt signal when it is ready for use.
Condition Code Register - Part of the ALU, as a special purpose register, to store flag bits.
[Source - Google/Wikipedia]
 25 votes -- Siddharth Mahapatra (1.2k points)
1.14.7 Io Handling: GATE CSE 2008 | Question: 64, ISRO2009-13 top☝ ☛ https://gateoverflow.in/487

Answer is (B).
In synchronous I/O process performing I/O operation will be placed in blocked state till the I/O operation is completed. An ISR
will be invoked after the completion of I/O operation and it will place process from block state to ready state.
In asynchronous I/O, Handler function will be registered while performing the I/O operation. The process will not be placed in

the block state and process continues to execute the remaining instructions. when the I/O operation completed signal mechanism
is used to notify the process that data is available.
 72 votes -- gate_asp (615 points)
1.15 Machine Instructions (19) top☝
1.15.1 Machine Instructions: GATE CSE 1988 | Question: 9i top☝ ☛ https://gateoverflow.in/94384
The following program fragment was written in an assembly language for a single address computer with one accumulator
register:
LOAD B
MULT C
STORE T1
ADD A
STORE T2
MULT T2
ADD T1
STORE Z
Give the arithmetic expression implemented by the fragment.
gate1988 normal descriptive co-and-architecture machine-instructions
Answer ☟
1.15.2 Machine Instructions: GATE CSE 1994 | Question: 12 top☝ ☛ https://gateoverflow.in/2508
A. Assume that a CPU has only two registers R1 and R2 and that only the following instruction is available
XOR Ri , Rj ; {Rj ← Ri ⊕ Rj , for i, j = 1, 2}
Using this XOR instruction, find an instruction sequence in order to exchange the contents of the registers R1 and R2
B. The line p of the circuit shown in figure has stuck at 1 fault. Determine an input test to detect the fault.
gate1994 co-and-architecture machine-instructions normal descriptive
Answer ☟
Consider the following program fragment in the assembly language of a certain hypothetical processor. The processor has
three general purpose registers R1, R2and R3. The meanings of the instructions are shown by comments (starting with ;) after the
instructions.
X: CMP R1, 0; Compare R1 and 0, set flags appropriately in status register
JZ Z; Jump if zero to target Z
MOV R2, R1; Copy contents of R1 to R2
SHR R1; Shift right R1 by 1 bit
SHL R1; Shift left R1 by 1 bit
CMP R2, R1; Compare R2 and R1 and set flag in status register
JZ Y; Jump if zero to target Y
INC R3; Increment R3 by 1;
Y: SHR R1; Shift right R1 by 1 bit
JMP X; Jump to target X
Z:...
A. Initially R1, R2 and R3 contain the values 5,0 and 0 respectively, what are the final values of R1 and R3 when control
reaches Z?
B. In general, if R1, R2 and R3 initially contain the values n, 0, and 0 respectively. What is the final value of R3 when control
reaches Z ?
gate1999 co-and-architecture machine-instructions normal descriptive

Answer ☟
Consider the following assembly language program for a hypothetical processor A, B, and C are 8− bit registers. The
meanings of various instructions are shown as comments.
MOV B, #0 ; B←0
MOV C, #8 ; C←8
Z: CMP C, #0 ; compare C with 0
JZ X ; jump to X if zero flag is set
SUB C, #1 ; C ← C −1
RRC A, #1 ; right rotate A through carry by one bit. Thus:
; If the initial values of A and the carry flag are a7 … a0 and
; c0 respectively, their values after the execution of this
; instruction will be c0 a7 … a1 and a0 respectively.
JC Y ; jump to Y if carry flag is set
JMP Z ; jump to Z
Y: ADD B, #1 ; B ← B+1
JMP Z ; jump to Z
X: ;
If the initial value of register A is A0 the value of register B after the program execution will be
A. the number of 0 bits in A0
B. the number of 1 bits in A0
C. A0
D. 8
gate2003-cse co-and-architecture machine-instructions normal
Answer ☟
Consider the following assembly language program for a hypothetical processor A, B, and C are 8 bit registers. The meanings
of various instructions are shown as comments.
MOV B, #0 ;B ← 0
MOV C, #8 ;C ← 8
Z: CMP C, #0 ; compare C with 0
JZ X ; jump to X if zero flag is set
C←C
SUB C, #1 ;
−1
RRC A, #1 ; right rotate A through carry by one bit. Thus:
; If the initial values of A and the carry flag are a7 . . a0 and
; c0 respectively, their values after the execution of this
; instruction will be c0 a7 . . a1 and a0 respectively.
JC Y ; jump to Y if carry flag is set
JMP Z ; jump to Z
B←B
Y: ADD B, #1 ;
+1
JMP Z ; jump to Z

X:
Which of the following instructions when inserted at location X will ensure that the value of the register A after program execution
is as same as its initial value?
A. RRC A, #1
B. NOP; no operation
C. LRC A,#1; left rotate A through carry flag by one bit
D. ADD A,#1
Answer ☟
Consider the following program segment for a hypothetical CPU having three user registers R1 , R2 and R3 .
Instruction Operation Instruction size

(in words)
MOV R1 , 5000 R1 ← Memory[5000] 2
MOV R2, (R1) R2 ← Memory[(R1 )] 1
ADD R2 , R3 R2 ← R2 + R3 1
MOV 6000, R2 Memory[6000] ← R2 2
HALT Machine Halts 1
Consider that the memory is byte addressable with size 32 bits, and the program has been loaded starting from memory location
1000 (decimal). If an interrupt occurs while the CPU has been halted after executing the HALT instruction, the return address (in
decimal) saved in the stack will be
A. 1007
B. 1020
C. 1024
D. 1028
Answer ☟
Consider the following program segment for a hypothetical CPU having three user registers R1 , R2 and R3 .
Instruction Operation Instruction size (in Words)

MOV R1 , 5000 R1 ← Memory[5000] 2
MOV R2 (R1 ) R2 ← Memory[(R1 )] 1
ADD R2 , R3 R2 ← R2 + R3 1
MOV 6000, R2 Memory[6000] ← R2 2
Halt Machine Halts 1
Let the clock cycles required for various operations be as follows:
Register to/from memory transfer 3 clock cycles

ADD with both operands in register 1 clock cycles
Instruction fetch and decode 2 clock cycles
The total number of clock cycles required to execute the program is

A. 29
B. 24
C. 23
D. 20

Answer ☟
1.15.8 Machine Instructions: GATE CSE 2006 | Question: 09, ISRO2009-35 top☝ ☛ https://gateoverflow.in/888
A CPU has 24-bit instructions. A program starts at address 300 (in decimal). Which one of the following is a legal program
counter (all values in decimal)?
A. 400
B. 500
C. 600
D. 700
gate2006-cse co-and-architecture machine-instructions easy isro2009
Answer ☟
In a simplified computer the instructions are:
OP Rj , Ri Perform Rj OP Ri and store the result in register Rj

OP m, Ri Perform val OP Ri and store the result in register Ri
val denotes the content of the memory location m
MOV m, Ri Moves the content of memory location m to register Ri
MOV Ri , m Moves the content of register Ri to memory location m
The computer has only two registers, and OP is either ADD or SUB. Consider the following basic block:
t1 = a+b
t2 = c+d
t3 = e − t2
t4 = t1 – t3
Assume that all operands are initially in memory. The final value of the computation should be in memory. What is the minimum
number of MOV instructions in the code generated for this basic block?
A. 2
B. 3
C. 5
D. 6
Answer ☟
Consider the following program segment. Here R1, R2 and R3 are the general purpose registers.
Instruction Operation Instruction Size

(no. of words)
MOV R1,(3000) R1← M[3000] 2
LOOP: MOV R2,(R3) R2← M[R3] 1
ADD R2,R1 R2← R1 + R2 1
MOV (R3),R2 M[R3]← R2 1
INC R3 R3← R3 + 1 1
DEC R1 R1← R1 – 1 1
BNZ LOOP Branch on not zero 2
HALT Stop 1

Assume that the content of memory location 3000 is 10 and the content of the register R3 is 2000 . The content of each of the
memory locations from 2000 to 2010 is 100. The program is loaded from the memory location 1000 . All the numbers are in
decimal.
Assume that the memory is word addressable. The number of memory references for accessing the data in executing the program
completely is
A. 10
B. 11
C. 20
D. 21
gate2007-cse co-and-architecture machine-instructions interrupts normal
Answer ☟

(no. of words)
MOV R1,(3000) R1← M[3000] 2
ADD R2,R1 R2← R1 + R2 1
MOV (R3),R2 M[R3]← R2 1
INC R3 R3← R3 + 1 1
DEC R1 R1← R1 – 1 1
HALT Stop 1
Assume that the content of memory location 3000 is 10 and the content of the register R3 is 2000 . The content of each of the
memory locations from 2000 to 2010 is 100. The program is loaded from the memory location 1000 . All the numbers are in
decimal.
Assume that the memory is word addressable. After the execution of this program, the content of memory location 2010 is:
A. 100
B. 101
C. 102
D. 110
Answer ☟

(no. of words)
MOV R1,(3000) R1← M[3000] 2
ADD R2,R1 R2← R1 + R2 1
MOV (R3),R2 M[R3]← R2 1
INC R3 R3← R3 + 1 1
DEC R1 R1← R1 – 1 1
HALT Stop 1
Assume that the content of memory location 3000 is 10 and the content of the register R3 is 2000. The content of each of the
memory locations from 2000 to 2010 is 100. The program is loaded from the memory location 1000. All the numbers are in

decimal.
Assume that the memory is byte addressable and the word size is 32 bits. If an interrupt occurs during the execution of the
instruction “INC R3”, what return address will be pushed on to the stack?
A. 1005
B. 1020
C. 1024
D. 1040
Answer ☟
Which of the following must be true for the RFE (Return From Exception) instruction on a general purpose processor?
I. It must be a trap instruction

II. It must be a privileged instruction
III. An exception cannot be allowed to occur during execution of an RFE instruction
A. I only
B. II only
C. I and II only
D. I, II and III only
Answer ☟
1.15.14 Machine Instructions: GATE CSE 2015 Set 2 | Question: 42 top☝ ☛ https://gateoverflow.in/8215
Consider a processor with byte-addressable memory. Assume that all registers, including program counter (PC) and Program
Status Word (PSW), are size of two bytes. A stack in the main memory is implemented from memory location (0100)16 and it
grows upward. The stack pointer (SP) points to the top element of the stack. The current value of SP is (016E)16 . The CALL
instruction is of two words, the first word is the op-code and the second word is the starting address of the subroutine (one word = 2
bytes). The CALL instruction is implemented as follows:
Store the current value of PC in the stack

Store the value of PSW register in the stack
Load the statring address of the subroutine in PC
The content of PC just before the fetch of a CALL instruction is (5FA0)16 . After execution of the CALL instruction, the value of
the stack pointer is:
A. (016A)16
B. (016C)16
C. (0170)16
D. (0172)16
gate2015-cse-set2 co-and-architecture machine-instructions easy
Answer ☟
A processor has 40 distinct instruction and 24 general purpose registers. A 32-bit instruction word has an opcode, two
registers operands and an immediate operand. The number of bits available for the immediate operand field is_______.
gate2016-cse-set2 machine-instructions co-and-architecture easy numerical-answers
Answer ☟

Consider the following instruction sequence where registers R1, R2 and R3 are general purpose and MEMORY[X] denotes
the content at the memory location X.
Instruction Semantics Instruction Size (bytes)

MOV R1, (5000) R1 ← MEMORY[5000] 4
MOV R2, (R3) R2← MEMORY[R3] 4
ADDR2, R1 R2 ← R1 + R2 2
MOV (R3), R2 MEMORY[R3]← R2 4
INC R3 R3 ← R3 + 1 2
DEC R1 R1 ← R1 − 1 2
BNZ 1004 Branch if not zero to the 2
given absolute address
HALT Stop 1
Assume that the content of the memory location 5000 is 10, and the content of the register R3 is 3000 . The content of each of the
memory locations from 3000 to 3020 is 50. The instruction sequence starts from the memory location 1000 . All the numbers are in
decimal format. Assume that the memory is byte addressable.
After the execution of the program, the content of memory location 3010 is ____________
gate2021-cse-set1 co-and-architecture machine-instructions numerical-answers
Answer ☟
1.15.17 Machine Instructions: GATE IT 2004 | Question: 46 top☝ ☛ https://gateoverflow.in/3689
If we use internal data forwarding to speed up the performance of a CPU (R1, R2 and R3 are registers and M[100] is a
memory reference), then the sequence of operations
R1 → M[100]
M[100] → R2
M[100] → R3
can be replaced by
A. R1 → R3
R2 → M[100]
B. M[100] → R2
R1 → R2
R1 → R3
C. R1 → M[100]
R2 → R3
D. R1 → R2
R1 → R3
R1 → M[100]
gate2004-it co-and-architecture machine-instructions easy
Answer ☟
Following table indicates the latencies of operations between the instruction producing the result and instruction using the
result.
Instruction producing the result Instruction using the result Latency

ALU Operation ALU Operation 2
ALU Operation Store 2
Load ALU Operation 1
Load Store 0
Consider the following code segment:

Load R1, Loc 1; Load R1 from memory location Loc1

Load R2, Loc 2; Load R2 from memory location Loc 2
Add R1, R2, R1; Add R1 and R2 and save result in R1
Dec R2; Decrement R2
Dec R1; Decrement R1
Mpy R1, R2, R3; Multiply R1 and R2 and save result in R3
Store R3, Loc 3; Store R3 in memory location Loc 3
What is the number of cycles needed to execute the above code segment assuming each instruction takes one cycle to execute?
A. 7
B. 10
C. 13
D. 14
gate2007-it co-and-architecture machine-instructions normal
Answer ☟
Assume that EA = (X)+ is the effective address equal to the contents of location X, with X incremented by one word length
after the effective address is calculated; EA = −(X) is the effective address equal to the contents of location X, with X decremented
by one word length before the effective address is calculated; EA = (X)− is the effective address equal to the contents of location X,
with X decremented by one word length after the effective address is calculated. The format of the instruction is (opcode, source,
destination), which means (destination ← source op destination). Using X as a stack pointer, which of the following instructions can
pop the top two elements from the stack, perform the addition operation and push the result back to the stack.
A. ADD (X)−, (X)
B. ADD (X), (X)−
C. ADD −(X), (X)+
D. ADD −(X), (X)
gate2008-it co-and-architecture machine-instructions normal
Answer ☟
Answers: Machine Instructions
1.15.1 Machine Instructions: GATE CSE 1988 | Question: 9i top☝ ☛ https://gateoverflow.in/94384

Z = [BC + A]2 + (BC)
 12 votes -- Yash wadhwani (831 points)

R1 ← R1 XOR R2
R2 ← R2 XOR R1
R1 ← R1 XOR R2
Stuck at 0 fault:: A stuck-at fault is a particular fault model used by fault simulators and automatic test pattern generation
(ATPG) tools to mimic a manufacturing defect within an integrated circuit. Individual signals and pins are assumed to be stuck
at Logical ′ 1′ ,′ 0′ , and ′ X ′ .
Stuck at line 1 means no matter whatever input we provide to the line the output is always 1.
Now, when we take A = 1, B = 1, C = 1 the required output, if no fault is there should be 1.
But, since line p is stuck at logic 1 final output of A NAND B will be 1 only .So, final circuit output becomes 0.(which is wrong
)
Reference : https://www.tutorialspoint.com/digital_electronics/stuckat1_fault_in_logic_circuit.asp
References

 20 votes -- Vidhi Sethi (8.3k points)

SHR R1 (Lower bit is lost and upper bit becomes 0 and all other bits shift right by 1)
SHL R1 (Upper bit is lost and lower bit becomes 0 and all other bits shift left by 1)
These two operations change the value of R1 if its lower bit is 1. So, the given program checks the lowest bit of R1 in each
iteration and if its 1 then only increment R3 and loop terminates when R1 becomes 0. Thus at end, R3 will have the count of the
number of bits set to 1 in R1.
a. R1 = 0, R3 = 2 as 101 has two 1's
b. R3 = #1 in R1.

Option (B). The code is counting the number of 1 bits in A0 . When a 1 is moved to carry, B is incremented.

Option (A)RRC a, #1. As the 8 bit register is rotated via carry 8 times.
a7 a6 a5 a4 a3 a2 a1 a0
c0 a7 a6 a5 a4 a3 a2 a1 , now a0 is the new carry. So, after next rotation,
a0 c0 a7 a6 a5 a4 a3 a2
So, after 8 rotations, a6 a5 a4 a3 a2 a1 a0 c0 and carry is a7 .
Now, one more rotation will restore the original value of A0 .

Option is D.
Word size is 32 bits (4 bytes ). Interrupt occurs after execution of HALT instruction NOT during, So address of next
instruction will be saved on to the stack which is 1028 .
(We have 5 instructions starting from address 1000 , each of size 2, 1, 1, 2, 1 totaling 7 words = 7 ∗ 4 = 28 bytes ).
1000 + 28 = 1028 ,
1028 is the starting address of NEXT Instruction .
After HALT instruction CPU enters a HALT state and if an interrupt happens the return address will be that of the instruction
after the HALT.
References :
1. https://x86.puri.sm/html/file_module_x86_id_134.html [ X86 Instructors Manual ]

2. http://electronics.stackexchange.com/questions/277735/what-happens-if-the-interrupt-occurs-during-the-execution-of-halt-
instruction
3. https://en.wikipedia.org/wiki/HLT_(x86_instruction)

References

B. 24 cycles
Instruction Size Fetch and Decode + Execute

MOV 2 2×2+3 = 7
MOV 1 2×1+3 = 5
ADD 1 2×1+1 = 3
MOV 2 2×2+3 = 7
HALT 1 2×1+0 = 2
Total 24 Cycles
1.15.8 Machine Instructions: GATE CSE 2006 | Question: 09, ISRO2009-35 top☝ ☛ https://gateoverflow.in/888

Option (C). 24 bits = 3 bytes instructions. So, PC will have multiples of 3 in it.
 40 votes -- anshu (2.7k points)

MOV a, R1
ADD b, R1
MOV c, R2
ADD d, R2
SUB e, R2
SUB R1 , R2
MOV R2 , m
Total number of MOV instructions = 3

Correct Answer: B

Loop is executed 10 times and there are two memory reference in the loop (each MOV
is loading 1 word, so 1 memory reference for each MOV inside the loop). So number
of memory reference inside loop is
2(MOV) × 10 (times iteration) × 1(1 word access/ MOV) = 20 memory accesses.
One memory access is outside the loop for the first instruction
MOV R1, (3000)
So, totally 20 + 1 = 21
Correct Answer: D


The loop runs 10 times.
1. When R1 = 10, Memory[2000] = 110,

2. When R1 = 9, Memory[2001] = 109,
3. When R1 = 8, Memory[2002] = 108,
4. When R1 = 7, Memory[2003] = 107,
5. When R1 = 6, Memory[2004] = 106,
6. When R1 = 5, Memory[2005] = 105,
7. When R1 = 4, Memory[2006] = 104,
8. When R1 = 3, Memory[2007] = 103,
9. When R1 = 2, Memory[2008] = 102,
10. When R1 = 1, Memory[2009] = 101,
When R1 = 0 the loop breaks., Memory[2010] = 100

Correct Answer: A
 54 votes -- Arnab Bhadra (3.7k points)

An interrupt is checked for after the execution of the current instruction and the contents of PC
(address of next instruction to be executed) is pushed on to stack.
(2 + 1 + 1 + 1) × 32
Here, address of INC, R3 = 1000 + = 1020 and
8
next instruction address = 1020 + 4 = 1024 which is pushed on to stack.
Reference: http://www.ece.utep.edu/courses/web3376/Notes_files/ee3376-interrupts_stack.pdf
Correct Answer: C
References

RFE (Return From Exception) is a privileged trap instruction that is executed when exception occurs, so an exception is
not allowed to execute. (D) is the correct option.
Reference: http://www.cs.rochester.edu/courses/252/spring2014/notes/08_exceptions
References

First we have to consider here memory is byte-addressable

The CALL instruction is implemented as follows:
Store the current value of PC in the stack
PC is 2 bytes it means when we store pc in stack it will increase by 2

So current value of SP is (016E)16 + 2
Store the value of PSW register in the stack

PSW is 2 byte it means when we store psw in stack it will increase by 2
So current value of SP is (016E)16 + 2 + 2 = (0172)16
Correct Answer: D
 71 votes -- Anoop Sonkar (4.1k points)

Instruction Opcode Size = log2 40 = 6
Register operand size = log2 24 = 5
Total bits available = 32
Bits required for opcode + two register operands = 6 + 2 × 5 = 16
Bits available for immediate operand = 32 − 16 = 16.


The given code is iterating 10 times and incrementing the contents of locations 3000 to 3000 + i by 10 − i, for i < 10.
Location 3010 is left untouched.
So, correct answer: 50.


Data forwarding means if CPU writes to a memory location and subsequently reads from the same memory location, the
second instruction can fetch the value directly from the register used to do the write than waiting for the memory. So, this
increases the performance.
Here, choices A, B and C doesn't really make any sense as the data was in R1 and it must be moved to R2, R3 and M[100]. So,
(D) is the answer.

In the given question there are 7 instructions each of which takes 1 clock cycle to complete. (Pipelining may be used)
If an instruction is in execution phase and any other instructions can not be in the execution phase. So, at least 7 clock cycles will
be taken.
Now, it is given that between two instructions latency or delay should be there based on their operation. Ex- 1st line of the table
says that between two operations in which first is producing the result of an ALU operation and the 2nd is using the result there
should be a delay of 2 clock cycles.
Clock cycle T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13

I1 I2 I3 I4 I5 I6 I7
1. Load R1, Loc 1; Load R1 from memory location Loc1

Takes 1 clock cycle, simply loading R1 on loc1.

2. Load R2, Loc 2; Load R2 from memory location Loc2
Takes 1 clock cycle, simply loading r2 on loc2.
3. (Add R1, R2, R1; Add R1 and R2 and save result in R1

R1=R1+R2;
Hence, this instruction is using the result of R1 and R2, i.e. result of Instruction 1 and Instruction 2.
As instruction 1 is load operation and instruction 3 is ALU operation. So, there should be a delay of 1 clock cycle between
instruction 1 and instruction 3, which is already there due to I2.
As instruction 2 is load operation and instruction 3 is ALU operation. So, there should be a delay of 1 clock cycle between
instruction 2 and instruction 3.
4. Dec R2; Decrement R2

This instruction is dependent on instruction 2 and there should be a delay of one clock cycle between Instruction 2 and
Instruction 4. As instruction 2 is load and 4 is ALU, which is already there due to Instruction 3.
5. Dec R1 Decrement R1
This instruction is dependent on Instruction 3
As Instruction I3 is ALU and I5 is also ALU so a delay of 2 clock cycles should be there between them of which 1 clock
cycle delay is already there due to I4 so one clock cycle delay between I4 and I5.
6. MPY R1, R2, R3; Multiply R1 and R2 and save result in R3

R3=R1*R2;
This instruction uses the result of Instruction 5, as both instruction 5 and 6 are ALU so there should be a delay of 2 clock
cycles.
7. Store R3, Loc 3 Store R3 in memory location Loc3
This instruction is dependent on instruction 6 which is ALU and instruction 7 is store so there should be a delay of 2 clock
cycles between them.
Hence, a total of 13 clock cycles will be there.

Correct Answer: C
 146 votes -- Madhab Paul Choudhury (2.8k points)

It should be A as 998 ← 1000 + 998. ( I am writing only memory locations for sake of brevity )
Lets say SP is 1000 initially then after it calculates the EA of source (which is 1000 as it decrements after the EA) the
destination becomes 998 and that is where we want to store the result as stack is decrementing.
In case of C and D it becomes 998 ← 998 + 998.

 29 votes -- Shaun Patel (6.1k points)
1.16 Memory Interfacing (2) top☝
1.16.1 Memory Interfacing: GATE CSE 2016 Set 1 | Question: 09 top☝ ☛ https://gateoverflow.in/39632
A processor can support a maximum memory of 4GB, where the memory is word-addressable (a word consists of two bytes).
The size of address bus of the processor is at least _________bits.
gate2016-cse-set1 co-and-architecture easy numerical-answers memory-interfacing
Answer ☟
1.16.2 Memory Interfacing: GATE CSE 2018 | Question: 23 top☝ ☛ https://gateoverflow.in/204097
A 32-bit wide main memory unit with a capacity of 1 GB is built using 256 M × 4 − bit DRAM chips. The number of rows
of memory cells in the DRAM chip is 214 . The time taken to perform one refresh operation is 50 nanoseconds . The refresh period
i s 2 milliseconds. The percentage (rounded to the closest integer) of the time available for performing the memory read/write
operations in the main memory unit is_____.
gate2018-cse co-and-architecture memory-interfacing normal numerical-answers
Answer ☟

Answers: Memory Interfacing
1.16.1 Memory Interfacing: GATE CSE 2016 Set 1 | Question: 09 top☝ ☛ https://gateoverflow.in/39632

Size of Memory = No of words (Addresses) × No of bits per word
32
2 B = No of words (Addresses) × 2B
No of words (Addresses) = 231
Number of Address lines = 31
 54 votes -- Praveen Saini (41.9k points)
1.16.2 Memory Interfacing: GATE CSE 2018 | Question: 23 top☝ ☛ https://gateoverflow.in/204097

One refresh operation takes 50ns.
Total number of rows = 214
Total time to refresh all Rows = 214 × 50 ns = 819200 ns = 0.819200 ms

The Refresh Period is 2ms.
Total time to Refresh all Rows

% Time spent in refresh = Refresh period
∗ 100
0.8192ms
= 2.0ms ∗ 100 = 40.96%
% Time spent in Read/Write = 100 − 40.96 = 59.04%
= 59% (Rounded to the closest Integer)

Reference: https://en.wikipedia.org/wiki/Memory_refresh
References
1.17 Microprogramming (12) top☝
1.17.1 Microprogramming: GATE CSE 1987 | Question: 4a top☝ ☛ https://gateoverflow.in/81359
Find out the width of the control memory of a horizontal microprogrammed control unit, given the following specifications:
16 control lines for the processor consisting of ALU and 7 registers.

Conditional branching facility by checking 4 status bits.
Provision to hold 128 words in the control memory.
gate1987 co-and-architecture microprogramming descriptive
Answer ☟
1.17.2 Microprogramming: GATE CSE 1996 | Question: 2.25 top☝ ☛ https://gateoverflow.in/2754
A micro program control unit is required to generate a total of 25 control signals. Assume that during any micro instruction, at
most two control signals are active. Minimum number of bits required in the control word to generate the required control signals
will be:
A. 2
B. 2.5
C. 10
D. 12
gate1996 co-and-architecture microprogramming normal

Answer ☟
A micro instruction is to be designed to specify:
a. none or one of the three micro operations of one kind and

b. none or upto six micro operations of another kind
The minimum number of bits in the micro-instruction is:

A. 9
B. 5
C. 8
Answer ☟
Arrange the following configuration for CPU in decreasing order of operating speeds:
Hard wired control, Vertical microprogramming, Horizontal microprogramming.
A. Hard wired control, Vertical microprogramming, Horizontal microprogramming.

B. Hard wired control, Horizontal microprogramming, Vertical microprogramming.
C. Horizontal microprogramming, Vertical microprogramming, Hard wired control.
D. Vertical microprogramming, Horizontal microprogramming, Hard wired control.
Answer ☟
Horizontal microprogramming:
A. does not require use of signal decoders

B. results in larger sized microinstructions than vertical microprogramming
C. uses one bit for each control signal
D. all of the above
gate2002-cse co-and-architecture microprogramming
Answer ☟
1.17.6 Microprogramming: GATE CSE 2004 | Question: 67 top☝ ☛ https://gateoverflow.in/1061
The microinstructions stored in the control memory of a processor have a width of 26 bits. Each microinstruction is divided
into three fields: a micro-operation field of 13 bits, a next address field (X), and a MUX select field (Y ). There are 8 status bits in
the input of the MUX.

How many bits are there in the X and Y fields, and what is the size of the control memory in number of words?
A. 10, 3, 1024
B. 8, 5, 256
C. 5, 8, 2048
D. 10, 3, 512
gate2004-cse co-and-architecture microprogramming normal
Answer ☟
Consider the following sequence of micro-operations.
MBR ← PC MAR ← X PC ← Y Memory ← MBR
Which one of the following is a possible operation performed by this sequence?

A. Instruction fetch
B. Operand fetch
C. Conditional branch
D. Initiation of interrupt service
gate2013-cse co-and-architecture microprogramming normal
Answer ☟
1.17.8 Microprogramming: GATE IT 2004 | Question: 49 top☝ ☛ https://gateoverflow.in/3692
A CPU has only three instructions I1, I2 and I3, which use the following signals in time steps T 1 − T 5 :
I1 : T 1 : Ain, Bout, Cin
T 2 : PCout, Bin
T 3 : Zout, Ain
T 4 : Bin, Cout
T 5 : End
I2 : T 1 : Cin, Bout, Din
T 2 : Aout, Bin
T 3 : Zout, Ain
T 4 : Bin, Cout
T 5 : End
I3 : T 1 : Din, Aout
T 2 : Ain, Bout
T 3 : Zout, Ain
T 4 : Dout, Ain

T 5 : End
Which of the following logic functions will generate the hardwired control for the signal Ain ?
A. T 1.I1 + T 2.I3 + T 4.I3 + T 3
B. (T 1 + T 2 + T 3). I3 + T 1.I1
C. (T 1 + T 2). I1 + (T 2 + T 4). I3 + T 3
D. (T 1 + T 2). I2 + (T 1 + T 3). I1 + T 3
gate2004-it co-and-architecture microprogramming normal
Answer ☟
A hardwired CPU uses 10 control signals S1 to S10 , in various time steps T1 to T5 , to implement 4 instructions I1 to I4 as
shown below:
T1 T2 T3 T4 T5
I1 S1 , S3 , S5 S2 , S4 , S6 S1 , S7 S10 S3 , S8
I2 S1 , S3 , S5 S8 , S9 , S10 S5 , S6 S7 S6 S10
I3 S1 , S3 , S5 S7 , S8 , S10 S2 , S6 , S9 S10 S1 , S3
I4 S1 , S3 , S5 S2 , S6 , S7 S5 , S10 S6 , S9 S10
Which of the following pairs of expressions represent the circuit for generating control signals S5 and S10 respectively?
((Ij + Ik )Tn indicates that the control signal should be generated in time step Tn if the instruction being executed is Ij or lk )
A. S5 = T1 + I2 ⋅ T3 and
S10 = (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
B. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
C. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I2 + I3 + I4 ) ⋅ T2 + (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
D. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I2 + I3 ) ⋅ T2 + I4 ⋅ T3 + (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
Answer ☟
An instruction set of a processor has 125 signals which can be divided into 5 groups of mutually exclusive signals as follows:
Group 1 : 20 signals, Group 2 : 70 signals, Group 3 : 2 signals, Group 4 : 10 signals, Group 5 : 23 signals.
How many bits of the control words can be saved by using vertical microprogramming over horizontal microprogramming?
A. 0
B. 103
C. 22
D. 55
Answer ☟
The data path shown in the figure computes the number of 1s in the 32 − bit input word corresponding to an unsigned even
integer stored in the shift register.
The unsigned counter, initially zero, is incremented if the most significant bit of the shift register is 1.

The microprogram for the control is shown in the table below with missing control words for microinstructions I1 , I2 , … In .
Microinstruction Reset_Counter Shift_left Load_output

BEGIN 1 0 0
I1 ? ? ?
: : : :
In ? ? ?
END 0 0 1
The counter width (k), the number of missing microinstructions (n), and the control word for microinstructions I1 , I2 , … In are,
respectively,
A. 32, 5, 010
B. 5, 32, 010
C. 5, 31, 011
D. 5, 31, 010
Answer ☟
Consider a CPU where all the instructions require 7 clock cycles to complete execution. There are 140 instructions in the
instruction set. It is found that 125 control signals are needed to be generated by the control unit. While designing the horizontal
microprogrammed control unit, single address field format is used for branch control logic. What is the minimum size of the control
word and control address register?
A. 125, 7
B. 125, 10
C. 135, 9
D. 135, 10
Answer ☟
Answers: Microprogramming
1.17.1 Microprogramming: GATE CSE 1987 | Question: 4a top☝ ☛ https://gateoverflow.in/81359
Width of the control memory = size of a control word
Control word = [Condition bits + control signal bits + next address]

Condition bits = 4 ⟹ log2 4 -bits
128 words ⟹ 7-bits for address.
In the case of horizontal micro-programming, there is 1- bit for each control signal. It is given there are 16 control signals.

⟹ Length of control word = 2 + 16 + 7 = 25 - bits.
 18 votes -- agoh (1.2k points)

The best sense I can make of this question is that you want to transmit up to 2 simultaneous signals out of a choice of 25,
and ask how many bits you need for that.
One solution would be to have 2 groups of 5 − bits , each can send one of 31 signals (or the absence of signal). But it is not
optimal. The number of different states is
1(no signal) + 25(one signal) + (25 × 24/2)(two signals) = 326 states.
You can transmit any of these states over 9 − bits . But it is more complex to encode/ decode, adding an extra bit would
probably cost less.
Hence C is correct option.
Reference: https://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi?board=riddles_cs;action=display;num=1354778770
References
 30 votes -- vnc (2.1k points)

Actually the given question incorporates the concept of horizontal μprogramming (also known as decoded form of
control signals) and vertical μprogramming (also known as encoded form of control signals)
The (a) part says :
' none or one of the three micro operations of one kind
This is referred to encoding form of vertical one since at most one signal can be active in vertical microprogramming since it
involves use of external decoder to select one control signal out of the given control signals..
No of bits required for vertical microprogramming given n number of control signals = ⌈(log2 n)⌉
Here, n = 3
So, no of bits required for part (a) = ⌈(log2 3)⌉ = 2
Now coming to (b) part , it says :
' none or upto six micro operations of another kind
at maximum we can have at most 6 microoperations of another kind at a time. To accommodate that we need decoded form of
control signals which is horizontal signals.
So, no of bits required for (b) part
= No of control signals of (b) kind = 6
Therefore overall bits required to accommodate both (a) and (b),
= 2 + 6 = 8 − bits
Besides this, address field, flags etc are also there in a control word. That is why it is asked in the question :

' minimum number of bits in the micro-instruction required
Hence, minimum no of bits required = 8 − bits

C is the correct answer.
 60 votes -- HABIB MOHAMMAD KHAN (67.5k points)

Hard wired control involves only hardware, whereas microprogramming is software approach. So, hardwire control
should be faster than both microprogramming approaches.
Between vertical and horizontal microprogramming. Horizontal is faster because in this control signals are not encoded whereas
in vertical microprogramming to save memory signals are encoded. So, it takes less time in horizontal microprogramming
because decoding of signals is not required. Therefore, final order is :
hard wired control > horizontal microprogramming > vertical microprogramming
Correct Answer: B
 33 votes -- Shikhar Vashishth (3.1k points)

Option (D). All statements are true.
Reference: http://http://www.cs.virginia.edu/~cs333/notes/microprogramming.pdf
References
 27 votes -- Suvojit Mondal (291 points)

x + y + 13 = 26 → (1)
y = 3 (y) is no of bits used to represent 8 different states of multiplexer → (2)
x is no of bits required represent size of control memory
x = 10 from (1) and (2)
∴ Size of control memory = 2x = 210 = 1024
Correct Answer: A

Here PC value is being stored in memory which is done when either CALL RETURN involved or there is Interrupt. As,
we will have to come back to execute current instruction.
So, options (A), (B) are clearly incorrect.
Option (C) is incorrect because conditional branch does not require to save PC contents.
Option (D) is correct as it matches the generic Interrupt Cycle :

Interrupt Cycle:
t1 : MBR ← (PC)
t2 : MAR ← (save-address)
PC ← (routine-address)
t3 : Memory ← (MBR)

We just have to see which all options give 1 whenever Ain is 1 and 0 otherwise.
So, Ain is 1 in T 3 of I1, I2 and I3. Also during T 1 of I1, and T 2 and T 4 of I3. So, answer will be
T 1.I1 + T 2.I3 + T 4.I3 + T 3.I1 + T 3.I2 + T 3.I3
Since CPU is having only 3 instructions, T 3.I1 + T 3.I2 + T 3.I3 can be replaced with T 3 (we don't need to see which
instruction and Ain will be activated in time step 3 of all the instructions).
So, T 1.I1 + T 2.I3 + T 4.I3 + T 3
The answer is Option (A).

D. is the correct option for this question.
If we look at the table, we need to find those time-stamps and instructions which are using these control signals.
For example, S5 = T1 has used control signal S5 for all the instructions, or we can say irrespective of the instructions. Also, S5
is used by instructions I2 and I4 for the time stamp T3 so that comes to:
S5 = T1 + I2 ⋅ T3 + I4 ⋅ T3 = T1 + (I2 + I4 ) ⋅ T3
In the same way, we'll calculate for S10 .

It's an example of Hardwired CU Programming used in RISC processors. It gives accurate result, but isn't good for debugging
since minor change will cause to restructure the control unit.

In horizontal microprogramming we need 1 bit for every control word, therefore total bits in
Horizontal Microprogramming = 20 + 70 + 2 + 10 + 23 = 125
Now lets consider vertical microprogramming, In vertical microprogramming we use Decoder (n to 2n ) and output lines are
equal to number of control words. A input is given according to what control word we have to select.
Now in this question these 5 groups contains mutually exclusive signals, i.e, they can be activated one at a time for a given
group, we can safely use decoder.
group 1 = ⌈log2 20⌉ = 5 (Number of input bits for decoder, given output
is number of control word in given group)
group 2 = ⌈log2 70⌉ = 7
group 3 = ⌈log2 2⌉ = 1
group 4 = ⌈log2 10⌉ = 4
group 5 = ⌈log2 23⌉ = 5
Total bits required in vertical microprogramming = 5 + 7 + 1 + 4 + 5 = 22
So number of control words saved = 125 − 22 = 103
Hence, (B) is answer.

 47 votes -- Prateeksha Keshari (1.7k points)

Answer I1 to In are microinstructions and reset_counter, shift_left and load_output are control signals to activate
corresponding hardware(eg. Shift register or load output).
Counter width (k) is 5 bits as shift register uses 32 bit data Only.
The number of missing micro instructions (n) should be 31 as shift register contain Only unsigned EVEN integer. LSB Will be
always 0 so no need to shift for LSB.
Control word contains:-
1 for active/enable. 0 for inactive or disabled.
Reset counter is to reset the counter so it must be 0 for all microns.
Shift_left CS should be 1 to shift the given data in shift reg.
And load output has no meaning to make output active for all microinstructions as it will be used in the END only so it should
be 0.
 32 votes -- khush tak (5.9k points)

Its answer should be (D) because 140 instructions, each requiring 7 cycles means 980 cycles which will take 10 bits.
Since it is horizontal for control word, 125 control signals + 10 bits = 135 bits will be required.
 38 votes -- nagendra2016 (349 points)
1.18 Pipelining (36) top☝
1.18.1 Pipelining: GATE CSE 1999 | Question: 13 top☝ ☛ https://gateoverflow.in/1512
An instruction pipeline consists of 4 stages – Fetch (F) , Decode field (D), Execute (E) and Result Write (W). The 5
instructions in a certain instruction sequence need these stages for the different number of clock cycles as shown by the table below
Instruction F D E W
1 1 2 1 1
2 1 2 2 1
3 2 1 3 2
4 1 3 2 1
5 1 2 1 2
Find the number of clock cycles needed to perform the 5 instructions.
gate1999 co-and-architecture pipelining normal numerical-answers
Answer ☟
1.18.2 Pipelining: GATE CSE 2000 | Question: 1.8 top☝ ☛ https://gateoverflow.in/631
Comparing the time T1 taken for a single instruction on a pipelined CPU with time T2 taken on a non-pipelined but identical
CPU, we can say that
A. T1 ≤ T2
B. T1 ≥ T2
C. T1 < T2
D. T1 and T2 plus the time taken for one instruction fetch cycle
gate2000-cse pipelining co-and-architecture easy
Answer ☟

An instruction pipeline has five stages where each stage take 2 nanoseconds and all instruction use all five stages. Branch
instructions are not overlapped. i.e., the instruction after the branch is not fetched till the branch instruction is completed. Under
ideal conditions,
A. Calculate the average instruction execution time assuming that 20% of all instructions executed are branch instruction. Ignore
the fact that some branch instructions may be conditional.
B. If a branch instruction is a conditional branch instruction, the branch need not be taken. If the branch is not taken,
the following instructions can be overlapped. When 80% of all branch instructions are conditional branch instructions, and
50% of the conditional branch instructions are such that the branch is taken, calculate the average instruction execution time.
gate2000-cse co-and-architecture pipelining normal descriptive
Answer ☟
Consider a 5− stage pipeline - IF (Instruction Fetch), ID (Instruction Decode and register read), EX (Execute), MEM
(memory), and WB (Write Back). All (memory or register) reads take place in the second phase of a clock cycle and all writes
occur in the first phase. Consider the execution of the following instruction sequence:
I1: sub r2, r3, r4 /* r2 ← r3 − r4 */

I2: sub r4, r2, r3 /* r4 ← r2 − r3 */
I3: sw r2, 100(r1) /* M[r1 + 100] ← r2 */
I4: sub r3, r4, r2 /* r3 ← r4 − r2 */
A. Show all data dependencies between the four instructions.

B. Identify the data hazards.
C. Can all hazards be avoided by forwarding in this case.
gate2001-cse co-and-architecture pipelining normal descriptive
Answer ☟
1.18.5 Pipelining: GATE CSE 2002 | Question: 2.6, ISRO2008-19 top☝ ☛ https://gateoverflow.in/836
The performance of a pipelined processor suffers if:
A. the pipeline stages have different delays

B. consecutive instructions are dependent on each other
C. the pipeline stages share hardware resources
D. All of the above
gate2002-cse co-and-architecture pipelining easy isro2008
Answer ☟
1.18.6 Pipelining: GATE CSE 2003 | Question: 10, ISRO-DEC2017-41 top☝ ☛ https://gateoverflow.in/901
For a pipelined CPU with a single ALU, consider the following situations
I. The j + 1 st instruction uses the result of the jth instruction as an operand

II. The execution of a conditional jump instruction
III. The jth and j + 1 st instructions require the ALU at the same time.
Which of the above can cause a hazard

A. I and II only
B. II and III only
C. III only
D. All the three

gate2003-cse co-and-architecture pipelining normal isrodec2017
Answer ☟
A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nanoseconds , respectively. Registers that are used between
the stages have a delay of 5 nanoseconds each. Assuming constant clocking rate, the total time taken to process 1000 data items
on this pipeline will be:
A. 120.4 microseconds
B. 160.5 microseconds
C. 165.5 microseconds
D. 590.0 microseconds
gate2004-cse co-and-architecture pipelining normal
Answer ☟
A 5 stage pipelined CPU has the following sequence of stages:
IF – instruction fetch from instruction memory

RD – Instruction decode and register read
EX – Execute: ALU operation for data and address computation
MA – Data memory access – for write access, the register read at RD state is used.
WB – Register write back
Consider the following sequence of instructions:
I1 : L R0, loc 1; R0 ⇐ M[loc1]

I2 : A R0, R0; R0 ⇐ R0 + R0
I3 : S R2, R0; R2 ⇐ R2 − R0
Let each stage take one clock cycle.

What is the number of clock cycles taken to complete the above sequence of instructions starting from the fetch of I1 ?
A. 8
B. 10
C. 12
D. 15
Answer ☟
A CPU has a five-stage pipeline and runs at 1 GHz frequency. Instruction fetch happens in the first stage of the pipeline. A
conditional branch instruction computes the target address and evaluates the condition in the third stage of the pipeline. The
processor stops fetching new instructions following a conditional branch until the branch outcome is known. A program executes
109 instructions out of which 20% are conditional branches. If each instruction takes one cycle to complete on average, the total
execution time of the program is:
A. 1.0 second
B. 1.2 seconds
C. 1.4 seconds
D. 1.6 seconds
Answer ☟

1.18.10 Pipelining: GATE CSE 2007 | Question: 37, ISRO2009-37 top☝ ☛ https://gateoverflow.in/1235
Consider a pipelined processor with the following four stages:
IF: Instruction Fetch

ID: Instruction Decode and Operand Fetch
EX: Execute
WB: Write Back
The IF, ID and WB stages take one clock cycle each to complete the operation. The number of clock cycles for the EX stage
depends on the instruction. The ADD and SUB instructions need 1 clock cycle and the MUL instruction needs 3 clock cycles in the
EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the
following sequence of instructions?
ADD R2, R1, R0 R2 ← R1+R0

MUL R4, R3, R2 R4 ← R3∗R2
SUB R6, R5, R4 R6 ← R5−R4
A. 7
B. 8
C. 10
D. 14
gate2007-cse co-and-architecture pipelining normal isro2009
Answer ☟
Which of the following are NOT true in a pipelined processor?
I. Bypassing can handle all RAW hazards

II. Register renaming can eliminate all register carried WAR hazards
III. Control hazard penalties can be eliminated by dynamic branch prediction
A. I and II only
B. I and III only
C. II and III only
D. I, II and III
gate2008-cse pipelining co-and-architecture normal
Answer ☟
Delayed branching can help in the handling of control hazards

For all delayed conditional branch instructions, irrespective of whether the condition evaluates to true or false,
A. The instruction following the conditional branch instruction in memory is executed

B. The first instruction in the fall through path is executed
C. The first instruction in the taken path is executed
D. The branch takes longer to execute than any other instruction
Answer ☟
Delayed branching can help in the handling of control hazards

The following code is to run on a pipelined processor with one branch delay slot:
I1: ADD R2 ← R7 + R8

I2: Sub R4 ← R5– R6
I3: ADD R1 ← R2 + R3
I4: STORE Memory [R4] ← R1
BRANCH to Label if R1 == 0
Which of the instructions I1, I2, I3 or I4 can legitimately occupy the delay slot without any program modification?
A. I1
B. I2
C. I3
D. I4
Answer ☟
Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1, I2, I3, I4 in stages
S1, S2, S3, S4 is shown below:
S1 S2 S3 S4
I1 2 1 1 1
I2 1 3 2 2
I3 2 1 1 3
I4 1 2 2 2
What is the number of cycles needed to execute the following loop?

For (i = 1 to 2) {I1; I2; I3; I4;}
A. 16
B. 23
C. 28
D. 30
Answer ☟
A 5− stage pipelined processor has Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform Operation
(PO) and Write Operand (WO) stages. The IF, ID, OF and WO stages take 1 clock cycle each for any instruction. The PO stage
takes 1 clock cycle for ADD and SUB instructions, 3 clock cycles for MUL instruction and 6 clock cycles for DIV instruction
respectively. Operand forwarding is used in the pipeline. What is the number of clock cycles needed to execute the following
sequence of instructions?
Instruction Meaning of instruction

t0 : MUL R2 ,R0 ,R1 R2 ← R0 ∗ R1
t1 : DIV R5 , R3 , R4 R5 ← R3 ∕ R4
t2 : ADD R2 , R5 , R2 R2 ← R5 + R2
t3 : SUB R5 , R2 , R6 R5 ← R2 − R6
A. 13
B. 15
C. 17
D. 19

Answer ☟
Consider an instruction pipeline with four stages (S1, S2, S3 and S4) each with combinational circuit only. The pipeline
registers are required between each stage and at the end of the last stage. Delays for the stages and for the pipeline registers are as
given in the figure.
What is the approximate speed up of the pipeline in steady state under ideal conditions when compared to the corresponding non-
pipeline implementation?
A. 4.0
B. 2.5
C. 1.1
D. 3.0
Answer ☟
Register renaming is done in pipelined processors:
A. as an alternative to register allocation at compile time

B. for efficient access to function parameters and local variables
C. to handle certain kinds of hazards
D. as part of address translation
gate2012-cse co-and-architecture pipelining easy isro2016
Answer ☟
Consider an instruction pipeline with five stages without any branch prediction:
Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Write Operand (WO). The stage
delays for FI, DI, FO, EI and WO are 5 ns, 7 ns, 10 ns, 8 ns and 6 ns, respectively. There are intermediate storage buffers after
each stage and the delay of each buffer is 1 ns. A program consisting of 12 instructions I1, I2, I3,… , I12 is executed in this
pipelined processor. Instruction I4 is the only branch instruction and its branch target is I9. If the branch is taken during the
execution of this program, the time (in ns) needed to complete the program is
A. 132
B. 165
C. 176
D. 328
gate2013-cse normal co-and-architecture pipelining
Answer ☟

1.18.19 Pipelining: GATE CSE 2014 Set 1 | Question: 43 top☝ ☛ https://gateoverflow.in/1921
Consider a 6-stage instruction pipeline, where all stages are perfectly balanced. Assume that there is no cycle-time overhead of
pipelining. When an application is executing on this 6-stage pipeline, the speedup achieved with respect to non-pipelined execution
if 25% of the instructions incur 2 pipeline stall cycles is ____________
gate2014-cse-set1 co-and-architecture pipelining numerical-answers normal
Answer ☟
An instruction pipeline has five stages, namely, instruction fetch (IF), instruction decode and register fetch (ID/RF),
instruction execution (EX), memory access (MEM), and register writeback (WB) with stage latencies 1 ns, 2.2 ns, 2 ns, 1 ns, and
0.75 ns, respectively (ns stands for nanoseconds). To gain in terms of frequency, the designers have decided to split the ID/RF stage
into three stages (ID, RF1, RF2) each of latency 2.2/3 ns. Also, the EX stage is split into two stages (EX1, EX2) each of latency 1
ns. The new design has a total of eight pipeline stages. A program has 20% branch instructions which execute in the EX stage and
produce the next instruction pointer at the end of the EX stage in the old design and at the end of the EX2 stage in the new design.
The IF stage stalls after fetching a branch instruction until the next instruction pointer is computed. All instructions other than the
branch instruction have an average CPI of one in both the designs. The execution times of this program on the old and the new
design are P and Q nanoseconds, respectively. The value of P /Q is __________.
gate2014-cse-set3 co-and-architecture pipelining numerical-answers normal
Answer ☟
Consider the following processors (ns stands for nanoseconds). Assume that the pipeline registers have zero latency.
P1: Four-stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns.
P2: Four-stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns.
P3: Five-stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns.
P4: Five-stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns.
Which processor has the highest peak clock frequency?

A. P1
B. P2
C. P3
D. P4
gate2014-cse-set3 co-and-architecture pipelining normal
Answer ☟
Consider a non-pipelined processor with a clock rate of 2.5 GHz and average cycles per instruction of four. The same
processor is upgraded to a pipelined processor with five stages; but due to the internal pipeline delay, the clock speed is reduced to 2
GHz. Assume that there are no stalls in the pipeline. The speedup achieved in this pipelined processor is_______________.
gate2015-cse-set1 co-and-architecture pipelining normal numerical-answers
Answer ☟
Consider the sequence of machine instruction given below:
MUL R5, R0, R1

DIV R6, R2, R3
ADD R7, R5, R6
SUB R8, R7, R4

In the above sequence, R0 to R8 are general purpose registers. In the instructions shown, the first register shows the result of the
operation performed on the second and the third registers. This sequence of instructions is to be executed in a pipelined instruction
processor with the following 4 stages: (1) Instruction Fetch and Decode (IF) , (2) Operand Fetch (OF) , (3) Perform Operation
(PO) and (4) Write back the result (WB) . The IF , OF and WB stages take 1 clock cycle each for any instruction. The PO
stage takes 1 clock cycle for ADD and SUB instruction, 3 clock cycles for MUL instruction and 5 clock cycles for DIV instruction.
The pipelined processor uses operand forwarding from the PO stage to the OF stage. The number of clock cycles taken for the
execution of the above sequence of instruction is _________.
Answer ☟
Consider the following reservation table for a pipeline having three stages S1 , S2 and S3 .
Time →
1 2 3 4 5
S1 X X
S2 X X
S3 X
The minimum average latency (MAL) is ______
gate2015-cse-set3 co-and-architecture pipelining difficult numerical-answers
Answer ☟
The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800 picoseconds) is
replaced with a functionality equivalent design involving two stages with respective delays 600 and 350 picoseconds. The
throughput increase of the pipeline is ___________ percent.
Answer ☟
Consider a 3 GHz (gigahertz) processor with a three stage pipeline and stage
3τ2
latencies τ1 , τ2 and τ3 such that τ1 = = 2τ3 .
4
If the longest pipeline stage is split into two pipeline stages of equal latency ,
the new frequency is __________ GHz , ignoring delays in the pipeline registers.
Answer ☟
Instruction execution in a processor is divided into 5 stages, Instruction Fetch (IF), Instruction Decode (ID), Operand
fetch (OF), Execute (EX), and Write Back (WB). These stages take 5, 4, 20, 10 and 3 nanoseconds (ns) respectively. A pipelined
implementation of the processor requires buffering between each pair of consecutive stages with a delay of 2 ns. Two pipelined
implementation of the processor are contemplated:
i. a naive pipeline implementation (NP) with 5 stages and

ii. an efficient pipeline (EP) where the OF stage is divided into stages OF1 and OF2 with execution times of 12 ns and 8
ns respectively.
The speedup (correct to two decimal places) achieved by EP over NP in executing 20 independent instructions with no hazards is
_________ .

Answer ☟
The instruction pipeline of a RISC processor has the following stages: Instruction Fetch (IF) , Instruction Decode (ID),
Operand Fetch (OF) , Perform Operation (PO) and Writeback (WB) , The IF , ID , OF and WB stages take 1 clock cycle each
for every instruction. Consider a sequence of 100 instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35
instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data
hazards and no control hazards.
The number of clock cycles required for completion of execution of the sequence of instruction is _____.
gate2018-cse co-and-architecture pipelining numerical-answers
Answer ☟
Consider a non-pipelined processor operating at 2.5 GHz. It takes 5 clock cycles to complete an instruction. You are going to
make a 5- stage pipeline out of this processor. Overheads associated with pipelining force you to operate the pipelined processor at 2
GHz. In a given program, assume that 30% are memory instructions, 60% are ALU instructions and the rest are branch instructions.
5% of the memory instructions cause stalls of 50 clock cycles each due to cache misses and 50% of the branch instructions cause
stalls of 2 cycles each. Assume that there are no stalls associated with the execution of ALU instructions. For this program, the
speedup achieved by the pipelined processor over the non-pipelined processor (round off to 2 decimal places) is_____________.
gate2020-cse numerical-answers co-and-architecture pipelining
Answer ☟
A five-stage pipeline has stage delays of 150, 120, 150, 160 and 140 nanoseconds. The registers that are used between the
pipeline stages have a delay of 5 nanoseconds each.
The total time to execute 100 independent instructions on this pipeline, assuming there are no pipeline stalls, is _______
nanoseconds.
gate2021-cse-set1 co-and-architecture pipelining numerical-answers
Answer ☟
1.18.31 Pipelining: GATE IT 2004 | Question: 47 top☝ ☛ https://gateoverflow.in/3690
Consider a pipeline processor with 4 stages S1 to S4. We want to execute the following loop:
for (i = 1; i < = 1000; i++)
{I1, I2, I3, I4}
where the time taken (in ns) by instructions I1 to I4 for stages S1 to S4 are given below:
S1 S2 S3 S4
I1 1 2 1 2
I2 2 1 2 1
I3 1 1 2 1
I4 2 1 2 1
The output of I1 for i = 2 will be available after

A. 11 ns
B. 12 ns
C. 13 ns
D. 28 ns
gate2004-it co-and-architecture pipelining normal
Answer ☟

We have two designs D1 and D2 for a synchronous pipeline processor. D1 has 5 pipeline stages with execution times of
3 nsec, 2 nsec, 4 nsec, 2 nsec and 3 nsec while the design D2 has 8 pipeline stages each with 2 nsec execution time How much time
can be saved using design D2 over design D1 for executing 100 instructions?
A. 214 nsec
B. 202 nsec
C. 86 nsec
D. −200 nsec
Answer ☟
A pipelined processor uses a 4− stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode
(ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX
stage. The sequence of instructions corresponding to the statement X = (S − R ∗ (P + Q))/T is given below. The values of
variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the
instruction sequence.
ADD R5, R0, R1 ; R5 ← R0 + R1

MUL R6, R2, R5 ; R6 ← R2 * R5
SUB R5, R3, R6 ; R5 ← R3 - R6
DIV R6, R5, R4 ; R6 ← R5/R4
STORE R6, X ; X ← R6
The number of Read-After-Write (RAW) dependencies, Write-After-Read( WAR) dependencies, and Write-After-Write (WAW)
dependencies in the sequence of instructions are, respectively,
A. 2, 2, 4
B. 3, 2, 3
C. 4, 2, 2
D. 3, 3, 2
Answer ☟
A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode
(ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX
stage. The sequence of instructions corresponding to the statement X = (S − R ∗ (P + Q))/T is given below. The values of
variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the
instruction sequence.
ADD R5, R0, R1 ; R5 ← R0 + R1

MUL R6, R2, R5 ; R6 ← R2 * R5
SUB R5, R3, R6 ; R5 ← R3 -R6
DIV R6, R5, R4 ; R6 ← R5/R4
STORE R6, X ; X ← R6
The IF, ID and WB stages take 1 clock cycle each. The EX stage takes 1 clock cycle each for the ADD, SUB and STORE
operations, and 3 clock cycles each for MUL and DIV operations. Operand forwarding from the EX stage to the ID stage is used.
The number of clock cycles required to complete the sequence of instructions is
A. 10
B. 12
C. 14
D. 16

Answer ☟
1.18.35 Pipelining: GATE IT 2007 | Question: 6, ISRO2011-25 top☝ ☛ https://gateoverflow.in/3437
A processor takes 12 cycles to complete an instruction I. The corresponding pipelined processor uses 6 stages with the
execution times of 3, 2, 5, 4, 6 and 2 cycles respectively. What is the asymptotic speedup assuming that a very large number of
instructions are to be executed?
A. 1.83
B. 2
C. 3
D. 6
gate2007-it co-and-architecture pipelining normal isro2011
Answer ☟
A non pipelined single cycle processor operating at 100 MHz is converted into a synchronous pipelined processor with five
stages requiring 2.5 nsec, 1.5 nsec, 2 nsec, 1.5 nsec and 2.5 nsec, respectively. The delay of the latches is 0.5 nsec. The
speedup of the pipeline processor for a large number of instructions is:
A. 4.5
B. 4.0
C. 3.33
D. 3.0
Answer ☟
Answers: Pipelining

Answer: 15 cycles are required.
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

I1 F D D E W
I2 F − D D E E W
I3 F F D − E E E W W
I4 F − D D D E E W
I5 F − − D D E W W
1.18.2 Pipelining: GATE CSE 2000 | Question: 1.8 top☝ ☛ https://gateoverflow.in/631

Here we are comparing the execution time of only a single instruction. Pipelining in no way improves the execution time
of a single instruction (the time from its start to end). It increases the overall performance by splitting the execution to multiple
pipeline stages so that the following instructions can use the finished stages of the previous instructions. But in doing so
pipelining causes some problems also as given in the below link, which might slow some instructions. So, (B) is the answer.
http://www.cs.wvu.edu/~jdm/classes/cs455/notes/tech/instrpipe.html
References


Each stage is 2ns. So, after 5 time units each of 2ns, the first instruction finishes (i.e., after 10ns), in every 2ns after that a
new instruction gets finished. This is assuming no branch instructions. Now, once the pipeline is full, we can assume that the
initial fill time doesn't matter our calculations and average execution time for each instruction is 2ns assuming no branch
instructions.
A. Now, we are given that 20% of instructions are branch (like JMP) and when a branch instruction is executed, no further
instruction enters the pipeline. So, we can assume every 5th instruction is a branch instruction. So, with this assumption,
total time to finish 5 instruction will be 5 ∗ 2 + 8 = 18 ns (as when a branch instruction enters the pipeline and before it
finishes, 4 pipeline stages will be empty totaling 4 ∗ 2 = 8 ns, as it is mentioned in question that the next instruction fetch
starts only when branch instruction completes). And this is the same for every set of 5 instructions, and hence the average
instruction execution time = 18/5 = 3.6 ns
B. This is just a complex statement. But what we need is to identify the % of branch instructions which cause a branch to be
taken as others will have no effect on the pipeline flow.
20% of instructions are branch instructions. 80% of branch instructions are conditional.
That means .2 ∗ .8 = 16% of instructions are conditional branch instructions and it is given that 50% of those result in a
branch being taken.
So, 8% of instructions are conditional branches being taken and we also have 20% of 20% = 4% of unconditional branch
instructions which are always taken.
So, percentage of instructions where a branch is taken is 8 + 4 = 12% instead of 20% in (A) part.
So, in 100 instructions there will be 12 branch instructions. We can do a different calculation here as compared to (A) as 12 is
not a divisor of 100. Each branch instruction causes a pipeline delay of 4 ∗ 2 = 8 ns. So, 12 instructions will cause a delay of
12 ∗ 8 = 96 ns. For 100 instructions, we need 100 ∗ 2 = 200 ns without any delay and with delay we require 200 + 96 = 296
ns for 100 instructions.
So, average instruction execution time = 296/100 = 2.96 ns
(We can also use this method for part (A) which will give 100 ∗ 2 + 20 ∗ 8 = 360 ns for 100 instructions)
if an instruction branches then it takes 2ns × 5 = 10ns , coz if branch is taken then the instruction after that branch instruction is
not fetched until entire current branch instruction is completed, this means it will go through all stages.
if an instruction is non-branch or branching does not happen then, it takes 2ns to get completed.
a) average time taken = 0.8 × 2ns + 0.2 × 10ns = 3.6ns
b)
Average time taken,
= 0.8 × 2ns + 0.2(0.2 × 10ns + 0.8((0.5 × 10ns + 0.5 × 2ns)))
= 2.96ns


RAW dependencies:
1. I1 ← I2
2. I1 ← I3
3. I1 ← I4
4. I2 ← I4
WAR dependencies:
1. I2 ← I1
2. I4 ← I1
3. I4 ← I2
Consider a normal pipeline execution:
t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
I4 IF ID EX MEM WB
So, there are RAW hazards for I2 and I3 with I1 and for I4 with I2 . (Not all dependencies cause a hazard. Only if a dependency
causes a stall in the given pipeline structure, we get a hazard) These hazards cause the following stalls in the pipeline:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
I1 IF ID EX MEM WB
I2 IF − − ID EX MEM WB
I3 − − IF ID EX MEM WB
I4 − − IF − ID EX MEM WB
Now, with operand forwarding from EX − EX stage we can do as follows:
t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID EX MEM WB
1
I2 IF ID EX MEM WB
1
I3 IF ID EX MEM WB
1
3
I4 IF ID EX MEM WB
3
Thus all hazards are eliminated.

Ref: http://cseweb.ucsd.edu/classes/wi05/cse240a/pipe2.pdf
References
1.18.5 Pipelining: GATE CSE 2002 | Question: 2.6, ISRO2008-19 top☝ ☛ https://gateoverflow.in/836

Answer is D.

A: Yes. Total delay = Max (All delays) + Register Delay.
B: Yes, if data forwarding is not there.
C: Yes, like ID and EX shares ID/EX register.
1.18.6 Pipelining: GATE CSE 2003 | Question: 10, ISRO-DEC2017-41 top☝ ☛ https://gateoverflow.in/901

1. Data hazard
2. Control hazard
3. Structural hazard as only one ALU is there
So, (D).
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/hazards.html
References

Pipelining requires all stages to be synchronized meaning, we have to make the delay of all stages equal to the maximum
pipeline stage delay which here is 160. We also have to add the intermediate register delay which here is 5ns which makes the
clock period as 165ns.
Time for execution of the first instruction = 165 ∗ 4 = 660 ns.
Now, in every 165 ns, an instruction can be completed. So,
Total time for 1000 instructions = 660 + 999 ∗ 165 = 165.495 microseconds
Correct Answer: C

Answer is option A.
Without data forwarding:
13 clock - WB and RD state non overlapping.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13

IF RD EX MA WB
IF RD EX MA WB
IF RD EX MA WB
Here, WB and RD stage operate in Non-Overlapping mode.

11 clock - WB and RD states overlapping.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
IF RD EX MA WB
IF RD EX MA WB
IF RD EX MA WB
Split Phase access between WB and RD means:
WB stage produce the output during the rising edge of the clock and RD stage fetch the output during the falling edge.
In Question it is mentioned
' for write access, the register read at RD state is used.
This means that for writing operands back to memory, register read at RD state is used (no operand forward for STORE
instructions).
Note
' As in any question in any subject unless otherwise stated we always consider the best case. So, do overlap - unless
otherwise stated. But this is for only WB/RD
1. Why there is stall for I2 in T3 and T4 ?

RD is instruction decode and register read. IF we execute RD of I2 in T3, data from memory will not get stored to R0
hence proper operands are not available at T3. Perhaps I2 has to wait until I1 write values to memory.
2. WB of I1 and RD of I2 are operating in same clock why it is so ?
If nothing has mentioned in question. This scenario is taken into consideration by default. It is because after MA operands
will be available in register so RD and WB could overlap .
With data forwarding

(Should be the case here as question says no operand forwarding for memory register for STORE instructions)
8 clock cycles
1. Why there is a stall I2 in T4 ?

Data is being forwarded from MA of I1 EX of I2 .MA operation of I1 must complete so that correct data will be available
in register .
2. Why RD of I2 in T3 ? Will it not fetch incorrect information if executed before Operand are forwarded from MA of
I1 ?
Yes. RD of I2 will definitely fetch INCORRECT data at T3 . But don't worry about it Operand Forwarding technique will
take care of it .
3. Why can't RD of I2 be placed in T4 ?

Yes . We can place RD of I2 in T4 as well. But what is the fun in that ? pipeline is a technique used to reduce the execution
time of instructions . Why do we need to make an extra stall ? Moreover there is one more problem which is discussed just
below .After reading the below point Just think if we had created a stall at T3 !
4. Why can't RD of I3 be placed at T4 ?
This cannot be done . I3 cannot use RD because Previous instruction I2 should start next stage (EX) before current (I3)
could utilize that(RD) stage . It is because data will be residing in buffers.
5. Can an operand being forwarded from one clock cycle to same clock cycle ?
No, the previous clock cycle must complete before data being forwarded . Unless split phase technique is used
6. Cant there be a forwarding from EX stage(T3) of I1 to EX stage(T4) of I2 ?
This is not possible . See what is happening in I1 . It is Memory Read .So data will be available in register after memory
read only .So data cannot be forwarded from EX of I1 .
7. In some case data is forwarded from MA and some case data is forwarded from EX Why it is so ?
Data is forwarded when it is ready . It solely depends on the type of instruction .
8. When to use Split-Phase ?
We can use split phase if data is readily available like between WB/RD and also when operand forwarding happens from
EX-ID stage, but not from EX-EX stage. We cannot do split phase access between EX-EX because here the instruction
execution may not be possible in the first phase. (This is not mentioned in any standard resource but said by Arjun Suresh
by considering practical implementation and how previous year GATE questions have been formed)
[Mostly it is given in question that there is operand forwarding from A stage to B stage eg:https://gateoverflow.in/8218/gate2015-
2_44 ]
Split-Phase can be used even when no Operand Forwarding because they aren't related.
References
http://web.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
Similar Questions
https://gateoverflow.in/8218/gate2015-2_44
https://gateoverflow.in/2207/gate2010-33
https://gateoverflow.in/34735/pipelining-without-operand-forwarding
Discussions
https://gateoverflow.in/102565/operand-forwarding-in-pipeline
https://gateoverflow.in/113244/doubts-in-pipelining
References
 212 votes -- Akhil Nadh PC (16.5k points)

Delay slots in the pipeline caused due to a branch instruction is 2 as after the 3rd
stage of current instruction (during 4th stage) IF of next begins.
Ideally, this should be during 2nd stage.
So, for total no. of instructions = 109 and 20% branch,

we have 0.2 × 2 × 109 = 4 × 108 cycle penalty.
Since clock speed is 1 GHz and each instruction on average takes 1 cycle,
total execution time in seconds will be
109 108
= 9
+4×
10 109
= 1.4
Correct Answer: C


Answer: option B.
Considering EX to EX data forwarding.
t1 t2 t3 t4 t5 t6 t7 t8
ADD IF ID EX WB
MUL IF ID EX EX EX WB
SUB IF ID − − EX WB
EX to EX data Forwarding:

(B) I and III
I - False Bypassing can't handle all RAW hazard, consider when any instruction depends on the result of LOAD instruction,
now LOAD updates register value at Memory Access Stage (MA), so data will not be available directly on Execute stage.
II - True, register renaming can eliminate all WAR Hazard.
III- False, It cannot completely eliminate, though it can reduce Control Hazard Penalties
 81 votes -- Prateeksha Keshari (1.7k points)

Answer is A. In order to avoid the pipeline delay due to conditional branch instruction, a suitable instruction is placed
below the conditional branch instruction such that the instruction will be executed irrespective of whether branch is taken or not
and won't affect the program behaviour.

What is Delayed Branching ?
One way to maximize the use of the pipeline, is to find an instruction that can be safely executed whether the branch is taken or
not, and execute that instruction. So, when a branch instruction is encountered, the hardware puts the instruction following the
branch into the pipe and begins executing it, just as in predict-not-taken. However, unlike in predict-not-taken, we do not need to
worry about whether the branch is taken or not, we do not need to clear the pipe because no matter whether the branch is taken or
not, we know the instruction is safe to execute.
Moving I1 after branch
I1 is updating the value of R2

R2 which is used to determine branch condition R1

Value of R2 is available after branch
⇒ Cannot be moved
value of R1 is computed in this instruction

R1 is the branch condition
⇒ Cannot be moved
I4 is simple store instruction used to store R1 in memory

program execution will have no effect if this is placed after conditional branch
⇒ Can be moved
It update the memory location to place the storing of conditional branch instruction R1
If moved after branch , when compiler reaches I4 program execution will stop
⇒ Cannot be moved
Hence, option D is answer.
 60 votes -- Akhil Nadh PC (16.5k points)

Here bound of the loop are constants, therefore compiler will do the loop unrolling(If compiler won't then prefetcher will
do) to increase the instruction level parallelism. And after loop unrolling 23 cycles are required for execution. Therefore, correct
answer would be (B).
PS: We assume the buffers between the pipeline stages can store multiple results in the form of a queue.
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23
I1 S1 S1 S2 S3 S4
I2 S1 S2 S2 S2 S3 S3 S4 S4
I3 S1 S1 − S2 − S3 − S4 S4 S4
I4 S1 − S2 S2 S3 S3 − − S4 S4
I1 S1 S1 − S2 − S3 − − − S4
I2 S1 − S2 S2 S2 S3 S3 − S4 S4
I3 S1 S1 − − S2 − S3 − − S4 S4 S4
I4 S1 − − S2 S2 S3 S3 − − − S4 S4

MUL IF ID OF PO PO PO WO
DIV IF ID OF − − PO PO PO PO PO PO WO
ADD IF ID − − OF − − − − − PO WO
SUB IF − − ID − − − − − OF PO WO
Operand forwarding allows an output to be passed for the next instruction. Here from the output of PO stage of DIV instruction
operand is forwarded to the PO stage of ADD instruction and similarly between ADD and SUB instructions. Hence, 15cycles
required.
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
Correct Answer: B
References


Answer is (B) 2.5
In pipeline system, Time taken is determined by the max delay at any stage i.e., 11 ns plus the delay incurred by pipeline stages
i.e., 1 ns = 12 ns. In non-pipeline system,
Delay = 5 ns + 6 ns + 11 ns + 8 ns = 30 ns.
30
∴ The speedup is 12 = 2.5 ns.
 53 votes -- Sona Praneeth Akula (3.4k points)

Register renaming is done to eliminate WAR (Write after Read) and WAW (Write after Write) dependency between
instructions which could have caused pipieline stalls. Hence, (C) is the answer.
Example:
I1: Read A to B
I2: Write C to A
Here, there is a WAR dependency and pipeline would need stalls. In order to avoid it register renaming is done and
Write C to A
will be
Write C to A'
WAR dependency is actually called anti-dependency and there is no real dependency except the fact that both uses same memory
location. Register renaming can avoid this. Similarly WAW also.
http://people.ee.duke.edu/~sorin/ece252/lectures/4.2-tomasulo.pdf
References

After pipelining we have to adjust the stage delays such that no stage will be waiting for another to ensure smooth
pipelining (continuous flow). Since we can not easily decrease the stage delay, we can increase all the stage delays to the
maximum delay possible. So, here maximum delay is 10 ns. Buffer delay given is 1ns. So, each stage takes 11 ns in total.
FI of I9 can start only after the EI of I4. So, the total execution time will be
15 × 11 = 165

I1 FI DI FO EI WO
I2 FI DI FO EI WO
I3 FI DI FO EI WO
I4 FI DI FO EI WO
stall
stall
stall
I9 FI DI FO EI WO
I10 FI DI FO EI WO
I11 FI DI FO EI WO
I12 FI DI FO EI WO
Correct Answer: B

Time without pipeline = 6 stages = 6 cycles
Time with pipeline = 1 + stall freqency × stall cycle
= 1 + .25 × 2
= 1.5
6
Speed up = =4
1.5
 64 votes -- aravind90 (389 points)

Five stages:
(IF), instruction decode and register fetch (ID/RF),
instruction execution (EX),
memory access (MEM), and register writeback (WB)
P old design:
with stage latencies 1 ns, 2.2 ns, 2 ns, 1 ns, and 0.75 ns
MAX( 1 ns, 2.2 ns, 2 ns, 1 ns, and 0.75 ns) = 2.2nsec
AVG instruction execution time is
Tavg=(1+no of stalls×branch penality)×cycle time
= (1 + 0.20 × 2)2.2 { branch peanlity is 2 because the next instruction
pointer at the end of the EX stage in the old design.}
= 3.08 nsec
Q :new DESIGN:
the designers decided to split the ID/RF stage into three stages (ID, RF1, RF2)
2.2
each of latency ns . Also, the EX stage is split into two stages
3
(EX1, EX2) each of latency 1 ns.
The new design has a total of eight pipeline stages.
Time of stages in new design = {1 ns, 0.73ns, 0.73ns, 0.73ns , 1ns,1ns, 1 ns, and 0.75 ns}
(IF), instruction decode
register fetch (ID/RF) → further divided into 3 ie with latency 0.73 of each

instruction execution (EX) → further divided int 1 nsec of each)
memory access (MEM)
register writeback (WB)
MAX( 1 ns, 0.73ns, 0.73ns, 0.73ns , 1ns,1ns, 1 ns, and 0.75 ns) =1 nsec
AVG instruction execution time is
Tavg=(1+no of stalls×branch penality)×cycle time
= (1 + 0.20 × 5)1 { branch penalty is 5 because the next instruction pointer
at the end of the EX2 stage in the new design.}
= 2nsec
final result
P 3.08
= = 1.54
Q 2
 59 votes -- kunal chalotra (13.6k points)
 1
frequency =
max(time in stages)
1
for P3 , it is GHz
1
1
for P1 , it is = 0.5GHz
2
1
for P2 , it is = 0.67GHz
1.5
1
for P4 , it is GHz
1.1
Correct Answer: C
 41 votes -- Arpit Dhuriya (2.9k points)

Answer = 3.2.
To compute cycle time, we know that a 2.5 GHz processor means it completes 2.5 billion cycles in a second. So, for an
4
instruction which on an average takes 4 cycles to get completed, it will take nanoseconds.
2.5
On a perfect pipleline (i.e., one which has no stalls) CPI = 1 as during it an instruction takes just one cycle time to get
completed.
So,
Old Execution Time of an Instruction
Speed Up =
New Execution Time of an Instruction
CP Iold /C Fold
=
CP Inew /C Fnew
4/2.5 GHz
=
1/2 GHz
= 3.2
 87 votes -- naresh1845 (1.1k points)

Old execution time
Speed up =
New execution time
CPI 4
Old execution time = = = 1.6 ns
2.5 2.5
With pipelining,
each instruction needs old execution time × old frequency
=
new frequency (without pipelining)
1.6 × 2.5
= = 2 ns
2
There are 5 stages and when there is no pipeline stall, this can give a speed up of up to 5
(happens when all stages take same number of cycles).
2
In our case this time will be = 0.4 ns .
5
1
But clock frequency being 2 GHz, clock cycle is GHz = 0.5 ns
2
and a pipeline stage cannot be faster than this.
So, average instruction execution time after pipelining = max (0.4, 0.5) = 0.5 ns .
1.6
So, speed up compared to non-pipelined version = = 3.2
0.5

I1 IF OF PO PO PO WB
I2 IF OF − − PO PO PO PO PO WB
I3 IF − − − − − − − OF PO WB
I4 − − − − − − − IF − OF PO WB
It is mentioned in the question that operand forwarding takes place from PO stage to OF stage and not to PO stage. So, 15 clock
cycles.
But since operand forwarding is from PO-OF, we can do like make the PO stage produce the output during the rising edge of the
clock and OF stage fetch the output during the falling edge. This would mean the final PO stage and OF stage can be done in one
clock cycle making the total number of cycles = 13. And 13 is the answer given in GATE key.
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

I1 IF OF PO PO PO WB
I2 IF OF − − PO PO PO PO PO WB
I3 IF − − − − − − OF PO WB
I4 − − − − − − IF OF PO WB
Reference: http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
References


Reference: Page 24 http://www2.cs.siu.edu/~cs401/Textbook/ch3.pdf
S1 is needed at time 1 and 5, so its forbidden latency is 5 − 1 = 4.
S2 is needed at time 2 and 4, so its forbidden latency is 4 − 2 = 2.
So, forbidden latency = (2, 4, 0) ( 0 by default is forbidden)
Allowed latency = (1, 3, 5) (any value more than 5 also).
Collision vector (4, 3, 2, 1, 0) = 10101 which is the initial state as well.
From initial state we can have a transition after “1" or “3" cycles and we reach new states with collision vectors
(10101 >> 1 + 10101 = 11111) and (10101 >> 3 + 10101 = 10111) respectively.
These 2 becomes states 2 and 3 respectively. For “5" cycles we come back to state 1 itself.
From state 2 (11111), the new collision vector is 11111. We can have a transition only when we see the first 0 from the right.
So, here it happens on 5th cycle only which goes to the initial state. (Any transition after 5 or more cycles goes to initial state as
we have 5 time slices).
From state 3 (10111), the new collision vector is 10111. So, we can have a transition on 3, which will give
(10111 >> 3 + 10101 = 10111) third state itself. For 5, we get the initial state. Thus all the transitions are complete.
State\Time 1 3 5
1(10101) 2 3 1
2(11111) - - 1
3(10111) - 3 1
So, minimum length cycle is of length 3 either from 3-3 or from 1-3,3-1 .
Not asked in the question, still.
Pipeline throughput is the number of instructions initiated per unit time.
So, with MAL = 3, we have 2 initiations in 1 + 3 = 4 units of time (one at time unit 1 and another at time unit 4 ). So,
2
throughput = = 0.5 .
4
Pipeline efficiency is the % of time every stage of the pipeline is being used.
For the given question we can extend the reservation table and taking MAL = 3, we can initiate new tasks after every 3 cycles.
So, we can consider the time interval from 4-6 in the below figure. (The red color shows a stage not being used- affects
efficiency).
Time →
1 2 3 4 5 6 7 8 9 10 11
S1 X Y X × Z Y × A Z
S2 X X Y × Y Z Z A
S3 X × × Y Z
2 2 1
Here (during cycles 4 − 6 ), stage 1 is used , stage 2 is used and stage 3 is used .
3 3 3
(2 + 2 + 1) 5 500
So, total stage utilization = = and efficiency = % = 55.55 .
9 9 9
For simulation, Reference: http://www.ecs.umass.edu/ece/koren/architecture/ResTable/SimpRes/

Similar Question
https://gateoverflow.in/77125/advanced-computer-architecture-collision-vector-pipeline
References


In pipeline ideally CPI = 1
So in 1 cycle 1 instruction gets completed
Throughout is instructions in unit time
In pipeline 1, cycle time=max stage delay = 800 psec
In 800 psec, we expect to finish 1 instruction
1
So, in 1ps, instructions are expected to be completed, which is also the throughput for pipeline 1.
800
1
Similarly pipeline 2, throughput=
600
Throughput increase in percentage
new-old
= × 100
old
1 1
−
600 800
= × 100
1
800
200
= × 100
600
= 33.33%
 116 votes -- Anurag Semwal (6.7k points)
' Maximum throughput of a Pipeline i.e in best case without any stalls is equal to Clock Frequency of the pipeline
In first case Clock cycle time = Max Stage Delay = Max(800,500,400 and 300) = 800.
1
So clock Frequency = (Ignore the units as we have to calculate percentage only)
800
In Second Case Clock cycle time = Max(600,350,500,400 and 300) = 600.
1
So clock Frequency = .
600
Percentage increase in throughput of pipeline = percentage in Clock Frequency
1 1
( − )
600 800
= × 100 = 33.33%
1
800
 40 votes -- Mehak Sharma (1.2k points)

Answer is 4 GHz.
Given 3 stage pipeline , with 3 GHz processor.
3e2
Given , e1 = = 2e3
4
Put e1 = 6x
we get, e2 = 8x , e3 = 3x
Now largest stage time is 8x.
1
So, frequency is
8x
1
⇒ = 3GHz
1
⇒ = 3GHz
8x
1
⇒ = 24 GHz … (1)
x
Now, we divide e2 into two stages of 4x & 4x.

New processor has 4 stages -
6x, 4x, 4x, 3x.
Now largest stage time is 6x.
So, new frequency is
1 24
= = 4 GHz (Ans)… from (1)
6x 6

CASE 1
Stages 5, max delay = 22 (after adding buffer delay), number of instructions = 20
CASE 2
Stages 6, (since OF is split), max delay = 14, number of instructions = 20
So, execution time is (K + N − 1) × Max delay
528
Speed Up = = 1.508 (Execution time case 1/Execution time case 2)
350
So, answer is 1.508
 42 votes -- sriv_shubham (2.8k points)

Total Instruction = 100
Number of stages = 5
In normal case total cycles = 100 + 5 − 1 = 104 cycles
Now, For PO stage 40 instructions take 3 cycles, 35 take 2 cycles and rest of the 25 take 1 cycle.
That means all other stages are perfectly fine and working with CPI (Clock Cycle Per Instruction)= 1
PO stage:
40 instructions take 3 cycles i.e. these instructions are suffering from 2 stall cycle,
So, extra cycle would be 40 ∗ 2 + 35 ∗ 1 + 25 ∗ 0 = 80 + 35 = 115 cycle.
Total cycles = 104 + 115 = 219

 5n
Time taken by non-pipelined processor to finish executing the n instructions : 2.5 = 2n ns
Now, for pipelined processor,

Instruction type Number of such instructions %Causing stalls Number of stall cycles
Memory 0.3n 5% of 0.3n 50
ALU 0.6n None 0
Branch 0.1n 50% of 0.1n 2
Therefore, time taken by pipelined processor:
0.6n(1) + 0.3n[0.05(1 + 50) + 0.95(1)] + 0.1n[0.5(1 + 2) + 0.5(1)] cycles
= 1.85n cycles
1.85n
= 2 ns
= 0.925n ns
2n
Speedup = 0.925n = 2.162 ≈ 2.16.
 19 votes -- Debasish Das (1.5k points)

For the given pipelined system:
Total number of stages (k) = 4

Total number of instructions, (n) = 100
Total delay (tp ) = max(stage delay) + buffer delay
⟹ tp = max(150, 120, 150, 160, 140) + 5 ns
⟹ tp = 160 + 5 ns
⟹ tp = 165 ns
ETp = [(k + (n − 1)) ∗ tp ]

⟹ ETp = [(5 + (100 − 1)) ∗ 165] ns
⟹ ETp = (5 + 99) ∗ 165 ns
⟹ ETpipeline = 17160 ns
∴ To execute 100 instructions in the given pipeline, 17160 ns time is required.
 2 votes -- Hira (14.1k points)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

I1 s1 s2 s2 s3 s4 s4
I2 s1 s1 s2 s3 s3 s4
I3 s1 s2 -- s3 s3 s4
I4 s1 s1 s2 -- s3 s3 s4
I5 s1 -- s2 s2 s3 s4 s4
So, total time would be 13 ns
Option (c).
 47 votes -- Suvojit Mondal (291 points)

(B) is the correct option for this question.
(K + n − 1) ∗ execution_time

Execution time for Pipeline = (K + n − 1) ∗ execution_time where k = no of stages in pipeline, n = no of instructions
Execution time = max (all stages execution time)
D1 = (5 + 100 − 1) ∗ 4 = 416
D2 = (8 + 100 − 1) ∗ 2 = 214
Time saved using D2 = 416 − 214 = 202


(C) is the correct option for this question:
RAW
1. I1 - I2 (R5)
2. I2 - I3 (R6)
3. I3 - I4 (R5)
4. I4 - I5 (R6)
WAR
1. I2 - I3 (R5)
2. I3 - I4 (R6)
WAW
1. I1 - I3 (R5)
2. I2 - I4 (R6)
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12

ADD IF ID EX WB
1
MUL IF ID EX EX EX WB
1 1
SUB IF − − ID EX WB
1 1
DIV IF ID EX EX EX WB
1 1
STORE IF − − ID EX WB
1
− Stalls
1 Operand forwarding from EX-ID using split phase
So, answer is 12.
Correct Answer: B
1.18.35 Pipelining: GATE IT 2007 | Question: 6, ISRO2011-25 top☝ ☛ https://gateoverflow.in/3437

For non pipeline processor we have n instruction and each instruction take
12 cycle so total 12n instruction.
For pipeline processor we have each stage strict to 6ns so time to complete
6 × 6 + (n − 1) × 6

the n instruction is 6 × 6 + (n − 1) × 6 .
12n 12
limn→∞ = = 2.
36 + (n − 1) × 6 6
Correct Answer: B
 62 votes -- Arpit Dhuriya (2.9k points)

Here we have to keep in mind the phrase:
' A non-pipelined single cycle processor
This signifies that instruction in a non pipelined scenario is incurring only a single cycle to execute entire instruction. Hence no
concept of stage comes in case of single cycle non pipelined system.
The cycle time can be calculated from clock frequency given in non pipelined system = 100 MHz
1
Therefore clock cycle time in non pipelined system = sec
(100 × 106 )
= 10 ns
Now cycle time in pipelined system = max(stage delay + interface delay)
= 2.5 + 0.5 = 3 ns
Therefore,
CP Inon pipeline × Cycle time non pipeline
Speedup =
(CP Ipipeline × Cycle time pipeline )
1 × 10
= = 3.33
(1 × 3)
[Since in case of non pipeline we have single cycle processor, so CP Inon pipeline = 1 and CP Ipipeline by default = 1]
Hence, (C) is the correct answer.
 63 votes -- HABIB MOHAMMAD KHAN (67.5k points)
1.19 Runtime Environments (2) top☝
1.19.1 Runtime Environments: GATE CSE 2001 | Question: 1.10, UGCNET-Dec2012-III: 36 top☝ ☛ https://gateoverflow.in/703
Suppose a processor does not have any stack pointer registers, which of the following statements is true?
A. It cannot have subroutine call instruction

B. It cannot have nested subroutines call
C. Interrupts are not possible
D. All subroutine calls and interrupts are possible
gate2001-cse co-and-architecture normal ugcnetdec2012iii runtime-environments
Answer ☟
1.19.2 Runtime Environments: GATE CSE 2008 | Question: 37, ISRO2009-38 top☝ ☛ https://gateoverflow.in/448
The use of multiple register windows with overlap causes a reduction in the number of memory accesses for:
I. Function locals and parameters

II. Register saves and restores
III. Instruction fetches
A. I only
II

B. II only
C. III only
D. I , II and III
gate2008-cse co-and-architecture normal isro2009 runtime-environments
Answer ☟
Answers: Runtime Environments
1.19.1 Runtime Environments: GATE CSE 2001 | Question: 1.10, UGCNET-Dec2012-III: 36 top☝ ☛ https://gateoverflow.in/703

A stack pointer is a small register that stores the address of the last program request in a stack.
And a nested function (or nested procedure or subroutine) is a function which is defined within another function, the enclosing
function. So if there is no stack pointer register then No nested subroutine call possible, hence option B is correct.
 40 votes -- Bikram (58.4k points)
1.19.2 Runtime Environments: GATE CSE 2008 | Question: 37, ISRO2009-38 top☝ ☛ https://gateoverflow.in/448

I. Functions locals and parameters
this is true because overlapped registers eliminates the need for memory accesses. we here got to use registers instead.
II. Register saves and restores

this is false bc we need to see where memory accesses are reduced here before also we were using register as it says
Register saves... later also (i.e. after using multiple register windows) registers will are referred. So NO memory accesses
are reduced here.
III. Instruction fetches

it has nothing to do with reduction in memory accesses.
Hence, option (A) is correct.
1.20 Speedup (2) top☝
1.20.1 Speedup: GATE CSE 2014 Set 1 | Question: 55 top☝ ☛ https://gateoverflow.in/1935
Consider two processors P1 and P2 executing the same instruction set. Assume that under identical conditions, for the same
input, a program running on P2 takes 25% less time but incurs 20% more CPI (clock cycles per instruction) as compared to the
program running on P1 . If the clock frequency of P1 is 1GHZ , then the clock frequency of P2 (in GHz) is ______.
gate2014-cse-set1 co-and-architecture numerical-answers normal speedup
Answer ☟
1.20.2 Speedup: GATE IT 2004 | Question: 50 top☝ ☛ https://gateoverflow.in/790
In an enhancement of a design of a CPU, the speed of a floating point unit has been increased by 20% and the speed of a fixed
point unit has been increased by 10%. What is the overall speedup achieved if the ratio of the number of floating point operations to
the number of fixed point operations is 2 : 3 and the floating point operation used to take twice the time taken by the fixed point
operation in the original design?
A. 1.155
B. 1.185
C. 1.255
D. 1.285
gate2004-it normal co-and-architecture speedup
Answer ☟
Answers: Speedup

1.20.1 Speedup: GATE CSE 2014 Set 1 | Question: 55 top☝ ☛ https://gateoverflow.in/1935

CPU TIME (T) = No. of Instructions( I ) x No. of Cycles Per Instruction (c) x Cycle Time (t)
OR
No. of Instructions(I) × No. of Cycles Per Instruction (c)

CPU TIME (T) =
Clock frequency (f)
→ T = Ic × CPI × F −1
T ×F
→ = Ic
CPI
P1 & P2 executing same instruction set So,

No. of Instructions same for both = I1 = I2 = I
If P1 takes T1 time,
T2
→ T2 = 0.75 × T1 → = 0.75
T1
If P1 incurs C1 clock cycles per instruction,
C2
→ C2 = 1.2 × C1 → = 1.2
C1
Since I is same for both,
(f1 × T1 ) (f2 × T2 )
→ = and f1 = 1 GHz
c1 c2
C2 T1
→ F2 = ( ) × ( ) × F1
C1 T2
1.2 × 1GHz
= = 1.6 GHz
0.75
Hence, the clock frequency of P2 is = 1.6 GHz .

 85 votes -- Suraj Kaushal (273 points)
Execution time (T) = CPI * #instructions * time for a clock

= CPI * #instructions / clock frequency (F)
Given P1 and P2 execute the same set of instructions and
T2 = 0.75 T 1,
CPI2 = 1.2 CPI1 and
F1 = 1GHz.
So,
T1 T2
CPI1
× F1 = CPI2
× F2
T1 0.75T1
CPI1
× 1GHz = 1.2CPI1
× F2
1.2
⟹ F2 = 0.75
GHz
= 1.6GHz
1.20.2 Speedup: GATE IT 2004 | Question: 50 top☝ ☛ https://gateoverflow.in/790

 Original time taken
SpeedUp =
new time taken
Let x be the time for a fixed point operation,
(3x + 2 × 2x) 7x
Original time taken = =
5 5
3x 4x
( + )
1.1 1.2 8x
New time taken = =
5 1.32 × 5
7 × 1.32
So, SpeedUp = = 1.155
8
Correct Answer: A
1.21 Virtual Memory (3) top☝
1.21.1 Virtual Memory: GATE CSE 1991 | Question: 03,iii top☝ ☛ https://gateoverflow.in/517
The total size of address space in a virtual memory system is limited by:
A. the length of MAR
B. the available secondary storage
C. the available main memory
D. all of the above
E. none of the above
gate1991 co-and-architecture virtual-memory normal multiple-selects
Answer ☟
1.21.2 Virtual Memory: GATE CSE 2004 | Question: 47 top☝ ☛ https://gateoverflow.in/318
Consider a system with a two-level paging scheme in which a regular memory access takes 150 nanoseconds , and servicing
a page fault takes 8 milliseconds . An average instruction takes 100 nanoseconds of CPU time, and two memory accesses. The
TLB hit ratio is 90%, and the page fault rate is one in every 10, 000 instructions. What is the effective average instruction execution
time?
A. 645 nanoseconds
B. 1050 nanoseconds
C. 1215 nanoseconds
D. 1230 nanoseconds
gate2004-cse co-and-architecture virtual-memory normal
Answer ☟
In an instruction execution pipeline, the earliest that the data TLB (Translation Lookaside Buffer) can be accessed is:
A. before effective address calculation has started

B. during effective address calculation
C. after effective address calculation has completed
D. after data cache lookup has completed
gate2008-cse co-and-architecture virtual-memory normal
Answer ☟
Answers: Virtual Memory

1.21.1 Virtual Memory: GATE CSE 1991 | Question: 03,iii top☝ ☛ https://gateoverflow.in/517

The answer is (A) and (B).
Virtual memory concept is independent of size of main memory and depends only on the availability of the secondary storage.
MAR holds the address generated by CPU and this obviously limits the total virtual memory address space.
 34 votes -- Kalpna Bhargav (2.5k points)

Average Instruction execution time
= Average CPU execution time + Average time for getting data(instruction operands from memory for each instruction)
= Average CPU execution time

+ Average address translation time for each instruction
+ Average memory fetch time for each instruction
+ Average page fault time for each instruction
1
= 100 + 2((0.9(0) + 0.1(2 × 150))) + 2 × 150 + × 8 × 106
10000
(Page Fault Rate per 10,000 instruction is directly given in question.Two memory accesses per instruction and hence we need 2
× address translation time for average instruction execution time)
[ TLB access time assumed as 0 and 2 page tables need to be accessed in case of TLB miss as the system uses two-level paging
]
= 100 + 60 + 300 + 800
= 1260ns

C is the answer here.
Effective address is the address after applying the addressing mode like indexed, immediate etc. But this resulting address is still
the virtual address, the physical address is invisible to the CPU and will be given only by the MMU when given the
corresponding virtual address. Virtual address is given for TLB look up. TLB -Translation Lookaside Buffer, here Lookaside
means during Address translation (from Virtual to Physical). But virtual address must be there before we look into TLB.
https://gateoverflow.in/?qa=blob&qa_blobid=15279338060050073946
References
Click here for Subscribing to GO Classes

Answer Keys
1.1.1 C 1.1.2 N/A 1.1.3 N/A 1.1.4 N/A 1.1.5 B
1.1.6 D 1.1.7 A;B;C;D 1.1.8 C 1.1.9 A 1.1.10 B

1.1.11 C 1.1.12 B 1.1.13 C 1.1.14 C 1.1.15 D
1.1.16 D 1.1.17 D 1.1.18 D 1.2.1 N/A 1.2.2 N/A
1.2.3 22 1.2.4 180 1.2.5 N/A 1.2.6 D 1.2.7 D
1.2.8 61.25 1.2.9 N/A 1.2.10 B 1.2.11 B 1.2.12 N/A
1.2.13 N/A 1.2.14 C 1.2.15 A 1.2.16 A 1.2.17 D
1.2.18 C 1.2.19 B 1.2.20 D 1.2.21 C 1.2.22 A
1.2.23 A 1.2.24 D 1.2.25 B 1.2.26 C 1.2.27 D
1.2.28 C 1.2.29 C 1.2.30 D 1.2.31 C 1.2.32 A
1.2.33 A 1.2.34 A 1.2.35 D 1.2.36 D 1.2.37 20
1.2.38 1.68 1.2.39 14 1.2.40 A 1.2.41 24 1.2.42 30
1.2.43 0.05 1.2.44 14 1.2.45 A 1.2.46 4.7 : 4.8 1.2.47 18
1.2.48 B 1.2.49 D 1.2.50 160 1.2.51 13.49:13.59 1.2.52 B
17 :
1.2.53 1.2.54 2:2 1.2.55 A 1.2.56 C 1.2.57 B
17
1.2.58 C 1.2.59 C 1.2.60 A 1.2.61 B 1.2.62 D
1.2.63 A 1.3.1 A;B;C;D 1.3.2 D 1.4.1 N/A 1.4.2 B
1.5.1 76 1.6.1 B 1.7.1 B 1.7.2 B 1.8.1 N/A
1.8.2 D 1.8.3 B 1.8.4 B 1.8.5 28 1.8.6 C
80000 :
1.9.1 D 1.9.2 B 1.9.3 A 1.9.4 456 1.9.5
80000
1.9.6 C 1.10.1 A 1.11.1 False 1.11.2 N/A 1.11.3 C
1.11.4 C 1.11.5 A 1.11.6 -16.0 1.11.7 1.87 : 1.88 1.12.1 N/A
1.12.2 256 1.12.3 True 1.12.4 16383 1.12.5 500 1.12.6 32
1.12.7 14 1.13.1 D 1.13.2 B 1.13.3 A 1.13.4 A
1.13.5 B 1.13.6 C 1.13.7 C 1.14.1 True 1.14.2 True
1.14.3 False 1.14.4 A 1.14.5 1.4 : 1.5 1.14.6 B 1.14.7 B
1.15.1 N/A 1.15.2 N/A 1.15.3 N/A 1.15.4 B 1.15.5 A
1.15.6 D 1.15.7 B 1.15.8 C 1.15.9 B 1.15.10 D
1.15.11 A 1.15.12 C 1.15.13 D 1.15.14 D 1.15.15 16
50 :
1.15.16 1.15.17 D 1.15.18 C 1.15.19 A 1.16.1 31
50
59 :
1.16.2 1.17.1 N/A 1.17.2 C 1.17.3 C 1.17.4 B
60
1.17.5 D 1.17.6 A 1.17.7 D 1.17.8 A 1.17.9 D
1.17.10 B 1.17.11 D 1.17.12 D 1.18.1 15 1.18.2 B
1.18.3 N/A 1.18.4 N/A 1.18.5 D 1.18.6 D 1.18.7 C
1.18.8 A 1.18.9 C 1.18.10 B 1.18.11 B 1.18.12 A
1.18.13 D 1.18.14 B 1.18.15 B 1.18.16 B 1.18.17 C

1.18.18 B 1.18.19 4 1.18.20 1.50 : 1.60 1.18.21 C 1.18.22 3.2
1.50 :
1.18.23 13 1.18.24 3 1.18.25 33.0 : 34.0 1.18.26 4 1.18.27
1.51
17160 :
1.18.28 219 1.18.29 2.161:2.169 1.18.30 1.18.31 C 1.18.32 B
17160
1.18.33 C 1.18.34 B 1.18.35 B 1.18.36 C 1.19.1 X
1.19.2 A 1.20.1 1.6 1.20.2 A 1.21.1 A;B 1.21.2 D
1.21.3 C
CLICK here to order hardcopy of this book

Gate CoA

Uploaded by

Copyright:

Available Formats

Gate CoA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gate CoA

Uploaded by

Copyright:

Available Formats

1 CO and Architecture (206)

Mark Distribution in Previous GATE

1.1 Addressing Modes (18) top☝

The most relevant addressing mode to write position-independent codes is:

gate1987 co-and-architecture addressing-modes

gate1988 normal descriptive co-and-architecture addressing-modes

Match the pairs in the following questions:

(A) Base addressing (p) Reentranecy

gate1989 match-the-following co-and-architecture addressing-modes easy

1.1.4 Addressing Modes: GATE CSE 1993 | Question: 10 top☝ ☛ https://gateoverflow.in/2307

The instruction format of a CPU is:

© Copyright GATE Overflow. Some rights reserved.

A. What is the address of the operand?

gate1993 co-and-architecture addressing-modes normal descriptive

Relative mode of addressing is most relevant to writing:

gate1996 co-and-architecture addressing-modes easy isro2016

gate1998 co-and-architecture addressing-modes easy

gate1999 co-and-architecture addressing-modes normal multiple-selects

The most appropriate matching for the following pairs

© Copyright GATE Overflow. Some rights reserved.

gate2000-cse co-and-architecture normal addressing-modes

X. Indirect Addressing I. Array implementation

A. (X, III), (Y, I), (Z, II)

gate2001-cse co-and-architecture addressing-modes normal

In the absolute addressing mode:

A. the operand is inside the instruction

gate2002-cse co-and-architecture addressing-modes easy

1.1.11 Addressing Modes: GATE CSE 2004 | Question: 20 top☝ ☛ https://gateoverflow.in/1017

gate2004-cse co-and-architecture addressing-modes easy

© Copyright GATE Overflow. Some rights reserved.

Consider a three word machine instruction

gate2005-cse co-and-architecture addressing-modes normal

1.1.13 Addressing Modes: GATE CSE 2005 | Question: 66 top☝ ☛ https://gateoverflow.in/1389

(1) A[I] = B[J] (a) Indirect addressing

A. (1, c), (2, b), (3, a)

gate2005-cse co-and-architecture addressing-modes easy

Which of the following is/are true of the auto-increment addressing mode?

I. It is useful in creating self-relocating code

gate2008-cse addressing-modes co-and-architecture normal isro2009

1.1.15 Addressing Modes: GATE CSE 2011 | Question: 21 top☝ ☛ https://gateoverflow.in/2123

© Copyright GATE Overflow. Some rights reserved.

gate2011-cse co-and-architecture addressing-modes easy

Consider the C struct defined below:

A. Post-increment addressing mode, (R1)+

gate2017-cse-set1 co-and-architecture addressing-modes

Which of the following statements about relative addressing mode is FALSE?

A. It enables reduced instruction size

gate2006-it co-and-architecture addressing-modes normal isro2009

1.1.18 Addressing Modes: GATE IT 2006 | Question: 40 top☝ ☛ https://gateoverflow.in/3581

MOVI Rs , 1 ; Move immediate

Which of the statements below is TRUE after the program is executed ?

gate2006-it co-and-architecture addressing-modes normal

Answers: Addressing Modes

© Copyright GATE Overflow. Some rights reserved.

 26 votes -- srestha (85.2k points)

But MOV 2(R0 ), R1

 32 votes -- Prashant Singh (47.2k points)

1.1.4 Addressing Modes: GATE CSE 1993 | Question: 10 top☝ ☛ https://gateoverflow.in/2307

 31 votes -- Arjun Suresh (332k points)