Gate CoA
Gate CoA
Gate CoA
Machine instructions and Addressing modes. ALU, data‐path and control unit. Instruction pipelining. Pipeline hazards, Memory
hierarchy: cache, main memory and secondary storage; I/O interface (Interrupt and DMA mode)
1.1.1 Addressing Modes: GATE CSE 1987 | Question: 1-V top☝ ☛ https://gateoverflow.in/80194
Answer ☟
1.1.2 Addressing Modes: GATE CSE 1988 | Question: 9iii top☝ ☛ https://gateoverflow.in/94388
In the program scheme given below indicate the instructions containing any operand needing relocation for position
independent behaviour. Justify your answer.
Y = 10
MOV X(R0 ), R1
MOV X, R0
MOV 2(R0 ), R1
MOV Y (R0 ), R5
⋅
⋅
⋅
X: WORD 0, 0, 0
Answer ☟
1.1.3 Addressing Modes: GATE CSE 1989 | Question: 2-ii top☝ ☛ https://gateoverflow.in/87078
Answer ☟
Answer ☟
1.1.5 Addressing Modes: GATE CSE 1996 | Question: 1.16, ISRO2016-42 top☝ ☛ https://gateoverflow.in/2720
Answer ☟
1.1.6 Addressing Modes: GATE CSE 1998 | Question: 1.19 top☝ ☛ https://gateoverflow.in/1656
Which of the following addressing modes permits relocation without any change whatsoever in the code?
A. Indirect addressing
B. Indexed addressing
C. Base register addressing
D. PC relative addressing
Answer ☟
1.1.7 Addressing Modes: GATE CSE 1999 | Question: 2.23 top☝ ☛ https://gateoverflow.in/1500
A certain processor supports only the immediate and the direct addressing modes. Which of the following programming
language features cannot be implemented on this processor?
A. Pointers
B. Arrays
C. Records
D. Recursive procedures with local variable
Answer ☟
1.1.8 Addressing Modes: GATE CSE 2000 | Question: 1.10 top☝ ☛ https://gateoverflow.in/633
is
A. X − 3, Y − 2, Z − 1
B. X − 1, Y − 3, Z − 2
C. X − 2, Y − 3, Z − 1
D. X − 3, Y − 1, Z − 2
Answer ☟
1.1.9 Addressing Modes: GATE CSE 2001 | Question: 2.9 top☝ ☛ https://gateoverflow.in/727
Which is the most appropriate match for the items in the first column with the items in the second column:
Answer ☟
1.1.10 Addressing Modes: GATE CSE 2002 | Question: 1.24 top☝ ☛ https://gateoverflow.in/829
Answer ☟
Which of the following addressing modes are suitable for program relocation at run time?
I. Absolute addressing
II. Based addressing
III. Relative addressing
IV. Indirect addressing
A. I and IV
B. I and II
C. II and III
D. I, II and IV
Answer ☟
Answer ☟
Match each of the high level language statements given on the left hand side with the most natural addressing mode from
those listed on the right hand side.
Answer ☟
1.1.14 Addressing Modes: GATE CSE 2008 | Question: 33, ISRO2009-80 top☝ ☛ https://gateoverflow.in/444
A. I only
B. II only
C. III only
D. II and III only
Answer ☟
Consider a hypothetical processor with an instruction of type LW R1, 20(R2) , which during execution reads a 32 − bit
word from memory and stores it in a 32 − bit register R1 . The effective address of the memory location is obtained by the addition
of a constant 20 and the contents of register R2 . Which of the following best reflects the addressing mode implemented by this
instruction for the operand in memory?
A. Immediate addressing
B. Register addressing
C. Register Indirect Scaled Addressing
Answer ☟
1.1.16 Addressing Modes: GATE CSE 2017 Set 1 | Question: 11 top☝ ☛ https://gateoverflow.in/118291
The base address of student is available in register R1. The field student.grade can be accessed efficiently using:
Answer ☟
1.1.17 Addressing Modes: GATE IT 2006 | Question: 39, ISRO2009-42 top☝ ☛ https://gateoverflow.in/3578
Answer ☟
The memory locations 1000, 1001 and 1020 have data values 18, 1 and 16 respectively before the following program is
executed.
Answer ☟
1.1.1 Addressing Modes: GATE CSE 1987 | Question: 1-V top☝ ☛ https://gateoverflow.in/80194
1.1.2 Addressing Modes: GATE CSE 1988 | Question: 9iii top☝ ☛ https://gateoverflow.in/94388
MOV instructions here are problematic for Position Independent behaviour if they use Indexed addressing mode and the
base address is loaded using some Absolute value.
I'm assuming R0 is having some relative address. So, the first two MOV instructions are fine for Position Independent
behaviour.
1.1.3 Addressing Modes: GATE CSE 1989 | Question: 2-ii top☝ ☛ https://gateoverflow.in/87078
(A) Base addressing (s) Position independent (By changing the value in
base register, location of address can be changed)
(B) Indexed addressing (r) Array
(C) Stack addressing (p) Reentranecy (Whenever code happens to be used
again, address need not be the same)
(D) Implied addressing (q) Accumulator (If an address is not specified, it is
assumed/implied to be the Accumulator)
A. Address of the operand = content of PC = 2001 as PC holds the address of the next instruction to be executed and
instruction size is 1 − word as given in the diagram.
B. After execution of the current instruction PC will be automatically incremented by 1 when the next instruction is fetched.
Also one extra increment will be done by operand fetch. So, PC = 2003 supposing next instruction is fetched. If we
assume next instruction fetch is not done (this should be the default here), it should be 2002 .
1.1.5 Addressing Modes: GATE CSE 1996 | Question: 1.16, ISRO2016-42 top☝ ☛ https://gateoverflow.in/2720
Answer is (B).
Relative mode addressing is most relevant to writing a position-independent code.
Reference: http://en.wikipedia.org/wiki/Addressing_mode#PC-relative
References
1.1.6 Addressing Modes: GATE CSE 1998 | Question: 1.19 top☝ ☛ https://gateoverflow.in/1656
(D) PC relative addressing is the best option. For Base register addressing, we have to change the address in the base
register while in PC relative there is absolutely no change in code needed.
1.1.7 Addressing Modes: GATE CSE 1999 | Question: 2.23 top☝ ☛ https://gateoverflow.in/1500
Pointer access requires indirect addressing which can be simulated with indexed addressing or register indirect
addressing but not with direct and immediate addressing. An array and record access needs a pointer access. So, options (A), (B)
and (C) cannot be implemented on such a processor.
Now, to handle recursive procedures we need to use stack. A local variable inside the stack will be accessed as *(SP + offset)
which is nothing but a pointer access and requires indirect addressing. Usually this is done by moving the SP value to Base
register and then using Base Relative addressing to avoid unnecessary memory accesses for indirect addressing- but not possible
with just direct and immediate addressing.
1.1.8 Addressing Modes: GATE CSE 2000 | Question: 1.10 top☝ ☛ https://gateoverflow.in/633
(C) is the most appropriate one.
1.1.9 Addressing Modes: GATE CSE 2001 | Question: 2.9 top☝ ☛ https://gateoverflow.in/727
(A) is the answer.
While passing array as parameter we can make use of a pointer (as in (C)) and hence can use Indirect addressing.
Base Register addressing can be used to write relocatable code by changing the content of Base Register.
1.1.10 Addressing Modes: GATE CSE 2002 | Question: 1.24 top☝ ☛ https://gateoverflow.in/829
(B) is the answer. Absolute addressing mode means address of operand is given in the instruction.
Answer: (C)
A displacement type addressing should be preferred. So, (I) is not the answer.
Indirect Addressing leads to extra memory reference which is not preferable at run time. So, (IV) is not the answer.
1 memory read to get the first operand from memory address A + R0 (A is given as part of instruction)
1 memory read to get the address of the second operand (since second uses indirect addressing)
1 memory read to get the second operand from the address given by the previous memory read
1 memory write to store to first operand (which is the destination)
Correct Answer: B
References
C is the answer.
1.1.14 Addressing Modes: GATE CSE 2008 | Question: 33, ISRO2009-80 top☝ ☛ https://gateoverflow.in/444
In auto increment addressing mode, the base address is incremented after operand fetch. This is useful in fetching
elements from an array. But this has no effect in self-relocating code (where code can be loaded to any address) as this works on
the basis of an initial base address.
An additional ALU is desirable for better execution especially with pipelining, but never a necessity.
Amount of increment depends on the size of the data item accessed as there is no need to fetch a part of a data.
So, answer must be C only.
Answer is (D).
Base Index Addressing, as the content of register R2 will serve as the index and 20 will be the Base address.
1.1.16 Addressing Modes: GATE CSE 2017 Set 1 | Question: 11 top☝ ☛ https://gateoverflow.in/118291
Answer is option (D).
Displacement Mode :-
Similar to index mode, except instead of a index register a base register will be used. Base register contains a pointer to a
memory location. An integer (constant) is also referred to as a displacement. The address of the operand is obtained by adding
the contents of the base register plus the constant. The difference between index mode and displacement mode is in the number
of bits used to represent the constant. When the constant is represented a number of bits to access the memory, then we have
index mode. Index mode is more appropriate for array accessing; displacement mode is more appropriate for structure (records)
accessing.
Reference:
http://www.cs.iit.edu/~cs561/cs350/addressing/addsclm.html
References
1.1.17 Addressing Modes: GATE IT 2006 | Question: 39, ISRO2009-42 top☝ ☛ https://gateoverflow.in/3578
A. is true as instead of absolute address we can use a much smaller relative address in instructions which results in smaller
instruction size.
B. By using the base address of array as displacement and index in a base register (base relative addressing mode) we can
index array elements using relative addressing.
Ref: http://service.scs.carleton.ca/sivarama/asm_book_web/Instructor_copies/ch5_addrmodes.pdf
C. is true as we only need to change the base address in case of relocation- instructions remain the same.
D. is false. Relative addressing cannot be faster than absolute addressing as absolute address must be calculated from relative
address. With specialized hardware unit, this can perform equally as good as absolute addressing but not faster.
Before the execution of the program, the memory contents are-
1. Instruction-01: MOVI Rs , 1
This instruction uses immediate addressing mode.
The instruction is interpreted as Rs ← 1.
Thus, value = 1 is moved to the register Rs .
2. Instruction-02: LOAD Rd , 1000 (Rs )
This instruction uses indexed addressing mode.
The instruction is interpreted as Rd ← [1000 + [Rs ]].
Value of the operand = [1000 + [Rs ]] = [1000 + 1] = [1001] = 1.
Thus, value = 1 is moved to the register Rd.
3. Instruction-03: ADDI Rd , 1000
This instruction uses immediate addressing mode.
The instruction is interpreted as Rd ← [Rd ] + 1000.
Value of the operand = [Rd ] + 1000 = 1 + 1000 = 1001.
Thus, value = 1001 is moved to the register Rd.
4. Instruction-04: STOREI 0(Rd ), 20
This instruction uses indexed addressing mode.
The instruction is interpreted as 0 + [Rd ] ← 20.
Value of the destination address = 0 + [Rd ] = 0 + 1001 = 1001.
Thus, value = 20 is moved to the memory location 1001.
Hence, after the program execution is completed, memory location 1001 has a value 20.
Answer ☟
A certain computer system was designed with cache memory of size 1 Kbytes and main memory size of 256 Kbytes. The
cache implementation was fully associative cache with 4 bytes per block. The CPU memory data path was 16 bits and the memory
was 2− way interleaved. Each memory read request presents two 16−bit words. A program with the model shown below was run to
evaluate the cache design.
Answer ☟
A block-set associative cache memory consists of 128 blocks divided into four block sets. The main memory consists of
16, 384 blocks and each block contains 256 eight bit words.
1. How many bits are required for addressing the main memory?
2. How many bits are needed to represent the TAG, SET and WORD fields?
Answer ☟
1.2.4 Cache Memory: GATE CSE 1992 | Question: 5-a top☝ ☛ https://gateoverflow.in/584
The access times of the main memory and the Cache memory, in a computer system, are 500 n sec and 50 nsec, respectively.
It is estimated that 80% of the main memory request are for read the rest for write. The hit ratio for the read access only is 0.9 and a
write-through policy (where both main and cache memories are updated simultaneously) is used. Determine the average time of the
main memory (in ns).
Answer ☟
In the three-level memory hierarchy shown in the following table, pi denotes the probability that an access request will refer to
Mi .
If a miss occurs at level Mi , a page transfer occurs from Mi+1 to Mi and the average time required for such a page swap is Ti .
Calculate the average time tA required for a processor to read one word from this memory system.
Answer ☟
1.2.6 Cache Memory: GATE CSE 1995 | Question: 1.6 top☝ ☛ https://gateoverflow.in/2593
Answer ☟
1.2.7 Cache Memory: GATE CSE 1995 | Question: 2.25 top☝ ☛ https://gateoverflow.in/2638
A computer system has a 4 K word cache organized in block-set-associative manner with 4 blocks per set, 64 words per
block. The number of bits in the SET and WORD fields of the main memory address format is:
A. 15, 40
B. 6, 4
C. 7, 2
D. 4, 6
Answer ☟
A computer system has a three-level memory hierarchy, with access time and hit ratios as shown below:
Level 1 (Cache memory)Access time = 50nsec/byte Level 2 (Main memory)Access time = 200nsec/byte
Size Hit ratio Size Hit ratio Level 3Access time = 5μsec/byte
8M bytes 0.80 4M bytes 0.98 Size Hit ratio
16M bytes 0.90 16M bytes 0.99 260M bytes 1.0
64M bytes 0.95 64M bytes 0.995
A. What should be the minimum sizes of level 1 and 2 memories to achieve an average access time of less than 100nsec ?
B. What is the average access time achieved using the chosen sizes of level 1 and level 2 memories?
Answer ☟
Calculate the hit ratio for a loop executed 100 times where the size of the loop is n × b , and n = k × m is a non-zero integer and
1 ≤ m ≤ l.
Answer ☟
1.2.10 Cache Memory: GATE CSE 1999 | Question: 1.22 top☝ ☛ https://gateoverflow.in/1475
The main memory of a computer has 2 cm blocks while the cache has 2 c blocks. If the cache uses the set associative
mapping scheme with 2 blocks per set, then block k of the main memory maps to the set:
Answer ☟
1.2.11 Cache Memory: GATE CSE 2001 | Question: 1.7, ISRO2008-18 top☝ ☛ https://gateoverflow.in/700
More than one word are put in one cache block to:
A. exploit the temporal locality of reference in a program
B. exploit the spatial locality of reference in a program
C. reduce the miss penalty
D. none of the above
Answer ☟
A CPU has 32 − bit memory address and a 256 KB cache memory. The cache is organized as a 4 − way set associative
cache with cache block size of 16 bytes.
Answer ☟
In a C program, an array is declared as float A[2048] . Each array element is 4 Bytes in size, and the starting address of the
array is 0x00000000 . This program is run on a computer that has a direct mapped data cache of size 8 Kbytes , with block (line)
size of 16 Bytes .
A. Which elements of the array conflict with element A[0] in the data cache? Justify your answer briefly.
B. If the program accesses the elements of this array one by one in reverse order i.e., starting with the last element and ending
with the first element, how many data cache misses would occur? Justify your answer briefly. Assume that the data cache is
initially empty and that no other data or instruction accesses are to be considered.
Answer ☟
Consider a small two-way set-associative cache memory, consisting of four blocks. For choosing the block to be replaced, use
the least recently used (LRU) scheme. The number of cache misses for the following sequence of block addresses is:
8, 12, 0, 12, 8 .
A. 2
B. 3
C. 4
D. 5
Answer ☟
Consider a direct mapped cache of size 32 KB with block size 32 bytes . The CPU generates 32 bit addresses. The number
of bits needed for cache indexing and the number of tag bits are respectively,
A. 10, 17
B. 10, 22
C. 15, 17
D. 5, 17
Answer ☟
Consider two cache organizations. First one is 32 KB 2-way set associative with 32 byte block size, the second is of same
size but direct mapped. The size of an address is 32 bits in both cases . A 2-to-1 multiplexer has latency of 0.6 ns while a k-bit
k
comparator has latency of 10 ns. The hit latency of the set associative organization is h1 while that of direct mapped is h2 .
The value of h1 is:
A. 2.4 ns
B. 2.3 ns
C. 1.8 ns
D. 1.7 ns
Answer ☟
Consider two cache organizations. First one is 32 kB 2-way set associative with 32 byte block size, the second is of same size
but direct mapped. The size of an address is 32 bits in both cases . A 2-to-1 multiplexer has latency of 0.6ns while a k− bit
k
comparator has latency of 10 ns. The hit latency of the set associative organization is h1 while that of direct mapped is h2 .
The value of h2 is:
A. 2.4 ns
B. 2.3 ns
C. 1.8 ns
D. 1.7 ns
Answer ☟
A CPU has a 32KB direct mapped cache with 128 byte-block size. Suppose A is two dimensional array of size 512 × 512
with elements that occupy 8-bytes each. Consider the following two C code segments, P1 and P2 .
P1:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[i] [j];
}
}
P2:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[j] [i];
}
}
P1 and P2 are executed independently with the same initial state, namely, the array A is not in the cache and
i,
j,
x are in registers. Let the number of cache misses experienced by P1 be M1 and that for P2 be M2 .
The value of M1 is:
A. 0
B. 2048
C. 16384
D. 262144
Answer ☟
A CPU has a 32 KB direct mapped cache with 128 byte-block size. Suppose A is two dimensional array of size 512 × 512
with elements that occupy 8 − bytes each. Consider the following two C code segments, P1 and P2 .
P1 :
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[i] [j];
}
}
P2:
for (i=0; i<512; i++)
{
for (j=0; j<512; j++)
{
x +=A[j] [i];
}
}
P1 and P2 are executed independently with the same initial state, namely, the array A is not in the cache and i, j, x are in registers.
Let the number of cache misses experienced by P1 be M1 and that for P2 be M2.
M1
The value of the ratio M2
:
A. 0
1
B. 16
C. 18
D. 16
Answer ☟
Consider a 4-way set associative cache consisting of 128 lines with a line size of 64 words. The CPU generates a 20 − bit
address of a word in main memory. The number of bits in the TAG, LINE and WORD fields are respectively:
A. 9, 6, 5
B. 7, 7, 6
C. 7, 5, 8
D. 9, 5, 6
Answer ☟
Consider a machine with a byte addressable main memory of 216 bytes. Assume that a direct mapped data cache consisting of
32 lines of 64 bytes each is used in the system. A 50 × 50 two-dimensional array of bytes is stored in the main memory starting
from memory location 1100H . Assume that the data cache is initially empty. The complete array is accessed twice. Assume that the
contents of the data cache do not change in between the two accesses.
How many data misses will occur in total?
A. 48
B. 50
C. 56
D. 59
Answer ☟
Answer ☟
For inclusion to hold between two cache levels L1 and L2 in a multi-level cache hierarchy, which of the following are
necessary?
A. IV only
B. I and IV only
C. I, II and IV only
D. I, II, III and IV
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes . The cache is
managed using 32 bit virtual addresses and the page size is 4 Kbytes . A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;
/*Initialize array ARR to 0.0 */
for(i = 0; i < 1024; i++)
for(j = 0; j < 1024; j++)
ARR[i][j] = 0.0;
The size of double is 8 bytes. Array ARR is located in memory starting at the beginning of virtual page 0xFF000 and stored in
row major order. The cache is initially empty and no pre-fetching is done. The only data memory references made by the program
are those to array ARR .
The total size of the tags in the cache directory is:
A. 32 Kbits
B. 34 Kbits
C. 64 Kbits
D. 68 Kbits
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes. The cache is managed
using 32 bit virtual addresses and the page size is 4 Kbytes. A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;
/*Initialize array ARR to 0.0 */
The size of double is 8 bytes. Array ARR is located in memory starting at the beginning of virtual page 0xFF000 and stored in
row major order. The cache is initially empty and no pre-fetching is done. The only data memory references made by the program
are those to array ARR .
Which of the following array elements have the same cache index as ARR[0][0]?
A. ARR[0][4]
B. ARR[4][0]
C. ARR[0][5]
D. ARR[5][0]
Answer ☟
Consider a machine with a 2-way set associative data cache of size 64 Kbytes and block size 16 bytes . The cache is
managed using 32 bit virtual addresses and the page size is 4 Kbytes . A program to be run on this machine begins as follows:
double ARR[1024][1024];
int i, j;
/*Initialize array ARR to 0.0 */
for(i = 0; i < 1024; i++)
for(j = 0; j < 1024; j++)
ARR[i][j] = 0.0;
The size of double is 8 bytes. Array ARR is located in memory starting at the beginning of virtual page 0xFF000 and stored in
row major order. The cache is initially empty and no pre-fetching is done. The only data memory references made by the program
are those to array ARR .
The cache hit ratio for this initialization loop is:
A. 0%
B. 25%
C. 50%
D. 75%
Answer ☟
Consider a 4-way set associative cache (initially empty) with total 16 cache blocks. The main memory consists of 256 blocks
and the request for memory blocks are in the following order:
0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155.
Which one of the following memory block will NOT be in cache if LRU replacement policy is used?
A. 3
B. 8
C. 129
D. 216
Answer ☟
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown below. The block size in L1
cache is 4 words. The block size in L2 cache is 16 words. The memory access times are 2 nanoseconds, 20 nanoseconds and 200
nanoseconds for L1 cache, L2 cache and the main memory unit respectively.
Answer ☟
A computer system has an L1 cache, an L2 cache, and a main memory unit connected as shown below. The block size in L1
cache is 4 words. The block size in L2 cache is 16 words. The memory access times are 2 nanoseconds, 20 nanoseconds and 200
nanoseconds for L1 cache, L2 cache and the main memory unit respectively.
When there is a miss in both L1 cache and L2 cache, first a block is transferred from main memory to L2 cache, and then a block is
transferred from L2 cache to L1 cache. What is the total time taken for these transfers?
A. 222 nanoseconds
B. 888 nanoseconds
C. 902 nanoseconds
D. 968 nanoseconds
Answer ☟
An 8KB direct-mapped write-back cache is organized as multiple blocks, each size of 32-bytes . The processor generates
32-bit addresses. The cache controller contains the tag information for each cache block comprising of the following.
1 valid bit
1 modified bit
As many bits as the minimum needed to identify the memory block mapped in the cache.
What is the total size of memory needed at the cache controller to store meta-data (tags) for the cache?
A. 4864 bits
B. 6144 bits
C. 6656 bits
D. 5376 bits
Answer ☟
A computer has a 256-KByte, 4-way set associative, write back data cache with block size of 32-Bytes . The processor sends
32-bit addresses to the cache controller. Each cache tag directory entry contains, in addition to address tag, 2 valid bits, 1 modified
bit and 1 replacement bit.
The number of bits in the tag field of an address is
A. 11
B. 14
C. 16
D. 27
Answer ☟
A computer has a 256-KByte , 4-way set associative, write back data cache with block size of 32 Bytes . The processor sends
32 bit addresses to the cache controller. Each cache tag directory entry contains, in addition to address tag, 2 valid bits, 1 modified
bit and 1 replacement bit.
The size of the cache tag directory is:
A. 160 Kbits
B. 136 Kbits
C. 40 Kbits
D. 32 Kbits
Answer ☟
In a k -way set associative cache, the cache is divided into v sets, each of which consists of k lines. The lines of a set are
placed in sequence one after another. The lines in set s are sequenced before the lines in set (s + 1) . The main memory blocks are
numbered 0 onwards. The main memory block numbered j must be mapped to any one of the cache lines from
A. (j mod v) ∗ k to (j mod v) ∗ k + (k − 1)
B. (j mod v) to (j mod v) + (k − 1)
C. (j mod k) to (j mod k) + (v − 1)
D. (j mod k) ∗ v to (j mod k) ∗ v + (v − 1)
Answer ☟
1.2.34 Cache Memory: GATE CSE 2014 Set 1 | Question: 44 top☝ ☛ https://gateoverflow.in/1922
An access sequence of cache block addresses is of length N and contains n unique block addresses. The number of unique
block addresses between two consecutive accesses to the same block address is bounded above by k . What is the miss ratio if the
access sequence is passed through a cache of associativity A ≥ k exercising least-recently-used replacement policy?
n
A. ( )
N
1
B. ( )
N
1
C. ( )
A
D. ( )
k
n
1.2.35 Cache Memory: GATE CSE 2014 Set 2 | Question: 43 top☝ ☛ https://gateoverflow.in/2009
In designing a computer's cache system, the cache block (or cache line) size is an important parameter. Which one of the
following statements is correct in this context?
Answer ☟
1.2.36 Cache Memory: GATE CSE 2014 Set 2 | Question: 44 top☝ ☛ https://gateoverflow.in/2010
If the associativity of a processor cache is doubled while keeping the capacity and block size unchanged, which one of the
following is guaranteed to be NOT affected?
A. Width of tag comparator
B. Width of set index decoder
C. Width of way selection multiplexer
D. Width of processor to main memory data bus
Answer ☟
1.2.37 Cache Memory: GATE CSE 2014 Set 2 | Question: 9 top☝ ☛ https://gateoverflow.in/1963
A 4-way set-associative cache memory unit with a capacity of 16 KB is built using a block size of 8 words. The word length is
32 bits. The size of the physical address space is 4 GB. The number of bits for the TAG field is ____
Answer ☟
1.2.38 Cache Memory: GATE CSE 2014 Set 3 | Question: 44 top☝ ☛ https://gateoverflow.in/2078
The memory access time is 1 nanosecond for a read operation with a hit in cache, 5 nanoseconds for a read operation with a
miss in cache, 2 nanoseconds for a write operation with a hit in cache and 10 nanoseconds for a write operation with a miss in
cache. Execution of a sequence of instructions involves 100 instruction fetch operations, 60 memory operand read operations and
40 memory operand write operations. The cache hit-ratio is 0.9 . The average memory access time (in nanoseconds) in executing the
sequence of instructions is ______.
Answer ☟
1.2.39 Cache Memory: GATE CSE 2015 Set 2 | Question: 24 top☝ ☛ https://gateoverflow.in/8119
Assume that for a certain processor, a read request takes 50 nanoseconds on a cache miss and 5 nanoseconds on a cache
hit. Suppose while running a program, it was observed that 80% of the processor's read requests result in a cache hit. The average
read access time in nanoseconds is ______.
Answer ☟
1.2.40 Cache Memory: GATE CSE 2015 Set 3 | Question: 14 top☝ ☛ https://gateoverflow.in/8410
Consider a machine with a byte addressable main memory of 220 bytes, block size of 16 bytes and a direct mapped cache
having 212 cache lines. Let the addresses of two consecutive bytes in main memory be (E201F)16 and (E2020)16 . What are the
(E201F)
A. E, 201
B. F, 201
C. E, E20
D. 2, 01F
Answer ☟
1.2.41 Cache Memory: GATE CSE 2016 Set 2 | Question: 32 top☝ ☛ https://gateoverflow.in/39622
The width of the physical address on a machine is 40 bits. The width of the tag field in a 512 KB 8-way set associative cache
is ________ bits.
Answer ☟
1.2.42 Cache Memory: GATE CSE 2016 Set 2 | Question: 50 top☝ ☛ https://gateoverflow.in/39592
A file system uses an in-memory cache to cache disk blocks. The miss rate of the cache is shown in the figure. The latency to
read a block from the cache is 1 ms and to read a block from the disk is 10 ms. Assume that the cost of checking whether a block
exists in the cache is negligible. Available cache sizes are in multiples of 10 MB.
The smallest cache size required to ensure an average read latency of less than 6 ms is _________ MB.
Answer ☟
1.2.43 Cache Memory: GATE CSE 2017 Set 1 | Question: 25 top☝ ☛ https://gateoverflow.in/118305
Consider a two-level cache hierarchy with L1 and L2 caches. An application incurs 1.4 memory accesses per instruction on
average. For this application, the miss rate of L1 cache is 0.1 ; the L2 cache experiences, on average, 7 misses per 1000
instructions. The miss rate of L2 expressed correct to two decimal places is ________.
Answer ☟
1.2.44 Cache Memory: GATE CSE 2017 Set 1 | Question: 54 top☝ ☛ https://gateoverflow.in/118748
A cache memory unit with capacity of N words and block size of B words is to be designed. If it is designed as a direct
mapped cache, the length of the TAG field is 10 bits. If the cache unit is now designed as a 16 − way set-associative cache, the
length of the TAG field is ____________ bits.
1.2.45 Cache Memory: GATE CSE 2017 Set 2 | Question: 29 top☝ ☛ https://gateoverflow.in/118371
In a two-level cache system, the access times of L1 and L2 caches are 1 and 8 clock cycles, respectively. The miss penalty
from the L2 cache to main memory is 18 clock cycles. The miss rate of L1 cache is twice that of L2 . The average memory access
time (AMAT) of this cache system is 2 cycles. The miss rates of L1 and L2 respectively are
A. 0.111 and 0.056
B. 0.056 and 0.111
C. 0.0892 and 0.1784
D. 0.1784 and 0.0892
Answer ☟
1.2.46 Cache Memory: GATE CSE 2017 Set 2 | Question: 45 top☝ ☛ https://gateoverflow.in/118597
The read access times and the hit ratios for different caches in a memory hierarchy are as given below:
The read access time of main memory in 90 nanoseconds . Assume that the caches use the referred-word-first read policy and the
write-back policy. Assume that all the caches are direct mapped caches. Assume that the dirty bit is always 0 for all the blocks in the
caches. In execution of a program, 60% of memory reads are for instruction fetch and 40% are for memory operand fetch. The
average read access time in nanoseconds (up to 2 decimal places) is _________
Answer ☟
1.2.47 Cache Memory: GATE CSE 2017 Set 2 | Question: 53 top☝ ☛ https://gateoverflow.in/118613
Consider a machine with a byte addressable main memory of 232 bytes divided into blocks of size 32 bytes. Assume that a
direct mapped cache having 512 cache lines is used with this machine. The size of the tag field in bits is _______
Answer ☟
The size of the physical address space of a processor is 2P bytes. The word length is 2W bytes. The capacity of cache memory
is 2N bytes. The size of each cache block is 2M words. For a K-way set-associative cache memory, the length (in number of bits) of
the tag field is
A. P − N − log2 K
B. P − N + log2 K
C. P − N − M − W − log2 K
D. P − N − M − W + log2 K
Answer ☟
A certain processor uses a fully associative cache of size 16 kB, The cache block size is 16 bytes. Assume that the main
Answer ☟
A certain processor deploys a single-level cache. The cache block size is 8 words and the word size is 4 bytes. The memory
system uses a 60-MHz clock. To service a cache miss, the memory controller first takes 1 cycle to accept the starting address of the
block, it then takes 3 cycles to fetch all the eight words of the block, and finally transmits the words of the requested block at the rate
of 1 word per cycle. The maximum bandwidth for the memory system when the program running on the processor issues a series of
read operations is ______×106 bytes/sec
Answer ☟
A direct mapped cache memory of 1 MB has a block size of 256 bytes. The cache has an access time of 3 ns and a hit rate of
94%. During a cache miss, it takes 20 ns to bring the first word of a block from the main memory, while each subsequent word
takes 5 ns. The word size is 64 bits. The average memory access time in ns (round off to 1 decimal place) is______.
Answer ☟
A computer system with a word length of 32 bits has a 16 MB byte- addressable main memory and a 64 KB, 4-way set
associative cache memory with a block size of 256 bytes. Consider the following four physical addresses represented in
hexadecimal notation.
A1 = 0x42C8A4 ,
A2 = 0x546888 ,
A3 = 0x6A289C ,
A4 = 0x5E4880
gate2020-cse cache-memory
Answer ☟
1.2.53 Cache Memory: GATE CSE 2021 Set 1 | Question: 22 top☝ ☛ https://gateoverflow.in/357429
Consider a computer system with a byte-addressable primary memory of size 232 bytes. Assume the computer system has a
direct-mapped cache of size 32 KB (1 KB = 210 bytes), and each cache block is of size 64 bytes.
Answer ☟
Consider a set-associative cache of size 2KB (1KB = 210 bytes) with cache block size of 64 bytes. Assume that the cache is
byte-addressable and a 32 -bit address is used for accessing the cache. If the width of the tag field is 22 bits, the associativity of the
cache is _________
Answer ☟
1.2.55 Cache Memory: GATE CSE 2021 Set 2 | Question: 27 top☝ ☛ https://gateoverflow.in/357513
Assume a two-level inclusive cache hierarchy, L1 and L2, where L2 is the larger of the two. Consider the following
statements.
S1 : Read misses in a write through L1 cache do not result in writebacks of dirty lines to the L2
S2 : Write allocate policy must be used in conjunction with write through caches and no-write allocate policy is used with
writeback caches.
Answer ☟
1.2.56 Cache Memory: GATE IT 2004 | Question: 12, ISRO2016-77 top☝ ☛ https://gateoverflow.in/3653
Consider a system with 2 level cache. Access times of Level 1 cache, Level 2 cache and main memory are 1 ns, 10 ns, and
500 ns respectively. The hit rates of Level 1 and Level 2 caches are 0.8 and 0.9 , respectively. What is the average access time of
the system ignoring the search time within the cache?
A. 13.0
B. 12.8
C. 12.6
D. 12.4
Answer ☟
Consider a fully associative cache with 8 cache blocks (numbered 0 − 7 ) and the following sequence of memory block
requests:
4, 3, 25, 8, 19, 6, 25, 8, 16, 35, 45, 22, 8, 3, 16, 25, 7
If LRU replacement policy is used, which cache block will have memory block 7?
A. 4
B. 5
C. 6
D. 7
Answer ☟
Consider a 2-way set associative cache memory with 4 sets and total 8 cache blocks (0 − 7) and a main memory with 128
blocks (0 − 127) . What memory blocks will be present in the cache after the following sequence of memory block references if
LRU policy is used for cache block replacement. Assuming that initially the cache did not have any memory block from the current
Answer ☟
A cache line is 64 bytes. The main memory has latency 32 ns and bandwidth 1 GBytes/s . The time required to fetch the
entire cache line from the main memory is:
A. 32 ns
B. 64 ns
C. 96 ns
D. 128 ns
Answer ☟
A computer system has a level- 1 instruction cache ( 1-cache), a level- 1 data cache (D-cache) and a level- 2 cache (L2-cache)
with the following specifications:
The length of the physical address of a word in the main memory is 30 bits. The capacity of the tag memory in the I -cache, D-
cache and L2-cache is, respectively,
A. 1 K x 18-bit, 1 K x 19-bit, 4 K x 16-bit
B. 1 K x 16-bit, 1 K x 19-bit, 4 K x 18-bit
C. 1 K x 16-bit, 512 x 18-bit, 1 K x 16-bit
D. 1 K x 18-bit, 512 x 18-bit, 1 K x 18-bit
Answer ☟
Consider a Direct Mapped Cache with 8 cache blocks (numbered 0 − 7 ). If the memory block requests are in the following
order
3, 5, 2, 8, 0, 63, 9, 16, 20, 17, 25, 18, 30, 24, 2, 63, 5, 82, 17, 24.
Which of the following memory blocks will not be in the cache at the end of the sequence ?
A. 3
B. 18
C. 20
D. 30
Answer ☟
Consider a computer with a 4-ways set-associative mapped cache of the following characteristics: a total of 1 MB of main
memory, a word size of 1 byte, a block size of 128 words and a cache size of 8 KB .
The number of bits in the TAG, SET and WORD fields, respectively are:
A. 7, 6, 7
B. 8, 5, 7
C. 8, 6, 6
D. 9, 4, 7
Answer ☟
Consider a computer with a 4-ways set-associative mapped cache of the following characteristics: a total of 1 MB of main
memory, a word size of 1 byte , a block size of 128 words and a cache size of 8 KB .
While accessing the memory location 0C795H by the CPU, the contents of the TAG field of the corresponding cache line is:
A. 000011000
B. 110001111
C. 00011000
D. 110010101
Answer ☟
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time
or energy) to access data from the main memory.
A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main
memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is
usually organized as a hierarchy of more cache levels.
Source :- https://en.wikipedia.org/wiki/CPU_cache
References
All the blocks above and below the loop region can be assumed to be cache misses.
It is given that the cache is fully associative but the replacement policy is not mentioned. Let's assume it is FIFO. Also, for
simplicity let's assume that there's no reuse within a cache block.
In the first loop access every cache access will be a miss. Since the loop body size is 1028 and we assumed FIFO, the last cache
block will replace the first one. In the second iteration, first cache block access will be a miss and this will replace the second
cache block. Going like this every cache block access will be a miss.
(ii) By just reducing the loop body size by 4 bytes, all the loop accesses from second iteration of the loop will be a hit. So, we can
99×1024
get a hit ratio of 101×1024 = 98.01%
0 votes -- gatecse (63.3k points)
For main memory, there are 214 blocks and each block size is 28 bytes (A byte is an eight-bit word)
1. Size of main memory = 214 × 28 = 4MB ( 22 − bits required for addressing the main memory).
2. For WORD field, we require 8 − bits , as each block contains 28 words.
As there are 4 blocks in 1 set, 32 sets will be needed for 128 blocks. Thus SET field requires 5 − bits .
Then, TAG field requires 22 − (5 + 8) = 9 − bits
1.2.4 Cache Memory: GATE CSE 1992 | Question: 5-a top☝ ☛ https://gateoverflow.in/584
Average memory access time = Time spend for read + Time spend for write
= Read time when cache hit + Read time when cache miss + Write time when cache hit + Write time when cache miss
= 36 + 44 + 90 + 10 = 180 ns.
Reference: http://www.howardhuang.us/teaching/cs232/24-Cache-writes-and-examples.pdf
References
We are given the probability of access being a hit in each level (clear since their sum adds to 1).
So, we can get the average access time as:
tA = 0.99 × 10−6 + 0.00998 × (10−6 + 10−5 + 0.001)
+0.00002 × (10−6 + 10−5 + 10−4 + 0.1 + 0.001)]
≈ (0.99 + 10 + 2) × [10−6 ] = 13μs .
We can also use the following formula- for 100% of accesses M1 is accessed,
whenever M1 is a miss, M2 is accessed and when both misses only M3 is accessed.
So, average memory access time,
tA = 10−6 + (1 − 0.99) × (10−5 + 0.001) + 0.00002 × (10−4 + 0.1) = 1 + 10.01 + 2μs = 13.01μs.
1.2.6 Cache Memory: GATE CSE 1995 | Question: 1.6 top☝ ☛ https://gateoverflow.in/2593
Answer is (D).
Locality of reference is actually the frequent accessing of any storage location or some value. We can say in simple language that
whatever things are used more frequently, they are stored in the locality of reference. So we have cache memory for the purpose.
1.2.7 Cache Memory: GATE CSE 1995 | Question: 2.25 top☝ ☛ https://gateoverflow.in/2638
4K
Number of sets = = 16
(64 × 4)
The equation for access time can be written as follows (assuming a, b are the hit ratios of level 1 and level 2
respectively).
T = T1 + (1 − a)T2 + (1 − a) × (1 − b)T3
Here T ≤ 100, T1 = 50ns, T2 = 200ns and T3 = 5000ns. On substituting the a, b for the first case we get
T = 95ns for a = 0.8 and b = 0.995. i.e., L1 = 8M and L2 = 64M.
T = 75ns for a = 0.9 and b = 0.99. i.e., L1 = 16M and L2 = 4M
B.
Here, size of the loop is smaller than size of cache as m ≤ l . So we are guaranteed that the entire loop is in cache without any
replacement. (Here we assumed that the memory size of n × b used by the loop is contiguous, or else we could have a scenario
where the entire loop size gets mapped to the same set causing cache replacement).
No. of accesses = n × b
No. of misses = n as each new block access is a miss and loop body has n blocks each of size b for a total size of n × b .
No. of accesses = n × b
No. of misses = 0
= 100nb − n
100nb−n 1
So, hit ratio = 100nb
=1− 100b
1
The hit ratio is independent of l, so for l = 1 also we have hit ratio = 1 − 100b
1.2.10 Cache Memory: GATE CSE 1999 | Question: 1.22 top☝ ☛ https://gateoverflow.in/1475
Number of cache blocks = 2c
2c
Number of sets in cache = = c since each set has 2 blocks. Now, a block of main memory gets mapped to a set
2
(associativity of 2 just means there are space for 2 memory blocks in a cache set), and we have 2 cm blocks being mapped to c
sets. So, in each set 2m different main memory blocks can come and block k of main memory will be mapped to k mod c.
Correct Answer: B.
39 votes -- Arjun Suresh (332k points)
1.2.11 Cache Memory: GATE CSE 2001 | Question: 1.7, ISRO2008-18 top☝ ☛ https://gateoverflow.in/700
Exploit the spatial locality of reference in a program as, if the next locality is addressed immediately, it will already be in
the cache.
Consider the scenario similar to cooking, where when an ingredient is taken from cupboard, you also take the near by ingredients
along with it- hoping that they will be needed in near future.
Correct Answer: B
84 votes -- Arjun Suresh (332k points)
What is the number of sets in the cache
Cache memory
Number of sets =
(set associativity × cache block size)
256KB
=
(4 × 16B)
= 4096
What is the size (in bits) of the tag field per cache block?
Memory address size = 32-bit
Number of bits required to identify a particular set = 12 (Number of sets = 4096)
Number of bits required to identify a paticular location in cache line = 4 (cache block size = 16)
Size of tag field = 32 − 12 − 4 = 16-bit
What is the number and size of comparators required for tag matching?
We use 4-way set associate cache. So, we need 4 comparators each of size 16-bits
http://ecee.colorado.edu/~ecen2120/Manual/caches/cache.html
How many address bits are required to find the byte offset within a cache block?
Cache block size is 16-byte. so 4-bits are required to find the byte offset within a cache block.
What is the total amount of extra memory (in bytes) required for the tag bits?
size of tag = 16-bits
Number of sets = 4096
Set associativity = 4
Extra memory required to store the tag bits = 16 × 4096 × 4-bits = 218 bits = 215 bytes.
References
A. Data cache size = 8 KB.
Since each array element occupies 4 B, four consecutive array elements occupy a block line (elements are aligned as
starting address is 0 )
8 KB
Number of cache blocks = = 512.
16 B
2048
Number of cache blocks needed for the array = = 512.
4
So, all the array elements has its own cache block and there is no collision.
Here, the last 4 bits are used as OFFSET bits and the next 9 bits are used as SET bits. So, since the ending address is not
extending beyond these 9 bits, all cache accesses are to diff sets.
We have 4 blocks and 2 blocks in a set
Since the lowest bit of block address is used for indexing into the set, so 8, 12 and 0 first miss in cache with 0 replacing 8 (there
are two slots in each set due to 2 − way set) and then 12 hits in cache and 8 again misses. So, totally 4 misses.
Correct Answer: C
25 votes -- Arjun Suresh (332k points)
Cache size is 32 KB and cache block size is 32 B. So,
cache size
Number of sets =
no. of blocks in a set × block size
32 KB
= = 512
2 × 32 B
So, number of index bits needed = 9 ( since 29 = 512 ). Number of offset bits = 5 (since 25 = 32 B is the block size and
assuming byte addressing). So, number of tag bits = 32 − 9 − 5 = 18 (as memory address is of 32 bits).
So, time for comparing the data
= Time to compare the data + Time to select the block in set = 0.6 + 18/10 ns = 2.4 ns.
(Two comparisons of tag bits need to be done for each block in a set, but they can be carried out in parallel and the succeeding
one multiplexed as the output).
Reference: https://courses.cs.washington.edu/courses/cse378/09au/lectures/cse378au09-19.pdf
Correct Answer: A
References
cache size
number of sets =
no. of blocks in a set × block size
32KB
= = 1024
1 × 32B
17
So, h2 = = 1.7 ns
10
Correct Answer: D
20 votes -- Arjun Suresh (332k points)
Code being C implies array layout is row-major.
http://en.wikipedia.org/wiki/Row-major_order
128
When A[0][0] is fetched, 128 consecutive bytes are moved to cache. So, for the next − 1 = 15 memory references there
8
won't be a cache miss.
For the next iteration of i loop also the same thing happens as there is no temporal locality in the code. So, number of cache
misses for P1 is
512
= × 512
16
= 32 × 512
= 214 = 16384
Correct Answer: C
References
215 B
Number of Cache Lines = = 256
128B
128B
In 1 Cache Line = = 16 elements
8B
total elements in array
P1 =
elements in a cache line
512 × 512
= = 214 = 16384.
16
P2 = 512 × 512 = 218
P1 16384
=
P2 512 × 512
1
= = =
© Copyright GATE Overflow. Some rights reserved.
1
= 214−18 = 2−4 =
16
It is so, because for P1 for every line there is a miss, and once a miss is processed we get 16 elements in memory.
So, another miss happens after 16 elements.
For P2 for every element there is a miss because storage is row major order(by default) and we are accessing column wise.
Hence, answer is option B.
= 32 × 512
= 214 = 16384
In the case of P2, the memory references are not consecutive. After A[0][0], the next access is A[1][0] which is after 512 * 8
memory locations. Since our cache block can hold only 128 contiguous memory locations, A[1][0] won't be in cache after a[0]
[0] is accessed. Now, the next location after A[0][0] is A[0][1] which will be accessed only after 512 iterations of the inner loop-
after 512 distinct memory block accesses. In our cache we have only space for 32 KB/128 B = 256 memory blocks. So, by the
time A[0][1] is accessed, its cache block would be replaced. So, each of the memory access in P2 results in a cache miss. Total
number of cache miss
= 512 × 512
M1 32×512 1
So, M2
= 512×512 = 16
References
cache size
Number of sets =
(size of a block * No. of blocks in a set)
128 ∗ 64
= (4 way set associative means 4 blocks in a set)
(64 ∗ 4)
= 32.
So, number of index (LINE) bits = 5 and number of WORD bits = 6 since cache block (line) size is 64.
So, number of TAG bits = 20 − 6 − 5 = 9.
Bits used to represent the address = log2 216 = 16
Each cache line size = 64 bytes; means offset requires 6-bits
Total number of lines in cache = 32; means line # requires 5-bits
So, tag bits = 16 − 6 − 5 = 5
We have a 2D-array each of its element is of size = 1 Byte;
Total size of this array = 50 × 50 × 1 Byte = 2500 Bytes
So, total number of lines it will require to get contain in cache
2500B
= = 39.0625 ≈ 40
64B
Starting address of array = 1100H = 00010 00100 000000
The group of bits in middle represents Cache Line number ⟹ array starts from cache line number 4,
We require 40 cache lines to hold all array elements, but we have only 32 cache lines
Lets group/partition our 2500 array elements in those 40 array lines, we call this first array line as A0 which will have 64 of its
elements.
This line(group of 64 elements) of array will be mapped to cache line number 4 as found by analysis of starting address of array
above.
This all means that among those 40 array lines some array lines will be mapped to same cache line, coz there are just 32 cache
lines but 40 of array lines.
This is how mapping is:
0 A28
1 A29
2 A30
3 A31
4 A0 A32
5 A1 A33
6 A2 A34
7 A3 A35
8 A4 A36
9 A5 A37
10 A6 A38
11 A7 A39
12 A8
⋮
30 A26
31 A27
So, if we access complete array twice we get = 32 + 8 + 8 + 8 = 56 miss because only 8 lines from cache line number 4 to 11
are miss operation, rest are Hits(not counted) or Compulsory misses(first 32).
Hence, answer is option (C).
216 = 64 KB main memory is mapped to 32 lines of 64 bytes. So, number of offset bits = 6 (to identify a byte in a line) and
number of indexing bits = 5 (to identify the line).
Size of array = 50* 50 = 2500 B. If array is stored in row-major order (first row, second-row..), and if elements are also accessed
in row-major order (or stored and accessed in column-major order), for the first 64 accesses, there will be only 1 cache miss, and
for 2500 accesses, there will be 2500/64 = 40 cache misses during the first iteration.
We have 5 index bits and 6 offset bits. So, for 2 11 (5 + 6 = 11) continuous memory addresses there wont be any cache conflicts
as the least significant bits are offset bits followed by index bits.
So, number of array elements that can be accessed without cache conflict = 2048 (as element size is a byte). The next 452
elements conflict with the first 452 elements. This means 452/64 = 8 cache blocks are replaced. (We used ceil, as even if one
element from a new block is accessed, the whole block must be fetched).
Cache Organization:
4352 B
We need to find Starting block = = 68th block in main memory from where array start storing elements.
64 B
50 B
50 × 50 B = array size = 50 × = 39.0625 blocks needed ≊ 40 blocks
64 B
Starting block is 68 (mod 32) = 4th cache block and after that in sequence they will be accessed.
As shown in below table, line number 4 to 11 has been replaced by array in second time
1st is not correct as data need not to be exactly same at the same point of time and so write back policy can be used in
this.
cache size
Number of sets =
size of a set
64 KB
= (two blocks per set)
(16 B × 2)
= 2 K = 211
So, we need 11-bits for set indexing.
Number of WORD bits required = 4 as a cache block consists of 16 bytes and we need 4-bits to address each of them.
ARR[0][0] is located at virtual address 0x FF000 000. (FF000 is page address and 000 is page offset).
So, index bits are 00000000000
Address of ARR[0][5] = 0xFF000 + 5 × sizeof (double) = 0xFF000 000 +5 × 8 = 0xFF000 028 (40 = 28 in hex)
(index bits differ)
Address of ARR[5][0] = 0xFF000 + 5 × 1024 × sizeof (double) [since we use row major storage]
= 0xFF000 000 + 5120 × 8 = 0xFF000 000 + 0xA000 = 0xFF00A 000 (index bits differ)
So, only ARR[4][0] and ARR[0][0] have the same cache index.
The inner loop is iterating from 0 to 1023, so consecutive memory locations are accessed in sequence. Since cache block size is
only 16 bytes and our element being double is of size 8 bytes, during a memory access only the next element gets filled in the
cache. i.e.; every alternative memory access is a cache miss giving a hit ratio of 50 (If loops i and j are reversed, all accesses will
be misses and hit ratio will become 0 ).
Number of sets = cache size/ size of a set
= 64 KB/(16 B × 2) (two blocks per set)
= 2 K = 211
So, we need 11 bits for set indexing.
Number of WORD bits required = 4 as a cache block consists of 16 bytes and we need 4 bits to address each of them.
= 68 KB
We use the top 17 bits for tag and the next 11 bits for indexing and next 4 for offset. So, for two addresses to have the same
cache index, their 11 address bits after the 4 offset bits from right must be same.
ARR[0][0] is located at virtual address 0x FF000 000 . (FF000 is page address and 000 is page offset). So, index bits are
00000000000
Address of ARR[0][4] = 0xFF000 + 4× sizeof (double) = 0xFF000000 + 4 × 8 = 0xFF000 020 (32 = 20 in hex)
(index bits differ)
Address of ARR[4][0] = 0xFF000 + 4 × 1024× sizeof (double) [since we use row major storage]
= 0xFF000 000 + 4096 × 8 = 0xFF000 000 + 0x8000 = 0xFF008 000 ( index bits matches that of ARR [0][0] as both
read 000 0000 0000 )
Address of ARR[0][5] = 0xFF000 + 5× sizeof (double) = 0xFF000 000 + 5 × 8 = 0xFF000 028(40 = 28 in hex)
(index bits differ)
Address of ARR[5][0] = 0xFF000 + 5 × 1024× sizeof (double) [since we use row major storage]
= 0xFF000 000 + 5120 × 8 = 0xFF000 000 + 0xA000 = 0xFF00A 000 (index bits differ)
So, only ARR[4][0] and ARR[0][0] have the same cache index.
The inner loop is iterating from 0 to 1023 , so consecutive memory locations are accessed in sequence. Since cache block size is
only 16 bytes and our element being double is of size 8 bytes, during a memory access only the next element gets filled in the
cache. i.e.; every alternative memory access is a cache miss giving a hit ratio of 50%. (If loops i and j are reversed, all accesses
will be misses and hit ratio will become 0).
Correct Answer: B
53 votes -- Arjun Suresh (332k points)
Block size = 16B and one element = 8B.
So, in one block 2 element will be stored.
1024 × 1024
For 1024 × 1024 element num of block required = = 219 blocks required.
2
In one block the first element will be a miss and second one is hit(since we are transferring two unit at a time)
Total hit
⇒ hit ratio =
Total reference
219
=
220
1
= = 0.5
2
Correct Answer: C
26 votes -- asutosh kumar Biswal (8k points)
16 blocks and sets with 4 blocks each means there are 4 sets.So, the lower 2 bits are used
for getting a set and 4-way associative means in a set only the last 4 cache accesses can be stored.
0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155
mod 4 gives,
0, 3, 1, 0, 3, 0, 1, 3, 0, 1, 3, 0, 0, 0, 1, 0, 3
Correct Answer: D
60 votes -- Arjun Suresh (332k points)
Ideally the answer should be 20 ns as it is the time to transfer a block from L2 to L1 and this time only is asked in
question. But there is confusion regarding access time of L2 as this means the time to read data from L2 till CPU but here we
need the time till L1 only. So, I assume the following is what is meant by the question.
A block is transferred from L2 to L1. And L1 block size being 4 words (since L1 is requesting we need to consider L1 block
size and not L2 block size) and data width being 4 bytes, it requires one L2 access (for read) and one L1 access (for store). So,
time = 20 + 2 = 22 ns.
Correct Answer: C
106 votes -- Arjun Suresh (332k points)
The transfer time should be 4 ∗ 200 + 20 = 820 ns. But this is not in option. So, I assume the following is what is meant
by the question.
L2 block size being 16 words and data width between memory and L2 being 4 words, we require 4 memory accesses(for read)
and 4 L2 accesses (for store). Now, we need to send the requested block to L1 which would require one more L2 access (for
read) and one L1 access (for store). So, total time
= 880 + 22
= 902 ns
Correct Answer: C
119 votes -- Arjun Suresh (332k points)
cache size
Number of cache blocks =
size of a block
8 KB
=
32 B
So, we need 8-bits for indexing the 256 blocks of the cache. And since a block is 32 bytes we need 5 WORD bits to address
each byte. So, out of the remaining 19-bits (32 - 8 - 5) should be tag bits.
Correct Answer: D
46 votes -- Arjun Suresh (332k points)
Total cache size = 256 KB
Cache block size = 32 Bytes
256 K
So, number of cache entries = =8K
32
8K
Number of sets in cache = = 2 K as cache is 4-way associative.
4
So, log(2048) = 11 bits are needed for accessing a set. Inside a set we need to identify the cache entry.
Memory size
No. of memory block possible =
Cache block size
232
= = 227 .
32
227
=
211
= 216 .
So, we need 16 tag bits along with each cache entry to identify which of the possible 216 blocks is being mapped there.
Correct Answer: C
41 votes -- Arjun Suresh (332k points)
Total cache size = 256 KB
Cache block size = 32 Bytes
256 K
So, number of cache entries = =8K
32
8K
Number of sets in cache = = 2 K as cache is 4-way associative.
4
So, log(2048) = 11 bits are needed for accessing a set. Inside a set we need to identify the cache entry.
232 232
Total number of distinct cache entries = = = 227
cache entry size 32
227
Out of this 227 , each set will be getting only 11 = 216 possible distinct cache entries as we use the first 11 bits to identify a
2
set. So, we need 16 bits to identify a cache entry in a set, which is the number of bits in the tag field.
Size of cache tag directory= Size of tag entry × Number of tag entries
= 16 + (2 + 1 + 1) bits (2 valid, 1 modified, 1 replacement as given in question) × 8 K
Valid bit: Tells if the memory referenced by the cache entry is valid. Initially, when a process is loaded all entries are invalid.
Only when a page is loaded, its entry becomes valid.
Modified bit: When processor writes to a cache location its modified bit is made 1. This information is used when a cache entry
is replaced- entry 0 means no update to main memory needed. Entry 1 means an update is needed.
Replacement bit: This is needed for the cache replacement policy. Explained in the below link:
https://www.seas.upenn.edu/~cit595/cit595s10/handouts/LRUreplacementpolicy.pdf
Correct Answer: A
References
Number of sets in cache = v .
The question gives a sequencing for the cache lines. For set 0, the cache lines are numbered 0, 1, . . , k − 1 . Now for set 1, the
cache lines are numbered k, k + 1, . . . k + k − 1 and so on.
So, main memory block j will be mapped to set (j mod v) , which will be any one of the cache lines from
(j mod v) ∗ k to (j mod v) ∗ k + (k − 1) .
(Associativity plays no role in mapping- k -way associativity means there are k spaces for a block and hence reduces the chances
of replacement.)
Correct Answer: A
80 votes -- Arjun Suresh (332k points)
1.2.34 Cache Memory: GATE CSE 2014 Set 1 | Question: 44 top☝ ☛ https://gateoverflow.in/1922
There are N accesses to cache.
Out of these n are unique block addresses.
Now, we need to find the number of misses. (min. n misses are guaranteed whatever be the access sequence due to n unique
block addresses).
We are given that between two consecutive accesses to the same block, there can be only k unique block addresses. So, for a
block to get replaced we can assume that all the next k block addresses goes to the same set (given cache is set-associative)
which will be the worst case scenario (they may also go to a different set but then there is lesser chance of a replacement). Now,
if associativity size is ≥ k , and if we use LRU (Least Recently Used) replacement policy, we can guarantee that these k accesses
won't throw out our previously accessed cache entry (for that we need at least k accesses). So, this means we are at the best-cache
scenario for cache replacement -- out of N accesses we miss only n (which are unique and can not be helped from getting
missed and there is no block replacement in cache). So, miss ratio is n/N .
PS: In question it is given "bounded above by k ", which should mean k unique block accesses as k is an integer, but to ensure no
replacement this must be 'k − 1 '. Guess, a mistake in question.
Correct Answer: A
122 votes -- Arjun Suresh (332k points)
1.2.35 Cache Memory: GATE CSE 2014 Set 2 | Question: 43 top☝ ☛ https://gateoverflow.in/2009
A. A smaller block size means during a memory access only a smaller part of near by addresses are brought to cache-
B. A smaller block size means more number of blocks (assuming cache size constant) and hence index bits go up and offset
bits go down. But the tag bits remain the same.
C. A smaller block size implying larger cache tag is true, but this can't lower cache hit time in any way.
D. A smaller block size incurs a lower cache miss penalty. This is because during a cache miss, an entire cache block is
fetched from next lower level of memory. So, a smaller block size means only a smaller amount of data needs to be
fetched and hence reduces the miss penalty (Cache block size can go til the size of data bus to the next level of memory,
and beyond this only increasing the cache block size increases the cache miss penalty).
Correct Answer: D
1.2.36 Cache Memory: GATE CSE 2014 Set 2 | Question: 44 top☝ ☛ https://gateoverflow.in/2010
If associativity is doubled, keeping the capacity and block size constant, then the number of sets gets halved. So, width of
set index decoder can surely decrease - (B) is false.
Width of way-selection multiplexer must be increased as we have to double the ways to choose from- (C) is false
As the number of sets gets decreased, the number of possible cache block entries that a set maps to gets increased. So, we need
more tag bits to identify the correct entry. So, (A) is also false.
(D) is the correct answer- main memory data bus has nothing to do with cache associativity- this can be answered without even
looking at other options.
1.2.37 Cache Memory: GATE CSE 2014 Set 2 | Question: 9 top☝ ☛ https://gateoverflow.in/1963
cache size
Number of sets=
sizeof a set
16 KB
So, number of sets = = 128
(128 B)
Now, we can divide the physical address space equally between these 128 sets.
So, the number of bytes each set can access
4 GB
=
128
= 32 MB
32
= = 8 M words = 1 M blocks. (220 blocks)
4
1.2.38 Cache Memory: GATE CSE 2014 Set 3 | Question: 44 top☝ ☛ https://gateoverflow.in/2078
The question is to find the time taken for,
100 fetch operations and 60 operand read operations and 40 memory operand write operations
.
total number of instructions
1 corresponds to time taken for read when there is cache hit = 140 ns
Here, 2 and 10 are the times taken for write when there is cache hit and no cache hit respectively.
= 140 + 84 + 112
= 336 ns
336
=
200
= 1.68 ns
52 votes -- Divya Bharti (8.8k points)
1.2.39 Cache Memory: GATE CSE 2015 Set 2 | Question: 24 top☝ ☛ https://gateoverflow.in/8119
Answer is: 14 ns = 0.8(5) + 0.2(50)
PS: Here instead of cache and main memory access times, time taken on a cache hit and miss are directly given in question. So,
Average Access Time = Hit Rate × Hit Time + Miss Rate × Miss Time
1.2.40 Cache Memory: GATE CSE 2015 Set 3 | Question: 14 top☝ ☛ https://gateoverflow.in/8410
Block size of 16 bytes means we need 4 offset bits. (The lowest 4 digits of memory address are offset bits)
Number of sets in cache (cache lines) = 212 so the next lower 12 bits are used for set indexing.
The top 4 bits (out of 20) are tag bits.
So, answer is A.
1.2.41 Cache Memory: GATE CSE 2016 Set 2 | Question: 32 top☝ ☛ https://gateoverflow.in/39622
Physical Address = 40
We have: Cache Size = number of sets × blocks per set × Block size
Second way :
Cache Size = 219
MM size = 240
240
This means, We need to map = 221 Blocks to one line. And a set contain 23 lines.
219
Therefore, 224 blocks are mapped to one set.
Using Tag field, I need to identify which one block out of 224 blocks are present in this set.
Hence, 24 bits are needed in Tag field.
In question block size has not been given,so we can assume block size 2x Byte.
219−x 16−x
Number of sets:- =2
8
And we have already assumed block size 2x Byte,therefore number of bits for block size is x
And finally,
T + (16 − x) + x = 40
T + 16 = 40
T = 24.
54 votes -- ajit (2.5k points)
1.2.42 Cache Memory: GATE CSE 2016 Set 2 | Question: 50 top☝ ☛ https://gateoverflow.in/39592
Look aside Cache Latency = 1ms
Main Memory Latency = 10 ms
Next Take 30 MB
1.2.43 Cache Memory: GATE CSE 2017 Set 1 | Question: 25 top☝ ☛ https://gateoverflow.in/118305
Answer = 0.05.
1.2.44 Cache Memory: GATE CSE 2017 Set 1 | Question: 54 top☝ ☛ https://gateoverflow.in/118748
In set-associative 1 set = 16 lines. So the number of index bits will be 4 less than the direct mapped case.
So, Tag bits increased to 14 bits.
1.2.45 Cache Memory: GATE CSE 2017 Set 2 | Question: 29 top☝ ☛ https://gateoverflow.in/118371
In two-level memory system (hierarchical), it is clear that the second level is accessed only when first level access is a
miss. So, we must include the first level access time in all the memory access calculations. Continuing this way for any level, we
must include that level access time (without worrying about the hit rate in that level), to all memory accesses coming to that level
(i.e., by just considering the miss rate in the previous level). So, for the given question, we can get the following equation:
2 = 1 + x × 8 + 0.5x2 × 18
⟹ 9x2 + 8x − 1 = 0
−−−−−−
−8 ± √64 + 36 2
⟹ x= = = 0.111 .
18 18
1.2.46 Cache Memory: GATE CSE 2017 Set 2 | Question: 45 top☝ ☛ https://gateoverflow.in/118597
L2 cache is shared between Instruction and Data (is it always?, see below)
Now, why L2 must be shared? Because we can otherwise use it for either Instruction or Data and it is not logical to use it for
only 1. Ideally this should have been mentioned in question, but this can be safely assumed also (not enough merit for Marks to
All). Some more points in the question:
" Assume that the caches use the referred-word-first read policy and the writeback policy
Writeback policy is irrelevant for solving the given question as we do not care for writes. Referred-word-first read policy means
there is no extra time required to get the requested word from the fetched cache line.
' Assume that all the caches are direct mapped caches.
' Assume that the dirty bit is always 0 for all the blocks in the caches
Dirty bits are for cache replacement- which is not asked in the given question. But this can mean that there is no more delay
when there is a read miss in the cache leading to a possible cache line replacement. (In a write-back cache when a cache line is
replaced, if it is dirty then it must be written back to main memory).
1.2.47 Cache Memory: GATE CSE 2017 Set 2 | Question: 53 top☝ ☛ https://gateoverflow.in/118613
232
No. of blocks of main Memory = = 227
25
9
512 =
Tag bits tell us to how many blocks does 1 line in Cache memory points to
227
1 cache line points to = 218 lines
29
So, 18 bits are required as TAG bits.
48 votes -- Manish Joshi (20.5k points)
Physical Address Space = 2P Bytes i.e. P bits to represent size of total memory.
Cache Size = 2N Byte i.e., N bits to represent Cache memory.
Tag size = 2X Bytes i.e., X bits to represent Tag.
Cache is K− way associative.
Cache Size
(Size of Tag) × K
= Total Memory Size
N
⟹ 2X × 2K = 2P
⟹ 2X+N−log(K) = 2P
⟹ 2X = 2P−N+log(K)
Correct Answer: B
34 votes -- Digvijay (44.9k points)
Given that cache is Fully Associative.
There are no index bits in fully associative cache because every main memory block can go to any location in the cache
⟹ Index bits = 0.
Given that memory is byte addressable and uses 32-bit address.
Cache Block size is 16 Bytes ⟹ Number of bits required for Block Offset = ⌈log2 16⌉ = 4 bits
∴ Number of Tag bits = 32 − 4 = 28.
Answer is (D).
Time to transfer a cache block = 1 + 3 + 8 = 12 cycles.
6 Bytes
−−−−−−−−: 160 × 10
Answer sec
Explanation :
−−−−−−−−−−−−
1
∴ One cycle is completed in seconds
60 × 106
Now,
Note : Total data is the data which is used for trasmitting the words of the requested block at the rate of 1 word per cycle
= 8 words × 4 Byte (Size of each word)
Total Data
∴ Bandwidth =
12 cycle time
8 × 4 Bytes
=
12 cycles × 1 cycle time
32
=
1
12 ×
60 × 106
5
32 × 60 × 106
= 1
12
= 5 × 32 × 106
Bytes
= 160 × 106
sec
Bytes
∴ 160 × 106 is the correct answer.
sec
16 votes -- Jeet (15.3k points)
Block size is 256 Bytes,word size is 64 bits or 8 bytes. So Block size in words is 8 words.
Time to fetch a word from main-memory to cache is: 20 + 31 × 5 = 175 ns because first word takes 20ns and rest each
subsequent words take 5ns each.
Block size is 256 Bytes. Number of sets in cache = 26 so Set offset bits=6 and word offset bits=8.
So check for set, check for the rightmost 4 digits of each physical address.(Last two byte denote the word address)
A1=C8A4 = C8 = 11001000
A2=6888 = 68 = 01101000
A3=289C = 28 = 00101000
A4=4880 = 48 = 01001000
Now look for lowest order 6 bits in the highlighted part of Each physical address(corresponds to set number).
8 and 8 match and 6=0110 and 2=0010 two low order bits of 6 and 2 match,So A2 and A3 go to same set.
1.2.53 Cache Memory: GATE CSE 2021 Set 1 | Question: 22 top☝ ☛ https://gateoverflow.in/357429
Tag bits = PASbits – log2 (Cache Size) + log2 (K) (where K is associativity)
= 32 − 15 + 0 = 17 bits
5 votes -- Nikhil Dhama (2.5k points)
1.2.54 Cache Memory: GATE CSE 2021 Set 2 | Question: 19 top☝ ☛ https://gateoverflow.in/357521
32 bit address is used for accessing the cache.
It is given that cache is Set-Associative.
The address bits get split as follows:
1.2.55 Cache Memory: GATE CSE 2021 Set 2 | Question: 27 top☝ ☛ https://gateoverflow.in/357513
S1 : Read Miss in a write through L1 cache results in read allocate. No write back is done here, as in a write through
L1 cache, both L1 and L2 caches are updated during a write operation (no dirty blocks and hence no dirty bits as in a write back
cache). So during a Read miss it will simply bring in the missed block from L2 to L1 which may replace one block in L1 (this
replaced block in L1 is already updated in L2 and so needs no write back). So, S1 is TRUE.
S2 : This statement is FALSE. Both write-through and write-back policies can use either of these write-miss policies, but usually
they are paired in this way.
No write allocation during write through as L1 and L2 are accessed for each write operation (subsequent writes to same
location gives no advantage even if the location is in L1 cache).
In write back we can to do write allocate in L1 after a write operation hoping for subsequent writes to the same location
which will then hit in L1 and thus avoiding a more expensive L2 access.
Correct Answer: A.
Cache Writing Policies
References
1.2.56 Cache Memory: GATE IT 2004 | Question: 12, ISRO2016-77 top☝ ☛ https://gateoverflow.in/3653
By default we consider hierarchical access - because that is the common implementation and simultaneous access cache
has great practical difficulty. But here the question is a bit ambiguous -- it says to ignore search time within the cache - usually
search is applicable for an associative cache but here no such information given. So, may be they are telling to ignore the search
time for L1 and just consider the time for L2 for an L1 miss and similarly just consider memory access time for L2 miss. This is
nothing but simultaneous access.
When 45 comes, the cache contents are:
4 3 25 8 19 6 16 35
[4 3 19 6 25 8 16 35] .
So, 45 replaces 4.
45 3 25 8 19 6 16 35 [3 19 6 25 8 16 35 45]
45 22 25 8 19 6 16 35 [19 6 25 8 16 35 45 22]
8 hits in cache.
45 22 25 8 19 6 16 35 [19 6 25 16 35 45 22 8]
3 replaces 19
45 22 25 8 3 6 16 35 [6 25 16 35 45 22 8 3]
45 22 25 8 3 6 16 35 [6 35 45 22 8 3 16 25]
128 main memory blocks are mapped to 4 sets in cache. So, each set maps 32 blocks each. And in each set there is place
for two blocks (2-way set).
So, based on the two index bits (lower 2 bits) blocks will be going to sets as follows:
Since, each set has only 2 places, 3 will be thrown out as it is the least recently used block. So, final content of cache will be
0 5 7 9 16 55
(C) choice.
Answer is C.
For 1 sec it is 109 bytes ( Note - by looking at the options, it can be decided that 1GB should be considered as 109 , but not as
230 , and in general, if bandwidth is given, then we consider 1 Giga = 109 )
So, for 64 bytes?
64 × 1
It is so it is 64 ns but mm latency is 32.
109
So, total time required to place cache line is 64 + 32 = 96 ns.
1. I-cache
4K
Number of blocks in cache = = 210 blocks.
4
Bits to represent blocks = 10
Number of words in a block = 4 = 22 words.
Bits to represent words = 2.
tag bits = 30 − (10 + 2) = 18.
Each block will have it's own tag bits. So total tag bits = 1K × 18 bits.
2. D-cache
4K
Number of blocks in cache = = 210 blocks.
4
210
Number of sets in cache = = 29 sets.
2
Bits to represent sets = 9.
Number of words in a block = 4 = 22 words.
Bits to represent words = 2
tag bits = 30 − (9 + 2) = 19
Each block will have it's own tag bits. So total tag bits = 1K × 19 bits.
3. L2 cache
64K
Number of blocks in cache = = 212 blocks.
16
212
Number of sets in cache = = 1024 sets.
4
Bits to represent sets = 10
4
= 16 = words.
Option (A).
Answer is (B).
Cache location (memory block) = block req mod number of cache blocks. Since each block has only one location (associativity is
1) the last mod 8 request will be in cache (no need of any replacement policy as mapping is direct).
3, 5, 2, 8, 0, 63, 9, 16, 20, 17, 25, 18, 30, 24, 2, 63, 5, 82, 17, 24
8KB
Number of cache blocks = = 64
(128 × 1)
So, number of SET bits required = 4(as 24 = 16, and with 4 bits we can get 16 possible outputs)
We can now straight away choose (D) as answer but for confirmation can proceed further.
Since, only physical memory information is given we can assume cache is physically tagged (which is anyway the common case
even in case of virtual memory).
So, we can divide the physical memory into 16 regions so that, each set maps into only its assigned region.
1MB 216
So, size of a region a set can address = = 216 Bytes = = 29 cache blocks (as cache block size is 128 words
16 128
= 128 bytes).
So, when an access comes to a cache entry, it must be able to determine which out of the 29 possible physical block it is. In
short, it needs 9 bits for TAG.
Now, cache block size is 128 words and so to identify a word we need 7 bits for WORD.
As shown in https://gateoverflow.in/3403/gate2008-it_80
We have 16 sets in cache and correspondingly 16 regions in physical memory to which each set is mapped.
Now, WORD bit size is 7 as we need 7 bits to address 128 possible words in a cache block.
Of these bits, the lower 4 are used for addressing the 16 possible sets, giving us the tag bits: 0000 1100 0 in (A) choice.
References
1.3.1 Cisc Risc Architecture: GATE CSE 1999 | Question: 2.22 top☝ ☛ https://gateoverflow.in/1499
The main difference(s) between a CISC and a RISC processor is/are that a RISC processor typically
A. has fewer instructions
B. has fewer addressing modes
C. has more registers
D. is easier to implement using hard-wired logic
Answer ☟
1.3.2 Cisc Risc Architecture: GATE CSE 2018 | Question: 5 top☝ ☛ https://gateoverflow.in/204079
Which of the characteristics above are used in the design of a RISC processor?
A. I and II only
B. II and III only
C. I and III only
D. I, II and III
Answer ☟
1.3.1 Cisc Risc Architecture: GATE CSE 1999 | Question: 2.22 top☝ ☛ https://gateoverflow.in/1499
All are properties of the RISC processor.
http://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/whatis/index.html
http://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/risccisc/index.html
http://homepage3.nifty.com/alpha-1/computer/Control_E.html
References
(D) All of these
' Hardwired control units are implemented through use of combinational logic units, featuring a finite number of
gates that can generate specific results based on the instructions that were used to invoke those responses. Their design
uses a fixed architecture—it requires changes in the wiring if the instruction set is modified or changed. This
architecture is preferred in reduced instruction set computers (RISC) as they use a simpler instruction set.
Instructions length cannot vary in RISC usually it's 32 bit. For CISC it can be between 16 to 64 bits.
Register to register operations is always possible in RISC. CISC can have memory to memory instructions also.
References: https://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/2000-01/risc/risccisc/
https://en.wikipedia.org/wiki/Control_unit#Hardwired_control_unit
References
1.4.1 Clock Frequency: GATE CSE 1992 | Question: 01-iii top☝ ☛ https://gateoverflow.in/547
Many microprocessors have a specified lower limit on clock frequency (apart from the maximum clock frequency limit)
because _____
Answer ☟
The floating point unit of a processor using a design D takes 2t cycles compared to t cycles taken by the fixed point unit.
There are two more design suggestions D1 and D2 . D1 uses 30% more cycles for fixed point unit but 30% less cycles for floating
point unit as compared to design D. D2 uses 40% less cycles for fixed point unit but 10% more cycles for floating point unit as
compared to design D. For a given program which has 80% fixed point operations and 20% floating point operations, which of the
following ordering reflects the relative performances of three designs?
(Di > Dj denotes that Di is faster than Dj)
A. D1 > D > D2
B. D2 > D > D1
C. D > D2 > D1
D. D > D1 > D2
Answer ☟
1.4.1 Clock Frequency: GATE CSE 1992 | Question: 01-iii top☝ ☛ https://gateoverflow.in/547
Clock frequency becomes low means time period of clock becomes high. When this time period increases beyond the
time period in which the volatile memory contents must be refreshed, we loose those contents. So, clock frequency can't go
Reference: https://gateoverflow.in/261/microprocessors-specified-frequency-frequency-__________
References
(B) is correct.
'
T = 0.8 × time taken in fixed point + 0.2 × time taken in floating point
So, D2 is the best design for this given program followed by D and then D1 . Option B.
1.5.1 Conflict Misses: GATE CSE 2017 Set 1 | Question: 51 top☝ ☛ https://gateoverflow.in/118745
Consider a 2− way set associative cache with 256 blocks and uses LRU replacement. Initially the cache is empty. Conflict
misses are those misses which occur due to the contention of multiple blocks for the same cache set. Compulsory misses occur due
to first time access to the block. The following sequence of access to memory blocks :
{0, 128, 256, 128, 0, 128, 256, 128, 1, 129, 257, 129, 1, 129, 257, 129}
is repeated 10 times. The number of conflict misses experienced by the cache is _________ .
Answer ☟
1.5.1 Conflict Misses: GATE CSE 2017 Set 1 | Question: 51 top☝ ☛ https://gateoverflow.in/118745
{0, 128, 256, 128, 0, 128, 256, 128, 1, 129, 257, 129, 1, 129, 257, 129}
1st Iteration:
Similarly for {1, 129, 257, 129, 1, 129, 257, 129} , total number of conflict misses in set1 = 2
2nd iteration:
Similarly for {1, 129, 257, 129, 1, 129, 257, 129} , total number of conflict misses in set1 = 4
Note that content of each set is same, before and after 2nd iteration. Therefore each of the remaining iterations will also have 8
conflict misses.
1.6.1 Control Unit: GATE CSE 1987 | Question: 1-vi top☝ ☛ https://gateoverflow.in/80199
Answer ☟
1.6.1 Control Unit: GATE CSE 1987 | Question: 1-vi top☝ ☛ https://gateoverflow.in/80199
A. is wrong. Microprogrammed Control Unit (CU) can never be faster than hardwired CU. Microprogrammed CU it has an
extra layer on top of hardwired CU and hence can only be slower than hardwired CU.
B. is a suitable answer as we can add new instruction by changing the content of control memory.
C. is not correct as when only small programs are there, hardwired control makes more sense.
D. control unit can also be hardwired, so this is also not correct.
Reference: Slides
Correct Answer: B
References
1.7.1 Data Dependences: GATE CSE 2015 Set 3 | Question: 47 top☝ ☛ https://gateoverflow.in/8556
Consider the following code sequence having five instructions from I1 to I5 . Each of these instructions has the following
format.
OP Ri, Rj, Rk
Where operation OP is performed on contents of registers Rj and Rk and the result is stored in register Ri.
I1 : ADD R1, R2, R3
I2 : MUL R7, R1, R3
I3 : SUB R4, R1, R5
I4 : ADD R3, R2, R4
I5 : MUL R7, R8, R9
Consider the following three statements.
S1: There is an anti-dependence between instructions I2 and I5
S2: There is an anti-dependence between instructions I2 and I4
S3: Within an instruction pipeline an anti-dependence always creates one or more stalls
Which one of the above statements is/are correct?
A. Only S1 is true
B. Only S2 is true
C. Only S1 and S3 are true
D. Only S2 and S3 are true
Answer ☟
Data forwarding techniques can be used to speed up the operation in presence of data dependencies. Consider the following
replacements of LHS with RHS.
Answer ☟
1.7.1 Data Dependences: GATE CSE 2015 Set 3 | Question: 47 top☝ ☛ https://gateoverflow.in/8556
Answer should be (B).
Anti-dependence can be overcome in pipeline using register renaming. So, "always" in S3 makes it false. Also, if I2 is completed
before I4 (execution stage of MUL), then also there won't be any stall.
i. is true. Both LOC and R2 are getting the value of R1 in LHS and RHS .
ii. false, because R2 gets the correct data in both LHS and RHS , but LOC is not updated in RHS .
iii. is wrong because R2 is writing last, not R1 in LHS , but not in RHS .
iv. is true. The first write to LOC in LHS is useless as it is overwritten by the next write.
A single bus CPU consists of four general purpose register, namely, R0, … , R3, ALU, MAR, MDR, PC, SP and IR
(Instruction Register). Assuming suitable microinstructions, write a microroutine for the instruction, ADD R0, R1 .
Answer ☟
1.8.2 Data Path: GATE CSE 2001 | Question: 2.13 top☝ ☛ https://gateoverflow.in/731
Consider the following data path of a simple non-pipelined CPU. The registers A, B, A1 , A2 , MDR , the bus and the ALU are
8-bit wide. SP and MAR are 16-bit registers. The MUX is of size 8 × (2 : 1) and the DEMUX is of size 8 × (1 : 2) . Each
memory operation takes 2 CPU clock cycles and uses MAR (Memory Address Register) and MDR (Memory Date Register). SP
can be decremented locally.
M[SP] ← r
SP ← SP − 1
How many CPU clock cycles are required to execute the "push r" instruction?
A. 2
B. 3
C. 4
D. 5
Answer ☟
The ALU, the bus and all the registers in the data path are of identical size. All operations including incrementation of the PC and
the GPRs are to be carried out in the ALU. Two clock cycles are needed for memory read operation – the first one for loading
address in the MAR and the next one for loading data from the memory bus into the MDR.
The instruction ‘‘add R0, R1” has the register transfer interpretation R0 ⇐ R0 + R1. The minimum number of clock cycles
needed for execution cycle of this instruction is:
A. 2
B. 3
C. 4
D. 5
Answer ☟
The ALU, the bus and all the registers in the data path are of identical size. All operations including incrementation of the PC and
the GPRs are to be carried out in the ALU. Two clock cycles are needed for memory read operation – the first one for loading
address in the MAR and the next one for loading data from the memory bus into the MDR.
The instruction "call Rn, sub” is a two word instruction. Assuming that PC is incremented during the fetch cycle of the first word of
the instruction, its register transfer interpretation is
Rn ← PC + 1 ;
PC ← M[PC] ;
The minimum number of CPU clock cycles needed during the execution cycle of this instruction is:
A. 2
B. 3
C. 4
D. 5
Answer ☟
1.8.5 Data Path: GATE CSE 2016 Set 2 | Question: 30 top☝ ☛ https://gateoverflow.in/39627
Suppose the functions F and G can be computed in 5 and 3 nanoseconds by functional units UF and UG , respectively. Given
two instances of UF and two instances of UG , it is required to implement the computation F(G(Xi )) for 1 ≤ i ≤ 10 . Ignoring all
other delays, the minimum time required to complete this computation is ____________ nanoseconds.
Answer ☟
Which one of the following is the correct order of execution of the above steps?
A. 2, 1, 4, 5, 3
B. 1, 2, 4, 3, 5
C. 3, 5, 2, 1, 4
D. 3, 5, 1, 2, 4
Answer ☟
MAR ← PC
PC ← PC +3
MDR ← MEM[MAR]
IR ← MDR
MAR ← MAR+1
MDR ← MEM[MAR]
R0 ← MDR
MAR ← MAR +1
MDR ← MEM[MAR]
R1 ← MDR
R0 ← R0+R1
1.8.2 Data Path: GATE CSE 2001 | Question: 2.13 top☝ ☛ https://gateoverflow.in/731
A microinstruction cannot be further broken down into two or more. It can take more than a cycle if it involves a memory
access. The first instruction given here is not a microinstruction. It is an assembly language instruction.
T 3. : MDR ← r, SP ← SP − 1 ( It is not mandatory to decrement it in this cycle. Anyway, it can be decremented locally )
T 4, T 5 : M[MAR] ← MDR
The problem says, 8-bit MDR, 8-bit data bus, 8 bit registers.Can't you see that the given CPU is 8-bit? 8 multiplexers transfer 8
bits when selection input is 0 and 1 respectively. During cycle 1, bits in even positions are moved to MAR. During cycle 2, bits
in odd positions are transferred to MAR. We certainly need to move 16-bit SP to 16-bit MAR via a 8-bit bus. So, 2 cycles to get
SP to MAR.
The given data path has a single bus, which requires r to be carried in a separate cycle. For the contents of r to be moved to
MDR during the cycles T1 or T2, address and data bus should be separate. Here, it ain't the case.
Memory read takes 2 more cycles. In total, we need 5 of them clock cycles to execute a push.
https://www.cise.ufl.edu/~mssz/CompOrg/CDA-proc.html
Computer organization pal chaudari page 334-335
Instruction fetch requires two cycles but the question asks for the execution part only!
Now for execution:
MAR ← PC → 1 cycle
S ← PC (Since these two actions are independent they can be done in same cycle)
MDR ← M[MAR] → 2nd cycle (System BUS)
Rn ← S + 1 (ALU Is free and the two actions are independent.) (Internal BUS)
PC ← MDR → 3 rd cycle
1.8.5 Data Path: GATE CSE 2016 Set 2 | Question: 30 top☝ ☛ https://gateoverflow.in/39627
The same concept is used in pipelining. Bottleneck here is UF as it takes 5 ns while UG takes 3ns only. We have to do 10
such calculations and we have 2 instances of UF and UG respectively. So, UF can be done in 50/2 = 25 nano seconds. For the
start UF needs to wait for UG output for 3 ns and rest all are pipelined and hence no more wait. So, answer is
3 + 25 = 28ns.
3rd followed by 5th are Instruction fetch cycle micro operations and can be elaborated as follows:
t1 : MAR w ← PCr
t2 : MDRw ← Memoryr ∣ PC ← PC + 1
t3 : IRw ← MDRr
Now we need to perform Execute cycle micro operations. Just observe the figure and it will be very easy to identify the sequence
between 1st , 2nd , 4th
A hard disk with a transfer rate of 10 Mbytes/second is constantly transferring data to memory using DMA. The processor
runs at 600 MHz, and takes 300 and 900 clock cycles to initiate and complete DMA transfer respectively. If the size of the transfer
is 20 Kbytes, what is the percentage of processor time consumed for the transfer operation?
A. 5.0%
B. 1.0%
C. 0.5%
D. 0.1%
Answer ☟
Answer ☟
On a non-pipelined sequential processor, a program segment, which is the part of the interrupt service routine, is given to
transfer 500 bytes from an I/O device to memory.
Initialize the address register
Initialize the count to 500
LOOP: Load a byte from device
Store in memory at address given by address register
Increment the address register
Decrement the count
If count !=0 go to LOOP
Assume that each statement in this program is equivalent to a machine instruction which takes one clock cycle to execute if it is a
non-load/store instruction. The load-store instructions take two clock cycles to execute.
The designer of the system also has an alternate approach of using the DMA controller to implement the same transfer. The DMA
controller requires 20 clock cycles for initialization and other overheads. Each DMA transfer cycle takes two clock cycles to transfer
one byte of data from the device to the memory.
What is the approximate speed up when the DMA controller based design is used in a place of the interrupt driven program based
input-output?
A. 3.4
B. 4.4
C. 5.1
D. 6.7
Answer ☟
Answer ☟
Consider a computer system with DMA support. The DMA module is transferring one 8-bit character in one CPU cycle
from a device to memory through cycle stealing at regular intervals. Consider a 2 MHz processor. If 0.5% processor cycles are
used for DMA , the data transfer rate of the device is __________ bits per second.
Answer ☟
The storage area of a disk has the innermost diameter of 10 cm and outermost diameter of 20 cm. The maximum storage
density of the disk is 1400 bits/cm. The disk rotates at a speed of 4200 RPM. The main memory of a computer has 64-bit word
length and 1µs cycle time. If cycle stealing is used for data transfer from the disk, the percentage of memory cycles stolen for
transferring one word is
A. 0.5%
B. 1%
C. 5%
D. 10%
Answer ☟
Answers: Dma
1
Clock cycle time = [ Frequency = 1/Time]
600×106
(900+300)
For DMA initiation and completion = = 2 microsec .
600×106
Percentage = ( 2+2000
2
) × 100 = 0.0999 ≃ 0.1%
option (D)
x
% of CPU time consume x+y
Now, when, use x = Data preparation time or Total cycle time used by CPU and y = Data transfer time
x
To calculate the fraction of CPU time to the data transfer time - we use x+y it is burst mode.
Correct Answer: A
90 votes -- Manu Thakur (34k points)
' Data count register gives the number of words the DMA can transfer in a single cycle..
Answer is 80, 000 .
1
To complete one cycle at 2 MHz it will take seconds. So the total number of CPU cycles in one second will be 2 × 106 .
2×106
Now 0.5% of these cycles are taken by DMA to transfer the data.
0.5
So total number of cycles taken to transfer the data will be 100 × 2 × 106 = 10, 000 and in each cycle 8 bits are transferred.
So, data transfer rate in bits per second = 8 × 10000 = 80, 000 .
3 votes -- JATIN MITTAL (2.1k points)
In a disk all tracks have equal capacity and so data density is highest for the innermost track as it has the smallest radius.
4200 × 43960
With 4200 rotations per minute, data transfer rate = bits per second.
60
60
Therefore, to transfer 64 bits time required = × 64 = 20.798μs
4200 × 43960
1
With 1μs memory cycle time, the disk will take one memory cycle out of 21 + 1 = × 100 ≈ 5%
21 + 1
(PS: If we consider just one word transfer we add the memory cycle time to the disk transfer time in the denominator but for
continuous DMA transfer, this is not required as when data is transferred to main memory, disk can continue reading new data)
The chip select logic for a certain DRAM chip in a memory system design is shown below. Assume that the memory system
has 16 address lines denoted by A15 to A0 . What is the range of address (in hexadecimal) of the memory system that can get
A. C800 to CFFF
B. CA00 to CAFF
C. C800 to C8FF
D. DA00 to DFFF
Answer ☟
Answers: Dram
(A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0 )
According to question:
Converting to Hexadecimal:
(C800) to (CFFF)
Option A.
40 votes -- Ruturaj Mohanty (3.1k points)
' The memory system that can be enabled by the chip select signal
→ For starting address,keep all 0's in the blanks, and for ending address keep all 1's in the blanks.
1.11.1 Instruction Execution: GATE CSE 1990 | Question: 4-iii top☝ ☛ https://gateoverflow.in/85391
State whether the following statements are TRUE or FALSE with reason:
The flags are affected when conditional CALL or JUMP instructions are executed.
Answer ☟
1.11.2 Instruction Execution: GATE CSE 1992 | Question: 01-iv top☝ ☛ https://gateoverflow.in/548
Many of the advanced microprocessors prefetch instructions and store it in an instruction buffer to speed up processing. This
speed up is achieved because ________
Answer ☟
1.11.3 Instruction Execution: GATE CSE 1995 | Question: 1.2 top☝ ☛ https://gateoverflow.in/2589
Answer ☟
1.11.4 Instruction Execution: GATE CSE 2002 | Question: 1.13 top☝ ☛ https://gateoverflow.in/817
Answer ☟
Consider a new instruction named branch-on-bit-set (mnemonic bbs). The instruction “bbs reg, pos, label” jumps to label if bit
in position pos of register operand reg is one. A register is 32 -bits wide and the bits are numbered 0 to 31, bit in position 0 being
the least significant. Consider the following emulation of this instruction on a processor that does not have bbs implemented.
temp ← reg&mask
Branch to label if temp is non-zero. The variable temp is a temporary register. For correct emulation, the variable mask must be
generated by
A. mask ← 0x1 << pos
B. mask ← 0xffffffff << pos
C. mask ← pos
D. mask ← 0xf
Answer ☟
1.11.6 Instruction Execution: GATE CSE 2017 Set 1 | Question: 49 top☝ ☛ https://gateoverflow.in/118332
Consider a RISC machine where each instruction is exactly 4 bytes long. Conditional and unconditional branch instructions
use PC-relative addressing mode with Offset specified in bytes to the target location of the branch instruction. Further the Offset is
always with respect to the address of the next instruction in the program sequence. Consider the following instruction sequence
If the target of the branch instruction is i, then the decimal value of the Offset is ____________ .
Answer ☟
1.11.7 Instruction Execution: GATE CSE 2021 Set 2 | Question: 53 top☝ ☛ https://gateoverflow.in/357484
Consider a pipelined processor with 5 stages, Instruction Fetch(IF) , Instruction Decode(ID) , Execute (EX) ,
Memory Access (MEM) , and Write Back (WB) . Each stage of the pipeline, except the EX stage, takes one cycle. Assume that
the ID stage merely decodes the instruction and the register read is performed in the EX stage. The EX stage takes one cycle for
ADD instruction and the register read is performed in the EX stage, The EX stage takes one cycle for ADD instruction and two
cycles for MUL instruction. Ignore pipeline register latencies.
Assume that every MUL instruction is data-dependent on the ADD instruction just before it and every ADD instruction (except the
first ADD) is data-dependent on the MUL instruction just before it. The speedup defined as follows.
The Speedup achieved in executing the given instruction sequence on the pipelined processor (rounded to 2 decimal places) is
_____________
Answer ☟
1.11.1 Instruction Execution: GATE CSE 1990 | Question: 4-iii top☝ ☛ https://gateoverflow.in/85391
False. Flags are tested during conditional call and jump not affected or changed
15 votes -- khush tak (5.9k points)
1.11.2 Instruction Execution: GATE CSE 1992 | Question: 01-iv top☝ ☛ https://gateoverflow.in/548
Because CPU is faster than memory. Fetching instructions from memory would require considerable amount of time
while CPU is much faster. So, prefetching the instructions to be executed can save considerable amount of waiting time.
23 votes -- Arjun Suresh (332k points)
1.11.3 Instruction Execution: GATE CSE 1995 | Question: 1.2 top☝ ☛ https://gateoverflow.in/2589
Only the top of the stack can be accessed at any time. You can imagine a stack to be opened from only one side data structure. So
that if we put one thing over the other, we are able to access the last thing we inserted first. That is Last in First Out (LIFO).
1.11.4 Instruction Execution: GATE CSE 2002 | Question: 1.13 top☝ ☛ https://gateoverflow.in/817
The instruction opcode is a part of the instruction which tells the processor what operation is to be performed so it is not
a form of memory while the others are
26 votes -- Bhagirathi Nayak (11.7k points)
A. mask ← 0x1 << pos
We want to check for a particular bit position say 2 (third from right). Let the number be 0xA2A7 (last 4 bits being 0111 ). Here,
the bit at position 2 from right is 1. So, we have to AND this with 0x0004 as any other flag would give wrong value (may count
other bits or discard the bit at position "pos "). And 0x0004 is obtained by 0x1 << 2 (by shifting 1 "pos " times to the left we
get a flag with 1 being set only for the "pos " bit position).
1.11.6 Instruction Execution: GATE CSE 2017 Set 1 | Question: 49 top☝ ☛ https://gateoverflow.in/118332
Answer is −16.
Program Counter is updated with the address of next instruction even before the current instruction is executed.
That is why the question says that the address of the next instruction is updated with next instruction in sequence.
Before executing instruction i + 3 , the current state looks as under:
Please note: BEQ instruction is for Branch Equal
Question says that the target of branch instruction is 'i' which is at 2000 in our example.
So, we need to go to address 2000 from address 2016 (which is currently pointed by PC)
2016 − 2000 = 16
So, we have to specify Offset as −16 which would mean that 16 should be subtracted from next address instruction ( 2016 ).
1.11.7 Instruction Execution: GATE CSE 2021 Set 2 | Question: 53 top☝ ☛ https://gateoverflow.in/357484
Correct Answer: 1.875
Speedup(def in question) =
© Copyright GATE Overflow. Some rights reserved.
Time without Operand Forwarding
Speedup(def in question) =
Time with Operand Forwarding
Without Operand Forwarding:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
ADD IF ID EX MEM WB
MUL IF ID EX EX MEM WB
ADD IF ID EX MEM WB
MUL IF ID EX EX MEM WB
ADD IF ID EX MEM WB
MUL IF ID EX EX MEM WB
ADD IF ID EX MEM WB
MUL IF ID
1.12.1 Instruction Format: GATE CSE 1988 | Question: 2-ii top☝ ☛ https://gateoverflow.in/91676
Using an expanding opcode encoding for instructions, is it possible to encode all of the following in an instruction format
shown in the below figure. Justify your answer.
Answer ☟
1.12.2 Instruction Format: GATE CSE 1992 | Question: 01-vi top☝ ☛ https://gateoverflow.in/551
In an 11-bit computer instruction format, the size of address field is 4-bits. The computer uses expanding OP code technique
and has 5 two-address instructions and 32 one-address instructions. The number of zero-address instructions it can support is
________
Answer ☟
1.12.3 Instruction Format: GATE CSE 1994 | Question: 3.2 top☝ ☛ https://gateoverflow.in/2479
Expanding opcode instruction formats are commonly employed in RISC. (Reduced Instruction Set Computers) machines.
Answer ☟
1.12.4 Instruction Format: GATE CSE 2014 Set 1 | Question: 9 top☝ ☛ https://gateoverflow.in/1767
A machine has a 32 − bit architecture, with 1 − word long instructions. It has 64 registers, each of which is 32 bits long. It
needs to support 45 instructions, which have an immediate operand in addition to two register operands. Assuming that the
immediate operand is an unsigned integer, the maximum value of the immediate operand is ____________
Answer ☟
1.12.5 Instruction Format: GATE CSE 2016 Set 2 | Question: 31 top☝ ☛ https://gateoverflow.in/39601
Consider a processor with 64 registers and an instruction set of size twelve. Each instruction has five distinct fields, namely,
opcode, two source register identifiers, one destination register identifier, and twelve-bit immediate value. Each instruction must be
stored in memory in a byte-aligned fashion. If a program has 100 instructions, the amount of memory (in bytes) consumed by the
program text is _________.
Answer ☟
A processor has 16 integer registers (R0, R1, … , R15) and 64 floating point registers (F0, F1, … , F63). It uses a 2- byte
instruction format. There are four categories of instructions: Type-1, Type-2, Type-3, and Type-4. Type-1 category consists of
four instructions, each with 3 integer register operands (3Rs). Type-2 category consists of eight instructions, each with 2 floating
point register operands (2Fs). Type-3 category consists of fourteen instructions, each with one integer register operand and one
floating point register operand (1R+1F). Type-4 category consists of N instructions, each with a floating point register operand
(1F).
Answer ☟
A processor has 64 registers and uses 16-bit instruction format. It has two types of instructions: I-type and R-type. Each I-type
instruction contains an opcode, a register name, and a 4-bit immediate value. Each R-type instruction contains an opcode and two
register names. If there are 8 distinct I-type opcodes, then the maximum number of distinct R-type opcodes is _______.
Answer ☟
1.12.1 Instruction Format: GATE CSE 1988 | Question: 2-ii top☝ ☛ https://gateoverflow.in/91676
4 bits are for the opcode so number of 2 address instructions will be 24 = 16− so 14 double instructions are possible.
But out of 16 only 14 are used so 2 are still left which can be used for 1 address instruction. For 1 address instruction we can use
not only the 2 left over but also the 6 bits of operand 1 (to make it one address) − so 6 bits that is 64. So, total 2 × 64 single
address instructions can be supported − So, 127 single instructions are possible
But out of 128, 127 are used so 1 left which can be used for zero-address instruction. To make number of zero address we can
use the operand 2 address (we already included operand 1 address) −6 bits. So, total number possible is 64. So,
total 1 × 64 = 64 zero address instructions are possible.
1.12.2 Instruction Format: GATE CSE 1992 | Question: 01-vi top☝ ☛ https://gateoverflow.in/551
No. of possible instruction encoding = 211 = 2048
1.12.3 Instruction Format: GATE CSE 1994 | Question: 3.2 top☝ ☛ https://gateoverflow.in/2479
I think the answer is TRUE.
RISC systems use fixed length instruction to simplify pipeline.
eg: MIPS, PowerPC: Instructions are 4 bytes long.
CISC systems use Variable-length instructions.
eg: Intel 80X86 : Instructions vary from 1 to 17 bytes long.
Now the challenge is: How to fit multiple sets of instruction types into same (limited) number of bits (Fixed size instruction)?
Here comes Expanding opcode into the picture.
RISC systems commonly uses Expanding opcode technique to have fixed size instructions.
1.12.4 Instruction Format: GATE CSE 2014 Set 1 | Question: 9 top☝ ☛ https://gateoverflow.in/1767
64 registers means 6 bits (⌈log2 64⌉ = 6) for a register operand. So, 2 registers operand requires 12 bits. Now, 45
instructions require another 6 bits for opcode (⌈log2 45⌉ = 6) . So, totally 18 bits. So, we have 32 − 18 = 14 bits left for the
immediate operand. So, the max value will be 214 − 1 = 16383 (as the operand is unsigned we do not need a sign bit and with
14 bits we can represent from 0 to 214 − 1 )
82 votes -- Arjun Suresh (332k points)
1.12.5 Instruction Format: GATE CSE 2016 Set 2 | Question: 31 top☝ ☛ https://gateoverflow.in/39601
Answer: 500 bytes
Number of registers = 64
Number of bits to address register = ⌈log2 64⌉ = 6 − bits
Number of Instructions = 12
Opcode size = ⌈log2 12⌉ = 4
We have 2-byte instruction format. So, total number of instruction encodings = 216
PS: This is not the number of different instructions but different encodings; a single instruction can have different encodings
when the address part differs.
No. of bits taken by an integer operand (16 possible integer registers) = log2 16 = 4.
No. of bits taken by a floating point operand (64 possible floating point registers) = log2 64 = 6.
No. of encodings left for Type 4 = 216 − (214 + 215 + 14336) = 2048.
2048
No. of different instructions of Type 4 = 64
= 32.
83 votes -- Arjun Suresh (332k points)
Answer: 32 Instructions
Explanation:
Given,
16 Integer registers. So, we need 4 bits to address any one of them.
64 Floating point registers. This requires 6 bits to uniquely identify them.
Each instruction is 16 bits long.
Type 1 Instructions:
4 instructions, each with 3 integer operands.
The 3 integers, each requires 4 bits. So, 4 ∗ 3 bits for operands. We are left with 16 − 12 = 4 bits.
With 4 bits, 24 = 16 opcodes are possible. Out of these we used 4 opcodes. i.e 2 bits. Let's say first two bits are fixed to 00 and
next two bits are used for 4 different Type1 instructions.
00 00 . . .
00 01 . . .
00 10 . . .
00 11 . . .
Type 2 Instructions:
8 instructions, each with 2 floating point register operands.
Here we need 6 ∗ 2 bits for operands, and remaining 16 − 12 = 4 bits are left for opcodes.
So using these 4 bits, we need to get 8 opcodes.
Here we can't use 00 . . . for any opcode since it will not distinguish Type 2 from Type 1. So, we are left with 12 opcodes. And
we are going to use 8 out of these 12 for type 2 instructions.
01 00 . . .
01 01 . . .
01 10 . . .
01 11 . . .
10 00 . . .
10 01 . . .
Instruction Length: 16 bits
To distinguish among 64 registers, we need log2 (64) = 6 bits
I-type instruction format:
Answer ☟
In a vectored interrupt:
Answer ☟
Answer ☟
A device employing INTR line for device interrupt puts the CALL instruction on the data bus while:
¯¯¯¯¯¯¯¯¯¯¯¯¯¯
A. INT A is active
B. HOLD is active
C. READY is inactive
D. None of the above
Answer ☟
A device with data transfer rate 10 KB/sec is connected to a CPU. Data is transferred byte-wise. Let the interrupt overhead be
4μsec. The byte transfer time between the device interface register and CPU or memory is negligible. What is the minimum
performance gain of operating the device under interrupt mode over operating it under program-controlled mode?
A. 15
B. 25
C. 35
D. 45
Answer ☟
Answer ☟
Answer ☟
Answers: Interrupts
Answer should be (D) i.e branches off to ISR after completing current instruction.
CPU checks the status bit of interrupt at the completion of each current instruction running when there is a interrupt it service the
interrupt using ISR.
https://gateoverflow.in/18581/isro2009-78
References
Answer: B
A vectored interrupt is a processing technique in which the interrupting device directs the processor to the appropriate interrupt
service routine. This is in contrast to a polled interrupt system, in which a single interrupt service routine must determine the
source of the interrupt by checking all potential interrupt sources, a slow and relatively laborious process.
33 votes -- Rajarshi Sarkar (27.9k points)
Answer is (A).
A processor checks for the interrupt before FETCHING an instruction, so option (C) is also false.
INTR is a signal which if enabled then microprocessor has interrupt enabled it receives high INR signal & activates
INTA signal, so another request can’t be accepted till CPU is busy in servicing interrupt. Hence (A) is correct option.
In Programmed I/O, the CPU issues a command and waits for I/O operations to complete.
In Interrupt mode , data is transferred word by word (here word size is 1 byte as mentioned in question
"Data is transferred byte-wise").
So to transfer 1 byte of data overhead is 4 × 10−6 sec
Thus to transfer 10 KB of data overhead is= 4 × 10−6 × 104 sec
1 1
Performance gain = −6 4
= = 25
4 × 10 × 10 4 × 10−2
It will be (C).
After finishing the execution of each instruction the CPU reads the interrupt pins to recognize the interrupts.
Answer : C
I is true
The daisy-chaining method of establishing priority consists of a serial connection of all devices that request an interrupt.
The device with the highest priority is placed in the first position, followed by lower-priority devices up to the device
with the lowest priority, which is placed last in the chain.
II. is false
Vectored interrupts are achieved by assigning each interrupting device a unique code, typically four to eight bits in length.
When a device interrupts, it sends its unique code over the data bus to the processor, telling the processor which interrupt
service routine to execute.
III. is true
The process of periodically checking status bits to see if it is time for the next I/O operation, is called polling. Polling is
the simplest way for an I/O device to communicate with the processor the processor.
IV. is false
Since CPU release bus only after getting request from DMA and get after DMA release the BUS.
In a microprocessor-based system, if a bus (DMA) request and an interrupt request arrive sumultaneously, the microprocessor
attends first to the bus request.
Answer ☟
Data transfer between a microprocessor and an I/O device is usually faster in memory-mapped-I/O scheme than in I/O-mapped -I/O
scheme.
Answer ☟
State whether the following statements are TRUE or FALSE with reason:
The data transfer between memory and I/O devices using programmed I/O is faster than interrupt-driven I/O.
Answer ☟
For the daisy chain scheme of connecting I/O devices, which of the following statements is true?
Answer ☟
A hard disk is connected to a 50 MHz processor through a DMA controller. Assume that the initial set-up of a DMA transfer
takes 1000 clock cycles for the processor, and assume that the handling of the interrupt at DMA completion requires 500 clock
cycles for the processor. The hard disk has a transfer rate of 2000 Kbytes/sec and average block transferred is 4 K bytes. What
fraction of the processor time is consumed by the disk, if the disk is actively transferring 100% of the time?
Answer ☟
A. A − 4 B−3 C −1 D−2
B. A − 2 B−1 C −3 D−4
C. A − 4 B−3 C −2 D−1
D. A − 2 B−3 C −4 D−1
Answer ☟
1.14.7 Io Handling: GATE CSE 2008 | Question: 64, ISRO2009-13 top☝ ☛ https://gateoverflow.in/487
Which of the following statements about synchronous and asynchronous I/O is NOT true?
A. An ISR is invoked on completion of I/O in synchronous I/O but not in asynchronous I/O
B. In both synchronous and asynchronous I/O, an ISR (Interrupt Service Routine) is invoked after completion of the I/O
C. A process making a synchronous I/O call waits until I/O is complete, but a process making an asynchronous I/O call does not
wait for completion of the I/O
D. In the case of synchronous I/O, the process waiting for the completion of I/O is woken up by the ISR that is invoked after the
completion of I/O
Answer ☟
Answers: Io Handling
The HOLD input has a higher priority than the INTR or NMI interrupt inputs.
True
it will take extra time in IO mapped IO because of control signal.
0 votes -- Arnabh Gangwar (307 points)
False because in programmed I/O, CPU will check the I/O devices' status according to written program. Suppose
CPU requested 5 I/O devices and the program is written to check sequentially and 5 th device is ready before 2nd device, then
also CPU will come to check at its turn.
So, programmed I/O doesn't care about availability status of devices. it blindly works according to written program. That's why it
is slow.
Interrupt driven I/O : Here, if any device is ready then it won't wait for CPU, it will say to CPU that "I am ready" by sending
interrupt request and the delay here will be only "time taken in servicing the interrupt" which is less than programmed I/O.
Daisy chaining approach tells the processor in which order the interrupt should be handled by providing priority to the
devices.
In daisy-chaining method, all the devices are connected in serial. The device with the highest priority is placed in the first
position, followed by lower priority devices. The interrupt pin is common to all.
2000 KB is transferred in 1 second
Total cycle required for locking and handling of interrupts after DMA transfer control
30μs for initialization and termination and 2ms for data transfer.
30μs
Fraction of CPU time consumed = = 0.015
(30μs + 2 ms)
45 votes -- Pooja Palod (24.1k points)
Correct Option: B. A − 2, B − 1, C − 3, D − 4
Reason:
DMA I/O - For high speed, high volume data transfer from disk without affecting the processor(in most cases).
Cache-A high speed & low memory version of a RAM.
Interrupt I/O - The printer sends an interrupt signal when it is ready for use.
Condition Code Register - Part of the ALU, as a special purpose register, to store flag bits.
[Source - Google/Wikipedia]
1.14.7 Io Handling: GATE CSE 2008 | Question: 64, ISRO2009-13 top☝ ☛ https://gateoverflow.in/487
Answer is (B).
In synchronous I/O process performing I/O operation will be placed in blocked state till the I/O operation is completed. An ISR
will be invoked after the completion of I/O operation and it will place process from block state to ready state.
In asynchronous I/O, Handler function will be registered while performing the I/O operation. The process will not be placed in
The following program fragment was written in an assembly language for a single address computer with one accumulator
register:
LOAD B
MULT C
STORE T1
ADD A
STORE T2
MULT T2
ADD T1
STORE Z
Answer ☟
A. Assume that a CPU has only two registers R1 and R2 and that only the following instruction is available
XOR Ri , Rj ; {Rj ← Ri ⊕ Rj , for i, j = 1, 2}
Using this XOR instruction, find an instruction sequence in order to exchange the contents of the registers R1 and R2
B. The line p of the circuit shown in figure has stuck at 1 fault. Determine an input test to detect the fault.
Answer ☟
Consider the following program fragment in the assembly language of a certain hypothetical processor. The processor has
three general purpose registers R1, R2and R3. The meanings of the instructions are shown by comments (starting with ;) after the
instructions.
X: CMP R1, 0; Compare R1 and 0, set flags appropriately in status register
JZ Z; Jump if zero to target Z
MOV R2, R1; Copy contents of R1 to R2
SHR R1; Shift right R1 by 1 bit
SHL R1; Shift left R1 by 1 bit
CMP R2, R1; Compare R2 and R1 and set flag in status register
JZ Y; Jump if zero to target Y
INC R3; Increment R3 by 1;
Y: SHR R1; Shift right R1 by 1 bit
JMP X; Jump to target X
Z:...
A. Initially R1, R2 and R3 contain the values 5,0 and 0 respectively, what are the final values of R1 and R3 when control
reaches Z?
B. In general, if R1, R2 and R3 initially contain the values n, 0, and 0 respectively. What is the final value of R3 when control
reaches Z ?
Consider the following assembly language program for a hypothetical processor A, B, and C are 8− bit registers. The
meanings of various instructions are shown as comments.
MOV B, #0 ; B←0
MOV C, #8 ; C←8
Z: CMP C, #0 ; compare C with 0
JZ X ; jump to X if zero flag is set
SUB C, #1 ; C ← C −1
RRC A, #1 ; right rotate A through carry by one bit. Thus:
; If the initial values of A and the carry flag are a7 … a0 and
; c0 respectively, their values after the execution of this
; instruction will be c0 a7 … a1 and a0 respectively.
JC Y ; jump to Y if carry flag is set
JMP Z ; jump to Z
Y: ADD B, #1 ; B ← B+1
JMP Z ; jump to Z
X: ;
If the initial value of register A is A0 the value of register B after the program execution will be
A. the number of 0 bits in A0
B. the number of 1 bits in A0
C. A0
D. 8
Answer ☟
Consider the following assembly language program for a hypothetical processor A, B, and C are 8 bit registers. The meanings
of various instructions are shown as comments.
MOV B, #0 ;B ← 0
MOV C, #8 ;C ← 8
C←C
SUB C, #1 ;
−1
RRC A, #1 ; right rotate A through carry by one bit. Thus:
JMP Z ; jump to Z
B←B
Y: ADD B, #1 ;
+1
JMP Z ; jump to Z
Answer ☟
Consider the following program segment for a hypothetical CPU having three user registers R1 , R2 and R3 .
Consider that the memory is byte addressable with size 32 bits, and the program has been loaded starting from memory location
1000 (decimal). If an interrupt occurs while the CPU has been halted after executing the HALT instruction, the return address (in
decimal) saved in the stack will be
A. 1007
B. 1020
C. 1024
D. 1028
Answer ☟
Consider the following program segment for a hypothetical CPU having three user registers R1 , R2 and R3 .
Answer ☟
1.15.8 Machine Instructions: GATE CSE 2006 | Question: 09, ISRO2009-35 top☝ ☛ https://gateoverflow.in/888
A CPU has 24-bit instructions. A program starts at address 300 (in decimal). Which one of the following is a legal program
counter (all values in decimal)?
A. 400
B. 500
C. 600
D. 700
Answer ☟
The computer has only two registers, and OP is either ADD or SUB. Consider the following basic block:
t1 = a+b
t2 = c+d
t3 = e − t2
t4 = t1 – t3
Assume that all operands are initially in memory. The final value of the computation should be in memory. What is the minimum
number of MOV instructions in the code generated for this basic block?
A. 2
B. 3
C. 5
D. 6
Answer ☟
Consider the following program segment. Here R1, R2 and R3 are the general purpose registers.
Answer ☟
Consider the following program segment. Here R1, R2 and R3 are the general purpose registers.
Assume that the content of memory location 3000 is 10 and the content of the register R3 is 2000 . The content of each of the
memory locations from 2000 to 2010 is 100. The program is loaded from the memory location 1000 . All the numbers are in
decimal.
Assume that the memory is word addressable. After the execution of this program, the content of memory location 2010 is:
A. 100
B. 101
C. 102
D. 110
Answer ☟
Consider the following program segment. Here R1, R2 and R3 are the general purpose registers.
Assume that the content of memory location 3000 is 10 and the content of the register R3 is 2000. The content of each of the
memory locations from 2000 to 2010 is 100. The program is loaded from the memory location 1000. All the numbers are in
Answer ☟
Which of the following must be true for the RFE (Return From Exception) instruction on a general purpose processor?
A. I only
B. II only
C. I and II only
D. I, II and III only
Answer ☟
1.15.14 Machine Instructions: GATE CSE 2015 Set 2 | Question: 42 top☝ ☛ https://gateoverflow.in/8215
Consider a processor with byte-addressable memory. Assume that all registers, including program counter (PC) and Program
Status Word (PSW), are size of two bytes. A stack in the main memory is implemented from memory location (0100)16 and it
grows upward. The stack pointer (SP) points to the top element of the stack. The current value of SP is (016E)16 . The CALL
instruction is of two words, the first word is the op-code and the second word is the starting address of the subroutine (one word = 2
bytes). The CALL instruction is implemented as follows:
The content of PC just before the fetch of a CALL instruction is (5FA0)16 . After execution of the CALL instruction, the value of
the stack pointer is:
A. (016A)16
B. (016C)16
C. (0170)16
D. (0172)16
Answer ☟
1.15.15 Machine Instructions: GATE CSE 2016 Set 2 | Question: 10 top☝ ☛ https://gateoverflow.in/39547
A processor has 40 distinct instruction and 24 general purpose registers. A 32-bit instruction word has an opcode, two
registers operands and an immediate operand. The number of bits available for the immediate operand field is_______.
Answer ☟
Consider the following instruction sequence where registers R1, R2 and R3 are general purpose and MEMORY[X] denotes
the content at the memory location X.
Assume that the content of the memory location 5000 is 10, and the content of the register R3 is 3000 . The content of each of the
memory locations from 3000 to 3020 is 50. The instruction sequence starts from the memory location 1000 . All the numbers are in
decimal format. Assume that the memory is byte addressable.
After the execution of the program, the content of memory location 3010 is ____________
Answer ☟
If we use internal data forwarding to speed up the performance of a CPU (R1, R2 and R3 are registers and M[100] is a
memory reference), then the sequence of operations
R1 → M[100]
M[100] → R2
M[100] → R3
can be replaced by
A. R1 → R3
R2 → M[100]
B. M[100] → R2
R1 → R2
R1 → R3
C. R1 → M[100]
R2 → R3
D. R1 → R2
R1 → R3
R1 → M[100]
Answer ☟
Following table indicates the latencies of operations between the instruction producing the result and instruction using the
result.
What is the number of cycles needed to execute the above code segment assuming each instruction takes one cycle to execute?
A. 7
B. 10
C. 13
D. 14
Answer ☟
Assume that EA = (X)+ is the effective address equal to the contents of location X, with X incremented by one word length
after the effective address is calculated; EA = −(X) is the effective address equal to the contents of location X, with X decremented
by one word length before the effective address is calculated; EA = (X)− is the effective address equal to the contents of location X,
with X decremented by one word length after the effective address is calculated. The format of the instruction is (opcode, source,
destination), which means (destination ← source op destination). Using X as a stack pointer, which of the following instructions can
pop the top two elements from the stack, perform the addition operation and push the result back to the stack.
A. ADD (X)−, (X)
B. ADD (X), (X)−
C. ADD −(X), (X)+
D. ADD −(X), (X)
Answer ☟
Z = [BC + A]2 + (BC)
12 votes -- Yash wadhwani (831 points)
R1 ← R1 XOR R2
R2 ← R2 XOR R1
R1 ← R1 XOR R2
Stuck at 0 fault:: A stuck-at fault is a particular fault model used by fault simulators and automatic test pattern generation
(ATPG) tools to mimic a manufacturing defect within an integrated circuit. Individual signals and pins are assumed to be stuck
at Logical ′ 1′ ,′ 0′ , and ′ X ′ .
Stuck at line 1 means no matter whatever input we provide to the line the output is always 1.
Now, when we take A = 1, B = 1, C = 1 the required output, if no fault is there should be 1.
But, since line p is stuck at logic 1 final output of A NAND B will be 1 only .So, final circuit output becomes 0.(which is wrong
)
Reference : https://www.tutorialspoint.com/digital_electronics/stuckat1_fault_in_logic_circuit.asp
References
SHR R1 (Lower bit is lost and upper bit becomes 0 and all other bits shift right by 1)
SHL R1 (Upper bit is lost and lower bit becomes 0 and all other bits shift left by 1)
These two operations change the value of R1 if its lower bit is 1. So, the given program checks the lowest bit of R1 in each
iteration and if its 1 then only increment R3 and loop terminates when R1 becomes 0. Thus at end, R3 will have the count of the
number of bits set to 1 in R1.
b. R3 = #1 in R1.
27 votes -- Arjun Suresh (332k points)
Option (B). The code is counting the number of 1 bits in A0 . When a 1 is moved to carry, B is incremented.
30 votes -- Arjun Suresh (332k points)
Option (A)RRC a, #1. As the 8 bit register is rotated via carry 8 times.
a7 a6 a5 a4 a3 a2 a1 a0
c0 a7 a6 a5 a4 a3 a2 a1 , now a0 is the new carry. So, after next rotation,
a0 c0 a7 a6 a5 a4 a3 a2
Option is D.
Word size is 32 bits (4 bytes ). Interrupt occurs after execution of HALT instruction NOT during, So address of next
instruction will be saved on to the stack which is 1028 .
(We have 5 instructions starting from address 1000 , each of size 2, 1, 1, 2, 1 totaling 7 words = 7 ∗ 4 = 28 bytes ).
1000 + 28 = 1028 ,
1028 is the starting address of NEXT Instruction .
After HALT instruction CPU enters a HALT state and if an interrupt happens the return address will be that of the instruction
after the HALT.
References :
B. 24 cycles
1.15.8 Machine Instructions: GATE CSE 2006 | Question: 09, ISRO2009-35 top☝ ☛ https://gateoverflow.in/888
Option (C). 24 bits = 3 bytes instructions. So, PC will have multiples of 3 in it.
40 votes -- anshu (2.7k points)
MOV a, R1
ADD b, R1
MOV c, R2
ADD d, R2
SUB e, R2
SUB R1 , R2
MOV R2 , m
Loop is executed 10 times and there are two memory reference in the loop (each MOV
is loading 1 word, so 1 memory reference for each MOV inside the loop). So number
of memory reference inside loop is
2(MOV) × 10 (times iteration) × 1(1 word access/ MOV) = 20 memory accesses.
One memory access is outside the loop for the first instruction
MOV R1, (3000)
So, totally 20 + 1 = 21
Correct Answer: D
The loop runs 10 times.
An interrupt is checked for after the execution of the current instruction and the contents of PC
(address of next instruction to be executed) is pushed on to stack.
(2 + 1 + 1 + 1) × 32
Here, address of INC, R3 = 1000 + = 1020 and
8
next instruction address = 1020 + 4 = 1024 which is pushed on to stack.
Reference: http://www.ece.utep.edu/courses/web3376/Notes_files/ee3376-interrupts_stack.pdf
Correct Answer: C
References
RFE (Return From Exception) is a privileged trap instruction that is executed when exception occurs, so an exception is
not allowed to execute. (D) is the correct option.
Reference: http://www.cs.rochester.edu/courses/252/spring2014/notes/08_exceptions
References
1.15.14 Machine Instructions: GATE CSE 2015 Set 2 | Question: 42 top☝ ☛ https://gateoverflow.in/8215
First we have to consider here memory is byte-addressable
Correct Answer: D
1.15.15 Machine Instructions: GATE CSE 2016 Set 2 | Question: 10 top☝ ☛ https://gateoverflow.in/39547
Instruction Opcode Size = log2 40 = 6
1.15.16 Machine Instructions: GATE CSE 2021 Set 1 | Question: 55 top☝ ☛ https://gateoverflow.in/357397
The given code is iterating 10 times and incrementing the contents of locations 3000 to 3000 + i by 10 − i, for i < 10.
Location 3010 is left untouched.
Data forwarding means if CPU writes to a memory location and subsequently reads from the same memory location, the
second instruction can fetch the value directly from the register used to do the write than waiting for the memory. So, this
increases the performance.
Here, choices A, B and C doesn't really make any sense as the data was in R1 and it must be moved to R2, R3 and M[100]. So,
(D) is the answer.
53 votes -- Arjun Suresh (332k points)
In the given question there are 7 instructions each of which takes 1 clock cycle to complete. (Pipelining may be used)
If an instruction is in execution phase and any other instructions can not be in the execution phase. So, at least 7 clock cycles will
be taken.
Now, it is given that between two instructions latency or delay should be there based on their operation. Ex- 1st line of the table
says that between two operations in which first is producing the result of an ALU operation and the 2nd is using the result there
should be a delay of 2 clock cycles.
5. Dec R1 Decrement R1
This instruction is dependent on Instruction 3
As Instruction I3 is ALU and I5 is also ALU so a delay of 2 clock cycles should be there between them of which 1 clock
cycle delay is already there due to I4 so one clock cycle delay between I4 and I5.
It should be A as 998 ← 1000 + 998. ( I am writing only memory locations for sake of brevity )
Lets say SP is 1000 initially then after it calculates the EA of source (which is 1000 as it decrements after the EA) the
destination becomes 998 and that is where we want to store the result as stack is decrementing.
1.16.1 Memory Interfacing: GATE CSE 2016 Set 1 | Question: 09 top☝ ☛ https://gateoverflow.in/39632
A processor can support a maximum memory of 4GB, where the memory is word-addressable (a word consists of two bytes).
The size of address bus of the processor is at least _________bits.
Answer ☟
A 32-bit wide main memory unit with a capacity of 1 GB is built using 256 M × 4 − bit DRAM chips. The number of rows
of memory cells in the DRAM chip is 214 . The time taken to perform one refresh operation is 50 nanoseconds . The refresh period
i s 2 milliseconds. The percentage (rounded to the closest integer) of the time available for performing the memory read/write
operations in the main memory unit is_____.
Answer ☟
1.16.1 Memory Interfacing: GATE CSE 2016 Set 1 | Question: 09 top☝ ☛ https://gateoverflow.in/39632
Size of Memory = No of words (Addresses) × No of bits per word
32
2 B = No of words (Addresses) × 2B
No of words (Addresses) = 231
Number of Address lines = 31
One refresh operation takes 50ns.
Total number of rows = 214
Find out the width of the control memory of a horizontal microprogrammed control unit, given the following specifications:
Answer ☟
A micro program control unit is required to generate a total of 25 control signals. Assume that during any micro instruction, at
most two control signals are active. Minimum number of bits required in the control word to generate the required control signals
will be:
A. 2
B. 2.5
C. 10
D. 12
Answer ☟
Arrange the following configuration for CPU in decreasing order of operating speeds:
Hard wired control, Vertical microprogramming, Horizontal microprogramming.
Answer ☟
Horizontal microprogramming:
Answer ☟
The microinstructions stored in the control memory of a processor have a width of 26 bits. Each microinstruction is divided
into three fields: a micro-operation field of 13 bits, a next address field (X), and a MUX select field (Y ). There are 8 status bits in
the input of the MUX.
Answer ☟
Answer ☟
A CPU has only three instructions I1, I2 and I3, which use the following signals in time steps T 1 − T 5 :
I1 : T 1 : Ain, Bout, Cin
T 2 : PCout, Bin
T 3 : Zout, Ain
T 4 : Bin, Cout
T 5 : End
I2 : T 1 : Cin, Bout, Din
T 2 : Aout, Bin
T 3 : Zout, Ain
T 4 : Bin, Cout
T 5 : End
I3 : T 1 : Din, Aout
T 2 : Ain, Bout
T 3 : Zout, Ain
T 4 : Dout, Ain
Answer ☟
A hardwired CPU uses 10 control signals S1 to S10 , in various time steps T1 to T5 , to implement 4 instructions I1 to I4 as
shown below:
T1 T2 T3 T4 T5
I1 S1 , S3 , S5 S2 , S4 , S6 S1 , S7 S10 S3 , S8
I2 S1 , S3 , S5 S8 , S9 , S10 S5 , S6 S7 S6 S10
I3 S1 , S3 , S5 S7 , S8 , S10 S2 , S6 , S9 S10 S1 , S3
I4 S1 , S3 , S5 S2 , S6 , S7 S5 , S10 S6 , S9 S10
Which of the following pairs of expressions represent the circuit for generating control signals S5 and S10 respectively?
((Ij + Ik )Tn indicates that the control signal should be generated in time step Tn if the instruction being executed is Ij or lk )
A. S5 = T1 + I2 ⋅ T3 and
S10 = (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
B. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
C. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I2 + I3 + I4 ) ⋅ T2 + (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
D. S5 = T1 + (I2 + I4 ) ⋅ T3 and
S10 = (I2 + I3 ) ⋅ T2 + I4 ⋅ T3 + (I1 + I3 ) ⋅ T4 + (I2 + I4 ) ⋅ T5
Answer ☟
An instruction set of a processor has 125 signals which can be divided into 5 groups of mutually exclusive signals as follows:
Group 1 : 20 signals, Group 2 : 70 signals, Group 3 : 2 signals, Group 4 : 10 signals, Group 5 : 23 signals.
How many bits of the control words can be saved by using vertical microprogramming over horizontal microprogramming?
A. 0
B. 103
C. 22
D. 55
Answer ☟
The data path shown in the figure computes the number of 1s in the 32 − bit input word corresponding to an unsigned even
integer stored in the shift register.
The unsigned counter, initially zero, is incremented if the most significant bit of the shift register is 1.
The counter width (k), the number of missing microinstructions (n), and the control word for microinstructions I1 , I2 , … In are,
respectively,
A. 32, 5, 010
B. 5, 32, 010
C. 5, 31, 011
D. 5, 31, 010
Answer ☟
Consider a CPU where all the instructions require 7 clock cycles to complete execution. There are 140 instructions in the
instruction set. It is found that 125 control signals are needed to be generated by the control unit. While designing the horizontal
microprogrammed control unit, single address field format is used for branch control logic. What is the minimum size of the control
word and control address register?
A. 125, 7
B. 125, 10
C. 135, 9
D. 135, 10
Answer ☟
Answers: Microprogramming
In the case of horizontal micro-programming, there is 1- bit for each control signal. It is given there are 16 control signals.
The best sense I can make of this question is that you want to transmit up to 2 simultaneous signals out of a choice of 25,
and ask how many bits you need for that.
One solution would be to have 2 groups of 5 − bits , each can send one of 31 signals (or the absence of signal). But it is not
optimal. The number of different states is
You can transmit any of these states over 9 − bits . But it is more complex to encode/ decode, adding an extra bit would
probably cost less.
Hence C is correct option.
Reference: https://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi?board=riddles_cs;action=display;num=1354778770
References
Actually the given question incorporates the concept of horizontal μprogramming (also known as decoded form of
control signals) and vertical μprogramming (also known as encoded form of control signals)
The (a) part says :
This is referred to encoding form of vertical one since at most one signal can be active in vertical microprogramming since it
involves use of external decoder to select one control signal out of the given control signals..
No of bits required for vertical microprogramming given n number of control signals = ⌈(log2 n)⌉
Here, n = 3
So, no of bits required for part (a) = ⌈(log2 3)⌉ = 2
Now coming to (b) part , it says :
at maximum we can have at most 6 microoperations of another kind at a time. To accommodate that we need decoded form of
control signals which is horizontal signals.
So, no of bits required for (b) part
= No of control signals of (b) kind = 6
Therefore overall bits required to accommodate both (a) and (b),
= 2 + 6 = 8 − bits
Besides this, address field, flags etc are also there in a control word. That is why it is asked in the question :
Hard wired control involves only hardware, whereas microprogramming is software approach. So, hardwire control
should be faster than both microprogramming approaches.
Between vertical and horizontal microprogramming. Horizontal is faster because in this control signals are not encoded whereas
in vertical microprogramming to save memory signals are encoded. So, it takes less time in horizontal microprogramming
because decoding of signals is not required. Therefore, final order is :
Correct Answer: B
33 votes -- Shikhar Vashishth (3.1k points)
Option (D). All statements are true.
Reference: http://http://www.cs.virginia.edu/~cs333/notes/microprogramming.pdf
References
x + y + 13 = 26 → (1)
y = 3 (y) is no of bits used to represent 8 different states of multiplexer → (2)
x is no of bits required represent size of control memory
x = 10 from (1) and (2)
Correct Answer: A
48 votes -- Digvijay (44.9k points)
Here PC value is being stored in memory which is done when either CALL RETURN involved or there is Interrupt. As,
we will have to come back to execute current instruction.
Option (C) is incorrect because conditional branch does not require to save PC contents.
We just have to see which all options give 1 whenever Ain is 1 and 0 otherwise.
So, Ain is 1 in T 3 of I1, I2 and I3. Also during T 1 of I1, and T 2 and T 4 of I3. So, answer will be
T 1.I1 + T 2.I3 + T 4.I3 + T 3.I1 + T 3.I2 + T 3.I3
Since CPU is having only 3 instructions, T 3.I1 + T 3.I2 + T 3.I3 can be replaced with T 3 (we don't need to see which
instruction and Ain will be activated in time step 3 of all the instructions).
So, T 1.I1 + T 2.I3 + T 4.I3 + T 3
The answer is Option (A).
D. is the correct option for this question.
If we look at the table, we need to find those time-stamps and instructions which are using these control signals.
For example, S5 = T1 has used control signal S5 for all the instructions, or we can say irrespective of the instructions. Also, S5
is used by instructions I2 and I4 for the time stamp T3 so that comes to:
S5 = T1 + I2 ⋅ T3 + I4 ⋅ T3 = T1 + (I2 + I4 ) ⋅ T3
In horizontal microprogramming we need 1 bit for every control word, therefore total bits in
Horizontal Microprogramming = 20 + 70 + 2 + 10 + 23 = 125
Now lets consider vertical microprogramming, In vertical microprogramming we use Decoder (n to 2n ) and output lines are
equal to number of control words. A input is given according to what control word we have to select.
Now in this question these 5 groups contains mutually exclusive signals, i.e, they can be activated one at a time for a given
group, we can safely use decoder.
group 1 = ⌈log2 20⌉ = 5 (Number of input bits for decoder, given output
is number of control word in given group)
group 2 = ⌈log2 70⌉ = 7
group 3 = ⌈log2 2⌉ = 1
group 4 = ⌈log2 10⌉ = 4
group 5 = ⌈log2 23⌉ = 5
Total bits required in vertical microprogramming = 5 + 7 + 1 + 4 + 5 = 22
So number of control words saved = 125 − 22 = 103
Hence, (B) is answer.
Answer I1 to In are microinstructions and reset_counter, shift_left and load_output are control signals to activate
corresponding hardware(eg. Shift register or load output).
Counter width (k) is 5 bits as shift register uses 32 bit data Only.
The number of missing micro instructions (n) should be 31 as shift register contain Only unsigned EVEN integer. LSB Will be
always 0 so no need to shift for LSB.
Control word contains:-
1 for active/enable. 0 for inactive or disabled.
Reset counter is to reset the counter so it must be 0 for all microns.
Shift_left CS should be 1 to shift the given data in shift reg.
And load output has no meaning to make output active for all microinstructions as it will be used in the END only so it should
be 0.
Its answer should be (D) because 140 instructions, each requiring 7 cycles means 980 cycles which will take 10 bits.
Since it is horizontal for control word, 125 control signals + 10 bits = 135 bits will be required.
38 votes -- nagendra2016 (349 points)
An instruction pipeline consists of 4 stages – Fetch (F) , Decode field (D), Execute (E) and Result Write (W). The 5
instructions in a certain instruction sequence need these stages for the different number of clock cycles as shown by the table below
Instruction F D E W
1 1 2 1 1
2 1 2 2 1
3 2 1 3 2
4 1 3 2 1
5 1 2 1 2
Answer ☟
Comparing the time T1 taken for a single instruction on a pipelined CPU with time T2 taken on a non-pipelined but identical
CPU, we can say that
A. T1 ≤ T2
B. T1 ≥ T2
C. T1 < T2
D. T1 and T2 plus the time taken for one instruction fetch cycle
Answer ☟
An instruction pipeline has five stages where each stage take 2 nanoseconds and all instruction use all five stages. Branch
instructions are not overlapped. i.e., the instruction after the branch is not fetched till the branch instruction is completed. Under
ideal conditions,
A. Calculate the average instruction execution time assuming that 20% of all instructions executed are branch instruction. Ignore
the fact that some branch instructions may be conditional.
B. If a branch instruction is a conditional branch instruction, the branch need not be taken. If the branch is not taken,
the following instructions can be overlapped. When 80% of all branch instructions are conditional branch instructions, and
50% of the conditional branch instructions are such that the branch is taken, calculate the average instruction execution time.
Answer ☟
Consider a 5− stage pipeline - IF (Instruction Fetch), ID (Instruction Decode and register read), EX (Execute), MEM
(memory), and WB (Write Back). All (memory or register) reads take place in the second phase of a clock cycle and all writes
occur in the first phase. Consider the execution of the following instruction sequence:
Answer ☟
1.18.5 Pipelining: GATE CSE 2002 | Question: 2.6, ISRO2008-19 top☝ ☛ https://gateoverflow.in/836
Answer ☟
1.18.6 Pipelining: GATE CSE 2003 | Question: 10, ISRO-DEC2017-41 top☝ ☛ https://gateoverflow.in/901
For a pipelined CPU with a single ALU, consider the following situations
Answer ☟
A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nanoseconds , respectively. Registers that are used between
the stages have a delay of 5 nanoseconds each. Assuming constant clocking rate, the total time taken to process 1000 data items
on this pipeline will be:
A. 120.4 microseconds
B. 160.5 microseconds
C. 165.5 microseconds
D. 590.0 microseconds
Answer ☟
Answer ☟
A CPU has a five-stage pipeline and runs at 1 GHz frequency. Instruction fetch happens in the first stage of the pipeline. A
conditional branch instruction computes the target address and evaluates the condition in the third stage of the pipeline. The
processor stops fetching new instructions following a conditional branch until the branch outcome is known. A program executes
109 instructions out of which 20% are conditional branches. If each instruction takes one cycle to complete on average, the total
execution time of the program is:
A. 1.0 second
B. 1.2 seconds
C. 1.4 seconds
D. 1.6 seconds
Answer ☟
The IF, ID and WB stages take one clock cycle each to complete the operation. The number of clock cycles for the EX stage
depends on the instruction. The ADD and SUB instructions need 1 clock cycle and the MUL instruction needs 3 clock cycles in the
EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the
following sequence of instructions?
A. 7
B. 8
C. 10
D. 14
Answer ☟
A. I and II only
B. I and III only
C. II and III only
D. I, II and III
Answer ☟
Answer ☟
A. I1
B. I2
C. I3
D. I4
Answer ☟
Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1, I2, I3, I4 in stages
S1, S2, S3, S4 is shown below:
S1 S2 S3 S4
I1 2 1 1 1
I2 1 3 2 2
I3 2 1 1 3
I4 1 2 2 2
A. 16
B. 23
C. 28
D. 30
Answer ☟
A 5− stage pipelined processor has Instruction Fetch (IF), Instruction Decode (ID), Operand Fetch (OF), Perform Operation
(PO) and Write Operand (WO) stages. The IF, ID, OF and WO stages take 1 clock cycle each for any instruction. The PO stage
takes 1 clock cycle for ADD and SUB instructions, 3 clock cycles for MUL instruction and 6 clock cycles for DIV instruction
respectively. Operand forwarding is used in the pipeline. What is the number of clock cycles needed to execute the following
sequence of instructions?
A. 13
B. 15
C. 17
D. 19
Consider an instruction pipeline with four stages (S1, S2, S3 and S4) each with combinational circuit only. The pipeline
registers are required between each stage and at the end of the last stage. Delays for the stages and for the pipeline registers are as
given in the figure.
What is the approximate speed up of the pipeline in steady state under ideal conditions when compared to the corresponding non-
pipeline implementation?
A. 4.0
B. 2.5
C. 1.1
D. 3.0
Answer ☟
1.18.17 Pipelining: GATE CSE 2012 | Question: 20, ISRO2016-23 top☝ ☛ https://gateoverflow.in/52
Answer ☟
Consider an instruction pipeline with five stages without any branch prediction:
Fetch Instruction (FI), Decode Instruction (DI), Fetch Operand (FO), Execute Instruction (EI) and Write Operand (WO). The stage
delays for FI, DI, FO, EI and WO are 5 ns, 7 ns, 10 ns, 8 ns and 6 ns, respectively. There are intermediate storage buffers after
each stage and the delay of each buffer is 1 ns. A program consisting of 12 instructions I1, I2, I3,… , I12 is executed in this
pipelined processor. Instruction I4 is the only branch instruction and its branch target is I9. If the branch is taken during the
execution of this program, the time (in ns) needed to complete the program is
A. 132
B. 165
C. 176
D. 328
Answer ☟
Consider a 6-stage instruction pipeline, where all stages are perfectly balanced. Assume that there is no cycle-time overhead of
pipelining. When an application is executing on this 6-stage pipeline, the speedup achieved with respect to non-pipelined execution
if 25% of the instructions incur 2 pipeline stall cycles is ____________
Answer ☟
An instruction pipeline has five stages, namely, instruction fetch (IF), instruction decode and register fetch (ID/RF),
instruction execution (EX), memory access (MEM), and register writeback (WB) with stage latencies 1 ns, 2.2 ns, 2 ns, 1 ns, and
0.75 ns, respectively (ns stands for nanoseconds). To gain in terms of frequency, the designers have decided to split the ID/RF stage
into three stages (ID, RF1, RF2) each of latency 2.2/3 ns. Also, the EX stage is split into two stages (EX1, EX2) each of latency 1
ns. The new design has a total of eight pipeline stages. A program has 20% branch instructions which execute in the EX stage and
produce the next instruction pointer at the end of the EX stage in the old design and at the end of the EX2 stage in the new design.
The IF stage stalls after fetching a branch instruction until the next instruction pointer is computed. All instructions other than the
branch instruction have an average CPI of one in both the designs. The execution times of this program on the old and the new
design are P and Q nanoseconds, respectively. The value of P /Q is __________.
Answer ☟
Consider the following processors (ns stands for nanoseconds). Assume that the pipeline registers have zero latency.
P1: Four-stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns.
P2: Four-stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns.
P3: Five-stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns.
P4: Five-stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns.
Answer ☟
Consider a non-pipelined processor with a clock rate of 2.5 GHz and average cycles per instruction of four. The same
processor is upgraded to a pipelined processor with five stages; but due to the internal pipeline delay, the clock speed is reduced to 2
GHz. Assume that there are no stalls in the pipeline. The speedup achieved in this pipelined processor is_______________.
Answer ☟
Answer ☟
Consider the following reservation table for a pipeline having three stages S1 , S2 and S3 .
Time →
1 2 3 4 5
S1 X X
S2 X X
S3 X
Answer ☟
The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage (with delay 800 picoseconds) is
replaced with a functionality equivalent design involving two stages with respective delays 600 and 350 picoseconds. The
throughput increase of the pipeline is ___________ percent.
Answer ☟
Consider a 3 GHz (gigahertz) processor with a three stage pipeline and stage
3τ2
latencies τ1 , τ2 and τ3 such that τ1 = = 2τ3 .
4
If the longest pipeline stage is split into two pipeline stages of equal latency ,
the new frequency is __________ GHz , ignoring delays in the pipeline registers.
Answer ☟
Instruction execution in a processor is divided into 5 stages, Instruction Fetch (IF), Instruction Decode (ID), Operand
fetch (OF), Execute (EX), and Write Back (WB). These stages take 5, 4, 20, 10 and 3 nanoseconds (ns) respectively. A pipelined
implementation of the processor requires buffering between each pair of consecutive stages with a delay of 2 ns. Two pipelined
implementation of the processor are contemplated:
The speedup (correct to two decimal places) achieved by EP over NP in executing 20 independent instructions with no hazards is
_________ .
Answer ☟
The instruction pipeline of a RISC processor has the following stages: Instruction Fetch (IF) , Instruction Decode (ID),
Operand Fetch (OF) , Perform Operation (PO) and Writeback (WB) , The IF , ID , OF and WB stages take 1 clock cycle each
for every instruction. Consider a sequence of 100 instructions. In the PO stage, 40 instructions take 3 clock cycles each, 35
instructions take 2 clock cycles each, and the remaining 25 instructions take 1 clock cycle each. Assume that there are no data
hazards and no control hazards.
The number of clock cycles required for completion of execution of the sequence of instruction is _____.
Answer ☟
Consider a non-pipelined processor operating at 2.5 GHz. It takes 5 clock cycles to complete an instruction. You are going to
make a 5- stage pipeline out of this processor. Overheads associated with pipelining force you to operate the pipelined processor at 2
GHz. In a given program, assume that 30% are memory instructions, 60% are ALU instructions and the rest are branch instructions.
5% of the memory instructions cause stalls of 50 clock cycles each due to cache misses and 50% of the branch instructions cause
stalls of 2 cycles each. Assume that there are no stalls associated with the execution of ALU instructions. For this program, the
speedup achieved by the pipelined processor over the non-pipelined processor (round off to 2 decimal places) is_____________.
Answer ☟
A five-stage pipeline has stage delays of 150, 120, 150, 160 and 140 nanoseconds. The registers that are used between the
pipeline stages have a delay of 5 nanoseconds each.
The total time to execute 100 independent instructions on this pipeline, assuming there are no pipeline stalls, is _______
nanoseconds.
Answer ☟
Consider a pipeline processor with 4 stages S1 to S4. We want to execute the following loop:
for (i = 1; i < = 1000; i++)
{I1, I2, I3, I4}
where the time taken (in ns) by instructions I1 to I4 for stages S1 to S4 are given below:
S1 S2 S3 S4
I1 1 2 1 2
I2 2 1 2 1
I3 1 1 2 1
I4 2 1 2 1
Answer ☟
We have two designs D1 and D2 for a synchronous pipeline processor. D1 has 5 pipeline stages with execution times of
3 nsec, 2 nsec, 4 nsec, 2 nsec and 3 nsec while the design D2 has 8 pipeline stages each with 2 nsec execution time How much time
can be saved using design D2 over design D1 for executing 100 instructions?
A. 214 nsec
B. 202 nsec
C. 86 nsec
D. −200 nsec
Answer ☟
A pipelined processor uses a 4− stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode
(ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX
stage. The sequence of instructions corresponding to the statement X = (S − R ∗ (P + Q))/T is given below. The values of
variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the
instruction sequence.
The number of Read-After-Write (RAW) dependencies, Write-After-Read( WAR) dependencies, and Write-After-Write (WAW)
dependencies in the sequence of instructions are, respectively,
A. 2, 2, 4
B. 3, 2, 3
C. 4, 2, 2
D. 3, 3, 2
Answer ☟
A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF), Instruction decode
(ID), Execute (EX) and Writeback (WB). The arithmetic operations as well as the load and store operations are carried out in the EX
stage. The sequence of instructions corresponding to the statement X = (S − R ∗ (P + Q))/T is given below. The values of
variables P, Q, R, S and T are available in the registers R0, R1, R2, R3 and R4 respectively, before the execution of the
instruction sequence.
The IF, ID and WB stages take 1 clock cycle each. The EX stage takes 1 clock cycle each for the ADD, SUB and STORE
operations, and 3 clock cycles each for MUL and DIV operations. Operand forwarding from the EX stage to the ID stage is used.
The number of clock cycles required to complete the sequence of instructions is
A. 10
B. 12
C. 14
D. 16
Answer ☟
A processor takes 12 cycles to complete an instruction I. The corresponding pipelined processor uses 6 stages with the
execution times of 3, 2, 5, 4, 6 and 2 cycles respectively. What is the asymptotic speedup assuming that a very large number of
instructions are to be executed?
A. 1.83
B. 2
C. 3
D. 6
Answer ☟
A non pipelined single cycle processor operating at 100 MHz is converted into a synchronous pipelined processor with five
stages requiring 2.5 nsec, 1.5 nsec, 2 nsec, 1.5 nsec and 2.5 nsec, respectively. The delay of the latches is 0.5 nsec. The
speedup of the pipeline processor for a large number of instructions is:
A. 4.5
B. 4.0
C. 3.33
D. 3.0
Answer ☟
Answers: Pipelining
Answer: 15 cycles are required.
Here we are comparing the execution time of only a single instruction. Pipelining in no way improves the execution time
of a single instruction (the time from its start to end). It increases the overall performance by splitting the execution to multiple
pipeline stages so that the following instructions can use the finished stages of the previous instructions. But in doing so
pipelining causes some problems also as given in the below link, which might slow some instructions. So, (B) is the answer.
http://www.cs.wvu.edu/~jdm/classes/cs455/notes/tech/instrpipe.html
References
Each stage is 2ns. So, after 5 time units each of 2ns, the first instruction finishes (i.e., after 10ns), in every 2ns after that a
new instruction gets finished. This is assuming no branch instructions. Now, once the pipeline is full, we can assume that the
initial fill time doesn't matter our calculations and average execution time for each instruction is 2ns assuming no branch
instructions.
A. Now, we are given that 20% of instructions are branch (like JMP) and when a branch instruction is executed, no further
instruction enters the pipeline. So, we can assume every 5th instruction is a branch instruction. So, with this assumption,
total time to finish 5 instruction will be 5 ∗ 2 + 8 = 18 ns (as when a branch instruction enters the pipeline and before it
finishes, 4 pipeline stages will be empty totaling 4 ∗ 2 = 8 ns, as it is mentioned in question that the next instruction fetch
starts only when branch instruction completes). And this is the same for every set of 5 instructions, and hence the average
instruction execution time = 18/5 = 3.6 ns
B. This is just a complex statement. But what we need is to identify the % of branch instructions which cause a branch to be
taken as others will have no effect on the pipeline flow.
20% of instructions are branch instructions. 80% of branch instructions are conditional.
That means .2 ∗ .8 = 16% of instructions are conditional branch instructions and it is given that 50% of those result in a
branch being taken.
So, 8% of instructions are conditional branches being taken and we also have 20% of 20% = 4% of unconditional branch
instructions which are always taken.
So, percentage of instructions where a branch is taken is 8 + 4 = 12% instead of 20% in (A) part.
So, in 100 instructions there will be 12 branch instructions. We can do a different calculation here as compared to (A) as 12 is
not a divisor of 100. Each branch instruction causes a pipeline delay of 4 ∗ 2 = 8 ns. So, 12 instructions will cause a delay of
12 ∗ 8 = 96 ns. For 100 instructions, we need 100 ∗ 2 = 200 ns without any delay and with delay we require 200 + 96 = 296
ns for 100 instructions.
(We can also use this method for part (A) which will give 100 ∗ 2 + 20 ∗ 8 = 360 ns for 100 instructions)
if an instruction branches then it takes 2ns × 5 = 10ns , coz if branch is taken then the instruction after that branch instruction is
not fetched until entire current branch instruction is completed, this means it will go through all stages.
if an instruction is non-branch or branching does not happen then, it takes 2ns to get completed.
a) average time taken = 0.8 × 2ns + 0.2 × 10ns = 3.6ns
b)
= 2.96ns
RAW dependencies:
1. I1 ← I2
2. I1 ← I3
3. I1 ← I4
4. I2 ← I4
WAR dependencies:
1. I2 ← I1
2. I4 ← I1
3. I4 ← I2
t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
I4 IF ID EX MEM WB
So, there are RAW hazards for I2 and I3 with I1 and for I4 with I2 . (Not all dependencies cause a hazard. Only if a dependency
causes a stall in the given pipeline structure, we get a hazard) These hazards cause the following stalls in the pipeline:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
I1 IF ID EX MEM WB
I2 IF − − ID EX MEM WB
I3 − − IF ID EX MEM WB
I4 − − IF − ID EX MEM WB
t1 t2 t3 t4 t5 t6 t7 t8
I1 IF ID EX MEM WB
1
I2 IF ID EX MEM WB
1
I3 IF ID EX MEM WB
1
3
I4 IF ID EX MEM WB
3
1.18.5 Pipelining: GATE CSE 2002 | Question: 2.6, ISRO2008-19 top☝ ☛ https://gateoverflow.in/836
Answer is D.
1.18.6 Pipelining: GATE CSE 2003 | Question: 10, ISRO-DEC2017-41 top☝ ☛ https://gateoverflow.in/901
1. Data hazard
2. Control hazard
3. Structural hazard as only one ALU is there
So, (D).
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/hazards.html
References
Pipelining requires all stages to be synchronized meaning, we have to make the delay of all stages equal to the maximum
pipeline stage delay which here is 160. We also have to add the intermediate register delay which here is 5ns which makes the
clock period as 165ns.
Total time for 1000 instructions = 660 + 999 ∗ 165 = 165.495 microseconds
Correct Answer: C
57 votes -- Arjun Suresh (332k points)
Answer is option A.
Without data forwarding:
13 clock - WB and RD state non overlapping.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
IF RD EX MA WB
IF RD EX MA WB
IF RD EX MA WB
WB stage produce the output during the rising edge of the clock and RD stage fetch the output during the falling edge.
In Question it is mentioned
This means that for writing operands back to memory, register read at RD state is used (no operand forward for STORE
instructions).
Note
' As in any question in any subject unless otherwise stated we always consider the best case. So, do overlap - unless
otherwise stated. But this is for only WB/RD
[Mostly it is given in question that there is operand forwarding from A stage to B stage eg:https://gateoverflow.in/8218/gate2015-
2_44 ]
Split-Phase can be used even when no Operand Forwarding because they aren't related.
References
http://web.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
Similar Questions
https://gateoverflow.in/8218/gate2015-2_44
https://gateoverflow.in/2207/gate2010-33
https://gateoverflow.in/34735/pipelining-without-operand-forwarding
Discussions
https://gateoverflow.in/102565/operand-forwarding-in-pipeline
https://gateoverflow.in/113244/doubts-in-pipelining
References
Delay slots in the pipeline caused due to a branch instruction is 2 as after the 3rd
stage of current instruction (during 4th stage) IF of next begins.
Ideally, this should be during 2nd stage.
Since clock speed is 1 GHz and each instruction on average takes 1 cycle,
total execution time in seconds will be
109 108
= 9
+4×
10 109
= 1.4
Correct Answer: C
62 votes -- Arjun Suresh (332k points)
Answer: option B.
Considering EX to EX data forwarding.
t1 t2 t3 t4 t5 t6 t7 t8
ADD IF ID EX WB
MUL IF ID EX EX EX WB
SUB IF ID − − EX WB
EX to EX data Forwarding:
(B) I and III
I - False Bypassing can't handle all RAW hazard, consider when any instruction depends on the result of LOAD instruction,
now LOAD updates register value at Memory Access Stage (MA), so data will not be available directly on Execute stage.
III- False, It cannot completely eliminate, though it can reduce Control Hazard Penalties
81 votes -- Prateeksha Keshari (1.7k points)
Answer is A. In order to avoid the pipeline delay due to conditional branch instruction, a suitable instruction is placed
below the conditional branch instruction such that the instruction will be executed irrespective of whether branch is taken or not
and won't affect the program behaviour.
60 votes -- Arjun Suresh (332k points)
What is Delayed Branching ?
One way to maximize the use of the pipeline, is to find an instruction that can be safely executed whether the branch is taken or
not, and execute that instruction. So, when a branch instruction is encountered, the hardware puts the instruction following the
branch into the pipe and begins executing it, just as in predict-not-taken. However, unlike in predict-not-taken, we do not need to
worry about whether the branch is taken or not, we do not need to clear the pipe because no matter whether the branch is taken or
not, we know the instruction is safe to execute.
It update the memory location to place the storing of conditional branch instruction R1
If moved after branch , when compiler reaches I4 program execution will stop
⇒ Cannot be moved
Here bound of the loop are constants, therefore compiler will do the loop unrolling(If compiler won't then prefetcher will
do) to increase the instruction level parallelism. And after loop unrolling 23 cycles are required for execution. Therefore, correct
answer would be (B).
PS: We assume the buffers between the pipeline stages can store multiple results in the form of a queue.
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23
I1 S1 S1 S2 S3 S4
I2 S1 S2 S2 S2 S3 S3 S4 S4
I3 S1 S1 − S2 − S3 − S4 S4 S4
I4 S1 − S2 S2 S3 S3 − − S4 S4
I1 S1 S1 − S2 − S3 − − − S4
I2 S1 − S2 S2 S2 S3 S3 − S4 S4
I3 S1 S1 − − S2 − S3 − − S4 S4 S4
I4 S1 − − S2 S2 S3 S3 − − − S4 S4
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
MUL IF ID OF PO PO PO WO
DIV IF ID OF − − PO PO PO PO PO PO WO
ADD IF ID − − OF − − − − − PO WO
SUB IF − − ID − − − − − OF PO WO
Operand forwarding allows an output to be passed for the next instruction. Here from the output of PO stage of DIV instruction
operand is forwarded to the PO stage of ADD instruction and similarly between ADD and SUB instructions. Hence, 15cycles
required.
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
Correct Answer: B
References
Answer is (B) 2.5
In pipeline system, Time taken is determined by the max delay at any stage i.e., 11 ns plus the delay incurred by pipeline stages
i.e., 1 ns = 12 ns. In non-pipeline system,
Delay = 5 ns + 6 ns + 11 ns + 8 ns = 30 ns.
30
∴ The speedup is 12 = 2.5 ns.
1.18.17 Pipelining: GATE CSE 2012 | Question: 20, ISRO2016-23 top☝ ☛ https://gateoverflow.in/52
Register renaming is done to eliminate WAR (Write after Read) and WAW (Write after Write) dependency between
instructions which could have caused pipieline stalls. Hence, (C) is the answer.
Example:
I1: Read A to B
I2: Write C to A
Here, there is a WAR dependency and pipeline would need stalls. In order to avoid it register renaming is done and
Write C to A
will be
Write C to A'
WAR dependency is actually called anti-dependency and there is no real dependency except the fact that both uses same memory
location. Register renaming can avoid this. Similarly WAW also.
http://people.ee.duke.edu/~sorin/ece252/lectures/4.2-tomasulo.pdf
References
After pipelining we have to adjust the stage delays such that no stage will be waiting for another to ensure smooth
pipelining (continuous flow). Since we can not easily decrease the stage delay, we can increase all the stage delays to the
maximum delay possible. So, here maximum delay is 10 ns. Buffer delay given is 1ns. So, each stage takes 11 ns in total.
FI of I9 can start only after the EI of I4. So, the total execution time will be
15 × 11 = 165
Correct Answer: B
122 votes -- gatecse (63.3k points)
Time without pipeline = 6 stages = 6 cycles
= 1 + .25 × 2
= 1.5
6
Speed up = =4
1.5
64 votes -- aravind90 (389 points)
Five stages:
(IF), instruction decode and register fetch (ID/RF),
instruction execution (EX),
memory access (MEM), and register writeback (WB)
P old design:
with stage latencies 1 ns, 2.2 ns, 2 ns, 1 ns, and 0.75 ns
MAX( 1 ns, 2.2 ns, 2 ns, 1 ns, and 0.75 ns) = 2.2nsec
AVG instruction execution time is
Tavg=(1+no of stalls×branch penality)×cycle time
= (1 + 0.20 × 2)2.2 { branch peanlity is 2 because the next instruction
pointer at the end of the EX stage in the old design.}
= 3.08 nsec
Q :new DESIGN:
the designers decided to split the ID/RF stage into three stages (ID, RF1, RF2)
2.2
each of latency ns . Also, the EX stage is split into two stages
3
(EX1, EX2) each of latency 1 ns.
The new design has a total of eight pipeline stages.
Time of stages in new design = {1 ns, 0.73ns, 0.73ns, 0.73ns , 1ns,1ns, 1 ns, and 0.75 ns}
(IF), instruction decode
register fetch (ID/RF) → further divided into 3 ie with latency 0.73 of each
final result
P 3.08
= = 1.54
Q 2
1
frequency =
max(time in stages)
1
for P3 , it is GHz
1
1
for P1 , it is = 0.5GHz
2
1
for P2 , it is = 0.67GHz
1.5
1
for P4 , it is GHz
1.1
Correct Answer: C
41 votes -- Arpit Dhuriya (2.9k points)
Answer = 3.2.
To compute cycle time, we know that a 2.5 GHz processor means it completes 2.5 billion cycles in a second. So, for an
4
instruction which on an average takes 4 cycles to get completed, it will take nanoseconds.
2.5
On a perfect pipleline (i.e., one which has no stalls) CPI = 1 as during it an instruction takes just one cycle time to get
completed.
So,
Old Execution Time of an Instruction
Speed Up =
New Execution Time of an Instruction
CP Iold /C Fold
=
CP Inew /C Fnew
4/2.5 GHz
=
1/2 GHz
= 3.2
With pipelining,
each instruction needs old execution time × old frequency
=
new frequency (without pipelining)
1.6 × 2.5
= = 2 ns
2
There are 5 stages and when there is no pipeline stall, this can give a speed up of up to 5
(happens when all stages take same number of cycles).
2
In our case this time will be = 0.4 ns .
5
1
But clock frequency being 2 GHz, clock cycle is GHz = 0.5 ns
2
and a pipeline stage cannot be faster than this.
So, average instruction execution time after pipelining = max (0.4, 0.5) = 0.5 ns .
1.6
So, speed up compared to non-pipelined version = = 3.2
0.5
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
I1 IF OF PO PO PO WB
I2 IF OF − − PO PO PO PO PO WB
I3 IF − − − − − − − OF PO WB
I4 − − − − − − − IF − OF PO WB
It is mentioned in the question that operand forwarding takes place from PO stage to OF stage and not to PO stage. So, 15 clock
cycles.
But since operand forwarding is from PO-OF, we can do like make the PO stage produce the output during the rising edge of the
clock and OF stage fetch the output during the falling edge. This would mean the final PO stage and OF stage can be done in one
clock cycle making the total number of cycles = 13. And 13 is the answer given in GATE key.
Reference: http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html
References
Reference: Page 24 http://www2.cs.siu.edu/~cs401/Textbook/ch3.pdf
S1 is needed at time 1 and 5, so its forbidden latency is 5 − 1 = 4.
S2 is needed at time 2 and 4, so its forbidden latency is 4 − 2 = 2.
So, forbidden latency = (2, 4, 0) ( 0 by default is forbidden)
Allowed latency = (1, 3, 5) (any value more than 5 also).
Collision vector (4, 3, 2, 1, 0) = 10101 which is the initial state as well.
From initial state we can have a transition after “1" or “3" cycles and we reach new states with collision vectors
(10101 >> 1 + 10101 = 11111) and (10101 >> 3 + 10101 = 10111) respectively.
These 2 becomes states 2 and 3 respectively. For “5" cycles we come back to state 1 itself.
From state 2 (11111), the new collision vector is 11111. We can have a transition only when we see the first 0 from the right.
So, here it happens on 5th cycle only which goes to the initial state. (Any transition after 5 or more cycles goes to initial state as
we have 5 time slices).
From state 3 (10111), the new collision vector is 10111. So, we can have a transition on 3, which will give
(10111 >> 3 + 10101 = 10111) third state itself. For 5, we get the initial state. Thus all the transitions are complete.
State\Time 1 3 5
1(10101) 2 3 1
2(11111) - - 1
3(10111) - 3 1
So, minimum length cycle is of length 3 either from 3-3 or from 1-3,3-1 .
Not asked in the question, still.
Pipeline throughput is the number of instructions initiated per unit time.
So, with MAL = 3, we have 2 initiations in 1 + 3 = 4 units of time (one at time unit 1 and another at time unit 4 ). So,
2
throughput = = 0.5 .
4
Pipeline efficiency is the % of time every stage of the pipeline is being used.
For the given question we can extend the reservation table and taking MAL = 3, we can initiate new tasks after every 3 cycles.
So, we can consider the time interval from 4-6 in the below figure. (The red color shows a stage not being used- affects
efficiency).
Time →
1 2 3 4 5 6 7 8 9 10 11
S1 X Y X × Z Y × A Z
S2 X X Y × Y Z Z A
S3 X × × Y Z
2 2 1
Here (during cycles 4 − 6 ), stage 1 is used , stage 2 is used and stage 3 is used .
3 3 3
(2 + 2 + 1) 5 500
So, total stage utilization = = and efficiency = % = 55.55 .
9 9 9
https://gateoverflow.in/77125/advanced-computer-architecture-collision-vector-pipeline
References
In pipeline ideally CPI = 1
So in 1 cycle 1 instruction gets completed
Throughout is instructions in unit time
In pipeline 1, cycle time=max stage delay = 800 psec
In 800 psec, we expect to finish 1 instruction
1
So, in 1ps, instructions are expected to be completed, which is also the throughput for pipeline 1.
800
1
Similarly pipeline 2, throughput=
600
Throughput increase in percentage
new-old
= × 100
old
1 1
−
600 800
= × 100
1
800
200
= × 100
600
= 33.33%
116 votes -- Anurag Semwal (6.7k points)
' Maximum throughput of a Pipeline i.e in best case without any stalls is equal to Clock Frequency of the pipeline
In first case Clock cycle time = Max Stage Delay = Max(800,500,400 and 300) = 800.
1
So clock Frequency = (Ignore the units as we have to calculate percentage only)
800
In Second Case Clock cycle time = Max(600,350,500,400 and 300) = 600.
1
So clock Frequency = .
600
Percentage increase in throughput of pipeline = percentage in Clock Frequency
1 1
( − )
600 800
= × 100 = 33.33%
1
800
Answer is 4 GHz.
Given 3 stage pipeline , with 3 GHz processor.
3e2
Given , e1 = = 2e3
4
Put e1 = 6x
we get, e2 = 8x , e3 = 3x
Now largest stage time is 8x.
1
So, frequency is
8x
1
⇒ = 3GHz
© Copyright GATE Overflow. Some rights reserved.
1
⇒ = 3GHz
8x
1
⇒ = 24 GHz … (1)
x
CASE 1
CASE 2
528
Speed Up = = 1.508 (Execution time case 1/Execution time case 2)
350
Total Instruction = 100
Number of stages = 5
In normal case total cycles = 100 + 5 − 1 = 104 cycles
Now, For PO stage 40 instructions take 3 cycles, 35 take 2 cycles and rest of the 25 take 1 cycle.
That means all other stages are perfectly fine and working with CPI (Clock Cycle Per Instruction)= 1
PO stage:
40 instructions take 3 cycles i.e. these instructions are suffering from 2 stall cycle,
35 instructions take 2 cycles i.e. these instructions are suffering from 1 stall cycle,
25 instructions take 1 cycles i.e. these instructions are suffering from 0 stall cycle,
5n
Time taken by non-pipelined processor to finish executing the n instructions : 2.5 = 2n ns
= 1.85n cycles
1.85n
= 2 ns
= 0.925n ns
2n
Speedup = 0.925n = 2.162 ≈ 2.16.
19 votes -- Debasish Das (1.5k points)
For the given pipelined system:
Option (c).
47 votes -- Suvojit Mondal (291 points)
(B) is the correct option for this question.
(K + n − 1) ∗ execution_time
D1 = (5 + 100 − 1) ∗ 4 = 416
D2 = (8 + 100 − 1) ∗ 2 = 214
(C) is the correct option for this question:
RAW
1. I1 - I2 (R5)
2. I2 - I3 (R6)
3. I3 - I4 (R5)
4. I4 - I5 (R6)
WAR
1. I2 - I3 (R5)
2. I3 - I4 (R6)
WAW
1. I1 - I3 (R5)
2. I2 - I4 (R6)
MUL IF ID EX EX EX WB
1 1
SUB IF − − ID EX WB
1 1
DIV IF ID EX EX EX WB
1 1
STORE IF − − ID EX WB
1
− Stalls
1 Operand forwarding from EX-ID using split phase
So, answer is 12.
Correct Answer: B
For non pipeline processor we have n instruction and each instruction take
12 cycle so total 12n instruction.
For pipeline processor we have each stage strict to 6ns so time to complete
6 × 6 + (n − 1) × 6
12n 12
limn→∞ = = 2.
36 + (n − 1) × 6 6
Correct Answer: B
62 votes -- Arpit Dhuriya (2.9k points)
Here we have to keep in mind the phrase:
This signifies that instruction in a non pipelined scenario is incurring only a single cycle to execute entire instruction. Hence no
concept of stage comes in case of single cycle non pipelined system.
The cycle time can be calculated from clock frequency given in non pipelined system = 100 MHz
1
Therefore clock cycle time in non pipelined system = sec
(100 × 106 )
= 10 ns
Now cycle time in pipelined system = max(stage delay + interface delay)
= 2.5 + 0.5 = 3 ns
Therefore,
CP Inon pipeline × Cycle time non pipeline
Speedup =
(CP Ipipeline × Cycle time pipeline )
1 × 10
= = 3.33
(1 × 3)
[Since in case of non pipeline we have single cycle processor, so CP Inon pipeline = 1 and CP Ipipeline by default = 1]
Hence, (C) is the correct answer.
1.19.1 Runtime Environments: GATE CSE 2001 | Question: 1.10, UGCNET-Dec2012-III: 36 top☝ ☛ https://gateoverflow.in/703
Suppose a processor does not have any stack pointer registers, which of the following statements is true?
Answer ☟
1.19.2 Runtime Environments: GATE CSE 2008 | Question: 37, ISRO2009-38 top☝ ☛ https://gateoverflow.in/448
The use of multiple register windows with overlap causes a reduction in the number of memory accesses for:
A. I only
II
Answer ☟
1.19.1 Runtime Environments: GATE CSE 2001 | Question: 1.10, UGCNET-Dec2012-III: 36 top☝ ☛ https://gateoverflow.in/703
A stack pointer is a small register that stores the address of the last program request in a stack.
And a nested function (or nested procedure or subroutine) is a function which is defined within another function, the enclosing
function. So if there is no stack pointer register then No nested subroutine call possible, hence option B is correct.
1.19.2 Runtime Environments: GATE CSE 2008 | Question: 37, ISRO2009-38 top☝ ☛ https://gateoverflow.in/448
I. Functions locals and parameters
this is true because overlapped registers eliminates the need for memory accesses. we here got to use registers instead.
Consider two processors P1 and P2 executing the same instruction set. Assume that under identical conditions, for the same
input, a program running on P2 takes 25% less time but incurs 20% more CPI (clock cycles per instruction) as compared to the
program running on P1 . If the clock frequency of P1 is 1GHZ , then the clock frequency of P2 (in GHz) is ______.
Answer ☟
In an enhancement of a design of a CPU, the speed of a floating point unit has been increased by 20% and the speed of a fixed
point unit has been increased by 10%. What is the overall speedup achieved if the ratio of the number of floating point operations to
the number of fixed point operations is 2 : 3 and the floating point operation used to take twice the time taken by the fixed point
operation in the original design?
A. 1.155
B. 1.185
C. 1.255
D. 1.285
Answer ☟
Answers: Speedup
CPU TIME (T) = No. of Instructions( I ) x No. of Cycles Per Instruction (c) x Cycle Time (t)
OR
→ T = Ic × CPI × F −1
T ×F
→ = Ic
CPI
If P1 takes T1 time,
T2
→ T2 = 0.75 × T1 → = 0.75
T1
C2
→ C2 = 1.2 × C1 → = 1.2
C1
(f1 × T1 ) (f2 × T2 )
→ = and f1 = 1 GHz
c1 c2
C2 T1
→ F2 = ( ) × ( ) × F1
C1 T2
1.2 × 1GHz
= = 1.6 GHz
0.75
(3x + 2 × 2x) 7x
Original time taken = =
5 5
3x 4x
( + )
1.1 1.2 8x
New time taken = =
5 1.32 × 5
7 × 1.32
So, SpeedUp = = 1.155
8
Correct Answer: A
56 votes -- gatecse (63.3k points)
1.21.1 Virtual Memory: GATE CSE 1991 | Question: 03,iii top☝ ☛ https://gateoverflow.in/517
The total size of address space in a virtual memory system is limited by:
A. the length of MAR
B. the available secondary storage
C. the available main memory
D. all of the above
E. none of the above
Answer ☟
Consider a system with a two-level paging scheme in which a regular memory access takes 150 nanoseconds , and servicing
a page fault takes 8 milliseconds . An average instruction takes 100 nanoseconds of CPU time, and two memory accesses. The
TLB hit ratio is 90%, and the page fault rate is one in every 10, 000 instructions. What is the effective average instruction execution
time?
A. 645 nanoseconds
B. 1050 nanoseconds
C. 1215 nanoseconds
D. 1230 nanoseconds
Answer ☟
In an instruction execution pipeline, the earliest that the data TLB (Translation Lookaside Buffer) can be accessed is:
Answer ☟
The answer is (A) and (B).
Virtual memory concept is independent of size of main memory and depends only on the availability of the secondary storage.
MAR holds the address generated by CPU and this obviously limits the total virtual memory address space.
Average Instruction execution time
= Average CPU execution time + Average time for getting data(instruction operands from memory for each instruction)
1
= 100 + 2((0.9(0) + 0.1(2 × 150))) + 2 × 150 + × 8 × 106
10000
(Page Fault Rate per 10,000 instruction is directly given in question.Two memory accesses per instruction and hence we need 2
× address translation time for average instruction execution time)
[ TLB access time assumed as 0 and 2 page tables need to be accessed in case of TLB miss as the system uses two-level paging
]
= 1260ns
140 votes -- Arjun Suresh (332k points)
C is the answer here.
Effective address is the address after applying the addressing mode like indexed, immediate etc. But this resulting address is still
the virtual address, the physical address is invisible to the CPU and will be given only by the MMU when given the
corresponding virtual address. Virtual address is given for TLB look up. TLB -Translation Lookaside Buffer, here Lookaside
means during Address translation (from Virtual to Physical). But virtual address must be there before we look into TLB.
https://gateoverflow.in/?qa=blob&qa_blobid=15279338060050073946
References
17 :
1.2.53 1.2.54 2:2 1.2.55 A 1.2.56 C 1.2.57 B
17
80000 :
1.9.1 D 1.9.2 B 1.9.3 A 1.9.4 456 1.9.5
80000
50 :
1.15.16 1.15.17 D 1.15.18 C 1.15.19 A 1.16.1 31
50
59 :
1.16.2 1.17.1 N/A 1.17.2 C 1.17.3 C 1.17.4 B
60
1.50 :
1.18.23 13 1.18.24 3 1.18.25 33.0 : 34.0 1.18.26 4 1.18.27
1.51
17160 :
1.18.28 219 1.18.29 2.161:2.169 1.18.30 1.18.31 C 1.18.32 B
17160
1.21.3 C