Quiz Sols

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

MASSACHUSETTS INSTITUTE OF TECHNOLOGY DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

6.375 Complex Digital Systems Spring 2006 - Quiz - March 24, 2006 80 Minutes

NAME:

SCORE:

Please write your name on every page of the quiz. Not all questions are of equal diculty, so look over the entire quiz and budget your time carefully. Please carefully state any assumptions you make. Enter your answers in the spaces provided below. If you need extra room for an answer or for scratch work, you may use the back of each page but please clearly indicate where your answer is located. A list of useful equations is printed at the end of this quiz. You can detach this sheet for reference and do not have to hand this in. We will not grade anything written on the equation sheet. You will also receive a separate handout containing a copy of the relevant Bluespec lecture slides and code. We will not grade anything written on the Bluespec slides. You must not discuss the quizs contents with other students who have not yet taken the quiz. If, prior to taking it, you are inadvertently exposed to material in a quiz by whatever means you must immediately inform the instructor or a TA.

Points Problem 1 Problem 2 Problem 3 Problem 4 25 25 25 25

Score

6.375 Quiz, Spring 2006

Name:

Problem 1 : Logical Eort for Incrementer Carry Chain (25 total points)
The following diagram illustrates two dierent incrementer architectures. For all parts of this question you should assume that the delay unit ( ) for this process is 10 ps and that the parasitic delay of a minimum-sized inverter (Pinv ) is 1.
in[7] in[6] in[5] in[4] in[3] in[2] in[1] in[0]

sum[7]

sum[6]

sum[5]

sum[4]

sum[3]

sum[2]

sum[1]

sum[0]

Ripple-Carry Architecture There was an error in the ripple-carry gate topology given in the quiz. The last two NOR/NAND gates were accidently swaped. The correct topology is shown above. The error would not have aected your answers to Part 1.A or Part 1.B. It would have changed your answer to Part 1.C but full credit was awarded if you calculated the input caps using the incorrect gate topology.

in[6]

in[5]

in[4]

in[3]

in[2]

in[1]

in[0]

in[7]

in[6]

in[5]

in[4]

in[3]

in[2]

in[1]

sum[7]

sum[6]

sum[5]

sum[4]

sum[3]

sum[2]

sum[1]

sum[0]

Parallel-Prex Architecture Part 1.A : Critical paths for the adder architectures (5 points) Draw a line through the critical path for both the ripple-carry and the parallel-prex architectures. When determining the critical path you can assume that XOR gates are slower than NAND/NOR gates which are slower than inverters.

6.375 Quiz, Spring 2006

Name:

Part 1.B : Optimal delay of the adder architectures (10 points) Use logical eort to calculate the optimal delay of the critical path for both architectures in picoseconds. You should ignore all gates which are not on the critical path! Do not using branching eort. Ignore the fact that some gates have a fanout greater than one. The desired input capacitance of the isolated carry chain is 6 fF (since we are ignoring gates which are not on the critical path this is the input capacitance for a single gate). The load capacitance of every sum output is 60 fF. Show all your work. Ripple-Carry Architecture F P = GH = (4/3)3 (5/3)3 (4) (60/6) = 43.89 10 = 438.9 = (3 2pinv ) + (3 2pinv ) + 4pinv = 16

fopt = F 1/N = 438.91/7 = 2.385 Dopt = N fopt + P = 7 2.385 + 16 = 32.69 Dabs = Dopt = 326.9ps

Parallel-Prex Architecture F P = GH = (4/3)2 (5/3)2 (4) (60/6) = 19.75 10 = 197.5 = (2 2pinv ) + (2 2pinv ) + 4pinv = 12

fopt = F 1/N = 197.51/5 = 2.878 Dopt = N fopt + P = 5 2.878 + 12 = 26.39 Dabs = Dopt = 263.9ps

Part 1.C : Gate sizing for the adder architectures (10 points) Identify the optimum gate sizes for each gate in the critical path for both architectures. The gate sizes should be in femtofarads of input capacitance. Ripple-Carry (correct topology) Gate NAND2 NOR2 NAND2 NOR2 NAND2 NOR2 XOR2 Cout Cin (fF) 6.0 10.7 15.4 27.5 39.3 70.3 100.6 60 Ripple-Carry (incorrect topology) Gate NAND2 NOR2 NAND2 NOR2 NOR2 NAND2 XOR2 Cout Cin (fF) 6.0 10.7 15.4 27.5 39.3 56.3 100.6 60

Parallel-Prex Gate NAND2 NOR2 NAND2 NOR2 XOR2 Cout Cin (fF) 6.0 13.0 22.4 48.3 83.4 60

6.375 Quiz, Spring 2006

Name:

Problem 2 : RC Modeling of Register File Write Bitline (25 total points)


In this problem we will be revisiting the register le write bitline you analyzed in Lab 2. Remember that the write bitline must drive the D input port of 32 ip-ops. The combined gate capacitance of these ip-ops can be a signicant load on the write bitline. The load on the write bitline is further increased by wire capacitance, since ip-ops are usually large and thus often spread apart. The following gure illustrates the write bitline including a reasonable nal stage of the bitline driver. For this problem we will only consider this nal stage even though the real driver might include many stages. As you determined in the lab assignment, each bitcell is 15.68 m wide and the input capacitance of the bitcells D port is 3 fF. The following gure illustrates the register le write bitline. The bitline is routed on Metal 2. You can ignore any via resistance or capacitance. Remember that the driver PMOS/NMOS sizes are in units of minimum NMOS transistor width (0.36 m). For example, the NMOS for the last stage of the bitline driver is 0.36 m 32 = 11.52 m.
Input (D) Port (3fF)

15.68um 64 32

15.68um

31

Last Driver Stage

Bit-Cell

Write Bit-Line

The following table lists various parameters for a 0.18 m technology which you may nd useful when solving this problem. Remember that there is a list of equations at the end of this quiz.

Transistor Process Parameters Desired ratio of PMOS/NMOS widths PMOS gate capacitance per m of transistor width NMOS gate capacitance per m of transistor width PMOS drain capacitance per m of transistor width NMOS drain capacitance per m of transistor width PMOS eective on resistance NMOS eective on resistance Parameters for Metal 2 Wire Wire resistance per unit length Wire capacitance per unit length

Value 2 1.5 fF/m 1.5 fF/m 0.3 fF/m 0.3 fF/m 6.6 km 3.3 km Value 0.4 /m 0.2 fF/m

6.375 Quiz, Spring 2006

Name:

Part 2.A : Delay calculation with end-of-line driver (10 points) Draw a simple RC model for the register le write bitline. Only include the nal stage of the driver. Use a lumped wire model. Use the RC model to determine the delay of the write bitline. Express your answer in RC time constants. This part is very similar to the question asked in Lab 2. Students received 5 points for correctly drawing the RC model and 5 points for correctly performing the delay calculation. Students were not penalized if they reported their delay as a unit-less delay instead of reporting it in picoseconds.

Rdriver Rwire Cdriver,drain Cbitline 2 Cbitline 2 Cload

Rdriver = 3.3km/11.52m = 0.286k Cdriver,drain = 96 0.36m 0.3fF/m = 10.4fF Lwire = 32 15.68m = 501.75m Rwire = 501.75m 0.4/m = 0.2k Cbitline,wire = 501.75m 0.2fF/m = 100.4fF Cbitline,gate = 31 3fF = 93fF Cbitline = Cbitline,wire + Cbitline,gate = 193.4fF Cload = 3fF Time Constant = Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire ) (Cbitline /2 + Cload ) = 30.6ps + 48.6ps = 79ps

6.375 Quiz, Spring 2006

Name:

Part 2.B : Delay calculation with middle-line driver (15 points) There is no reason we have to position the write bitline driver at one end of the bitline. In this part we will evaluate moving the driver to the middle of the bitline. The following gure illustrates the new design.
Last Driver Stage 64 32 15.68um

15

16

31

Draw a new RC model for the register le write bitline. Use the RC model to determine the delay of the write bitline. Express your answer in RC time constants. How does this new design compare to the baseline design evaluated in Part 2.A? Does this approach help mitigate wire resistance, wire capacitance, or both? Students received 5 points for correctly drawing the RC model, 7 points for correctly performing the delay calculation, and 3 points for correctly answering whether this approach helps mitigate wire resistance, wire capacitance, or both. It was not terribly clear in the problem text, but the intent was for the students to calculate the time constant for driving the last bit of the bitline (as was the case in Lab 2 and in the previous part). After studying this problem, the sta have determined that we provided the students with insucient information to correctly solve this problem. We did not explain in lecture how to deal with branching when using the Peneld-Rubinstein approximation. All the students who attempted this question either signicantly over-estimated or under-estimated the time constant. In this solution we rst show how to correctly estimate the time constant, then we show the two (incorrect) approaches used by the students. Although incorrect, as long as the student attempted the delay calculation in a reasonable way we awarded the full 7 points.

Rwire 2 Cload Cbitline 4 Cbitline 4

Rdriver Cdriver, drain

Rwire 2 Cbitline 4 Cbitline 4 Cload

This is a bit of an approximation since we are using the same Cbitline as in the previous part, but it should still be a good approximation.

6.375 Quiz, Spring 2006

Name:

We can simplify this RC model as follows where R1 = Rdriver , R2 = Rwire /2, C1 = Cbitline /2 + Cdriver,drain , and C2 = Cbitline /4 + Cload .

R1 R2 C2 C1 R2 C2

There are two approaches to estimate the time constant. The rst uses the PeneldRubinstein approximation. In lecture, we only illustrated using the Peneld-Rubinstein approximation for RC chains, but in this example we must work with an RC tree. The classic paper by Rubinstein, Peneld, and Horowitz entitled Signal Delay in RC Tree Networks explains how to estimate the delay of these type of RC trees. This paper is available in the course locker (/mit/6.375/doc/rubinstein-penfield.pdf). One of the primary results from the paper is that the time constant for a specic path from an input node to an output node i can be roughly approximated with the following equation. Time Constant =
k

Rki Ck

where the summation is over all nodes in the tree and Rki is dened as the resistance of the portion of the path between the input and node i that is common with the path between the input and node k. The key points is that we do include capacitances which are o the path of interest but we do not include resistances which are o the path of interest. So using the Peneld-Rubinstein equation, the time constant for driving the nal bitcell is approximately as follows. Time Constant = R1 C1 + (R1 + R2 )C2 + R1 C2 = R1 C1 + (2R1 + R2 )C2 An alternative method exploits the fact that both branches of the RC tree are symmetric. Each branch gets exactly half of the drive current. Essentially it is as if we were driving each half of the bitline with a driver whose eective resistance is twice the original eective resistance. The following gure illustrates an equivalent RC model.

2 x R1 C1 2

R2 C2

The approximated time constant for this equivalent circuit is the same as for the PeneldRubinstein approximation. Time Constant = 2R1 C1 /2 + (2R1 + R2 )C2 = R1 C1 + (2R1 + R2 )C2

6.375 Quiz, Spring 2006

Name:

For both of these solution we can rewrite our result to look more like the result we found in Part 2.A. Time Constant = R1 C1 + (2R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /4) (Cbitline /2 + Cload ) = 30.1ps + 33.5ps = 63.6ps Driving the bitline from the middle helps mitigate the impact of wire resistance. We can also determine this intuitively by considering two scenarios. In the rst scenario, the wire capacitance is very large relative to the wire resistance. In this scenario you should be able to qualitatively see that driving the bitline from the middle will not really help the delay. Since the wire resistance is small, we can just lump all of the wire capacitance together - it does not matter where we drive the load from. In the second scenario, the wire resistance is very large relative to the wire capacitance. Now you should be able to see that driving the bitline from the middle will denitely help reduce the delay since it helps reduce the resistive path through which the driver charges up the load capacitance. The intuition then is that distributed drivers help mitigate wire resistance and are not really useful for mitigating wire capacitance. Let us now consider the two solutions most students provided and explain why they are incorrect. The rst solution includes the time constant for the left-hand branch in the approximation. Time Constant = R1 C1 + (R1 + R2 )C2 + (R1 + R2 )C2 = R1 C1 + 2(R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /2) (Cbitline /2 + Cload ) = 30.1ps + 38.5ps = 68.6ps This is an overly pessimistic approximation. Imagine if the branches were not symmetric and the resistance of the left-hand branch was very large. This would result in a large time constant for driving the right-hand branch since we are factoring in how long it takes to charge up the left-hand branch. The second solution provided by students completely ignored the left-hand branch. Time Constant = R1 C1 + (R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /2) (Cbitline /4 + Cload ) = 30.1ps + 19.8ps = 49.9ps This is an overly optimistic approximation. This approach ignores the fact that some of the current to charge up the right-hand branch is being diverted to charge up the left-hand branch. The Peneld-Rubinstein approximation helps properly account for the left-hand branch without being overly pessimistic or optimistic.

6.375 Quiz, Spring 2006

Name:

Problem 3 : Bluespec Synthesis (25 total points)


Consider the algorithm for binary multiplication presented in Lecture 7 (Introduction to Bluespec): 1001 x 0101 --------1001 0000 1001 0000 --------0101101 // d = 4d9 // r = 4d5 // // // // d 0 d 0 << << << << 0 1 2 3 (since (since (since (since r[0] r[1] r[2] r[3] == == == == 1) 0) 1) 0)

// product (sum of above) = 45

This algorithm is actually quite similar to the software multiplication algorithm you implemented for SMIPS in Lab 1. For this problem we will explore implementing this as a hardware module in Bluespec. The following module implements this algorithm using two shifters to form an iterative multiplier: interface I_mult; method Action start( Bit#(16) x,Bit#(16) y ); method Bit#(32) result(); endinterface module mkMult ( I_mult ); Reg#(Bit#(32)) product <- mkReg(0); Reg#(Bit#(32)) d <- mkReg(0); Reg#(Bit#(16)) r <- mkReg(0); rule cycle ( r != 0 ); if ( r[0] == 1 ) product <= product + d; d <= d << 1; r <= r >> 1; endrule method Action start( Bit#(16) x, Bit#(16) y ) if ( r == 0 ); d <= zeroExtend(x); r <= y; product <= 0; endmethod method Bit#(32) result() if ( r == 0 ); return product; endmethod endmodule

6.375 Quiz, Spring 2006

Name:

10

Diagram the hardware that the Bluespec compiler should produce for this module, including interface ports. Clearly circle and label which parts correspond to the rule, the scheduler, the start and result methods. Label which wire or wires correspond to CAN FIRE cycle and WILL FIRE cycle, as well as all ports corresponding to method ready and enable signals. Some students included the CLK and RST as inputs. This was not required, but of course, not penalized. The CLK and RST are implicit in the diagram solution. Other students decided to represent register enables as an extra level of muxing. This is also acceptable, and in many cases easier to reason about. The compiler found a small optimization on the product enable signal that no student found, but of course this is unnecessary.

6.375 Quiz, Spring 2006

Name:

11

Problem 4 : Rule Scheduling in Bluespec (25 total points)


In this problem we will explore the behavior of the pipeline used in Lab 3 and presented in class. The reference code has been included in a separate handout. In order to gain ne-grained control over the scheduling, it is often desirable to split large rules with case statements into multiple rules. Consider the execute rule. It only interacts with the dataReqQ on a memory operation, so one natural partitioning is to create an execMem rule which handles Load and Store operations. Similarly the execute rule only interacts with the pc when the current instruction is a branch. Therefore one design choice might be to separate the handling of branch instructions into a separate rule. However this choice is actually too restrictive. In point of fact, the execute stage only sets pc on a taken branch. Consider the design where execute is split into four rules, execALU, execMem, execBr NotTaken, and execBr Taken. For reference, here is the code for the execBr NotTaken and the execBr Taken function. function Bool isBranch( Instr i ); // Returns True if i is a Branch endfunction function Bool branchTaken( Instr i ); // If given a branch instruction, returns True if the branch is taken, // otherwise returns False. // Note that in some cases this involves reading the RegFile. endfunction rule execBr_NotTaken ( instRespQ.first() matches tagged LoadResp .ld &&& ld.tag == epoch &&& unpack(ld.data) matches .inst &&& !stallfunc(inst) &&& isBranch(inst) &&& !branch_taken(inst) ); pcQ.deq(); instRespQ.deq(); endrule

6.375 Quiz, Spring 2006

Name:

12

rule execBr_Taken ( instRespQ.first() matches tagged LoadResp .ld &&& ld.tag == epoch &&& unpack(ld.data) matches .inst &&& !stallfunc(inst) &&& isBranch(inst) &&& branch_taken(inst) ); Addr next_pc; case (inst) matches tagged J .it : next_pc = { pcQ.first()[31:28], it.target, 2b0 }; tagged JR .it : next_pc = rf.rd1(it.rsrc); tagged JAL .it : begin wbQ.enq( WB_ALU {dest: 31, data: pcQ.first()} ); next_pc = { pcQ.first()[31:28], it.target, 2b0 }; end tagged JALR .it : begin wbQ.enq( WB_ALU {dest: it.rdst, data: pcQ.first()} ); next_pc = rf.rd1(it.rsrc); end //BLEZ, BGTZ, BTZ, BGEZ, BEQ, BNE default: next_pc = pcQ.first() + (sext(it.offset) << 2); endcase pc <= next_pc; epoch <= epoch + 1; pcQ.deq(); instRespQ.deq(); endrule

6.375 Quiz, Spring 2006

Name:

13

After splitting this system the rules have the following resource usage. (Note that the FIFO clear methods are unused.)

pcGen pc.read epoch.read pc.write pcQ.enq instReqQ.enq

discard epoch.read pcQ.deq instRespQ.deq

execALU epoch.read instRespQ.rst instRespQ.deq pcQ.deq wbQ.enq wbQ.nd1,2 rf.rd1,2

execMem epoch.read instRespQ.rst instRespQ.deq pcQ.rst pcQ.deq wbQ.enq wbQ.nd1,2 rf.rd1,2 dataReqQ.enq

execBr Taken epoch.read instRespQ.rst instRespQ.deq pcQ.rst pcQ.deq wbQ.enq wbQ.nd1,2 rf.rd1,2 dataReqQ.enq epoch.write pc.write

execBr NotTaken epoch.read instRespQ.rst instRespQ.deq pcQ.deq wbQ.nd1,2

writeback wbQ.rst wbQ.deq dataRespQ.rst dataRespq.deq rf.wr

6.375 Quiz, Spring 2006

Name:

14

Part 4.A : Method scheduling 1 (6 points) Suppose you want your system to have the following scheduling behavior when multiple rules execute in the same clock cycle: pcGen < execBr Taken < writeback These rules interact through various modules such as the pc and pcQ. For each of these modules, give the method relationship necessary to meet the above scheduling behavior. For modules where the order is irrelevent or determined by factors outside of the processor write N/A. Weve done pc, you do the rest. pc: read < write epoch: read < write rf: rd{0,1} < wr

pcQ: enq < first, deq

instReqQ: N/A

instRespQ: N/A

wbQ: find{1,2}, enq < first, deq

dataReqQ: N/A

dataRespQ: N/A

Part 4.B : Method scheduling 2 (7 points) Perform the same reasoning, but for the following scheduling property: writeback < execBr Taken < pcGen pc: write < read epoch: write < read rf: wr < rd{1,2}

pcQ: first, deq < enq

instReqQ: N/A

instRespQ: N/A

wbQ: first, deq < find{1,2}, enq

dataReqQ: N/A

dataRespQ: N/A

6.375 Quiz, Spring 2006

Name:

15

Part 4.C : Dynamic Behavior (12 points) Consider the following three variants of a processor: Behaves as if: pcGen < execBr NotTaken, execBr Taken < writeback Behaves as if: writeback < execBr NotTaken, execBr Taken < pcGen Behaves as if: writeback < execBr NotTaken < pcGen < execBr Taken While running a program these processors reach the following state:

6.375 Quiz, Spring 2006

Name:

16

For each variant, answer the following. A) What rules (of those shown) will the scheduler choose to re and why. B) What is the longest combinational path in the system (including the parts not shown)? pcGen < execBr NotTaken, execBr Taken < writeback A) The execBr rules will not be able to re because of the stall signal, since wbQ.nd < wbQ.deq. The writeback rule will re, so the wbQ will be empty and the execBr Taken rule will re on the next cycle. B) If you work with non-combinational memories, as we did in the lab, then the paths are as follows. pc->pcGen->instReqQ, instRespQ/RF->execs->dataReqQ, and dataRespQ->rf. If, on the other hand you assume the memories are combinational, then this will act like an unpipelined machine, with the longest path going through the whole machine, including the memories. writeback < execBr NotTaken, execBr Taken < pcGen A) The writeback rule will re rst, and dequeue the wbQ. Therefore there is no stall for the execBr rules, and the execBr Taken rule will re, since it will read r4 == r5 == 64. B) The longest path in this system ows from the writeback rule, through the register le and through the execute rules. Then, since the execute rule updates the PC, the path continues through the pcGen rule and to the instReqQ and pcQ. writeback < execBr NotTaken < pcGen < execBr Taken A) This situation is the same as the previous, since we are not considering the pcGen rule (it is not in the diagram). B) The longest path goes from the writeback rule, through the register le and through the execute rules. However, since the new PC is no longer bypassed to the pcGen rule, that part of the path is no longer present. In fact in this system it would not be surprising if the ALU were the longest path.

6.375 Quiz, Spring 2006

Name:

17

Equation Sheet
Equation or Symbol g h = Cout /Cin f = gh p pinv d=f +p dabs = d G= gi H = Cout /Cin F = GH D= di = gi hi + pi fopt = F 1/N Dopt = N fopt + P Cin,opt,i = Cout,i gi /fopt Delay =
n i=0 j =i j =0 Rj

Description Gate logical eort Gate electrical eort Gate eort Gate parasitic delay Parasitic delay of minimum-sized inverter Delay unit Delay in units of Absolute delay in seconds Path logical eort Path electrical eort Path eort Path delay Optimal stage eort Optimal path delay Optimal input capacitance for stage i Peneld-Rubenstein wire-delay model Eective driver resistance Total wire resistance Total wire capacitance Simple lumped model

Ci

Rd Rw Cw Delay Rd Cw /2 + (Rd + Rw ) (Cw /2 + Cload )

Gate Type Inverter Logical Eort NAND Logical Eort NOR Logical Eort XOR/XNOR Logical Eort Inverter Parasitic Delay NAND Parasitic Delay NOR Parasitic Delay XOR/XNOR Parasitic Delay

1 1

2 4/3 5/3 4

Number of inputs 3 4 5 5/3 7/3 12 3pinv 3pinv 4pinv 6/3 9/3 32 4pinv 4pinv 4pinv 7/3 11/3

n (n + 2)/3 (2n + 1)/3

pinv 2pinv 2pinv 4pinv 5pinv 5pinv 4pinv npinv npinv 4pinv

You might also like