Quiz Sols
Quiz Sols
Quiz Sols
6.375 Complex Digital Systems Spring 2006 - Quiz - March 24, 2006 80 Minutes
NAME:
SCORE:
Please write your name on every page of the quiz. Not all questions are of equal diculty, so look over the entire quiz and budget your time carefully. Please carefully state any assumptions you make. Enter your answers in the spaces provided below. If you need extra room for an answer or for scratch work, you may use the back of each page but please clearly indicate where your answer is located. A list of useful equations is printed at the end of this quiz. You can detach this sheet for reference and do not have to hand this in. We will not grade anything written on the equation sheet. You will also receive a separate handout containing a copy of the relevant Bluespec lecture slides and code. We will not grade anything written on the Bluespec slides. You must not discuss the quizs contents with other students who have not yet taken the quiz. If, prior to taking it, you are inadvertently exposed to material in a quiz by whatever means you must immediately inform the instructor or a TA.
Score
Name:
Problem 1 : Logical Eort for Incrementer Carry Chain (25 total points)
The following diagram illustrates two dierent incrementer architectures. For all parts of this question you should assume that the delay unit ( ) for this process is 10 ps and that the parasitic delay of a minimum-sized inverter (Pinv ) is 1.
in[7] in[6] in[5] in[4] in[3] in[2] in[1] in[0]
sum[7]
sum[6]
sum[5]
sum[4]
sum[3]
sum[2]
sum[1]
sum[0]
Ripple-Carry Architecture There was an error in the ripple-carry gate topology given in the quiz. The last two NOR/NAND gates were accidently swaped. The correct topology is shown above. The error would not have aected your answers to Part 1.A or Part 1.B. It would have changed your answer to Part 1.C but full credit was awarded if you calculated the input caps using the incorrect gate topology.
in[6]
in[5]
in[4]
in[3]
in[2]
in[1]
in[0]
in[7]
in[6]
in[5]
in[4]
in[3]
in[2]
in[1]
sum[7]
sum[6]
sum[5]
sum[4]
sum[3]
sum[2]
sum[1]
sum[0]
Parallel-Prex Architecture Part 1.A : Critical paths for the adder architectures (5 points) Draw a line through the critical path for both the ripple-carry and the parallel-prex architectures. When determining the critical path you can assume that XOR gates are slower than NAND/NOR gates which are slower than inverters.
Name:
Part 1.B : Optimal delay of the adder architectures (10 points) Use logical eort to calculate the optimal delay of the critical path for both architectures in picoseconds. You should ignore all gates which are not on the critical path! Do not using branching eort. Ignore the fact that some gates have a fanout greater than one. The desired input capacitance of the isolated carry chain is 6 fF (since we are ignoring gates which are not on the critical path this is the input capacitance for a single gate). The load capacitance of every sum output is 60 fF. Show all your work. Ripple-Carry Architecture F P = GH = (4/3)3 (5/3)3 (4) (60/6) = 43.89 10 = 438.9 = (3 2pinv ) + (3 2pinv ) + 4pinv = 16
fopt = F 1/N = 438.91/7 = 2.385 Dopt = N fopt + P = 7 2.385 + 16 = 32.69 Dabs = Dopt = 326.9ps
Parallel-Prex Architecture F P = GH = (4/3)2 (5/3)2 (4) (60/6) = 19.75 10 = 197.5 = (2 2pinv ) + (2 2pinv ) + 4pinv = 12
fopt = F 1/N = 197.51/5 = 2.878 Dopt = N fopt + P = 5 2.878 + 12 = 26.39 Dabs = Dopt = 263.9ps
Part 1.C : Gate sizing for the adder architectures (10 points) Identify the optimum gate sizes for each gate in the critical path for both architectures. The gate sizes should be in femtofarads of input capacitance. Ripple-Carry (correct topology) Gate NAND2 NOR2 NAND2 NOR2 NAND2 NOR2 XOR2 Cout Cin (fF) 6.0 10.7 15.4 27.5 39.3 70.3 100.6 60 Ripple-Carry (incorrect topology) Gate NAND2 NOR2 NAND2 NOR2 NOR2 NAND2 XOR2 Cout Cin (fF) 6.0 10.7 15.4 27.5 39.3 56.3 100.6 60
Parallel-Prex Gate NAND2 NOR2 NAND2 NOR2 XOR2 Cout Cin (fF) 6.0 13.0 22.4 48.3 83.4 60
Name:
15.68um 64 32
15.68um
31
Bit-Cell
Write Bit-Line
The following table lists various parameters for a 0.18 m technology which you may nd useful when solving this problem. Remember that there is a list of equations at the end of this quiz.
Transistor Process Parameters Desired ratio of PMOS/NMOS widths PMOS gate capacitance per m of transistor width NMOS gate capacitance per m of transistor width PMOS drain capacitance per m of transistor width NMOS drain capacitance per m of transistor width PMOS eective on resistance NMOS eective on resistance Parameters for Metal 2 Wire Wire resistance per unit length Wire capacitance per unit length
Value 2 1.5 fF/m 1.5 fF/m 0.3 fF/m 0.3 fF/m 6.6 km 3.3 km Value 0.4 /m 0.2 fF/m
Name:
Part 2.A : Delay calculation with end-of-line driver (10 points) Draw a simple RC model for the register le write bitline. Only include the nal stage of the driver. Use a lumped wire model. Use the RC model to determine the delay of the write bitline. Express your answer in RC time constants. This part is very similar to the question asked in Lab 2. Students received 5 points for correctly drawing the RC model and 5 points for correctly performing the delay calculation. Students were not penalized if they reported their delay as a unit-less delay instead of reporting it in picoseconds.
Rdriver = 3.3km/11.52m = 0.286k Cdriver,drain = 96 0.36m 0.3fF/m = 10.4fF Lwire = 32 15.68m = 501.75m Rwire = 501.75m 0.4/m = 0.2k Cbitline,wire = 501.75m 0.2fF/m = 100.4fF Cbitline,gate = 31 3fF = 93fF Cbitline = Cbitline,wire + Cbitline,gate = 193.4fF Cload = 3fF Time Constant = Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire ) (Cbitline /2 + Cload ) = 30.6ps + 48.6ps = 79ps
Name:
Part 2.B : Delay calculation with middle-line driver (15 points) There is no reason we have to position the write bitline driver at one end of the bitline. In this part we will evaluate moving the driver to the middle of the bitline. The following gure illustrates the new design.
Last Driver Stage 64 32 15.68um
15
16
31
Draw a new RC model for the register le write bitline. Use the RC model to determine the delay of the write bitline. Express your answer in RC time constants. How does this new design compare to the baseline design evaluated in Part 2.A? Does this approach help mitigate wire resistance, wire capacitance, or both? Students received 5 points for correctly drawing the RC model, 7 points for correctly performing the delay calculation, and 3 points for correctly answering whether this approach helps mitigate wire resistance, wire capacitance, or both. It was not terribly clear in the problem text, but the intent was for the students to calculate the time constant for driving the last bit of the bitline (as was the case in Lab 2 and in the previous part). After studying this problem, the sta have determined that we provided the students with insucient information to correctly solve this problem. We did not explain in lecture how to deal with branching when using the Peneld-Rubinstein approximation. All the students who attempted this question either signicantly over-estimated or under-estimated the time constant. In this solution we rst show how to correctly estimate the time constant, then we show the two (incorrect) approaches used by the students. Although incorrect, as long as the student attempted the delay calculation in a reasonable way we awarded the full 7 points.
This is a bit of an approximation since we are using the same Cbitline as in the previous part, but it should still be a good approximation.
Name:
We can simplify this RC model as follows where R1 = Rdriver , R2 = Rwire /2, C1 = Cbitline /2 + Cdriver,drain , and C2 = Cbitline /4 + Cload .
R1 R2 C2 C1 R2 C2
There are two approaches to estimate the time constant. The rst uses the PeneldRubinstein approximation. In lecture, we only illustrated using the Peneld-Rubinstein approximation for RC chains, but in this example we must work with an RC tree. The classic paper by Rubinstein, Peneld, and Horowitz entitled Signal Delay in RC Tree Networks explains how to estimate the delay of these type of RC trees. This paper is available in the course locker (/mit/6.375/doc/rubinstein-penfield.pdf). One of the primary results from the paper is that the time constant for a specic path from an input node to an output node i can be roughly approximated with the following equation. Time Constant =
k
Rki Ck
where the summation is over all nodes in the tree and Rki is dened as the resistance of the portion of the path between the input and node i that is common with the path between the input and node k. The key points is that we do include capacitances which are o the path of interest but we do not include resistances which are o the path of interest. So using the Peneld-Rubinstein equation, the time constant for driving the nal bitcell is approximately as follows. Time Constant = R1 C1 + (R1 + R2 )C2 + R1 C2 = R1 C1 + (2R1 + R2 )C2 An alternative method exploits the fact that both branches of the RC tree are symmetric. Each branch gets exactly half of the drive current. Essentially it is as if we were driving each half of the bitline with a driver whose eective resistance is twice the original eective resistance. The following gure illustrates an equivalent RC model.
2 x R1 C1 2
R2 C2
The approximated time constant for this equivalent circuit is the same as for the PeneldRubinstein approximation. Time Constant = 2R1 C1 /2 + (2R1 + R2 )C2 = R1 C1 + (2R1 + R2 )C2
Name:
For both of these solution we can rewrite our result to look more like the result we found in Part 2.A. Time Constant = R1 C1 + (2R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /4) (Cbitline /2 + Cload ) = 30.1ps + 33.5ps = 63.6ps Driving the bitline from the middle helps mitigate the impact of wire resistance. We can also determine this intuitively by considering two scenarios. In the rst scenario, the wire capacitance is very large relative to the wire resistance. In this scenario you should be able to qualitatively see that driving the bitline from the middle will not really help the delay. Since the wire resistance is small, we can just lump all of the wire capacitance together - it does not matter where we drive the load from. In the second scenario, the wire resistance is very large relative to the wire capacitance. Now you should be able to see that driving the bitline from the middle will denitely help reduce the delay since it helps reduce the resistive path through which the driver charges up the load capacitance. The intuition then is that distributed drivers help mitigate wire resistance and are not really useful for mitigating wire capacitance. Let us now consider the two solutions most students provided and explain why they are incorrect. The rst solution includes the time constant for the left-hand branch in the approximation. Time Constant = R1 C1 + (R1 + R2 )C2 + (R1 + R2 )C2 = R1 C1 + 2(R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /2) (Cbitline /2 + Cload ) = 30.1ps + 38.5ps = 68.6ps This is an overly pessimistic approximation. Imagine if the branches were not symmetric and the resistance of the left-hand branch was very large. This would result in a large time constant for driving the right-hand branch since we are factoring in how long it takes to charge up the left-hand branch. The second solution provided by students completely ignored the left-hand branch. Time Constant = R1 C1 + (R1 + R2 )C2 Rdriver (Cdriver,drain + Cbitline /2) + (Rdriver + Rwire /2) (Cbitline /4 + Cload ) = 30.1ps + 19.8ps = 49.9ps This is an overly optimistic approximation. This approach ignores the fact that some of the current to charge up the right-hand branch is being diverted to charge up the left-hand branch. The Peneld-Rubinstein approximation helps properly account for the left-hand branch without being overly pessimistic or optimistic.
Name:
This algorithm is actually quite similar to the software multiplication algorithm you implemented for SMIPS in Lab 1. For this problem we will explore implementing this as a hardware module in Bluespec. The following module implements this algorithm using two shifters to form an iterative multiplier: interface I_mult; method Action start( Bit#(16) x,Bit#(16) y ); method Bit#(32) result(); endinterface module mkMult ( I_mult ); Reg#(Bit#(32)) product <- mkReg(0); Reg#(Bit#(32)) d <- mkReg(0); Reg#(Bit#(16)) r <- mkReg(0); rule cycle ( r != 0 ); if ( r[0] == 1 ) product <= product + d; d <= d << 1; r <= r >> 1; endrule method Action start( Bit#(16) x, Bit#(16) y ) if ( r == 0 ); d <= zeroExtend(x); r <= y; product <= 0; endmethod method Bit#(32) result() if ( r == 0 ); return product; endmethod endmodule
Name:
10
Diagram the hardware that the Bluespec compiler should produce for this module, including interface ports. Clearly circle and label which parts correspond to the rule, the scheduler, the start and result methods. Label which wire or wires correspond to CAN FIRE cycle and WILL FIRE cycle, as well as all ports corresponding to method ready and enable signals. Some students included the CLK and RST as inputs. This was not required, but of course, not penalized. The CLK and RST are implicit in the diagram solution. Other students decided to represent register enables as an extra level of muxing. This is also acceptable, and in many cases easier to reason about. The compiler found a small optimization on the product enable signal that no student found, but of course this is unnecessary.
Name:
11
Name:
12
rule execBr_Taken ( instRespQ.first() matches tagged LoadResp .ld &&& ld.tag == epoch &&& unpack(ld.data) matches .inst &&& !stallfunc(inst) &&& isBranch(inst) &&& branch_taken(inst) ); Addr next_pc; case (inst) matches tagged J .it : next_pc = { pcQ.first()[31:28], it.target, 2b0 }; tagged JR .it : next_pc = rf.rd1(it.rsrc); tagged JAL .it : begin wbQ.enq( WB_ALU {dest: 31, data: pcQ.first()} ); next_pc = { pcQ.first()[31:28], it.target, 2b0 }; end tagged JALR .it : begin wbQ.enq( WB_ALU {dest: it.rdst, data: pcQ.first()} ); next_pc = rf.rd1(it.rsrc); end //BLEZ, BGTZ, BTZ, BGEZ, BEQ, BNE default: next_pc = pcQ.first() + (sext(it.offset) << 2); endcase pc <= next_pc; epoch <= epoch + 1; pcQ.deq(); instRespQ.deq(); endrule
Name:
13
After splitting this system the rules have the following resource usage. (Note that the FIFO clear methods are unused.)
execMem epoch.read instRespQ.rst instRespQ.deq pcQ.rst pcQ.deq wbQ.enq wbQ.nd1,2 rf.rd1,2 dataReqQ.enq
execBr Taken epoch.read instRespQ.rst instRespQ.deq pcQ.rst pcQ.deq wbQ.enq wbQ.nd1,2 rf.rd1,2 dataReqQ.enq epoch.write pc.write
Name:
14
Part 4.A : Method scheduling 1 (6 points) Suppose you want your system to have the following scheduling behavior when multiple rules execute in the same clock cycle: pcGen < execBr Taken < writeback These rules interact through various modules such as the pc and pcQ. For each of these modules, give the method relationship necessary to meet the above scheduling behavior. For modules where the order is irrelevent or determined by factors outside of the processor write N/A. Weve done pc, you do the rest. pc: read < write epoch: read < write rf: rd{0,1} < wr
instReqQ: N/A
instRespQ: N/A
dataReqQ: N/A
dataRespQ: N/A
Part 4.B : Method scheduling 2 (7 points) Perform the same reasoning, but for the following scheduling property: writeback < execBr Taken < pcGen pc: write < read epoch: write < read rf: wr < rd{1,2}
instReqQ: N/A
instRespQ: N/A
dataReqQ: N/A
dataRespQ: N/A
Name:
15
Part 4.C : Dynamic Behavior (12 points) Consider the following three variants of a processor: Behaves as if: pcGen < execBr NotTaken, execBr Taken < writeback Behaves as if: writeback < execBr NotTaken, execBr Taken < pcGen Behaves as if: writeback < execBr NotTaken < pcGen < execBr Taken While running a program these processors reach the following state:
Name:
16
For each variant, answer the following. A) What rules (of those shown) will the scheduler choose to re and why. B) What is the longest combinational path in the system (including the parts not shown)? pcGen < execBr NotTaken, execBr Taken < writeback A) The execBr rules will not be able to re because of the stall signal, since wbQ.nd < wbQ.deq. The writeback rule will re, so the wbQ will be empty and the execBr Taken rule will re on the next cycle. B) If you work with non-combinational memories, as we did in the lab, then the paths are as follows. pc->pcGen->instReqQ, instRespQ/RF->execs->dataReqQ, and dataRespQ->rf. If, on the other hand you assume the memories are combinational, then this will act like an unpipelined machine, with the longest path going through the whole machine, including the memories. writeback < execBr NotTaken, execBr Taken < pcGen A) The writeback rule will re rst, and dequeue the wbQ. Therefore there is no stall for the execBr rules, and the execBr Taken rule will re, since it will read r4 == r5 == 64. B) The longest path in this system ows from the writeback rule, through the register le and through the execute rules. Then, since the execute rule updates the PC, the path continues through the pcGen rule and to the instReqQ and pcQ. writeback < execBr NotTaken < pcGen < execBr Taken A) This situation is the same as the previous, since we are not considering the pcGen rule (it is not in the diagram). B) The longest path goes from the writeback rule, through the register le and through the execute rules. However, since the new PC is no longer bypassed to the pcGen rule, that part of the path is no longer present. In fact in this system it would not be surprising if the ALU were the longest path.
Name:
17
Equation Sheet
Equation or Symbol g h = Cout /Cin f = gh p pinv d=f +p dabs = d G= gi H = Cout /Cin F = GH D= di = gi hi + pi fopt = F 1/N Dopt = N fopt + P Cin,opt,i = Cout,i gi /fopt Delay =
n i=0 j =i j =0 Rj
Description Gate logical eort Gate electrical eort Gate eort Gate parasitic delay Parasitic delay of minimum-sized inverter Delay unit Delay in units of Absolute delay in seconds Path logical eort Path electrical eort Path eort Path delay Optimal stage eort Optimal path delay Optimal input capacitance for stage i Peneld-Rubenstein wire-delay model Eective driver resistance Total wire resistance Total wire capacitance Simple lumped model
Ci
Gate Type Inverter Logical Eort NAND Logical Eort NOR Logical Eort XOR/XNOR Logical Eort Inverter Parasitic Delay NAND Parasitic Delay NOR Parasitic Delay XOR/XNOR Parasitic Delay
1 1
2 4/3 5/3 4
Number of inputs 3 4 5 5/3 7/3 12 3pinv 3pinv 4pinv 6/3 9/3 32 4pinv 4pinv 4pinv 7/3 11/3
pinv 2pinv 2pinv 4pinv 5pinv 5pinv 4pinv npinv npinv 4pinv