Static Pipelining #2 and Goodbye To Computer Architecture: Prof. Lawrence Rauchwerger

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 22

CPSCS614:Graduate Computer Architecture

Static Pipelining #2 and

Goodbye to Computer Architecture
Prof. Lawrence Rauchwerger

Based on Lectures by
Prof. David A. Patterson
UC Berkeley
Review #1: Hardware versus
Software Speculation Mechanisms
• To speculate extensively, must be able to
disambiguate memory references
– Much easier in HW than in SW for code with pointers
• HW-based speculation works better when control
flow is unpredictable, and when HW-based branch
prediction is superior to SW-based branch prediction
done at compile time
– Mispredictions mean wasted speculation
• HW-based speculation maintains precise exception
model even for speculated instructions
• HW-based speculation does not require compensation
or bookkeeping code
Review #2: Hardware versus Software
Speculation Mechanisms cont’d
• Compiler-based approaches may benefit from the
ability to see further in the code sequence,
resulting in better code scheduling
• HW-based speculation with dynamic scheduling
does not require different code sequences to
achieve good performance for different
implementations of an architecture
– may be the most important in the long run?
Review #3: Software Scheduling

• Instruction Level Parallelism (ILP) found either by

compiler or hardware.
• Loop level parallelism is easiest to see
– SW dependencies/compiler sophistication determine if compiler can
unroll loops
– Memory dependencies hardest to determine => Memory disambiguation
– Very sophisticated transformations available
• Trace Sceduling to Parallelize If statements
• Superscalar and VLIW: CPI < 1 (IPC > 1)
– Dynamic issue vs. Static issue
– More instructions issue at same time => larger hazard penalty
– Limitation is often number of instructions that you can successfully
fetch and decode per cycle
VLIW in Embedded Designs

• VLIW: greater parallelism under

programmer, compiler control vs. hardware
in superscalar
• Used in DSPs, Multimedia processors as well
as IA-64
• What about code size?
• Effectiveness, Quality of compilers for
these applications?
Example VLIW for multimedia:
Philips Trimedia CPU
• Every instruction contains 5 operations
• Predicated with single register value;
if 0 => all 5 operations are canceled
• 128 64-bit registers, which contain either
integer or floating point data
• Partitioned ALU (SIMD) instructions to
compute on multiple instances of narrow data
• Offers both saturating arithmetic (DSPs) and
2’s complement arithmetic (desktop)
• Delayed Branch with 3 branch slots
Trimedia Operations
Examples No.
• large number of
Load/store ld8, ld16, ld32, ld64,limm. st8, 39 SIMD, signed, unsigned, ops because used
ops st16, st32, st64
register indirect, indexed,retargetable

shift right 1-, 2- , 3-bytes, select

scaled addressing compilers, multiple
Byte 67 SIMD type convert
shuffles byte, merge, pack machine
Bit shifts asl, asr, lsl, lsr, rol, 48 round, fields, SIMD descriptions, and
Multiplies mul, sum of products, sum-of- 54 round, saturate, 2’s die size estimators
comp, SIMD to explore the
Integer add, sub, min, max, abs, average,
bitand, bitor, bitxor, bitinv,
104 saturate, 2’s comp, space to find the
arithmetic unsigned, immediate,
bitandinv, eql, neq, gtr, geq, les,
leq, sign extend, zero extend, SIMD best cost-
sum of absolute differences performance design
Floating add, sub, neg, mul, div, sqrt eql, 59 scalar and SIMD
neq, gtr, geq, les, leq, IEEE flags – Verification time,
manufacturing test,
Lookup SIMD gather load using registers 6 SIMD
as addresses design time?
Special ops alloc, prefetch block, invalidate 23 MMU, cache, special
block, copy block back, read tag
read, cache status, read counter
Branch jmpt, jmpf 10 (un)interruptible, trap
Total 410
Trimedia Functional Units, Latency,
Instruction Slots
F.U. Latency Operation Slot •
Typical operations performed 23 functional
by functional unit units of 11
1 2 3 4 5
ALU 0 X X X X X Integer add/subtract/compare,
logicals • which of 5 slots
DMem 2 X X Loads and stores can issue (and
DMemSp 2 X Cache invalidate, prefetch, hence number
ec allocate
Shifter 0 X X Shifts and rotates of functional
DSPALU 1 X X Simple DSP arithmetic ops units)
DSPMul 2 X X DSP ops with multiplication
Branch 3 X X X Branches and jumps
FALU 2 X X FP add, subtract
IFMul 2 X X Integer and FP multiply
FComp 0 X FP compare
FTough 16 X FP divide, square root
Philips Trimedia CPU

• Compiler responsible for including no-ops

– both within an instruction-- when an operation field
cannot be used--and between dependent instructions
– processor does not detect hazards, which if present
will lead to incorrect execution
• Code size? compresses the code (~ Quiz
– decompresses after fetched from instruction cache

• Using MIPS notation, look at code for

void sum (int a[], int b[], int c[],
int n)
{ int i;
for (i=0; i<n; i++)
c[i] = a[i]+b[i];
• MIPS code for loop
Loop: LD R11,R0(R4) # R11 = a[i]
LD R12,R0(R5) # R12 = b[i]
DADDU R17,R11,R12 # R17 = a[i]+b[i]
SD R17,0(R6) # c[i] = a[i]+b[i]
DADDIU R4,R4,8 # R4 = next a[] addr
DADDIU R5,R5,8 # R5 = next b[] addr
DADDIU R6,R6,8 # R6 = next c[] addr
BNE R4,R7,Loop # if not last go to Loop
• Then unroll 4 times and schedule
Tridmedia Version
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
LD R11,0(R4) LD R12,R0(R5)
DADDUI R25,R6,32 LD R14,8(R4) LD R15,8(R5)
SETEQ R25,R25,R7 LD R19,16(R4) LD R20,16(R5)
DADDU R17,R11,R12 DADDIU R4,R4,32 LD R22,24(R4) LD R23,24(R5)
DADDU R18,R14,R15 JMPF R25,R30 SD R17, 0(R6)
DADDU R21,R19,R20 DADDIU R5,R5,32 SD R18, 8(R6)
DADDU R24,R22,R23 SD R21,16(R6)
DADDIU R6,R6,32 SD R24, 24(R6)

• Loop address in register 30

• Conditional jump (JMPF) so that only jump is conditional, not whole
instruction predicated
• DADDUI (1st slot, 2nd instr) and SETEQ (1st slot, 3rd instr) compute loop
termination test
– Duplicate last add early enough to schedule 3 instruction branch delay
• 24/40 slots used (60%) in this example

Clock cycles to execute 2D iDCT 240

250 230

Clock cycles




Trimedia PowerPC PA-8000 w Trimedia TI Pentium II
CPU64 w Altivec MAX2 TM-1000 320C620x w. MMX
• Note that the Trimedia results are based on compilation, unlike many of the others.
The year 2000 clock rate of the CPU64 is 300 MHz . The 1999 clock rates of the
others are about 400 MHz for the PowerPC, PA-8000, and Pentium II, with the TM-
1000 at 100 MHz and the TI 320620x at 200 MHz.
Transmeta Crusoe MPU

• 80x86 instruction set compatibility through

a software system that translates from the
x86 instruction set to VLIW instruction set
implemented by Crusoe
• VLIW processor designed for the low-power
Crusoe processor: Basics

• VLIW with in-order execution

• 64 Integer registers
• 32 floating point registers
• Simple in-order, 6-stage integer pipeline:
2 fetch stages, 1 decode, 1 register read,
1 execution, and 1 register write-back
• 10-stage pipeline for floating point, which has 4
extra execute stages
• Instructions in 2 sizes: 64 bits (2 ops) and 128
bits (4 ops)
Crusoe processor: Operations

• 5 different types of operation slots:

• ALU operations: typical RISC ALU operations
• Compute: this slot may specify any integer ALU
operation (2 integer ALUs), a floating point operation,
or a multimedia operation
• Memory: a load or store operation
• Branch: a branch instruction
• Immediate: a 32-bit immediate used by another
operation in this instruction
• For 128-bit instr: 1st 3 are Memory, Compute, ALU;
last field either Branch or Immediate
80x86 Compatability
• Initially, and for lowest latency to start
execution, the x86 code can be interpreted on
an instruction by instruction basis
• If a code segment is executed several times,
translated into an equivalent Crusoe code
sequence, and the translation is cached
– The unit of translation is at least a basic block, since we
know that if any instruction is executed in the block, they
will all be executed
– Translating an entire block both improves the translated
code quality and reduces the translation overhead, since
the translator need only be called once per basic block
• Assumes 16MB of main memory for cache
Exception Behavior during Speculation

• Crusoe support for speculative reordering consists

of 4 major parts:
1. shadowed register file
– Shadow discarded only when x86 instruction has no exception
2. program-controlled store buffer
– Only store when no exception; keep until OK to store
3. memory alias detection hardware with
speculative loads
4. conditional move instruction (called select) that
is used to do if-conversion on x86 code sequences
Crusoe Performance?
• Crusoe depends on realistic behavior to tune the
code translation process, it will not perform in a
predictive manner when benchmarked using simple,
but unrealistic scripts
– Needs idle time to translate
– Profiling to find hot spots
• To remedy this factor, Transmeta has proposed a
new set of benchmark scripts
– Unfortunately, these scripts have not been released and
endorsed by either a group of vendors or an independent entity
Real Time, so comparison is Energy
Workload Energy consumption Relative
description for the workload consumption
(W/Hr.) TM 3200 /
Pentium III
Mobile TM 3200
Pentium III @400MHz
@ 500 MHz 1.5V
MP3 0.672 0.214 0.32
DVD 1.13 0.479 0.42
Crusoe Applications?

• Notebook: Sony, others

• Compact Servers: RLX technologies
VLIW Readings
• Josh Fisher 1983 Paper + 1998 Retrospective
• What are characteristics of VLIW?
• Is ELI-512 the first VLIW?
– How many bits in instruction of ELI-512?
• What is breakthrough?
• What expected speedup over RISC?
• What is wrong with vector?
• What benchmark results on code size,
• What limited speedups to 5X to 10X?
• What other problems faced ELI-512?
• In retrospect, what wished changed?
• In retrospect, what naïve about?

You might also like