Advanced Computer Architecture: Tran Ngoc Thinh HCMC University of Technology

1
BK
TP.HCM
2010
dce
Advanced Computer
Architecture
Tran Ngoc Thinh
HCMC University of Technology
http://www.cse.hcmut.edu.vn/~tnthinh/aca
2010
dce
Advanced Computer Architecture
2
Class
Time and venue: Thursdays, 6:30am - 09:00am, 605B4
Web page:
http://www.cse.hcmut.edu.vn/~tnthinh/aca
Textbook:
John Hennessy, David Patterson, Computer Architecture: A
Quantitative Approach, 3rd edition, Morgan Kaufmann Publisher, 2003
Stallings, William, Computer Organization and Architecture, 7th
edition, Prentice Hall International, 2006
Kai Hwang, Advanced Computer Architecture : Parallelism,
Scalability, Programmability, McGraw-Hill, 1993
Kai Hwang & F. A. Briggs, Computer Architecture and Parallel
Processing, McGraw-Hill, 1989
Research papers on Computer Design and Architecture from IEEE and
ACM conferences, transactions and journals
Administrative Issues
2
2010
dce
3
Grades
10% homeworks
20% presentations
20% midterm exam
50% final exam
Administrative Issues (cont.)
2010
dce
4
Administrative Issues (cont.)
Personnel
Instructor: Dr. Tran Ngoc Thinh
Email: tnthinh@cse.hcmut.edu.vn
Phone: 8647256 (5843)
Office: A3 building
Office hours: Thursdays, 09:00-11:00
TA: Mr. Tran Huy Vu
Email:vutran@cse.hcmut.edu.vn
Phone: 8647256 (5843)
Office: A3 building
Office hours:
3
2010
dce
Course Coverage
Introduction
Brief history of computers
Basic concepts of computer architecture.
Instruction Set Principle
Classifying Instruction Set Architectures
Addressing Modes,Type and Size of Operands
Operations in the Instruction Set, Instructions for Control
Flow, Instruction Format
The Role of Compilers
5
2010
dce
Course Coverage
Pipelining: Basic and Intermediate Concepts
Organization of pipelined units,
Pipeline hazards,
Reducing branch penalties, branch prediction strategies.
Instructional Level Parallelism
Temporal partitioning
List-scheduling approach
Integer Linear Programming
Network Flow
Spectral methods
Iterative improvements
6
4
2010
dce
Course Coverage
Memory Hierarchy Design
Memory hierarchy
Cache memories
Virtual memories
Memory management.
SuperScalar Architectures
Instruction level parallelism and machine parallelism
Hardware techniques for performance enhancement
Limitations of the superscalar approach
Vector Processors
7
2010
dce
Course Requirements
Computer Organization & Architecture
Comb./Seq. Logic, Processor, Memory, Assembly
Language
Data Structures / Algorithms
Complexity analysis, efficient implementations
Operating Systems
Task scheduling, management of processors,
memory, input/output devices
8
5
2010
dce
1950s to 1960s: Computer Architecture Course: Computer
Arithmetic
1970s to mid 1980s: Computer Architecture Course:
Instruction Set Design, especially ISA appropriate for
compilers
1990s: Computer Architecture Course:
Design of CPU, memory system, I/O system,
Multiprocessors, Networks
2000s: Multi-core design, on-chip networking, parallel
programming paradigms, power reduction
2010s: Computer Architecture Course: Self adapting
systems? Self organizing structures?
DNA Systems/Quantum Computing?
Computer Architectures Changing Definition
9
2010
dce
Role of a computer architect:
To design and engineer the various levels
of a computer system to maximize
performance and programmability within
limits of technology and cost
Computer Architecture
10
6
2010
dce
S/W and H/W consists of hierarchical layers of abstraction,
each hides details of lower layers from the above layer
The instruction set arch. abstracts the H/W and S/W
interface and allows many implementation of varying cost
and performance to run the same S/W
Levels of Abstraction
Instruction Set Architecture
Applications
Operating System
Firmware
Compiler
Instruction Set Processor I/O System
Datapath & Control
Digital Design
Circuit Design
Layout
11
2010
dce
The Task of Computer Designer
determine what attribute are important for a
new machine
design a machine to maximize cost
performance
What are these Task?
instruction set design
function organization
logic design
implementation
IC design, packaging, power, cooling.

12
7
2010
dce
History
Big Iron Computers:
Used vacuum tubes, electric relays and bulk magnetic
storage devices. No microprocessors. No memory.
Example: ENIAC (1945), IBM Mark 1 (1944
13
2010
dce
History
Von Newmann:
Invented EDSAC (1949).
First Stored Program Computer. Uses Memory.
Importance: We are still using The same basic
design.
14
8
2010
dce
The Processor Chip
15
2010
dce
Advanced Computer Architecture 16
Intel 4004 Die Photo
Introduced in 1970
First microprocessor
2,250 transistors
12 mm
2
108 KHz
9
2010
dce
Intel 8086 Die Scan
29,0000 transistors
33 mm
2
5 MHz
Introduced in 1979
Basic architecture of
the IA32 PC
2010
dce
Intel 80486 Die Scan
1,200,000
transistors
81 mm
2
25 MHz
Introduced in 1989
1
st
pipelined
implementation of
IA32
10
2010
dce
Pentium Die Photo
3,100,000
transistors
296 mm
2
60 MHz
Introduced in 1993
1
st
superscalar
implementation of
IA32
2010
dce
Pentium III
9,5000,000
transistors
125 mm
2
450 MHz
Introduced in 1999
11
2010
dce
Moores Law
Cramming More Components onto Integrated Circuits
Gordon Moore, Electronics, 1965
# on transistors on cost-effective integrated circuit double every 18 months
2010
dce
Performance Trend
In general,
tradeoffs
should
improve
performance
The natural
idea here
HW cheaper,
easier to
manufacture
can make
our processor
do more
things
22
12
2010
dce
Price Trends (Pentium III)
23
2010
dce
Price Trends (DRAM memory)
24
13
2010
dce
Technology constantly on the move!
Num of transistors not limiting factor
Currently ~ 1 billion transistors/chip
Problems:
Too much Power, Heat, Latency
Not enough Parallelism
3-dimensional chip technology?
Sandwiches of silicon
Through-Vias for communication
On-chip optical connections?
Power savings for large packets
The Intel Core i7
microprocessor (Nehalem)
4 cores/chip
45 nm, Hafnium hi-k dielectric
731M Transistors
Shared L3 Cache - 8MB
L2 Cache - 1MB (256K x 4)
Nehalem
25
2010
dce
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
P
e
r
f
o
r
m
a
n
c
e

(
v
s
.

V
A
X
-
1
1
/
7
8
0
)

25%/year
52%/year
??%/year
Crossroads: Uniprocessor Performance
VAX : 25%/year 1978 to 1986
RISC + x86: 52%/year 1986 to 2002
RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, October, 2006
26
14
2010
dce
Limiting Force: Power Density
27
2010
dce
Old Conventional Wisdom: Power is free, Transistors expensive
New Conventional Wisdom: Power wall Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
Old CW: Sufficiently increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, )
New CW: ILP wall law of diminishing returns on more HW for ILP
Old CW: Multiplies are slow, Memory access is fast
New CW: Memory wall Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)
Old CW: Uniprocessor performance 2X / 1.5 yrs
New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
Uniprocessor performance now 2X / 5(?) yrs
Sea change in chip design: multiple cores
(2X processors per chip / ~ 2 years)
More power efficient to use a large number of simpler processors
rather than a small number of complex processors
Crossroads: Conventional Wisdom in Comp. Arch
28
15
2010
dce
Sea Change in Chip Design
Intel 4004 (1971):
4-bit processor,
2312 transistors, 0.4 MHz,
10 m PMOS, 11 mm
2
chip
RISC II (1983):
32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 m NMOS, 60 mm
2
chip
125 mm
2
chip, 65 nm CMOS
= 2312 RISC II+FPU+Icache+Dcache
RISC II shrinks to ~ 0.02 mm
2
at 65 nm
Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?
Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
Processor is the new transistor?
29
2010
dce
ManyCore Chips: The future is here
ManyCore refers to many processors/chip
64? 128? Hard to say exact boundary
How to program these?
Use 2 CPUs for video/audio
Use 1 for word processor, 1 for browser
76 for virus checking???
Something new is clearly needed here
Intel 80-core multicore chip (Feb 2007)
80 simple cores
Two FP-engines / core
Mesh-like network
100 million transistors
65nm feature size
Intel Single-Chip Cloud
Computer (August 2010)
24 tiles with two IA
cores per tile
24-router mesh network
with 256 GB/s bisection
4 integrated DDR3 memory controllers
Hardware support for message-passing
30
16
2010
dce
The End of the Uniprocessor Era
Single biggest change in the history of
computing systems
31
2010
dce
The End of the Uniprocessor Era
Multiprocessors imminent in 1970s, 80s, 90s,
todays processors are nearing an impasse as technologies
approach the speed of light..
David Mitchell, The Transputer: The Time Is Now (1989)
Custom multiprocessors strove to lead uniprocessors
Procrastination rewarded: 2X seq. perf. / 1.5 years
We are dedicating all of our future product development to
multicore designs. This is a sea change in computing
Paul Otellini, President, Intel (2004)
Difference is all microprocessor companies switch to multicore
(AMD, Intel, IBM, Sun; all new Apples 2-4 CPUs)
Procrastination penalized: 2X sequential perf. / 5 yrs
Biggest programming challenge: 1 to 2 CPUs
32
17
2010
dce
Problems with Sea Change
Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries, not ready
to supply Thread Level Parallelism or Data Level
Parallelism for 1000 CPUs / chip
Need whole new approach
People have been working on parallelism for over 50 years without
general success
Architectures not ready for 1000 CPUs / chip
Unlike Instruction Level Parallelism, cannot be solved by just by
computer architects and compiler writers alone, but also cannot be
solved without participation of computer architects
PARLab: Berkeley researchers from many backgrounds
meeting since 2005 to discuss parallelism
Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John
Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik
Sen, John Shalf, John Wawrzynek, Kathy Yelick,
Circuit design, computer architecture, massively parallel computing,
computer-aided design, embedded hardware and software,
programming languages, compilers, scientific programming, and
numerical analysis
33
2010
dce
Computer Design Cycle
Performance
Technology
and Cost
Evaluate Existing
Systems for
Bottlenecks
Simulate New
Designs and
Organizations
Implement Next
Generation System
Benchmarks
Workloads
Implementation
Complexity
34
18
2010
dce
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Performance
Technology and cost
The computer design is evaluated for bottlenecks using
certain benchmarks to achieve the optimum performance..
1
35
2010
dce
Performance (Metric)
Time/Latency: The wall clock or CPU elapsed
time.
Throughput: The number of results per second.
Other measures such as MIPS, MFLOPS, clock frequency
(MHz), cache size do not make any sense.
36
19
2010
dce
Performance (Measuring Tools)
Benchmarks:
Hardware: Cost, delay, area, power
consumption
Simulation (at levels - ISA, RT, Gate,
Circuit)
Queuing Theory
Fundamental Laws/Principles
37
2010
dce
Evaluate Existing Systems for Bottlenecks
using Benchmarks
1: Performance
Simulate New Designs
and Organizations
Workloads
2: Technology
The Technology Trends motivate new designs. These designs are
simulated to evaluate the performance for different levels of
workloads. Simulation helps in keeping the result verification
38
20
2010
dce
Technology Trends: Computer Generations
Vacuum tube 1946-1957 1
st
Gen.
Transistor - 1958-1964 2
nd
Gen.
Small scale integration 1965-1968
Up to 100 devices/chip
Medium scale integration 1969-1971 3
rd
Gen.
100-3,000 devices/chip
Large scale integration 1972-1977
3,000 - 100,000 devices/chip
Very large scale integration 1978 on.. 4
th
Gen.
100,000 - 100,000,000 devices/chip
Ultra large scale integration
Over 100,000,000 devices/chip
39
2010
dce
Implement Next Generation System Implement Next Generation System
Implementation Complexity
3: Cost
1: Performance
2: Technology
The systems are implemented using the
latest technology to obtain cost effective,
high performance solution - the
implementation complexities are given due
consideration
40
21
2010
dce
Price Verses Cost
The relationship between cost and price is
complex one
The cost is the total amount spends to produce a
product
The price is the amount for which a finished good
is sold.
The cost passes through different stages before it
becomes price.
A small change in cost may have a big impact on
price
41
2010
dce
Price vs. Cost
Manufacturing Costs: Total amount spent to produce a
component
- Component Cost: Cost at which the components are
available to the designer. - It ranges from 40% to 50% of
the list price of the product.
- Direct cost (Recurring costs): Labor, purchasing
scrap, warranty 4% - 16 % of list price
- Gross margin Non-recurring cost: R&D,
marketing, sales, equipment, rental, maintenance,
financing cost, pre-tax profits, taxes
42
22
2010
dce
Price vs. Cost
List Price:
Amount for which the finished good is sold;
it includes Average Discount of 15% to 35% of the as
volume discounts and/or retailer markup
0%
20%
40%
60%
80%
100%
Mini W/S PC
Average Di scount
Gross Margi n
Di rect Costs
Component Costs
43
2010
dce
Cost-effective IC Design: Price-Performance Design
Yield: Percentage of manufactured components
surviving testing
Volume: increases manufacturing hence decreases
the list price and improves the purchasing efficiency
Feature Size: the minimum size of a transistor or wire
in either x or y direction
44
23
2010
dce
Reduction in feature size from 10 microns in
1971 and 0.045 in 2008 has resulted in:
- Quadratic rise in transistor count
- Linear increase in performance
- 4-bit to 64-bit microprocessor
- Desktops have replaced time-sharing
machines
Cost-effective IC Design: Price-Performance Design
45
2010
dce
Cost of Integrated Circuits
Manufacturing Stages:
The Integrated circuit manufacturing passes
through many stage:
Wafer growth and testing
Wafer chopping it into dies
Packaging the dies to chips
Testing a chip.
46
24
2010
dce
Die: is the square area of the wafer containing the
integrated circuit
See that while fitting dies on the wafer the small wafer area
around the periphery goes waist
Cost of a die: The cost of a die is determined from cost of
a wafer; the number of dies fit on a wafer and the
percentage of dies that work, i.e., the yield of the die.
47
2010
dce
The cost of integrated circuit can be determined as ratio of
the total cost; i.e., the sum of the costs of die, cost of testing
die, cost of packaging and the cost of final testing a chip; to
the final test yield.
Cost of IC=
die cost + die testing cost + packaging cost + final testing cost
final test yield
The cost of die is the ratio of the cost of the wafer to the
product of the dies per wafer and die yield
Die cost = Cost of wafer
dies per wafer x die yield
48
25
2010
dce
The number of dies per wafer is determined by the dividing
the wafer area (minus the waist wafer area near the round
periphery) by the die area
Dies per wafer =
(wafer diameter/2)
2
(wafer diameter)
die area 2 x die area
Example: For die of 0.7 cm on a side, find the number
of dies per wafer of 30 cm diameter
Answer:
[Wafer area / Die Area] - Wafer Waist area
= (30/2)2 / 0.49 - (30) / (2 x 0.49)
= 1347 dies
49
2010
dce
Calculating Die Yield
Die yield is the fraction or percentage of good dies on a
wafer number
Wafer yield accounts for completely bad wafers so need not
be tested
Wafer yield corresponds to on defect density by which
depends on number of masking levels good estimate for
CMOS is 4.0
Example:
The yield of a die, 0.7cm on a side, with defect density of 0.6/cm2
= (1+[0.6x0.49]/4.0)
-4
= 0.75
50
o
o

)
`

+ =
Area Die Area) t Defect/Uni (
1 Yield Wafer DieYield
26
2010
dce
Volume vs. Cost
Rule of thumb on applying learning curve to
manufacturing:
When volume doubles, costs reduce 10%
A DEC View of Computer Engineering by C. G. Bell, J. C. Mudge, and
J. E. McNamara, Digital Press, Bedford, MA., 1978.
51
1990 1992 1994 1997
PC 23,880,898 33,547,589 44,006,000 65,480,000
WS 407,624 584,544 679,320 978,585
Ratio 59 57 65 67
2
x
= 65 => X = 6.0
Since doubling value reduces cost by 10%, costs reduces to

(0.9)
6.0
= 0.53 of the original price.
PC costs are 47% less than workstation costs for whole market.
2010
dce
High Margins on High-End Machines
R&D considered return on investment (ROI) 10%
Every $1 R&D must generate $7 to $13 in sales
High end machines need more $ for R&D
Sell fewer high end machines
Fewer to amortize R&D
Much higher margins
Cost of 1 MB Memory (January 1994):
PC $40 (Mac Quadra)
WS $42 (SS-10)
Mainframe $1920 (IBM 3090)
Supercomputer $600 (M90 DRAM)
$1375 (C90 15 ns SRAM)
52
27
2010
dce
Recouping Development Cost on Low Volume
Microprocessors?
Hennessy says MIPS R4000 cost $30M to develop
Intel rumored to invest $100M on 486
SGI/MIPS sells 300,000 R4000s over product lifetime?
Intel sells 50,000,000 486s?
Intel must get $100M from chips ($2/chip)
SGI/MIPS can get $30M from margin of workstations vs.
chips vs. $100/chip
Alternative: SGI buys chips vs. develops them
53
2010
dce
Metrics of Performance
MIPS: Millions of Instructions per second
MFLOPS: millions of FP operations per sec.
Cycles per second (clock rate)
Megabytes per second
Compiler
Programming
Language
Application
Instruction Set Architecture
Answers per month
Operations per second
Datapath
Control
Transistors
Wire I/O Pins/
Function Units
54
28
2010
dce
Does Anybody Really Know What Time it is?
User CPU Time (Time spent in program): 90.7 sec
System CPU Time (Time spent in OS): 12.9 sec
Elapsed Time (Response Time = 2 min 39 sec =159
Sec.)
(90.7+12.9)/159 * 100 = 65%, % of lapsed time that is
CPU time. 45% of the time spent in I/O or running
other programs
UNIX Time Command: 90.7u 12.9s 2:39 65%
55
2010
dce
Time
CPU time
time the CPU is computing
not including the time waiting for I/O or running other
program
User CPU time
CPU time spent in the program
System CPU time
CPU time spent in the operating system performing
task requested by the program decrease execution
time
CPU time = User CPU time + System CPU time
56
29
2010
dce
Performance
System Performance
elapsed time on unloaded system
CPU performance
user CPU time on an unloaded system
57
2010
dce
Two notions of performance
Time to do the task (Execution Time)
execution time, response time, latency
Tasks per day, hour, week, sec, ns. ..
throughput, bandwidth
Response time and throughput often are in opposition
DC to Paris
6.5 hours
3 hours
Plane
Boeing 747
BAD/Sud
Concorde
Speed
610 mph
1350 mph
Passengers
470
132
Throughput
286,700
178,200
Which has higher performance?
58
30
2010
dce
Example
Time of Concorde vs. Boeing 747?
Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
Throughput of Concorde vs. Boeing 747 ?
Concord is 178,200 pmph / 286,700 pmph = 0.62
times faster
Boeing is 286,700 pmph / 178,200 pmph = 1.6 times
faster
Boeing is 1.6 times (60%)faster in terms of throughput
Concord is 2.2 times (120%) faster in terms of flying time
We will focus primarily on execution time for a single job
59
2010
dce
Computer Performance Measures: Computer Performance Measures: Program Program
Execution Time Execution Time ((11//22))
For a specific program compiled to run on a
specific machine (CPU) A, the following
parameters are provided:
The total instruction count of the program.
The average number of cycles per instruction
(average CPI).
Clock cycle of machine A
60
31
2010
dce
Computer Performance Measures: Computer Performance Measures: Program Program
Execution Time Execution Time ((22//22))
How can one measure the performance of
this machine running this program?
Intuitively the machine is said to be faster or has better
performance running this program if the total execution time is
shorter.
Thus the inverse of the total measured program execution
time is a possible performance measure or metric:
Performance
A
= 1 / Execution Time
A
How to compare performance of different machines?
What factors affect performance? How to improve
performance?!!!!
61
2010
dce
Comparing Computer Performance Using Execution
Time
To compare the performance of two machines (or CPUs)
A, B running a given specific program:
Performance
A
A
Performance
B
B
Machine A is n times faster than machine B means (or
slower? if n < 1) :
Speedup = n = =
Performance
A
Performance
B
Execution Time
B
Execution Time
A
62
32
2010
dce
Example
For a given program:
Execution time on machine A: Execution
A
= 1 second
Execution time on machine B: Execution
B
= 10 seconds
The performance of machine A is 10 times the performance of
machine B when running this program, or: Machine A is said to be
10 times faster than machine B when running this program.
The two CPUs may target different ISAs provided
the program is written in a high level language (HLL)
10
1
10

A
Time Execution
B
Time Execution

B
e Performanc
A
e Performanc
Speedup
= =
= =
63
2010
dce
CPU Execution Time: The CPU Equation
A program is comprised of a number of instructions executed , I
Measured in: instructions/program
The average instruction executed takes a number of cycles per
instruction (CPI) to be completed.
Measured in: cycles/instruction, CPI
CPU has a fixed clock cycle time C = 1/clock rate
Measured in: seconds/cycle
CPU execution time is the product of the above three parameters as
follows:
CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
T = I x CPI x C
execution Time
per program in seconds
Number of
instructions executed
Average CPI for program CPU Clock Cycle
64
33
2010
dce
CPU Execution Time: Example
A Program is running on a specific machine with the
following parameters:
Total executed instruction count: 10,000,000 instructions
Average CPI for the program: 2.5 cycles/instruction.
CPU clock rate: 200 MHz. (clock cycle = 5x10
-9
seconds)
What is the execution time for this program:
CPU time = Instruction count x CPI x Clock cycle
= 10,000,000 x 2.5 x 1 / clock rate
= 10,000,000 x 2.5 x 5x10
-9
= .125 seconds
65
2010
dce
Factors Affecting CPU Performance Factors Affecting CPU Performance
CPI Clock Cycle C
Instruction
Count I
Program
Compiler
Organization
(CPU Design)
Technology
(VLSI)
Instruction Set
Architecture (ISA)
X
X
X
X
X
X
X X
X
66
34
2010
dce
Performance Comparison: Example
From the previous example: A Program is running on a
specific machine with the following parameters:
Total executed instruction count, I: 10,000,000 instructions
Average CPI for the program: 2.5 cycles/instruction.
CPU clock rate: 200 MHz.
Using the same program with these changes:
A new compiler used: New instruction count 9,500,000
New CPI: 3.0
Faster CPU implementation: New clock rate = 300 MHZ
What is the speedup with the changes?
Speedup = (10,000,000 x 2.5 x 5x10
-9
) / (9,500,000 x 3 x 3.33x10
-9
)
= .125 / .095 = 1.32
or 32 % faster after changes.
Speedup = Old Execution Time = I x CPI x Clock cycle Speedup = Old Execution Time = I
old
x CPI
old
x Clock cycle
old
New Execution Time I
new
x CPI
new
x Clock Cycle
new
67
2010
dce
Instruction Types & CPI
Given a program with n types or classes of instructions
executed on a given CPU with the following characteristics:
C
i
= Count of instructions of type
i
CPI
i
= Cycles per instruction for type
i
Then: Then:
CPI = CPU Clock Cycles / Instruction Count I
Where:
Instruction Count I = E C
i
( )
CPU clock cycles
i i
i
n
CPI C
=
=
1
i = 1, 2, . n
35
2010
dce
Instruction Types & CPI: An Example
An instruction set has n= three instruction classes:
Two code sequences have the following instruction counts:
CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count = 10 /5 = 2
CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5
Instruction class CPI
A 1
B 2
C 3
Instruction counts for instruction class
Code Sequence A B C
1 2 1 2
2 4 1 1
For a specific
CPU design
69
2010
dce
Instruction Frequency & CPI
Given a program with n types or classes of instructions
with the following characteristics:
C
i
= Count of instructions of type
i
CPI
i
= Average cycles per instruction of type
i
F
i
= Frequency or fraction of instruction type
i
executed
= C
i
/ total executed instruction count = C
i
/ I
Then:
( )
=
=
n
i
i i F CPI
CPI
1
Fraction of total execution time for instructions of type i =
CPI
i
x F
i
CPI
70
36
2010
dce
Instruction Type Frequency & CPI: Instruction Type Frequency & CPI: A A RISC Example RISC Example
Typical Mix
Base Machine (Reg / Reg)
Op Freq, F
i
CPI
i
CPI
i
x F
i
% Time
ALU 50% 1 .5 23% = .5/2.2
Load 20% 5 1.0 45% = 1/2.2
Store 10% 3 .3 14% = .3/2.2
Branch 20% 2 .4 18% = .4/2.2
( )
=
=
n
i
i i F CPI
CPI
1
CPI
i
x F
i
CPI
Sum = 2.2
Program Profile or Executed Instructions Mix
71
2010
dce
Performance Terminology
X is n% faster than Y means:
ExTime(Y) Performance(X) n
--------- = -------------- = 1 + -----
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
n = 100(15 - 10) = 50%
10
72
37
2010
dce
Speedup
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction
enhanced
of the task by a factor Speedup
enhanced
, and the
remainder of the task is unaffected, then what is
ExTime(E) = ?
Speedup(E) = ?
73
2010
dce
Amdahls Law
States that the performance improvement to be gained
from using some faster mode of execution is limited by
the fraction of the time faster mode can be used
Speedup = Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup = Execution time without the enhancement
Execution time for entire task using the enhancement
ExTime
new
= ExTime
old
x (1 - Fraction
enhanced
) + Fraction
enhanced
Speedup
overall
=
ExTime
old
ExTime
new
Speedup
enhanced
=
1
(1 - Fraction
enhanced
) + Fraction
enhanced
Speedup
enhanced
74
38
2010
dce
Example of Amdahls Law
Floating point instructions improved to run 2X; but
only 10% of actual instructions are FP
Speedup
overall =
1
0.95
= 1.053
ExTime
new
= ExTime
old
x (0.9 + .1/2) = 0.95 x ExTime
old
75
2010
dce
Performance Enhancement Calculations: Amdahl's Performance Enhancement Calculations: Amdahl's
Law Law
The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is
used Amdahls Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = -------------------------------------- = ---------------------------------
Execution Time with E Performance without E
Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = --------------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
76
39
2010
dce
Pictorial Depiction of Amdahls Law Pictorial Depiction of Amdahls Law
Before:
Execution Time without enhancement E: (Before enhancement is applied)
After:
Execution Time with enhancement E:
Enhancement E accelerates fraction F of original execution time by a factor of S
Unaffected fraction: (1- F) Affected fraction: F
Unaffected fraction: (1- F) F/S
Unchanged
Execution Time without enhancement E 1
Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
shown normalized to 1 = (1-F) + F =1
77
2010
dce
Performance Enhancement Example Performance Enhancement Example
For the RISC machine with the following instruction mix given earlier:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 100% - 45% = 55% or .55
Factor of enhancement = 5/2 = 2.5
Using Amdahls Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
78
40
2010
dce
An Alternative Solution Using CPU Equation
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
If a CPU design enhancement improves the CPI of load instructions from 5 to 2,
what is the resulting performance improvement from this enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time Instruction count x old CPI x clock cycle
Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle
old CPI 2.2
= ------------ = --------- = 1.37
new CPI 1.6
Which is the same speedup obtained from Amdahls Law in the first solution.
79
2010
dce
Extending Amdahl's Law To Multiple Enhancements
Suppose that enhancement E
i
accelerates a fraction F
i
of the execution time by a factor S
i
and the remainder
of the time is unaffected then:

+
=
i i
i
i
i
X
S
F
F
Speedup
Time Execution Original ) 1
Time Execution Original
) ((

+
=
i i
i
i
i
S
F
F
Speedup
) ( ) 1
1
(
Note: All fractions F
i
refer to original execution time before the
enhancements are applied.
.
80
41
2010
dce
Amdahl's Law With Multiple Enhancements: Example
Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup
1
= S
1
= 10 Percentage
1
= F
1
= 20%
Speedup
2
= S
2
= 15 Percentage
1
= F
2
= 15%
Speedup
3
= S
3
= 30 Percentage
1
= F
3
= 10%
While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code.
What is the resulting overall speedup?
Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1 / [ .55 + .0333 ]
= 1 / .5833 = 1.71

+
=
i i
i
i
i
S
F
F
Speedup
) ( ) 1
1
(
81
2010
dce
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions (F
i
, i = 1, 2, 3) refer to original execution time.
Unaffected, fraction: .55
Unchanged
Unaffected, fraction: .55
F
1
= .2 F
2
= .15 F
3
= .1
S
1
= 10 S
2
= 15 S
3
= 30
/ 10
/ 30 / 15
82
42
2010
dce
Computer Performance Measures: MIPS Rating (1/3)
For a specific program running on a specific CPU the MIPS rating is a
measure of how many millions of instructions are executed per
second:
MIPS Rating = Instruction count / (Execution Time x 10
6
)
= Instruction count / (CPU clocks x Cycle time x 10
6
)
= (Instruction count x Clock rate) / (Instruction
count x CPI x 10
6
)
= Clock rate / (CPI x 10
6
)
Major problem with MIPS rating: As shown above the MIPS rating
does not account for the count of instructions executed (I).
A higher MIPS rating in many cases may not mean higher
performance or better execution time. i.e. due to compiler design
variations.
85
2010
dce
In addition the MIPS rating:
Does not account for the instruction set architecture
(ISA) used.
Thus it cannot be used to compare computers/CPUs with
different instruction sets.
Easy to abuse: Program used to get the MIPS rating is
often omitted.
Often the Peak MIPS rating is provided for a given CPU which
is obtained using a program comprised entirely of instructions
with the lowest CPI for the given CPU design which does not
represent real programs.
86
43
2010
dce
Under what conditions can the MIPS rating be used to compare
performance of different CPUs?
The MIPS rating is only valid to compare the performance of
different CPUs provided that the following conditions are satisfied:
1 The same program is used
(actually this applies to all performance metrics)
2 The same ISA is used
3 The same compiler is used
(Thus the resulting programs used to run on the CPUs and obtain
the MIPS rating are identical at the machine code level including the
same instruction count)
87
2010
dce
A MIPS Example (1)
Consider the following computer:
Code from: A B C
Compiler 1 5 1 1
Compiler 2 10 1 1
Instruction counts (in millions) for each
instruction class
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction B requires 2
clock cycles, Instruction C requires 3 clock cycles.
E CPI
i
x C
i
i =1
n
CPI =
Instruction Count
CPU Clock Cycles
Instruction Count
= !
Note
important
formula!
88
44
2010
dce
A MIPS Example (2)
CPI
1
=
(5 + 1 + 1) x 10
6
[(5x1) + (1x2) + (1x3)] x 10
6
10/7 = 1.43 =
MIPS
1
=
1.43
100 MHz
69.9 =
CPI
2
=
(10 + 1 + 1) x 10
6
[(10x1) + (1x2) + (1x3)] x 10
6
15/12 = 1.25 =
MIPS
2
=
1.25
100 MHz
80.0 =
So, compiler 2 has a higher
MIPS rating and should be
faster?
count cycles
cycles
89
2010
dce
A MIPS Example (3)
Now lets compare CPU time:
CPU Time =
Clock Rate
Instruction Count x CPI
= 0.10 seconds CPU Time
1
=
100 x 10
6
7 x 10
6
x 1.43
= 0.15 seconds
CPU Time
2
=
100 x 10
6
12 x 10
6
x 1.25
Therefore program 1 is faster despite a lower MIPS!
!
Note
important
formula!
90
45
2010
dce
Computer Performance Measures :MFLOPS (1/2)
A floating-point operation is an addition, subtraction,
multiplication, or division operation applied to numbers
represented by a single or a double precision floating-
point representation.
MFLOPS, for a specific program running on a specific
computer, is a measure of millions of floating point-
operation (megaflops) per second:
MFLOPS = Number of floating-point operations /
(Execution time x 10
6
)
MFLOPS rating is a better comparison measure between
different machines (applies even if ISAs are different)
than the MIPS rating.
Applicable even if ISAs are different
91
2010
dce
Computer Performance Measures :MFLOPS (2/2)
Program-dependent: Different programs have
different percentages of floating-point operations
present. i.e compilers have no floating- point
operations and yield a MFLOPS rating of zero.
Dependent on the type of floating-point
operations present in the program.
Peak MFLOPS rating for a CPU: Obtained using a program
comprised entirely of the simplest floating point instructions (with
the lowest CPI) for the given CPU design which does not
represent real floating point programs.
92
46
2010
dce
CPU Benchmark Suites
Performance Comparison: the execution time of the
same workload running on two machines without running
the actual programs
Benchmarks: the programs specifically chosen to
measure the performance.
Five levels of programs: in the decreasing order of
accuracy
Real Applications
Modified Applications
Kernels
Toy benchmarks
Synthetic benchmarks
93
2010
dce
SPEC: System Performance Evaluation Cooperative
SPECCPU: popular desktop benchmark suite
CPU only, split between integer and floating point programs
First Round 1989: 10 programs yielding a single number SPECmarks
Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14
floating point programs)
Third Round 1995
new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10
floating point)
benchmarks useful for 3 years
Single flag setting for all programs: SPECint_base95, SPECfp_base95
SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
SPECCPU2006 to be announced Spring 2006
SPECSFS (NFS file server) and SPECWeb (WebServer) added
as server benchmarks
94

Advanced Computer Architecture: Tran Ngoc Thinh HCMC University of Technology

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture: Tran Ngoc Thinh HCMC University of Technology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Architecture: Tran Ngoc Thinh HCMC University of Technology

Uploaded by

Copyright:

Available Formats

1

You might also like