0% found this document useful (0 votes)
28 views68 pages

Unit 1

This include a presentation on cpu performance.

Uploaded by

Neel Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views68 pages

Unit 1

This include a presentation on cpu performance.

Uploaded by

Neel Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Current slide

Unit 1

Faculty of Engineering, D.E.I EEM 513 – Computer Architecture


1
Introduction
Introduction
Administrivia
Computer Architecture – definition

2
The HW/SW Interface

Application software a[i] = b[i] + c;

Compiler

lw $15, 0($2)
add $16, $15, $14
Systems software add $17, $15, $13
(OS, compiler) lw $18, 0($12)
lw $19, 0($17)
add $20, $18, $19
sw $20, 0($16)

Assembler

Hardware 000000101100000
110100000100010
3

Administrivia

Components of Evaluation
Class Assignments (two per unit)
Home Assignments (“Daily”)
Maintain a DHA notebook, to be evaluated at random.
Properly indexed, with page numbers..
CTs (Two – mid-sem and end-sem)
Surprise Quizzes (best n – 1 of n, n ~ 2 – 3) &/ Additional HA
Term Paper
Attendance

Textbook: Computer Organization and Design: The


Hardware/Software Interface, Hennessy & Patterson, 3rd Ed.,
Morgan Kauffman Publishing House.
4
Introduction
Elements of Computing Systems
Von Neumann vs Harvard Model
ISA and Microarchitecture 5
Review

• Greater compute capability than a few years back


• Emergence of new classes of computing systems
• Dominance of microprocessor-based systems
• Software implications – performance versus
productivity
• Challenges: RIP Moore’s Law, Limits of parallelism…

6
Key Elements of a Computing System

Processing
Control Memory
(sequencing) (program I/O
and data)
Datapath

7
The Von Neumann Model/Architecture
AKA stored program computer (instructions in memory).

Key properties:
1. Stored program
Instructions stored in memory
Unified memory: instructions and data stored together
How can we interpret the contents of memory correctly?
Depends on the control signals issued

2. Sequential instruction processing


One instruction processed at a time
(fetch – decode – (fetch operands) – execute – writeback)
Program Counter (PC) / Instruction Pointer identifies current/next instruction
• Advances sequentially except in case of conditional/jump instructions

All major instruction set architectures today based on this model 8


x86, ARM, MIPS, SPARC, Alpha…
Von-Neumann Model

PC

IR

Register File

MAR MDR

9
Harvard Architecture
▪Separate storage and datapath
for instructions and data
▪Originated from the Harvard Mark I computer
▪ CPU can both read an instruction and access data
memory simultaneously.
▪ Canbe faster since instruction fetches and data access do not
contend for a single memory pathway, as in Von Neumann
▪ Distinct code and data address spaces
▪Applications in DSP, microcontrollers like PIC
▪Modern high performance CPU chip designs
incorporate aspects of both Harvard and von Neumann
architecture.
▪Split I/D cache 10
ISA vs. Micro-architecture
ISA: Specifies how the programmer sees instructions to be
executed
Example: Programmer sees a sequential flow of execution
Micro-architecture: How underlying implementation actually
executes instructions
Example: Micro-architecture may execute instructions in any order as long as it obeys the
semantics specified by the ISA when making the instruction results visible to software

Micro-architecture level execution models vary in most


implementations
Pipelined instruction execution: Intel 80486 uarch
Multiple instructions at a time: Intel Pentium uarch
Out-of-order execution: Intel Pentium Pro uarch
Separate instruction and data caches

(None of these is exposed to the programmer)

11
ISA vs. Microarchitecture
Various micro-architectures can (and do) exist for the same ISA
Add instruction (ISA) vs. Adder implementation (Micro-architecture)
Several possible implementations: ripple carry v/s carry lookahead v/s carry save..
x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro,
Pentium 4, Core, …

Q. Which of the two changes faster? ISA or Microarchitecture?


Few ISAs (x86, ARM, SPARC, MIPS, Alpha) but many microarchitectures

12
What comprises the ISA?
Instructions
Opcodes, Data Types
Instruction Types and Formats
Registers, Addressing Modes
Memory
Address space, Addressability, Alignment
Virtual memory management
Function Calls, Interrupt/Exception Handling
Access Control, Device Priorities
I/O mapping: memory-mapped vs. I/O mapped
Task/thread Management

Should the programmer/software layer be able to access these?


13
What comprises the Micro-architecture?
Micro-architecture: A particular implementation of the ISA,
guided by some set of design constraints and objectives

Pipelining
In-order versus out-of-order instruction execution
Memory access scheduling policy
Speculative execution (eg: cache prefetch, branch pred)
Superscalar
Caching: number of levels, cache size, associativity,
replacement…

Should the programmer/software layer have access to these?


14
Tradeoffs in ISA and MicroArch

Design Considerations like

Cost
Performance
Energy/Battery life
Reliability and Correctness
Time to market

15
Computer Architecture
Architecture
The art and science of designing and constructing
buildings

Computer Architecture
The art and science of selecting, designing
and interconnecting hardware components
and designing the hardware/software
interface to create a computing system that
meets the required system design goals

16
Course Contents in brief

Unit 1. Introduction, Performance metrics


Unit 2. MIPS (RISC) ALP
Computer Arithmetic: algorithms & hardware
Unit 3 Memory. Cache basics. Virtual memory.
Unit 4 Input Output
Unit 5 CPU Design and pipelining

17
Performance
Response time & throughput; Factors that affect performance
The Performance Equation
Some Examples 18
Performance and Cost
• Which of the following aircrafts has
the best performance?
Airplane Passengers Range (mi) Speed (mph)

Boeing 737 101 630 598


Boeing 747 470 4150 610
Concorde 132 4000 1350
Douglas DC-8 146 8720 544

• Concorde is the fastest


• DC-8 has the longest range
• 747 can carry the max. no. of passengers

19
Performance and Cost
• Which of the following aircrafts has
the best performance?
Airplane Passengers Range (mi) Speed (mph)

Boeing 737 101 630 598


Boeing 747 470 4150 610
Concorde 132 4000 1350
Douglas DC-8 146 8720 544

Clearly, we must first define what we mean by “performance”

20
Defining Performance
• Why study hardware performance?
– Hardware performance is often the key to the
effectiveness of the entire software
• Not an easy task
– Different aspects of a computing system can
assume significance in determining the overall
performance
– Different performance metrics appropriate for
different types of applications

21
Defining Performance
• What is important to whom?
– Individual Computer User/Programmer
• Minimize Response Time for the program
= completion time – start time
• Also called elapsed time, wall clock time
– Data Center Manager (what are data centers?)
• Maximize Throughput
– number of jobs completed in given time
• Measured as #jobs/unit time

22
Q. Defining Performance
• What does each of the following improve?
(a) response time or (b) throughput?
– Replacing the existing CPU with a faster CPU
– Adding more processor cores in the same
system

23
Defining Performance
• Depending on the type of application, we
might need to optimize
– Throughput
– Response time
– Some combination of both

25
Response Time vs. Throughput
• How can we relate the two?
– E.g. Simple uniprocessor system
• processes executed sequentially
– Assume 5 processes
– Each process needs 2 minutes of CPU time
• One process starts after every 2 minutes
OR
• Processes start every 10 seconds
– What’s the throughput in each case?
– Average response time in each case?
• Is throughput = 1/average response time?
– Yes, BUT only if there is NO overlap of jobs
– Otherwise, throughput > 1/average response time
26
Performance according to the
Computer Architect
• From the perspective of a computer designer
– CPU time = time spent actually running a program
• Intuitively, faster should be better, so:

• CPU time is a major performance metric

27
CPU Execution Time
• Time spent by CPU in running a program
– System CPU Time
– User CPU Time
• (Try) Unix/Linux time command
– Gives user and system CPU time
– Also gives elapsed time
• CPU time is a major performance metric:
– Specifically, we’ll view CPU performance in terms of
user CPU time

28
Performance Comparison
• Machine A is n times faster than machine B iff
perf(A)/perf(B) = Exec time(B) / Exec time(A) = n
• Machine A is x% faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100

• Example: for some program run on machines A


and B, time(A) = 10s, time(B) = 15s
– 15/10 = 1.5 => A is 1.5 times faster than B
– 15/10 = 1.5 => A is 50% faster than B

29
CPU Performance Factors
• Computer equipped with a clock
– that runs at a fixed rate, and
– determines when events will take place in the hardware
• Discrete time intervals called clock cycles/ticks
• Example:
– 1GHz Processor runs 109 cycles/sec, 1 cycle every 1 ns
– 2.5GHz Core i7 runs 2.5x109 cycles/sec
• So one clock tick every 0.4 ns
• Clock tick is a.k.a. clock cycle time
• 2.5GHz is the clock rate

30
CPU Performance Factors
• CPU time can be related to
– Number of CPU clock cycles
– Clock cycle time

• Alternatively:

31
Implication: The hardware designer can improve
performance by reducing either the length of the
clock cycle or the number of clock cycles
required for a program

Trade-off: Many techniques that decrease


the number of clock cycles also increase the clock
cycle time.

32
• This equation does not include any reference to the
number of instructions needed for the program.

• However, the execution time does depend on the


number of instructions in the program also.

• A simple way to think about execution time is that


it equals the number of instructions executed
multiplied by the average time per instruction.

33
CPU Performance Equation
The performance equation separates the three key factors
that affect performance.

CPU time = IC x Average CPI x Clock Cycle Time

Can be used
❑ To compare two different implementations
❑ To evaluate a design alternative if we know its impact on
these three parameters

How can we determine the values of these three key factors


in the performance equation?

35
Three key factors
affecting performance
• Instruction count
– Determined by algorithm, compiler, ISA
– Measured using software that simulates the ISA, hardware
counters present in many systems
• Clock cycle time
– Determined by technology, organization, efficient circuit design
– Usually published in processor documentation
• Average cycles per instructions (CPI)
– Determined by ISA, CPU organization, instrn mix
– Measured by detailed simulations of implementation

36
CPU PERFORMANCE EQN.
CONTD.

37
CPU Performance Equation
• It is also possible to obtain the number of clock
cycles by looking at the different types of
instructions and their individual clock cycles
counts:

where ci is the number of instructions of class i


executed

38
Example
A compiler designer is trying to decide between two code
sequences for a particular machine. The hardware designers have
supplied the following facts:
Instruction class CPI for this instruction class
A 1
B 2
C 3
For a particular high-level language statement, the compiler
designer is considering two code sequences that require the
following instruction counts:
Instruction counts for instruction class
Code Sequence A B C
1 2 1 2
2 4 1 1
Which code sequence (a) executes more instructions? (b) has lower
average CPI (c) will be slower (in terms of execution time)?
39
Takeaways

Considering only one factor (instruction count, in


this case) to assess performance can mislead.

When comparing two computers, we must look at


all three components, which combine to form
execution time.

If some of the factors are identical, like the clock rate


in the previous example, performance can be
determined by comparing all the non-identical
factors.
40
Summary
• We looked at some fundamental ways of
quantifying and comparing the
performance of computing systems
• Next topic: benchmarking

Readings:
HP3E Ch.1. (1.1-1.3, 1.5); Ch.4. (4.1,4.2)

41
Performance
Evaluating Performance

42
Performance & Benchmarks
• Measuring performance of a computer system
is not easy.
• What we need: a simple yet representative
metric that captures the capabilities of the
computer system for our usage.
– metric that takes into account the kind of
applications, or the specific app-mix that we
would run on our system.
• Scientific
• Word processing
• Gaming
And so on…
Performance & Benchmarks

• If we plan to run scientific applications


involving lots of floating-point computations,
knowing how many integer calculations a
given machine can perform is not important
• Similarly, if the applications mostly do
character manipulation, we don’t care how
many FP calculations per second the system
can do.
Performance & Benchmarks

• So it is important to take into account the


expected mix of applications – the workload
– and derive metrics that make sense for the
target user group.
• Workload: suite of representative programs
which can be executed to measure the time.
– If this suite of applications represents the target
user application mix reasonably, then we can
compare the performance of different systems by
comparing execution times
Performance & Benchmarks

• Obviously, if machine X executes the


workload in 300 seconds and machine Y
takes 330 seconds, we say that machine X
is better for this workload.
– Note, however, that if we change the
workload, it is quite possible that machine Y
performs better than machine X for the new
workload.
• Workload is important in comparing the
performance of different machines.
Performance & Benchmarks

• Standard bodies have tried to define a set


of benchmark programs that approximate
the intended real-world applications.
✓ Benchmarks can be real programs taken from
sample applications or they can be synthetic.

• In synthetic benchmarks, artificial


programs are created to exercise the
system in a specific way.
Synthetic Benchmarks

• Programs specifically written for performance


testing.
• Two examples are the Whetstone and Dhrystone benchmarks.

• Whetstone benchmark, named after the


Whetstone Algol compiler (Algol, later Fortran)
– was developed in the mid-1970s to
measure floating-point performance
• Dhrystone benchmark (Ada, later C)
– developed in 1984 to measure integer
performance.
Synthetic Benchmarks

• Both Whetstone and Dhrystone


benchmarks are small programs
• Drawbacks with synthetic benchmarks:
– No user would use them as applications:
they don’t do anything of use
– Not real programs, so they do not reflect
program behavior
– They encouraged excessive optimization
by compilers to distort performance
results
Real Benchmarks

• SPEC: System Performance Evaluation Cooperative


– SPEC CPU2006
• Benchmark for measuring processor performance,
memory, and compiler
• 12I + 17F apps written in three or four different PLs
• Integer programs: Compilers, compression, chess, CAD
placement & routing programs etc.
• Floating Point programs: FEM, CFD simulations, ANN, 3D
graphics, image processing programs etc.
• Performance of a REF machine is given (Sun Sparc WrkStn).
Real Benchmarks

• Others
– SPECmail, SPECweb, SPECjvm etc.
Performance Metrics

• Computer system performance can be


measured by several performance
metrics.
• The metrics we use depend on the
purpose as well as the component of the
system in which we are interested.
– For example, to benchmark a network device,
we’d use network bandwidth, which tells us the
number of bits the component can transmit
per second.
Other Performance Metrics I: MIPS
• MIPS stands for millions of instructions
per second.
– simple metric but not very useful for
expressing system performance

CPI  Instruction count


CPU time =
Clockrate
Clockrate Instruction count
=
CPI CPU time
Clockrate Instruction count
= = MIPS
CPI  10 6
CPU time  10 6
Other Performance Metrics I: MIPS
• MIPS: drawbacks.
• Instruction counts can vary widely among processors
– Complex instructions take more clocks than simple instructions.
• MIPS does not capture the actual work done by these
instructions.
• MIPS can vary inversely to performance!
• Can we use MIPS to compare:
Machines with different instruction sets?
Programs with different instruction mixes?
At best, MIPS may be useful for comparing various versions of
processors derived from the same instruction set.
Uncorrelated with performance: more a marketing metric
“Meaningless Indicator of Processor Speed”
Other Performance Metrics II: MFLOPS
• MFLOPS: popular metric often used in the scientific
computing area.
– Millions of floating-point operations per second.
Number of FP operations
MFLOP/s =
CPU time  10 6
• Better metric than MIPS as it captures the number of
operations performed, rather than instructions

• Popular in supercomputing community


• Not all FP operations are equal: need for normalization
• Can magnify performance differences
•A better algorithm (e.g., with better data reuse) can run
faster even with higher FLOP count
Amdahl’s Law

A motivating example
A program runs in 100 seconds on a
computer, with multiply operations
responsible for 80 seconds of this time. By
how much do I have to improve the speed of
multiplication if I want my program to run
five times faster?

56
Amdahl’s Law
“Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl,
AFIPS Conference Proceedings, pp. 483-485, April 1967

 Historical context
⚫ Amdahl was demonstrating “the continued validity of the
single processor approach and of the weaknesses of the
multiple processor approach” (!)
➢“..A fairly obvious conclusion which can be drawn at this
point is that the effort expended on achieving high parallel
performance rates is wasted unless it is accompanied by
achievements in sequential processing rates of very nearly
the same magnitude..”
 Nevertheless, Amdahl’s Law has widespread
applicability in many situations

57
Amdahl’s Law

 Fenhanced 
Exec _ timenew = Exec _ timeold  (1 − Fenhanced ) + 
 Speedupenhanced 

Exec _ timeold 1
Speedupoverall = =
Exec _ timenew (1 − Fenhanced ) + Fenhanced
Speedupenhanced

60
Implications of Amdahl’s Law
 The performance improvements provided by a feature are
limited by how often that feature is used
 Bottleneck is the most promising target for improvements
⚫ “Make the common case fast”
⚫ Infrequent events, even if they consume a lot of time, will make
little difference to performance
 Typical use: Change only one parameter of system, and
compute effect of this change
⚫ The same program, with the same input data, should run on the
machine in both cases

61
Making the Common Case Fast: Examples
 All instructions require an instruction fetch, only a
fraction require a data fetch/store
⚫ Optimize instruction access over data access

 Programs exhibit locality


⚫ Spatial Locality
➢items with addresses near one another tend to be referenced
close together in time
⚫ Temporal Locality
➢recently accessed items are likely to be accessed in the near
future
 Access to small memories is faster
⚫ Provide a storage hierarchy such that the most frequent
accesses are to the smallest (closest) memories.

Reg's Cache Memory Disk / Tape 62


“Make The Common Case Fast” (2)
 What is the common case?
⚫ The rate at which the system spends most of its time
⚫ The “bottleneck”

 Exactly what does this statement mean?


⚫ Make the common case faster, rather than making some
other case faster
⚫ Make the common case faster by a certain amount, rather
than making some other case faster by the same amount

63
“Means” of Performance

• Matter of interest: a single summarizing metric to get


an idea of performance
– Less information, but preferred by marketers and users
• Even if we conduct several experiments, once an
appropriate workload has been identified and the
performance metric has been selected, we need to find a
way to get a single value for the metric.
• Several ways of obtaining such a metric. (Eg. SPECRatio)
• Suppose we run two programs to evaluate a system.
Individual execution times of 100 seconds (for Program
1) and 80 seconds (for Program 2).
“Means” of Performance

– Arithmetic mean = 90 seconds


– The implicit assumption in our arithmetic
mean calculation
• Both programs are equally likely in the target
workload.
• What if they are not?
– Weighted Arithmetic Mean
• Example: P2 appears three times more often than
P1
• Weighted Mean = (3 x 80 + 1 x 100)/4 = 85 seconds
• AM is a special case of the weighted
arithmetic mean with equal weights
Comparing relative performance

• Simplest approach: use execution time


– Perf (A) / Perf (B) = ET (B) / ET(A)

Computer A Computer B
Program 1 1 10
(seconds)
Program 2 1000 100
(seconds)
Total time 1001 110
(seconds)
Comparing relative performance
• Weighted AM okay for metrics such as the response
time to look at the performance of a single system
• But we normalize to a reference machine when
comparing A and B: A/REF; B/REF
• AM can be inconsistent in this case, as it depends
on the REF machine. See example below.
Response Time Normalized Values
REF M1 M2 Ratio M1 M2 Ratio
P1 10 11 12 1.1 1.2
P2 40 49.5 60 1.24 1.5
AM 30.25 36 1.19 1.17 1.35 1.16
GM 23.33 26.83 1.15 1.167 1.342 1.15
“Means” of Performance
• How about Geometric Mean?
• GM has the following property:
– GM (Xi) / GM (Yi) = GM (Xi/Y)
– Advantage: Independent of running
times of individual programs and REF
machine

• Example: Time on A Time on B


P1 1 10
P2 1000 100
AM 500.5 55
GM 31.6 31.6
“Means” of Performance

• AM values tell us that A is about three


times faster than B, but GM suggests that
both machines perform the same. (Why?)
• Coz GM tracks the performance ratio, not
execution time (that’s its key drawback)
o Since Program 1 runs 10 times faster on A
and Program B runs 10 times faster on B, by
using GM we erroneously conclude that the
average performance of the two programs is
the same.
Summary: “Means” of Performance

• Execution Time is the best measure of


performance. Other metrics have
limitations/drawbacks.
• Any measure that summarizes
performance should reflect execution
time (eg. weighted AM)
– GM does not do this
• GM is good for throughput: SPEC
benchmarks use GM for this purpose.
Readings
 References. HP3E Ch.1. (1.1-1.3, 1.5); Ch.4. (4.1-4.6)
 Readings:
⚫ Gordon E. Moore, “Cramming more components onto ICs”, Electronics,
April 19, 1965
⚫ David Patterson, “Why Latency Lags Bandwidth, and What it Means to
Computing”, Proceedings of ICCD, 2005

You might also like