Unit 1
Unit 1
Unit 1
2
The HW/SW Interface
Compiler
lw $15, 0($2)
add $16, $15, $14
Systems software add $17, $15, $13
(OS, compiler) lw $18, 0($12)
lw $19, 0($17)
add $20, $18, $19
sw $20, 0($16)
Assembler
Hardware 000000101100000
110100000100010
3
…
Administrivia
Components of Evaluation
Class Assignments (two per unit)
Home Assignments (“Daily”)
Maintain a DHA notebook, to be evaluated at random.
Properly indexed, with page numbers..
CTs (Two – mid-sem and end-sem)
Surprise Quizzes (best n – 1 of n, n ~ 2 – 3) &/ Additional HA
Term Paper
Attendance
6
Key Elements of a Computing System
Processing
Control Memory
(sequencing) (program I/O
and data)
Datapath
7
The Von Neumann Model/Architecture
AKA stored program computer (instructions in memory).
Key properties:
1. Stored program
Instructions stored in memory
Unified memory: instructions and data stored together
How can we interpret the contents of memory correctly?
Depends on the control signals issued
PC
IR
Register File
MAR MDR
9
Harvard Architecture
▪Separate storage and datapath
for instructions and data
▪Originated from the Harvard Mark I computer
▪ CPU can both read an instruction and access data
memory simultaneously.
▪ Canbe faster since instruction fetches and data access do not
contend for a single memory pathway, as in Von Neumann
▪ Distinct code and data address spaces
▪Applications in DSP, microcontrollers like PIC
▪Modern high performance CPU chip designs
incorporate aspects of both Harvard and von Neumann
architecture.
▪Split I/D cache 10
ISA vs. Micro-architecture
ISA: Specifies how the programmer sees instructions to be
executed
Example: Programmer sees a sequential flow of execution
Micro-architecture: How underlying implementation actually
executes instructions
Example: Micro-architecture may execute instructions in any order as long as it obeys the
semantics specified by the ISA when making the instruction results visible to software
11
ISA vs. Microarchitecture
Various micro-architectures can (and do) exist for the same ISA
Add instruction (ISA) vs. Adder implementation (Micro-architecture)
Several possible implementations: ripple carry v/s carry lookahead v/s carry save..
x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro,
Pentium 4, Core, …
12
What comprises the ISA?
Instructions
Opcodes, Data Types
Instruction Types and Formats
Registers, Addressing Modes
Memory
Address space, Addressability, Alignment
Virtual memory management
Function Calls, Interrupt/Exception Handling
Access Control, Device Priorities
I/O mapping: memory-mapped vs. I/O mapped
Task/thread Management
Pipelining
In-order versus out-of-order instruction execution
Memory access scheduling policy
Speculative execution (eg: cache prefetch, branch pred)
Superscalar
Caching: number of levels, cache size, associativity,
replacement…
Cost
Performance
Energy/Battery life
Reliability and Correctness
Time to market
15
Computer Architecture
Architecture
The art and science of designing and constructing
buildings
Computer Architecture
The art and science of selecting, designing
and interconnecting hardware components
and designing the hardware/software
interface to create a computing system that
meets the required system design goals
16
Course Contents in brief
17
Performance
Response time & throughput; Factors that affect performance
The Performance Equation
Some Examples 18
Performance and Cost
• Which of the following aircrafts has
the best performance?
Airplane Passengers Range (mi) Speed (mph)
19
Performance and Cost
• Which of the following aircrafts has
the best performance?
Airplane Passengers Range (mi) Speed (mph)
20
Defining Performance
• Why study hardware performance?
– Hardware performance is often the key to the
effectiveness of the entire software
• Not an easy task
– Different aspects of a computing system can
assume significance in determining the overall
performance
– Different performance metrics appropriate for
different types of applications
21
Defining Performance
• What is important to whom?
– Individual Computer User/Programmer
• Minimize Response Time for the program
= completion time – start time
• Also called elapsed time, wall clock time
– Data Center Manager (what are data centers?)
• Maximize Throughput
– number of jobs completed in given time
• Measured as #jobs/unit time
22
Q. Defining Performance
• What does each of the following improve?
(a) response time or (b) throughput?
– Replacing the existing CPU with a faster CPU
– Adding more processor cores in the same
system
23
Defining Performance
• Depending on the type of application, we
might need to optimize
– Throughput
– Response time
– Some combination of both
25
Response Time vs. Throughput
• How can we relate the two?
– E.g. Simple uniprocessor system
• processes executed sequentially
– Assume 5 processes
– Each process needs 2 minutes of CPU time
• One process starts after every 2 minutes
OR
• Processes start every 10 seconds
– What’s the throughput in each case?
– Average response time in each case?
• Is throughput = 1/average response time?
– Yes, BUT only if there is NO overlap of jobs
– Otherwise, throughput > 1/average response time
26
Performance according to the
Computer Architect
• From the perspective of a computer designer
– CPU time = time spent actually running a program
• Intuitively, faster should be better, so:
27
CPU Execution Time
• Time spent by CPU in running a program
– System CPU Time
– User CPU Time
• (Try) Unix/Linux time command
– Gives user and system CPU time
– Also gives elapsed time
• CPU time is a major performance metric:
– Specifically, we’ll view CPU performance in terms of
user CPU time
28
Performance Comparison
• Machine A is n times faster than machine B iff
perf(A)/perf(B) = Exec time(B) / Exec time(A) = n
• Machine A is x% faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
29
CPU Performance Factors
• Computer equipped with a clock
– that runs at a fixed rate, and
– determines when events will take place in the hardware
• Discrete time intervals called clock cycles/ticks
• Example:
– 1GHz Processor runs 109 cycles/sec, 1 cycle every 1 ns
– 2.5GHz Core i7 runs 2.5x109 cycles/sec
• So one clock tick every 0.4 ns
• Clock tick is a.k.a. clock cycle time
• 2.5GHz is the clock rate
30
CPU Performance Factors
• CPU time can be related to
– Number of CPU clock cycles
– Clock cycle time
• Alternatively:
31
Implication: The hardware designer can improve
performance by reducing either the length of the
clock cycle or the number of clock cycles
required for a program
32
• This equation does not include any reference to the
number of instructions needed for the program.
33
CPU Performance Equation
The performance equation separates the three key factors
that affect performance.
Can be used
❑ To compare two different implementations
❑ To evaluate a design alternative if we know its impact on
these three parameters
35
Three key factors
affecting performance
• Instruction count
– Determined by algorithm, compiler, ISA
– Measured using software that simulates the ISA, hardware
counters present in many systems
• Clock cycle time
– Determined by technology, organization, efficient circuit design
– Usually published in processor documentation
• Average cycles per instructions (CPI)
– Determined by ISA, CPU organization, instrn mix
– Measured by detailed simulations of implementation
36
CPU PERFORMANCE EQN.
CONTD.
37
CPU Performance Equation
• It is also possible to obtain the number of clock
cycles by looking at the different types of
instructions and their individual clock cycles
counts:
38
Example
A compiler designer is trying to decide between two code
sequences for a particular machine. The hardware designers have
supplied the following facts:
Instruction class CPI for this instruction class
A 1
B 2
C 3
For a particular high-level language statement, the compiler
designer is considering two code sequences that require the
following instruction counts:
Instruction counts for instruction class
Code Sequence A B C
1 2 1 2
2 4 1 1
Which code sequence (a) executes more instructions? (b) has lower
average CPI (c) will be slower (in terms of execution time)?
39
Takeaways
Readings:
HP3E Ch.1. (1.1-1.3, 1.5); Ch.4. (4.1,4.2)
41
Performance
Evaluating Performance
42
Performance & Benchmarks
• Measuring performance of a computer system
is not easy.
• What we need: a simple yet representative
metric that captures the capabilities of the
computer system for our usage.
– metric that takes into account the kind of
applications, or the specific app-mix that we
would run on our system.
• Scientific
• Word processing
• Gaming
And so on…
Performance & Benchmarks
• Others
– SPECmail, SPECweb, SPECjvm etc.
Performance Metrics
A motivating example
A program runs in 100 seconds on a
computer, with multiply operations
responsible for 80 seconds of this time. By
how much do I have to improve the speed of
multiplication if I want my program to run
five times faster?
56
Amdahl’s Law
“Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl,
AFIPS Conference Proceedings, pp. 483-485, April 1967
Historical context
⚫ Amdahl was demonstrating “the continued validity of the
single processor approach and of the weaknesses of the
multiple processor approach” (!)
➢“..A fairly obvious conclusion which can be drawn at this
point is that the effort expended on achieving high parallel
performance rates is wasted unless it is accompanied by
achievements in sequential processing rates of very nearly
the same magnitude..”
Nevertheless, Amdahl’s Law has widespread
applicability in many situations
57
Amdahl’s Law
Fenhanced
Exec _ timenew = Exec _ timeold (1 − Fenhanced ) +
Speedupenhanced
Exec _ timeold 1
Speedupoverall = =
Exec _ timenew (1 − Fenhanced ) + Fenhanced
Speedupenhanced
60
Implications of Amdahl’s Law
The performance improvements provided by a feature are
limited by how often that feature is used
Bottleneck is the most promising target for improvements
⚫ “Make the common case fast”
⚫ Infrequent events, even if they consume a lot of time, will make
little difference to performance
Typical use: Change only one parameter of system, and
compute effect of this change
⚫ The same program, with the same input data, should run on the
machine in both cases
61
Making the Common Case Fast: Examples
All instructions require an instruction fetch, only a
fraction require a data fetch/store
⚫ Optimize instruction access over data access
63
“Means” of Performance
Computer A Computer B
Program 1 1 10
(seconds)
Program 2 1000 100
(seconds)
Total time 1001 110
(seconds)
Comparing relative performance
• Weighted AM okay for metrics such as the response
time to look at the performance of a single system
• But we normalize to a reference machine when
comparing A and B: A/REF; B/REF
• AM can be inconsistent in this case, as it depends
on the REF machine. See example below.
Response Time Normalized Values
REF M1 M2 Ratio M1 M2 Ratio
P1 10 11 12 1.1 1.2
P2 40 49.5 60 1.24 1.5
AM 30.25 36 1.19 1.17 1.35 1.16
GM 23.33 26.83 1.15 1.167 1.342 1.15
“Means” of Performance
• How about Geometric Mean?
• GM has the following property:
– GM (Xi) / GM (Yi) = GM (Xi/Y)
– Advantage: Independent of running
times of individual programs and REF
machine