Chapter 2-Part 12 1
Chapter 2-Part 12 1
Chapter 2-Part 12 1
Intensive Computing
Presented by Dr. A. Djenadi
1
CHAPTER 2: COMPREHENSIVE PERFORMANCE
ASSESSMENT ACROSS ARCHITECTURES
2
Chapter objectives
The knowledge provided in this chapter will prove valuable to you, whether you are tasked with choosing a
new system or aiming to enhance the performance of an existing one.
By the end of this chapter, you will have a clear understanding of what to examine in system tuning
reports and how each piece of information contributes to the broader perspective of overall system
performance.
3
Introduction
The word architecture covers all three aspects of computer design: Software, Instruction set architecture,
and hardware.
Optimization targets
Programming Microarchitecture
Compiler Transistor
language
4
Introduction
Computer architects must design a computer considering the following aspects:
Functional Trends in
Performance Power
requirements technology
Availability
Price/Cost
goals
5
Functional requirements
Definition: This refers to the intended functionality and capabilities of the computer system.
• Application area:
o Personal mobile device (Real-time performance, graphics, videos and audio, energy efficiency.)
o Desktop computer (Real-time performance, graphics, videos and audio)
o Servers (Support for databases and transaction processing; enhancements for reliability and
availability; support for scalability).
o Clusters computers (Throughput performance for many independent tasks; error correction for
memory; energy proportionality)
o Internet of things / Embedded computing (special support for graphics or video (or other
application-specific extension); power limitations and power control may be required; real-time
constraints
6
Functional requirements
• Level of software compatibility:
• Floating point Format and arithmetic: IEEE 754 standard, special arithmetic for graphics or signal
processing
• I/O interfaces For I/O devices: Serial ATA, Serial Attached SCSI, PCI Express
• Programming languages Languages (ANSI C, C++, Java, Fortran) affect instruction set
7
Trends in Technology
Computer architect must stay updated on swiftly changing implementation technologies, including:
• Integrated circuit logic technology: Transistor density and Increases in die size. However, this increase
• Network technology.
8
Performance measurement and analysis
Question 1: What does it mean when we say that computer X has better performance than computer Y?
Answer 2: It depends on the perspectives of the users and on both external and internal considerations of
the machine.
9
Performance measurement and analysis
The user of a desktop computer may say a computer is faster when a program runs in less time, while a
computer center administrator may say a computer is faster when it completes more transactions per
hour.
• The desktop computer user wants to reduce the response time (execution time) which is defined as the
time between the start and the completion of an event.
• The administrator wants to increase throughput (debit de sortie), which is defined as the total amount of
work done in a given time.
• In both cases the metric used to asses the performance is: The time.
Important: The primary, consistent and reliable indicator measure of performance is the execution time of
real programs.
10
Performance measurement and analysis
Time & computer: The clock system
The actions carried out by a processor, such as retrieving an instruction, interpreting the instruction, loading
and storing data and executing arithmetic operations, are controlled by a system clock.
At the most fundamental level, the speed of a processor is dictated by the pulse frequency produced by the
clock, measured in cycles per second, or Hertz (Hz).
11
Performance measurement and analysis
Clock signal generation
Analog to
Quartz
Digital
crystal
conversion
• The rate of pulses is known as the clock rate, or clock speed (Frequency)
• The time between pulses is the cycle time, clock periods, clocks, cycles.
12
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
CPU time (execution time) for a program can be expressed in seconds in two ways:
𝐶𝑃𝑈 𝑡ℏ𝑚𝑒 = 𝐶𝑃𝑈 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 𝑓𝑜𝑟 𝑎 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 × 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡ℏ𝑚𝑒(𝑝𝑒𝑟ℏ𝑜𝑑 )
• CPU Time (execution time): This is the total time the CPU spends executing a specific program. It is
often measured in seconds.
• CPU Clock Cycles for a Program: This refers to the number of clock cycles the CPU takes to execute
all the instructions in the program.
• Clock Cycle Time (period): This is the duration of a single clock cycle, measured in seconds. It
represents the time it takes for the CPU to complete one clock cycle.
• Clock rate: This is the clock frequency (the number of clock cycles per second).
13
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
If we know the number of clock cycles and the instruction count (IC), we can calculate the average
number of clock cycles per instruction (CPI).
Thus, we can use the CPI in the execution time formula (CPU time):
𝐶𝑃𝑈 𝑡ℏ𝑚𝑒 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡ℏ𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 × 𝐶𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 ℏ𝑛𝑠𝑡𝑟𝑢𝑐𝑡ℏ𝑜𝑛 × 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡ℏ𝑚𝑒
14
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
Example 2:
• A program P1 consists of 30 instructions.
• Clock frequency = 1 GHz.
• Number of cycles per instruction = 3 cycles.
15
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
Expressing the initial formula in terms of units of measurement illustrates the integration of its components:
16
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
Remarks:
• Executing an instruction involves multiple steps, such as retrieving it from memory, decoding, and
performing operations. Thus, most instructions on most processors require multiple clock cycles to
complete. Some instructions may take only a few cycles, while others require dozens.
• On any give processor, the number of clock cycles required varies for different types of instructions, such
as load, store, branch, and so on.
• A straight comparison of clock speeds on different processors does not tell the whole story about
performance.
17
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
Where 𝐼𝐶ℏ represents the number of times instruction “ℏ” is executed in a program and 𝐶𝑃𝐼ℏ represents the
average number of clocks per instruction for instruction "ℏ".
18
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
𝑛
σ𝑛ℏ=1 𝐼𝐶ℏ × 𝐶𝑃𝐼ℏ 𝐼𝐶ℏ
𝐺𝑙𝑜𝑏𝑎𝑙_𝐶𝑃𝐼 = = × 𝐶𝑃𝐼ℏ
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡ℏ𝑜𝑛 𝑐𝑜𝑢𝑛𝑡 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡ℏ𝑜𝑛 𝑐𝑜𝑢𝑛𝑡
ℏ=1
The overall version of the 𝐶𝑃𝐼 calculation considers each specific 𝐶𝑃𝐼ℏ and its frequency in a program (i.e.,
𝐼𝐶ℏ ÷ 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡ℏ𝑜𝑛 𝑐𝑜𝑢𝑛𝑡).
Because it must include pipeline effects, cache misses, and any other memory system inefficiencies, 𝐶𝑃𝐼ℏ
should be measured and not just calculated from a table in the back of a reference manual.
19
Performance measurement and analysis
CPU time (Execution time): The Processor Performance Equation
20
Performance measurement and analysis
Performance comparison
We often compare the performance of two different computers, X and Y, by using the assessment “X is faster
than Y”, which means that execution time is lower on X than on Y for the given task.
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑌
=𝑛
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑋
We suppose that the execution time is the reciprocal of performance, thus we have the following relationship:
1
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑌 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝑌 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝑋
𝑛= = =
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑋 1 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝑌
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝑋
21
Performance measurement and analysis
Performance comparison
The execution time can be replaced by the throughput metric to compare the performance between X and Y
in term of the amount of work done in a given time.
throughput 𝑌
𝑛=
throughput 𝑋
Example:
The throughput of X is 5.2 times as fast as Y signifies here that the number of tasks completed per unit time
on computer X is 5.2 times the number completed on Y.
22
Performance measurement and analysis
Performance comparison
Remarks:
➢ Execution time is expressed in seconds. It may include or not: instruction processing; memory access;
I/O; interruptions; operating system overhead.
➢ Output throughput is expressed in the number of instructions per second (for a processor), the number of
queries processed per hour (for a server), MIPS (Million Instructions Per Second), and MFLOPS (Million
Floating-point Operations Per Second).
𝐼𝐶 Clock frequency
𝑀𝐼𝑃𝑆 = =
𝐶𝑃𝑈 𝑡ℏ𝑚𝑒(𝑒𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒) × 106 𝐶𝑃𝐼 × 106
Reliable benchmarks play a crucial role in cutting through marketing exaggerations and statistical
manipulations. In essence, effective benchmarks help pinpoint systems that deliver optimal performance at a
reasonable cost.
24
Performance measurement and analysis
Benchmark type
25
Performance measurement and analysis
Flaws and limitations
• The compiler writer and architect can manipulate the test results by making the computer appear faster
on these surrogate programs than on real applications.
• The use of a benchmark-specific compiler flags to improve the performance of a benchmark. These flags
often caused transformations that would be illegal on many programs or would slow down performance
on others.
• No modifications allowed.
• Source modifications are allowed, as long as the altered version produces the same output.
26
Performance measurement and analysis
Better benchmarking solution: benchmark suites
An accepted solution for performance assessment is the use of collections of benchmark applications, called
benchmark suites.
A key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of
the other benchmarks.
27
Performance measurement and analysis
SPEC: Standard Performance Evaluation Corporation
The most recognized standardized benchmark application suites have been the SPEC (Standard
Performance Evaluation Corporation)
The first benchmark suites version was developed in 1980 to benchmark workstations. Currently, there are
SPEC benchmarks to cover many application classes. All the SPEC benchmark suites and their reported
results are found at http://www.spec.org.
28
Performance measurement and analysis
SPEC: Standard Performance Evaluation Corporation
Active benchmarks
from SPEC as of
2017
29
Performance measurement and analysis
Reporting Performance Results
The key principle in presenting performance measurements should prioritize reproducibility, ensuring that
another experimenter can replicate the results.
A SPEC benchmark report requires an extensive description of the computer and the compiler flags, as well
as the publication of both the baseline and the optimized results.
Alongside hardware, software, and baseline tuning details, a SPEC report includes performance times
displayed in tables and graphs.
30
Performance measurement and analysis
SPEC results comparison: SPECRatio
A normalization of the execution times to a reference computer by dividing the time on the reference
computer by the time on the computer being rated, yielding a ratio proportional to performance. SPEC uses
the SPECRatio.
For example, suppose that the SPECRatio of computer A on a benchmark is 2.56 times as fast as computer
B; then we know
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝑆𝑃𝐸𝐶𝑅𝑎𝑡ℏ𝑜𝐴 𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝐴 𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝐵 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝐴
2.56 = = = =
𝑆𝑃𝐸𝐶𝑅𝑎𝑡ℏ𝑜𝐵 𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝐴 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒𝐵
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝐵
Note: The choice of the reference computer is irrelevant when the comparisons are made as a ratio.
31
Performance measurement and analysis
SPEC results comparison: SPECRatio
After choosing a benchmark suite, the performance results of the suite are summarized in a unique number
that is the geometric mean of the SPECRatio of the programs in the suite.
𝑛
𝑛
𝐺𝑒𝑜𝑚𝑒𝑡𝑟ℏ𝑐 𝑚𝑒𝑎𝑛 = ෑ 𝑆𝑎𝑚𝑝𝑙𝑒ℏ
ℏ=1
32
Performance measurement and analysis
SPEC results comparison: SPECRatio
Why use Geometric mean:
1. The geometric mean of the ratios is the same as the ratio of the geometric means.
2. The ratio of the geometric means is equal to the geometric mean of the performance ratios, which implies
that the choice of the reference computer is irrelevant.
33
Performance measurement and analysis
SPEC results comparison: SPECRatio
Example
34
Performance enhancement: Amdahl’s Law
Objective: enhancing the performance by improving a portion of a computer.
Definition: Amdahl’s Law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used.
Speedup: Amdahl’s Law defines the speedup that can be gained by using a particular feature. Speedup is
the ratio given by:
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑓𝑜𝑟 𝑒𝑛𝑡ℏ𝑟𝑒 𝑡𝑎𝑠𝑘 𝑢𝑠ℏ𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑤ℎ𝑒𝑛 𝑝𝑜𝑠𝑠ℏ𝑏𝑙𝑒
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑓𝑜𝑟 𝑒𝑛𝑡ℏ𝑟𝑒 𝑡𝑎𝑠𝑘 𝑤ℏ𝑡ℎ𝑜𝑢𝑡 𝑢𝑠ℏ𝑛𝑔 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡
Or, function of the execution times:
35
Performance enhancement: Amdahl’s Law
Amdahl’s Law factors:
𝑭𝒓𝒂𝒄𝒕𝒊𝒐𝒏𝒆𝒏𝒉𝒂𝒏𝒄𝒆𝒅 : The fraction of the computation time in the original computer that can be converted to take
advantage of the enhancement. This value is always less than or equal to 1.
Example: if 20 seconds of the execution time of a program that takes 60 seconds in total can use an
enhancement, the fraction is 20/60.
𝑺𝒑𝒆𝒆𝒅𝒖𝒑𝒆𝒏𝒉𝒂𝒏𝒄𝒆𝒅 : The improvement gained by the enhanced execution mode. This value is the time of the
original mode over the time of the enhanced mode. This value is always greater than 1
Example: If the enhanced mode takes 4 seconds for a portion of the program, while it is 40 seconds in the
original mode, the improvement is 40/4 or 10.
36
Performance enhancement: Amdahl’s Law
The new enhanced execution time
The execution time using the original computer with the enhanced mode will be the time spent using the
unenhanced portion of the computer plus the time spent using the enhancement:
𝐹𝑟𝑎𝑐𝑡ℏ𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑛𝑒𝑤 = 𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑜𝑙𝑑 × 1 , 𝐹𝑟𝑎𝑐𝑡ℏ𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 +
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑜𝑙𝑑 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = =
𝐸𝑥𝑒𝑐𝑢𝑡ℏ𝑜𝑛 𝑡ℏ𝑚𝑒𝑛𝑒𝑤 𝐹𝑟𝑎𝑐𝑡ℏ𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑
1 , 𝐹𝑟𝑎𝑐𝑡ℏ𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 +
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑
37
Performance enhancement: Amdahl’s Law
Example: Amdahl’s Law
Suppose that we want to enhance the processor used for web serving. The new processor is 10 times faster
on computation in the web serving application than the old processor. Assuming that the original processor is
busy with computation 40% of the time and is waiting for I/O 60% of the time.
𝐹𝑟𝑎𝑐𝑡ℏ𝑜𝑛𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 0.4
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 = 10
1 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑜𝑣𝑒𝑟𝑎𝑙𝑙 = = ≈ 1.56
0.4 0.64
0.6 + 10
38