Microprocessor Book
Microprocessor Book
of TECHNOLOGY
Department of Computer Engineering
MICROPROCESSOR SYSTEMS I
Microprocessor Systems Architecture
IAY 0021
Lecture Notes
Arvo Toomsalu
Tallinn
1
© Arvo Toomsalu
All rights reserved, except as permitted under the Estonian Copyright Act.
Contents
2
MICROPROCESSOR SYSTEM ARCHITECTURE
System Description (AIR Principle)
System
Description SYSTEM
Levels (MPS)
WHAT?
Instruction set;
Memory management and protection;
Interrupts;
Floating-point standard, etc.
Development
Speed
V
A1 A2
Tim e
0 t
A B C
3
Forces on the microprocessor system (computer) architecture are:
1. Applications;
2. Technology;
3. Operating systems;
4. Programming languages;
5. History.
► Microprocessor Families
4
Intel
EPIC
Architecture Explicitly
IXA
Instruction set definition Internet
and compatibility
Parallel IA-32
Exchange
Instruction
Architecture
Computing
Microarchitecture
Hardware implementation
maintaining instruction set P5 P6 NetBurst Mobile
compatibility with high-level architecture
AIR principle uses only three system description levels, but for more precise description there is
used more description levels, as for:
Moving from the first description level to the last, we can see how the behavior of the
microprocessor system is transformed into the real hardware and software structure.
5
Architecture Taxonomies
Classical Architectures
Characteristic features:
1. Integral processing unit;
2. United data and instructions memory;
3. United bus between CPU and memory;
4. Centralized control;
5. Low-level programming languages;
6. Memory linear addressing.
Harvard Architecture
6
Characteristic features:
1. Instruction and data memories occupy different address spaces;
2. Instruction and data memories have separate buses to the CPU (A);
3. Instruction and data memories can be implemented in different ways.
Multi-microprocessor system - a system that use more than one microprocessor to perform
a desired application.
Multiprocessor systems were designed for at least one of two reasons:
1. Speed-up program;
2. Fault tolerance.
The classification of computer architectures based on notions of instruction and data streams or
Michael Flynn`s (1972)) classification. Flynn`s classification stresses the architectural relationship
at the memory-processor level. Other architectural levels are overlooked.
Single
Data
Stream SISD MISD
Multiple
Data SIMD => Data Level Parallelism
Streams SIMD MIMD MIMD => Thread Level Parallelism
Single Multiple
Instruction Instruction
Stream Streams
Flynn-Johnson Classification
7
Instruction Set Architecture
In the past, the term computer architecture referred only to instruction set design.
The term instruction set architecture (ISA) refers to the actual programmer-visible instruction set.
ISA enables different implementations of the same architecture but it may prevent using new
innovations.
RGF RGF
from
memory
ALU ALU
Stack top
RGF
RGF
from
memory
ALU ALU
8
Architectures Taxonomy by Milutinovic
(1989)
Summary
Comparison Different ISA Structures
Number of Maximum number
memory accesses of operands allowed Architecture Examples
3 3 memory-memory VAX
9
Advantages and Disadvantages Different ISA Structures
10
Operating System Management Functions
Operating system (OS) controls the execution of programs on a processor and manages the
processor’s resources.
Application programs
High-level languages
Machine-independent HLL
Machine-specific LLL
Assembly language
Machine language
Microprogram control
Hardware
The main functions of OS are process (task) scheduling //plaanimine// and memory management
//mäluhaldus//.
The OS determines which process should run at any given time.
During the lifetime of a process, its state will change a number time:
New > Ready > Running > Waiting > Halted.
The main tasks in memory management are swapping //saalimine// and partitioning
//sektsioneerimine// (fixed-size partitions or variable-size partitions).
11
Generic Architecture of Operating System
The kernel is the basic set of computing functions needed for an operating system. The kernel
contains the interrupt handler, the task manager, and the inter-process communication manager.
It may also contain the virtual memory manager and the network subsystem manager.
The services provided by an operating system can be organized in categories:
1. Task control,
2. File manipulation,
3. Device control,
4. Information maintenance.
Architecture of UNIX
A process may start threads or other processes, but thread cannot start a process.
The information needed to keep track of a process when switching is kept in a data package is
called a process control block (PCB). The PCB contains information about:
12
1. An ID number that identifies the process;
2. The priority of the process;
3. Pointers to the locations in the program and its data where processing last
occurred;
4. Pointers to the upper and lower bounds of memory required for the process;
5. States of flags and switches;
6. A list of files opened by the process;
7. Registers content;
8. The status of all I/O devices needed by the process.
In many OSs processes can subdivide themselves into more easily managed subunits called
threads.
A thread is a portion of a process that can be scheduled and executed independently.
The thread can deal with all the CPU-intensive work of a normal process, but generally does not
deal with the various types of I/O operations and does not establish requiring the extensive process
control block of a regular process.
Thread levels
• User-level threads or ULT (thread libraries)
• Kernel-level threads or KLT (system calls)
Resource Resource
allocated requested
Blocked
13
real-time scheduling).
b. Once dispatched, a thread has entered the running state and retains control of the CPU until one
of the following events:
1. The thread or its parent process terminates normally;
2. An interrupt occurs.
c. The most common method is for the process or thread to execute an exit service call. The exit
service call triggers a software interrupt.
d. When any interrupt is received, the CPU automatically suspends the currently executing thread,
pushes current register values onto the stack, and transfers control to the OS master interrupt
handler. The suspended thread remains on the stack until interrupt processing is completed.
e. During this period of time the thread is blocked. Once the interrupt has been processed, the OS
can do:
1. Leave the suspended thread in the blocked state,
2. Move it to the ready state,
3. Return it to the running state.
A process or program that divides itself into multiple threads is called a multithreaded.
OSs capbilities
MS-DOS supports a single user process and a single thread;
UNIX supports multiple user processes but only one thread per process;
Solaris supports multiple threads (multithreading).
Compiler
Compiler is a program that translaters a higher programming language into machine language.
Compiler
1. Minimizeses the number of operations;
2. Replaces expensive operations with simpler ones;
3. Minimizeses cache misses;
4. Minimizeses object code size.
14
Compilers usually decompose programs into their basic blocks (the first step in the analysis
process).
Intermediate form
Source code Object code
Machine-independent
Fortran IA-64
Optimizer
High-level optimizations:
Front End parallelization, loop transformations
C (C++) Back End SPARC
Abstract syntax tree Low-level optimizations:
redundancy elimination
Verilog POWER
The general translation and interpretation steps that have to be taken in order to execute a program
written in a HLL are:
• Frontend compilation;
• Determine dependences and produce data and control dependence graphs;
• Partitioning the graph;
• Bind partitions to nodes;
• Bind operands to locations;
• Bind operands to time slots;
• Bind operations to FUs;
• Bind transports to buses;
• Exacurte the operations and perform the transports.
The compiler affects significantly the performance of a microprocessor system. The modern
optimizing compiler consists of a number of passes. Three classes of optimizations are performed:
1. Local optimizations within a single basic block
Global optimizations work across multiple basic blocks.
2. Global register allocation allocates variables to registers for regions of the code.
15
PERFORMANCE MEASURING
In comparing design alternatives often is related to the performance (P). In comparing systems X
and Y we can state, that: “Y is n times faster than X” if:
n = teX / teY
The execution time is reciprocal to the performance, i.e.
te = 1 / P,
P = 1 / te,
n = (1 / PX) / (1 / PY) = PY / PX , or
PY = n × PX
The number of tasks completed per unit time on system Y is n times the number completed on X.
16
does not yet exist. A secondary application is the projection of the performance of a given system
on a new workload //koormus//.
For performance projection is used different performance modeling methods – simulation and
analytical modeling.
3. Performance monitoring provides data on the actual performance of an existing system.
No single measure of performance can give a truly accurate measure of MPS’s performance!
In real terms it can be defined only together with the user program.
The most commonly quoted measure of performance is peak performance – the maximum rate at
which some operation can be executed.
Peak performance is based on the clock rate of the processor and on the minimum number of
cycles (clocks) per instruction (CPI) attainable.
The peak performance ignores the fact that there is a mix of CPI values that depend on the
instruction set, the cache behavior and the proportions in which instructions are executed.
The clock rate is the second most often quoted performance figure.
It is only a bit more meaningful than peak performance.
17
PCPU = IPC × IC-1 × f
PCPU = CPI-1 × IC-1 × f
The clock rate (f) depends on the technology used to implement the processor.
The average IPC depends on processor microarchitecture.
The instruction count (IC) depends on instruction set architecture, compiler, and the
operating system.
Average IPC reflects the average instructions throughput achieved by the processor and is a
key measure of microarchitecture effectiveness.
The use of CPI was popular during the days of scalar pipelined processors.
For superscalar processors, it becomes more convenient to use IPC.
An ideal scalar processor pipeline would execute instructions at rate of one per cycle, resulting
in a core CPI = 1. This CPI assumes that all memory references are hits in the cache cycles.
Remark
1. MIPS = Clock rate (f) / (CPI × 106)
2. MIPS are misleading because the work done by an instruction varies.
3. FLOPS, LIPS, COPS are more meaningful because they specify a particular
operation, but they are often calculated.
18
Rating Technology
The application ratings //reiting, tootlus// are based on a comparison of workload run times
between the system being tested and a fixed calibration platform.
A rating of 100 indicates that the system has performance equal to that of the calibration platform,
200 indicates twice the performance of the calibration platform.
Performance-analysis Technologies
o Bottleneck //kitsaskoht// - the bottleneck is the resource with the lowest maximum performance.
o Latency //latentsusaeg// - delay between start and end of operation
o Response time //reaktsiooniaeg// - delay between request and response.
o Throughput //läbilaskevõime, jõudlus// – the amount of work done per unit time,
(Capacity) or the rate at which new results arrive.
o Bandwidth //ribalaius// - used to describe the theoretical maximum throughput
of a data link or other device.
o Availability //käideldavus// - fraction of time that a resource is available for use.
o Utilization //kasutatavus// - fraction of time a resource is busy.
Metric is used only inside a system.
o Accuracy //täpsus// - indicates how closely performance measurement results obtained when using
the technique correspond to the results that would have been obtained on
a real system.
19
Benchmarks
Benchmark – program or program fragment and related data inputs that are executed to
test
the performance of hardware component, a group of hardware or software
components, or an entire computer system or network.
The goal of benchmarking is to predict the performance of a workload on a particular
platform.
Workload – the amount of work which a computer has to do.
There are two ways in which workloads can be constructed:
1. Model workloads
2. Synthetic workloads
Benchmark Types
Benchmarks can be designed to focus on performance of individual components (CPU,
cache, memory, I/O, graphics, and network resources), or on application (or system)
performance:
1. Component benchmarks (microbenchmarks)
They measure the performance of a single aspect of a system.
2. Application benchmarks (macrobenchmarks)
They evaluate the performance of many system components working together to produce
the throughput that a user perceives.
Real programs (representative or real workload);
Kernels (representative program fragments);
Mixes (instruction frequency of occurrence);
Synthetic benchmarks (programs intended to provide a specific mix).
Benchmark sources
1. Vendor specific benchmarks (iCOMP)
2. User organizations who share their benchmarks (Linpack)
3. Industry organizations (SPEC, EEMBC)
(EEMBC – Embedded Microprocessor Benchmark Consortium)
Benchmark Examples
20
The MIPS is related closely to the clock rate of the processor. Using MIPS as a tool to
determine relative performance of different computers is very dangerous, so:
MIPS are only useful for comparison between two processors from the same vendor that
support the same instruction set with the same compilers.
2. Linpack
The Linpack benchmark uses a set of high performance library routines for linear algebra.
These routines are used to measure the time it takes to solve a dense system of linear
equations.
The benchmark reports average megaflops rates by dividing the total number of floating-
point operations by time. There are several versions of the Linpack benchmark, differing in
size, numerical precision and ground rules.
3. Whetstone
The first synthetic benchmark program that was designed for performance testing.
The program was designed primarily to measure floating-point performance.
Performance is quoted in MWIPS (millions of Whetstone instructions per second).
4. Dhrystone
It is a synthetic integer performance benchmark that was developed in 1984. Benchmark
contains less than 100 HLL statements, compiling to 1-1,5 kB of code.
5. HINT
HINT (Hierarchical INTegration) is a computer benchmark that ranks a computer
system as a whole.
Most benchmarks measure either the number of operations in a given time period, or the
time required to perform a given fixed calculations. HINT does neither; rather it performs
a particular calculation with ever-increasing accuracy.
The accuracy of the result at any given time is called the “Quality”. HINT measures the
improvement of quality at any given time as “Quality Improvements Per Second” or
QUIPS.
HINT rating is a function of raw CPU processing power, L1 and L2” cache size and
speed, and main-memory bandwidth. HINT allows comparisons over extreme variations in
computer architecture, absolute performance, storage capacity and precision.
6. SPEC Benchmarks
A group called the Systems Performance Evaluation Co-operative (SPEC) or Standard
Performance Evaluation Corporation formed to establish and maintain a standard set of
relevant benchmarks that can be applied to the newest generation of high
performance computers. The first SPEC benchmark suite was called SPEC89.
It consisted of the geometric mean of the runtimes for 10 programs in the suite.
The benchmark did not differentiate between integer and floating-point performance.
The SPEC92 benchmark has 20 programs in the CPU benchmark suite.
21
SPEC92 is reported in two summary figures – floating-point (SPECfp92) and integer
(SPECint92).
The SPEC95 is an improvement of the SPEC92 benchmark.
In addition to the SPECfp95 and SPECint95 numbers, there are several related
benchmark suits (SPECint_base95, SPECint_rate95).
SPEC CPU2000 benchmark suite
The SPEC CPU 2000 suite consists of integer (SPECint2000) and floating-point
(SPECfp2000) benchmark suites. The first one consists of 12 tests in C and C++. The
second one has 14 tests (Fortran-90, Fortran-77, and C). The CPU tests are designed to test
three parameters: the CPU, the memory hierarchy and the compilers.
The performance is stated relative to a reference machine, a 300-MHz Sun Ultra5-10,
which gets a score of 100.
SPEC CPU 2006
SPEC CPU2006 includes two benchmark suites: CINT2006 for measuring compute-
intensive integer performance and CFP2006 for compute-intensive floating point
performance.
The CINT2006 suite includes 12 application-based benchmarks written in C and C++
languages. CFP2006 includes 17 CPU-intensive benchmarks written in C, C++,
FORTRAN and a mixture of C and FORTRAN.
Performance metrics within SPEC CPU2006 measure system speed and throughput.
The speed metrics, SPECint2006 and SPECfp2006, compare the ability of a computer to
complete single tasks.
The throughput metrics, SPECint_rate2006 and SPECfp_rate2006, measure the rate at
which a computer can complete a number of tasks in a given amount of time.
The reference machine for SPEC CPU2006 benchmark is a Sun Ultra Enterprise 2
workstation with a 296-MHz UltraSPARC II processor.
TPC-C
TPC-C is a benchmark for OLTP. It models simple order-entry applications. TPC-C
performance is measured in new-order transactions per minute (tpmC).
22
TPC-H
The TPC-H is a DSSs benchmark. It consists of a suite of business-oriented queries and
concurrent data modifications. The performance metric is the TPC-H Composite Query-
per-Hour Performance Metric (QphH@Size), which reflects multiple aspects of the
capability of the system to process queries. The TPC-H Price/Performance metric is
expressed as $/QphH@Size.
Performance Metrics
Not-so-good Performance Better Performance
Metric Metric
Clock Execution
MIPS MFLOPS S PEC QUIPS
Rate Time
Popular benchmarks typically reflect yesterday’s programs, but microprocessor systems need to be
designed for tomorrow’s programs.
23
Example
We are going to run three benchmark programs A, B, C on brands X, Y and Z. The results are
given in time units.
24
Accumulated Benchmark Time
Brand X
1,04 x 0,65 + 1,00 x 0,25 + 1,00 x 0,1 = 1,03
Brand Y
1,20 x 0,65 + 1,16 x 0,25 + 1,10 x 0,1 = 1,18
Brand Z
1,00 x 0,65 + 1,03 x 0,25 + 1,00 x 0,1 = 1,01
Beware pitfall!
1 n
Tam = ∑ tei .
n i =1
The result will depend on the choice of the computer we use as the reference.
For example in the next case the program execution times are normalized to both A and B computers, and
the arithmetic mean is computed.
When we normalize to A, the arithmetic mean indicates that A is faster than B by 5.05 times, which is the
inverse ratio of the execution times.
When we normalize to B, we can conclude, that B is faster by exactly the same ratio.
Program 1 1 10 1 10 0,1 1
Because we want to determine the computers relative performance, we should use an average
that is based on multiplicative operations. The geometric mean is (Tgm) such an average:
25
n
Tgm = n
∏ tei
i =1
;
Tam ≥ Tgm.
Performance Measurement
Performance measurement can be done via the following means:
If the absolute drop-dead project performance goal is 1,3× some existing target, then the early
product concept microarchitecture ought to be capable of some much higher number like 1,8×.
Dave Sager, one of the principal architects of the Pentium 4.
Some Benchmarking Mistakes
1. Only average behavior represented in test workload;
2. Caching effects ignored;
3. Buffer sizes not appropriate;
4. Ignoring monitoring overhead;
5. Not ensuring same initial conditions;
6. Not measuring transient (cold start) performance;
7. Using device utilizations for performance comparisons;
8. Collecting too much data but doing too little analysis, etc.
26
Summary
Purpose of evaluation
Evaluation Selection Performance Performance
technique evaluation projection monitoring
New New Design Design Reconfigure Change
HW SW new HW new SW HW SW
Instruction mixes C ─ C ─ ─ ─
Kernel programs B C B C ─ C
Analytic models B C B C B ─
Benchmarks A A ─ B B B
Synthetic programs A A B B B B
Simulation A A A A A A
HW &SW monitoring B B B B A A
A – satisfactory;
B – provides some assistance but is insufficient. It should be used in conjunction with
other techniques;
C – has been used but is inadequate;
“─“ the technique is not applicable.
27
AMDAHL’S LAW
S => (Performance for entire task using the enhancement) / (Performance for entire
task without using the enhancement)
S => (Execution time for entire task without using the enhancement) / (Execution time
for entire task using the enhancement)
Speedup Factors
1. Fraction enhanced – FE or fx
FE (fx) is the fraction of the computation time in the original system that can be converted to take
advantage of the enhancement1.
1 ≥ fx
2. Speedup enhanced – SE or Sx
Sx > 1
Speedup overall – SU
Execution Time
Execution time old
Execution time new
The new execution time is equal to the time spent using the unenhanced portion of task plus the
time spent using the enhancement.
1
Enhancement – a substantial increase in the capabilities of hardware or software.
28
Fraction − enhanced
Execution time new = Execution time old × {(1-Fraction enhanced) + }=
Speedup − enhanced
Fraction − enhanced
= (Execution time old × Fraction old) + (Execution time old × )
Speedup − enhanced
SU = [(1-fx)+fx/Sx]-1
Sx = (fx×SU)/(SU×(fx-1)+1)
fx = (Sx×(1-SU))/(SU×(1-Sx))
Conclusions
Amdahl’s Law can serve as a guide to how much an enhancement will improve performance
and how to distribute resources to improve
29
The comparison is valid only if:
a. Both microprocessors (A and B) have the same clock cycle time, and
b. Both microprocessors run the same number of instructions (the same instructions in test
program).
This assumption generally holds for the complete execution of single-threaded, uniprocessor, user-
level programs. Multithreaded programs present a more complicated problem.
1
Speedup =
((1 − ∑ f i ) + ∑ f i / S i )
i i
30
MICROPROCESSORS TAXONOMY
Superscalar
CISC RISC
~1995 MCU MCU
4004
1971
I During the first decade (1971-1980), the advent of the 4-bit microprocessors led to the
introduction of the 8-bit microprocessors. The 8-bit microprocessors became the heart of the
simple personal computers.
II The second decade (1981-1990) witnessed major advantages in the architecture and
microarchitecture of 16-bit and 32-bit microprocessors. Instruction set design issues became the
focus of researches. Instruction pipelining and cache memories became standard microarchitecture
techniques.
III The third decade (1991-2000) was characterized by extremely aggressive microarchitecture
techniques to achieve very high levels of performance. Deeply pipelined machines capable of
achieving extremely high clock frequencies and sustaining multiple instructions executed per cycle
became popular. Out-of-order execution of instructions and aggressive branch prediction
techniques were introduced to avoid or reduce the number of pipeline stalls.
IV Now the fourth decade (2001-2010?) is focuses on instruction-level parallelism (ILP) will
expand to include thread-level parallelism (TLP) as well as memory-level parallelism (MLP).
31
Architectural features that historically belong to large systems (multiprocessors and memory
hierarchies), will be implemented on a single chip. Many traditional macroarchitecture issues will
now become microarchitecture issues.
Power consumption will become a dominant performance impediment.
Evolution of Microprocessors
(by J. P. Shen and M. H. Lipasti)
There are different ways how describe different microprocessor structures. Each classification
holds an abundant variety of processor organization.
The simplest way to classify microprocessor structures may be next:
By input signals
1. Digital MP;
2. Analog MP:
DSP;
Media processor;
By timing
1. Synchronous MP;
2. Asynchronous MP (internally).
By purpose
1. Universal MP;
2. Special-purpose MP:
[Microcontrollers].
32
By resources
1. One level;
2. Multilevel.
Comment
The MCU scheme is more flexible than the HCU. The meaning of an instruction can be hanged by
changing the microinstruction sequence corresponding to the instruction. The instruction set can be
extended simply by including a new ROM (a new ROM content) containing the corresponding
microoperation sequences. Hardware changes to the CU are minimal.
In an HCU any change to the instruction set requires substantial changes to the hardwired logic but
HCUs are generally faster than MCUs and used where the CU must be fast.
33
EU EU
CU CU
IOU IOU
EU 1 EU 2 EU 3 EU n
EU
CU 1 CU 2 CU 3 CU n CU FUx
IOU
IOU1 IOU2 IOU3 IOUn
CU CU
EU EU
IOU IOU
Internal Microprocessor Implementation
IOU
Bus
IOU
Models
EU EU
CU CU
Multi-core microprocessor
Quad-Core Intel Xeon
Eight-Core Sun Open SPARC T2
34
CPU microarchitecture
Internal bus organization
n
CISC microprocessors
Data bus (DB) (3 phases)
A.
Intel 8080
Zilog Z80
ACC RG (TEMP)
Motorola M6800
ADDR
OC
/mem/
RGF
RGf
I/O data
n
DB1 CISC microprocessors
B. (2 phases)
n
DB2
Intel 8086
Motorola 68000
ADDR1 ADDR2
OC
/RG/ /RG or mem/
RGF
RGf
I/O data
ACC
I/O data
ADDR1 ADDR2 ADDR3
OC
RGF RD SR1 SR2
RGf
(multiported)
n (2n) DB3
35
Multiport register file organization
DB 3 DB2 DB1
n
Mux n
DMux kxn
kxn
C
C
Y3
kxn
Y1 Register file Mux n
k x n-bits
kxn
C
DMux kxn
Y4
Mux n
C
C
Y2 n
Y5
Example
The speed-up (%Su) to be expected from the decrease in CPI period due to the different data
path’s bus-system organization is:
Tbi − Tb(i + 1)
%Su = × 100 , where
Tb(i + 1)
Tb = IC × CPI × t, where
IC – instruction count;
t – clock cycle time;
36
Assume, that IC will be not change (IC=const.) when going from the i–bus to the (i+1)-bus system
in the microprocessor’s microarchitecture.
CPI Count
IC × CPIi × ti − IC × CPI (i + 1) × ti
%Su i>(i+1) = = ×100
IC × CPI (i + 1) × ti
3 × t1 − 2 × t1 1
%Su 1>2 = × 100 = × 100 = 50%
2 × t1 2
2 × t1 − 1× t1
%Su 2>3 = × 100 =1 × 100 = 100%
1× t1
3 × t1 − 1× t1
%Su 1>3 = ×100 = 2 × 100 = 200%
1× t1
As the 2-bus or 3-bus structure increases the clock cycle (approximately by 10%) period, as a result of
increased signal propagation time over the additional buses, then
t2 = 1,1 × t1,
t3 = 1,1 × t2 = 1,21 × t1.
and
3 × t1 − 2 × t 2 3 × t1 − 2 × 1,1× t1
%Su 1>2 = ×100 = ×100 = 36,3%
2× t2 2 × 1,1× t1
2 × t 2 − 1× t 3 2 × t 2 − 1×1,1× t 2
%Su 2>3 = ×100 = ×100 = 81,8%
1× t 3 1× 1,1× t 2
3 × t1 − 1× t 3 3 × t1 − 1.21× t1
Su 1>3 = ×100 = ×100 = 147,9%
1× t 3 1,21× t1
37
Earliest microprocessor's architectural model
Bus interface
unit
n
Data
Data bus Decoder
bus
Control unit's
k(n) Address kernel
Address bus
bus
n
to the functiomal units
n n n n
Register
ADD/SUB
file
38
MEMORY-STORAGE DEVICES
MSD
An Ideal Memory
1. Infinite capacity;
2. Infinite bandwidth (for rapidly streaming large data sets and programs);
3. Zero latency (to prevent the processor from stalling while waiting for data or program code);
4. Nonvolatility (to allow data and programs to survive even when the power supply is cut off);
5. Zero or very low implementation cost.
39
Examples of Memories Internal Organization
40
which may hold the contents of whole row. SAM cells are used as a shift register and are usually made
with either standard SRAM cells or multiple transistor DRAM cells.
It is possible to transfer data from the whole row memory array to the register in single access cycle.
Because the register is based on fast, static cells, the access to it is very fast, usually several times faster
than to the memory array. In most typical applications VRAM is used as screen buffer memory.
The parallel port is used by host processor, and the serial port is used for sending pixel data to the
display. The DRAM and SAM ports can be independently accessed at any time except during on
internal transfers between the two memories.
41
WRAM - window RAM
WRAM is a variation on dual-ported memory. WRAM is optimized for acceleration and can transfer
blocks and supports text and pattern fills. Its two ports allow input of graphic drawing and output of
screen refresh data to be processed simultaneously, resulting in much higher bandwidth than
conventional single-ported memory types.
Fast block copies and fills are called window operations. WRAM is accessed in blocks or windows,
which makes it slightly faster than VRAM. Now WRAMs are replaced by SGRAMs.
3D-RAM
3D-RAMs are used in video cards in tandem with a 3D graphics accelerator. The memory contains a
CDRAM (cached DRAM) array and an ALU block, which allows some image manipulation
operations to be performed.
Literature
1. Bruce Jacob, Spencer W. Ng, David T. Wang. Memory Systems: Cache, DRAM, Disk.
Elsevier, 2008.
2. Betty Prince. Emerging Memories. Technologies and Trends. Kluwer Academic Publishers,
2002.
42
MEMORY SYSTEM HIERACHY
Registers CPU Memory System Levels
Main
DRAM Memory
Flash Drive
External
HDD Memory
(DAM*)
Network
(Remote storages)
Each memory level of the hierarchy is usually a subset of the level below (lower level) - data
found in a level is also found in the level below. The storage devices get slower, large and cheaper
as we move from higher to lower levels. At any given time, data is copied between only two
adjacent memory levels.
Access time – time to get the first word. Access time is composed of address setup time
and latching time (the time it takes to initiate a request for data and prepare
access).
An average (effective) access time (tavg) or AMAT (average memory access time) is equal to:
43
Memory System Management Goals
1. Protection (protect each process from each other and protect of the OS and of the users);
2. Utilization (ensuring full use of memory);
3. Allocation (each process should get enough memory space);
4. Address mapping.
Terminology
1. Block (page) - maximum unit that may be present (usually has fixed length).
2. Hit – block is found in upper level.
3. Hit rate (ph) - number of hits / total number of memory accesses.
4. Miss – block not found in upper level.
5. Miss rate (mR) – fraction of references that miss.
6. Hit time (thit) – time to access the upper level.
7. Miss penalty (tmiss, tpenalty) - time to replace block in upper level,
(plus the time to deliver the block to the CPU).
Inclusion
The inclusion property implies that all information items are originally stored in the outermost
(in the lowest) memory level.
Mi-1 ⊂ Mi ⊂ Mi+1
Coherence
A memory is coherent if the value returned by a read operation is always the same as the value
written by the most recent write operation to the same memory address.
The coherence property requires that all copies of the same information item at successive memory
levels be consistent. The concept is to balance system to adjusting sizes of memory hierarchy
components.
44
Locality
Memory hierarchy exploits locality to create illusion of large and fast memory.
The memory hierarchy was developed based on a program behavior known as locality references.
Cache Controller
A cache controller is a processor that manages cache content. A cache controller can be
implemented in:
a. a storage device controller or communication channel, as a special–purpose processor
controlling RAM.
b. the operating system, as a program that uses part of primary storage to implement the cache.
45
Secondary Storage Cache
Disk caching is common in modern systems, particularly in file and database servers.
The OS is the best source of file access information because it updates the information
dynamically as it services file access requests. Because the OS executes on the CPU, it is difficult
to implement access-based control if the cache controller is a special–purpose processor within a
disk controller.
Cache Pipelining
Cache pipelining is a technique in which memory loads the requested memory contents into a
small cache composed of SRAM, then immediately begins fetching the next memory contents.
This creates a two-stage pipeline, where data is read from or written to cache in one stage, and data
is read or written to memory in the other stage.
Level 1 2 3 4
(highest level) (lowest level)
Operating Operating
Managed by Compiler Hardware system /
system
operator
46
The efficiency with which memory space is being used at any time is defined as the ratio of the
memory space occupied by active user programs and data to the total amount of memory space
available.
The “wasted” memory space can be attributed to several sources:
1. Empty regions – the blocks of instructions and data occupying memory at any time are
generally of different length (memory fragmentation);
2. Inactive regions – data may be transferred to main memory and may subsequently transferred
back to external memory, without ever being referenced by a processor;
3. System regions – these regions are occupied by the memory-management software.
The memory consists of multiple memory modules. The memory module is the main building
block of the main memory system.
Each memory module is capable of performing one memory access at a time.
Each memory module contains a subset of the total physical address space.
Each memory module has two important parameters:
a. Module access time - the time required to retrieve a word into the
taccm memory module output buffer register.
b. Module cycle time - the minimum time between requests
tcycm directed at the same module.
A. In low-order interleaving the modules is organized such that the consecutive memory
addresses lie in consecutive physical modules.
B. In high-order interleaving (banking) the consecutive addresses lie in the same physical
module. The high-order address bits select a physical block and the remaining bits select a
word within that block.
47
Memory Modules
… m-1 m-2 … 0
Bankj+1 Bankj
Address
Within a memory bank only one memory module is able to begin or complete a memory operation
during any given bus cycle. The memory bank organization is meaningful only if the memory
cycle time is greater than the bus cycle time.
Address Space
Address space – the set of identifiers, that may be used by a program
to reference information.
qadr
Memory address space – the set of physical main memory locations
in which information items may be stored.
qmadr
System address space – The address space that contains memory (qmadr)
and I/O-spaces (qIOadr).
qsadr
Physical (real or direct) addressing space
qpadr
qsadr > qpadr
Virtual address is an address that corresponds to a location in virtual space and translated by
address mapping to a physical address. The virtual address space is divided into pages.
The maximum physical address size can be larger or smaller than the virtual address size.
48
If a physical address is large than the virtual address, then it means that a microprocessor system
could have more physical memory than any one program could access. This could be useful for
running multiple programs simultaneously.
If a physical address is smaller than the virtual address, then a program cannot have all of its
virtual pages in main memory at the same time.
Memory Wall
Memory Bus Speed versus RAM Speed
The memory speed is a function of a memory bus (sometimes of a system bus speed) speed.
The increase in microprocessor performance places significant demands on the memory system.
Typical on-chip L2 cache performance degradation is now more than 2× the ideal cache.
For multiprocessors, inter-cache latencies increase this degradation to 3× or more for 4 processors
and up. A perfect memory system is one that can supply immediately any datum that the CPU
request.
The ideal memory is not practically implementable as the three general factors of memory
capacity,
speed, and
cost
are in direct opposition.
49
Example
The memory wall problem arises because the difference between two competing trends, each of
which is an exponential function but with a different base grows at an exponential rate.
50
Overcoming Memory Wall
An economical solution is a memory hierarchy organized into several levels, each smaller, faster
and more expansive per byte than the next.
The concept of memory hierarchy takes advantage of the principle of locality.
Traditional techniques
Memory-level parallelism (MLP) is the ability to perform multiple memory transactions at once.
As for read and write operations at once, or multiple read operations. Risky is to perform multiple
write operations at once (address conflicts).
CPU
INTERFACE
CACHE
INTERFACE
MAIN
MEMORY
51
Suppose that 1 clock cycle takes to transfer address or data in memory system. Each access to
main memory takes 20 clock cycles. One cache memory line contains four main memory words.
The total transfer time (Ttot) of one cache line content is equal to:
Though many techniques have been suggested to circumvent the Memory Wall problem, most of
them provide one time boosts of either bandwidth or latency.
52
Examples
Wider Memory Organization (Alpha 21064)
CPU
INTERFACE
MUX
CACHE (4-way)
INTERFACE
MAIN
MEMORY
CPU
INTERFACE
CACHE
INTERFACE
MAIN
BANK BANK BANK BANK
MEMORY
53
Memory Capacity Planning
The optimization problem for a storage hierarchy concerns costs, and a selection among design
trade-offs. It is important to achieve a high a hit ratio as possible at memory level M1.
In optimizing the cost-effectiveness of a storage hierarchy design, two approaches have been used:
1. In the simpler approach, only the storage hierarchy itself is optimized and the CPU is
ignored.
The figure-of-merit //kvaliteedinäitaja// employed is the product of the mean access time
and the total cost of the storage system.
2. In the more complex approach, the storage hierarchy is treated as a component of computer
system, and the cost and performance of the CPU are included in the study.
∑ fi = 1
1
f1 >> f2 >> f3 >> … >> fn, where
54
ASSOCIATIVE MEMORY
Content Addressable Memory
Associative memories are referred also as:
1. Catalog memory;
2. Parallel search memory;
3. Data-addressed memory;
4. Content-addressable memory or CAM.
• Traditional memories have been organized conventionally as sets of words, each word
comprising a fixed number of binary digits.
• Associated with each word is an address, which corresponds in some manner with the physical
location in memory at which the word value is stored. Traditional memories store data
spatially.
• Access to a word in memory requires designation of an address in the form of binary number.
• In the associative memory (AM) a data word is not obtained by supplying an address that
specifies the location of that word in memory.
• An identifying descriptor (A) or key is provided to memory.
• In the memory is then searched until an exact match is found between the submitted descriptor
(AS) and search key and a descriptor associated (AA) or key with a data word (DA).
• The AA descriptor may be part of each data word or the descriptor may be stored separately
(sometimes AA = DA).
• A true associative memory, in the sense of completely simultaneous interrogation of all bits in
all words of a block of memory, requires that each elementary memory cell be furnished with a
logic operation capability.
• Many implementations of content addressability can be viewed as hash addressing.
• Another approach is TRIE memory, which is a kind of hash coding but does not suffer from
collisions.
Hash coding (key transformation) is a process in which a search key is transformed, through the
use of a hash function, to an actual address for the associated data.
A hash function is the mathematical function for turning some kind of data into a small integer,
that may serve as an index into an array. The values returned by a hash function are called hash
values. The data is usually converted into the index by taking a modulo.
Hash transformation
CAM cells
function
Key register HF
Key
55
Age-addressable memory is a special-purpose associative memory which uses the date and time
when the data was written as a retrieval key.
CAM Operation
The entire word line in the CAM is w elements wide where the first k bits are used for the tag and
d bits used for the data.
The entire row or word can be viewed as a single entity, the key is just compared the first k bits
within a word, essentially masking off the d data bits.
d bits
k bits
W = k+d
The operation of a CAM is like that of the tag portion of a fully associative cache.
But unlike CAMs, caches do not use priority encoders since only a single match occurs.
CAM is not a RAM where a particular address is permanently tied to a specific physical location.
56
CAM Architecture Model
[Data storage capacity is n × d bits]
n-1 0 n-1
n-2 0 n-2
j 1 j
1 0 1
0 0 0
Data
57
There are four basic operations which CAM performs:
1. Reads,
2. Writes,
3. Selection of a line,
4. Multiple response resolution.
Reads
During a read operation a key is first load into the search tag register and then, if needed, the mask
register is loaded with whatever is desired. Typically mask register allows all bits of the search key
to be used to scan the directory, but can be set to only compare certain bit position in the key.
Selection
The key is compared to all the tags of the words in the tag area, and when a match occurs, it sets a
bit in the hit register. This indicates the unmasked bits in the key were the same as their
corresponding bits for particular tag.
Once all the comparing is completed and the hit register is stable, the cell in the hit register
containing a “1” indicates, which line in the data storage area is to be output (line j).
Multiple responses
This possibility exists, the hit register may designate to this. Since several hits may occur, but only
one line of data can be output at a time, so some method of resolving this conflict must be
incorporated into the system.
This is the job of the priority logic (response processor), to choose a single line of data to output
based on the state of the hit register.
58
Writes
Writing to a CAM is not simply the converse of reading as it in a traditional RAM.
Once the decision has been make to write to the CAM, both the tag write register is loaded with
the intended tag and the data write register is loaded with the data to be linked to the tag.
There is a question, how to choose the line in the CAM to replace with a new values for tag and
data, by purging the old values from the CAM. This is known as a replacement strategy and
includes the policies such as FIFO, LIFO and others.
CAMs Taxonomy
CAMs are divided into classes, which define how the search is done within the CAM.
Since the tags are words of a certain bit length, the classifications are:
1. Bit-serial-word-parallel,
2. Bit-parallel-word-serial,
3. Fully parallel (all-parallel).
Bit-serial-word parallel
A search is accomplished by comparing the MSB of each AS tag (key) to the MSB of the AA tag,
in all words simultaneously. Each bit of every tag is compared sequentially to the key.
Bit-parallel-word-serial
It is just opposite of the previous method. The entire key is compared to each tag in succession
until a match occurs. This is the easiest to envision since it is just a linear search.
Fully parallel
In this case, every tag of every word is compared to the entire key simultaneously. This is the
hardest to implement and offers the highest performance, but is the most costly method.
CAM Drawbacks
1. Functional and design complexity of the associative subsystems;
2. High cost for reasonable storage capacity
3. Relatively poor storage density compared to conventional memory;
4. Slow access time;
5. A lack of software to properly use the associative power of the memory system.
59
Using CAMs
60
VIRTUAL MEMORY SYSTEM
Virtual memory (VM) deals with the main memory size limitations. The size of virtual storage is
limited by
• the addressing scheme of the computer system,
• the amount of auxiliary storage available (not by the actual number of main storage locations).
Original motivation was to make a small physical memory look large and permit common software
on a wide product line. Before virtual memory an overlaying //ülekattuvus// was used. Virtual
memory automates the overlay process. Virtual memory organization principles are similar to the
cache memory system.
Virtual memory is automatic address translation that provides decoupling program’s name space
from physical location and protection from interference with name space by other tasks.
61
Virtual Memory Main Principles
The working set //töökogum, -komplekt// is the group of physical memory pages currently
dedicated to a specific process.
The efficiency of the address translation process affects the performance of the virtual memory.
62
5. In simpler computer architectures (8-bit systems) bank switching.
There are two control registers in MMU – base register and bound register.
The base register //baasiregister// specifies memory location starting address.
The bound register //piiriregister// specifies size of address space.
The content of base and bound registers are changed by OS.
LA PA
MMU
TLB
MEM Main
CPU ADR Bus CNT
Exception Memory
Data Bus
Control Bus
• A logical address (LA) is the location of a word relative to the beginning of the program.
• A logical address consists of a page number and a relative address within page (offset).
• A physical address (PA) consists of a frame (physical memory page) number and
a relative address within frame (offset).
When a problem arises, as for page fault, the MMU signals an exception so the OS can intervene
and solve the problem.
The MMU contains a page table (PT). Each page table entry (PTE) gives the physical page number
corresponding to the virtual one. This is combined with the page offset to give the complete
physical address. Usually there is one page table per process. Page tables are usually so long that
they are stored in main memory, and sometimes are paged themselves.
A PTE may also include additional information about whether the page has been written to, when
it was last used (for a least recently used replacement algorithm), what kind of processes may read
and write it, and whether it should be cached.
63
1. Valid/Present (V/P) bit
Set if the page being pointed to is resident in memory;
2. Modified/Dirty (M/D) bit
Set if at least one word in page/segment has been modified;
3. Reference (A/R) bit
Set if page or segment has been referenced;
4. Protection (PR) bits
They are used to restrict accesses.
Usually is used 2-level page table format (Page Directory (Catalog) + Page Table(s)).
64
When segmentation is used with paging, a virtual address has three components:
A segment index (SI), a page index (PI) and a displacement (D).
In this case every memory address is generated by a program goes through a two stage
translation process:
Paging Segmentation
Trivial Hard
Replacing a block (all blocks are the same size) (must find contiguous, variable-size,
unused portion of main memory)
65
Address Translation Styles in Virtual Memory
Logical
Segment Base Address Page Offset
Address
Page i base
Range
Error
1. OS brings into main memory some pieces of the program (resident set).
Resident set – a portion of process that is in main memory.
2. An interrupt is generated when a data (address) is needed that is not in main memory.
3. OS places the process in a Blocking State.
4. New data is brought into main memory from external (secondary) memory (disk).
Another process is dispatched to run while the disk I/O takes place.
5. An interrupt is issued when disk I/O complete, it causes the OS to place the affected process
in the Ready State.
5. Each process has its own page table. The entire page table may take up too much main
memory.
6. Page tables are also stored in virtual memory. When process is running, part of its page
table is maintained in main memory.
66
Example
67
Accelerating Address Translation
The first hardware cache used in a computer system did not cache the contents of main memory
but rather translations between virtual memory addresses and physical addresses.
This cache is known as Translation Lookaside Buffer (TLB) or Translation Buffer.
TLB: Transfer Lookaside Buffer or
Transfer Lookahead Buffer or
Address Cache Memory (ACM) or
Address Translation Cache (ATC}.
The TLB is used in the memory control unit for fast physical address generation, to hold recently
used page or segment table entries. The TLB is used in the memory control unit for fast physical
address generation.
The TLB is a high-speed look-up table which stores the most recently referenced page entries.
The first step of the translation is to use the virtual page number as a key to search through the
TLB for a match. In case of a match (a hit) in the TLB, the page frame number is retrieved from
the matched page entry.
In case the match cannot be found (a miss) in the TLB, a pointer is used to identify one of the page
tables where the desired page frame number can be retrieved.
Classical TLB is fully associative.
In a TLB entry the tag portion holds portions of the virtual memory address and the data portion
holds a physical page frame number (address), control bits (valid (V), use (A), dirty (M)) and
protection information.
TL B
V A M
m iss P age
T ables
hit
to M M U to M M U
( page fault)
OR
Newer processor architectures are supporting TLBs that allow each entry to be independently
configured to map variable-size superpages.
The most serious problems associated with general utilization of superpages are – the requirement
that superpages be used only to map regions of physical memory that are appropriately sized,
aligned and contiguous.
Using superpages is restricted they are used for mapping a small number of large non-paged
segments, such as frame buffer and the non-pageable portions of the OS kernel.
The high order bits of the virtual address are used to access the TLB and the low order bits are
used as index into cache memory.
69
3. LIFO - last in first out;
4. Circular FIFO;
5. Random;
6. Belady`s optimal replacement policy
The theoretically optimal replacement algorithm) is an algorithm that works as follows:
when a page needs to be swapped in, the OS swaps out the page whose next use will occur
farthest in the future.
Summary
70
CACHE MEMORY SYSTEM
Thesis
A: Modern CPUs have one concern – keeping the processor pipelines filled with
instructions and data.
B: This is becoming an increasingly difficult task because:
The speed of CPU performance doubles every 18 – 24 month,
The speed of the DRAM chips, that constitutes main memory, increases by only a few
(5 - 7) percent each year.
C: The high-speed cache memory that acts as a buffer between main memory and CPU is an
increasingly significant factor in overall performance.
D: Cache consistency //vahemälu konsistentsus, ~kooskõlalisus//
Cache consistency – in presence of a cache, reads and writes behave no differently than if the
cache were not there.
The cache consistency reflects the fact that a datum must have one value and only one value.
If two requests to the same datum of the same time return different values, then the correctness of
the memory system is in question. Cache consistency aspects are:
1. Consistency with backing store
2. Consistency with itself
3. Consistency with other clients
The principle of memory coherence indicates that the memory system
behaves rationally.
Memory consistency defines rational behavior, the consistency model indicates how long
and in what way the system is allowed to behave irrationally with respect to a given set of
reference.
E: The property of locality which justifies the success of cache memory has two aspects:
1. Locality in time (temporal locality) – a hit that occurs on a word already in cache
that has been previously accessed.
2. Locality in space (spatial locality) – a hit that occurs on a word already in cache
that has not been previously accessed.
71
Spatial locality for code
Low ⇒ lots of jumps to far away places;
High ⇒ no branches or jumps at all.
S cache = Tm 1
= , where
(1 − h) × Tm + h × Tc T
1 − h × (1 − c )
Tm
Tm – main memory access time;
Tc – cache memory access time;
h – hit ratio,
a. Tc = Tm ⇒ Scache = 1.
1
b. Tc << Tm ⇒ Scache = ,
1− h
h↑ ⇒ Scache↑
h↓ ⇒ Scache↓
72
Direct Mapped Cache
Since there are more memory frames than lines in cache, an individual line cannot be uniquely and
permanently dedicated to a particular frame. Therefore, each line includes a tag that identifies
which particular frame of main memory is currently occupying that line of cache.
o Address length (a): a = {(m+n)+w} bits;
o Number of addressable units in main memory: 2a words or bytes;
o Block size (line size): 2w words or bytes;
o Number of blocks in main memory: 2a / 2w = 2n+m ;
n
o Number of blocks (lines, rows) in cache (r): r=2 ;
The mapping function gives the correspondence between main memory frames and cache memory
lines. Each line is shared between several main memory frames.
Direct mapping cache maps each block of main memory into one possible cache line.
The mapping can be expressed as:
l = f mod r, where
l – cache line number,
f – main memory block(frame) number.
The tags contain the address information required to identify whether a word (data block) in the
cache line corresponds to the requested word (data block). The tag uses the upper portion of the
address.
There is no two blocks that map into the same line numbers have the same tag numbers.
73
Set Associative Cache
M1 M2
Set
Line
& &
CMP CMP
& &
Data out
to CPU
74
Cache Address Formats
Global replacement allows a fully-associative cache to choose the best possible victim every time,
limited only by intelligence of the replacement algorithm.
Accessing a fully-associative cache is impractical; as it requires a tag comparison with every tag-
store entry (prohibitively increases the access latency and power consumption.
Dirty Data – when data is modified within cache but not modified in main memory, then
the data in the cache is called “dirty data”.
NB! The dirty bit is needed only in write-back caches.
Stale Data – when data is modified within main memory but not modified in the cache,
the data in the cache is called “stale data”.
75
Data transfers are not tied to the system clock.
b. In synchronous cache
Data transfers are tied to the memory bus clock.
c. In pipelined burst cache
Cache has special circuitry that allows the four data transfers occur in a burst.
The second transfer begins before that the first one is finished.
Transfer formula
Cache type (w1, w2, w3, w4)
Asynchronous 3-2-2-2
Synchronous 3-2-2-2
Pipelined burst 3-1-1-1
Higher associativity means lower miss rates, smaller associativity means lower cost, faster
access (hit) time, but higher miss rate.
Higher levels of associativity (s ≥ 32) requires hardware overhead that slows down the cache, it is
often the case that low levels of associativity with a larger capacity provide better performance
overall.
76
Cache Memory System's Historical Development
A B
CPU CPU
IU IU
SBUS
SBUS
EU EU
MMU MMU
MEM
MEM
MEM CM
L1
C D
CPU CPU
IU ICM L1 IU ICM L1
SBUS SBUS
EU EU
MMU MMU
MEM
MEM CM
L2
IC/DC
77
E
ICM L1
CPU
IU MMU
SBUS
EU MMU
DCM L1 MEM
CPU
L1
IU IC
SBUS
EU DC
L1
MMU
MEM CM L2
H
Advanced MPS with
on-chip multilevel (78
L1, L2, (L3))
instruction (IC) and data (DC) caches
and off-chip multilevel
separated or united instruction/data caches
Cache Memory Miss Types
Cache miss can occur during read and write operations.
2. Capacity miss
[misses due to size of cache] The cache cannot contain all the data blocks needed during execution
of a program. They will occur due to data blocks being discarded and later retrieved.
Miss Rate
Conflict misses
Direct mapped
Capacity misses
Fully associativ e
Compulsory misses
Cache Capacity
Reducing Cache Misses
1. Large block size (reduces compulsory and capacity misses);
2. Large cache size (reduces capacity and conflict misses);
3. Higher associativity (reduces conflict misses)
• Multi-ported cache is a cache implementation where the cache provides for more than one
Read or Write port for providing high bandwidth. Because of these multiple ports it results
in servicing multiple requests per cycle.
• Multi-banked cache is a cache implementation where the cache is implemented as a
banked structure for providing high bandwidth by providing the illusion of multiple ports.
It results in servicing multiple requests per cycle if there are no bank conflicts.
79
• Prefetching is a technique that predicts soon-to-be used instructions or data and loads them
into the cache before they are accessed.
Prefetch is a technique used to avoid or reduce the time the processor is waiting for data to
arrive in the registers. A commonly used method is sequential prefetching where it is
predicted that data or instructions immediately following those currently accessed will be
needed in the near future and are prefetched. Prefetching, especially aggressive prefetcing
may reduce performance.
Non-blocking cache is a cache that allows the processor to make references to the cache while the
cache is handling an earlier miss. Non-blocking cache is commonly used to hide the cache miss
latency by using out-of-order processors. The processor continues to operate until one of the
following events takes place:
1. Two cache misses are outstanding and a third load/store instruction appears;
2. A subsequent instruction requires data from of the instructions that caused
a cache miss.
The write traffic into L2-level cache primarily depends on whether the L1-level cache is write-
through (store-through) or write-back (store-in or copy-back).
2
Not needed for direct mapped caches.
80
Write-through and Write-back Caches Advantages and Disadvantages
Ability to handle Burst Write buffer can overflow Usually OK, unless miss with
Writes dirty victims
What should be done on a write operation that does not hit in the cache (write miss)?
There are two common options on a write miss:
1. Write allocation policy - the cache line to which the write miss occurred is brought into
the cache and also the block in main memory is updated.
Method decreases read misses, but it requires additional bus bandwidth.
2. Write no-allocation policy - the write is propagated to main memory without changing the
contents of the cache.
Method generates potentially more read misses, but it needs less bus bandwidth.
Write-back caches generally use write allocate policy and write-through caches often use no-write
allocate policy.
Write stall – if CPU has to wait for write to complete during write through.
To reduce write stalls, write buffer is used, allowing CPU to proceed while memory is being
updated.
Write buffer holds data awaiting write-through to lower level memory. The bursts of writes are
used in write buffer. For eliminating RAW hazard it must be drained write buffer before next read,
or check write buffer:
On reads, CPU checks cache and write buffer.
On writes, CPU stalls for full write buffer.
81
In write-through cache it helps with store misses. In write-back cache it helps with dirty misses. It
allows to do read first. For this a write dirty block is sent to the write buffer. The new data block is
read from memory to cache and then the content of write buffer is sent to memory.
A. Victim Cache
Victim cache is a small, fully-associative cache inserted between cache memory and its refill path.
It reduces need to fetch blocks from higher-level memory by keeping recent victims, blocks
discarded from cache memory. Victim cache reduces conflict misses.
B. Pseudo-associative Cache
For getting the miss rate of set-associative caches and the hit speed of direct mapped is realized in
pseudo-associative cache. When a hit occurs, the cache works like the direct-mapped cache. On a
miss, before going to the next lower level of the memory hierarchy, another cache entry is checked
to see if it matches there.
Hit Ratio versus Cache Size, Cache Type, Associativity and Block Size
30 Percent Rule
82
Misses per 1000 Instructions for Instruction, Data and Unified Caches
↑)
Associativity (↑ ↑)
Block size (↑ ↑)
Capacity (↑
Decreases conflict misses (+) Decreases compulsory misses (+) Decreases capacity
misses (+)
Increases or decreases
capacity misses (±)
Increases conflict misses (-)
Increases thit (-) No effect on thit (~) Increases thit (-)
83
In region A of the figure, larger blocks improve the miss rate, which provides a performance gain
that overcomes the performance drop due to increased channel contention, causing overall
performance to rise.
In region B, the miss ratio continues to drop with larger blocks, but performance also deteriorates
due to increased contention. The performance point resides at the boundary between regions A and
B.
In region C, pollution causes the miss rate to begin to increase with larger blocks, causing a
sharper drop in performance than in region B. The pollution point resides at the boundary between
regions B and C.
Virtual Indexing
Virtual indexing has lower latency, because the physical address is available only after the TLB
has performed its translation. Most L1-level caches are virtually indexed.
84
Virtual memory Physical memory Virtual memory Physical memory
AB AB
Offset Offset
Main Main
CPU Memory CPU Memory
Cache Cache
DB DB
Virtually indexed and/or tagged caches keep aliased //rööpnimesus// virtual addresses coherent by
guaranteeing that only one of them will be in the cache at any given time.
It is guaranteed by OS that no virtual aliases are simultaneously resident in the cache.
Whenever a new entry is added to a virtually indexed cache, the processor searches for any virtual
aliases already resident and evicts them first. This special handling happens only during a cache
miss. No special work is necessary during a cache hit.
Virtual Tagging
A physically tagged cache does not need to keep virtual tags. It is useful to distinguish the
two functions of tags in an associative cache: they are used to determine which way of the entry set
to select, and they are used to determine if the cache hit or missed.
If the TLB translation is slower than the L1-level cache access, then a virtually tagged cache will
be faster than a physically tagged cache. It is the reason why some L1-level caches are virtually
tagged.
The advantage of virtual tags is that, for associative caches, they allow the tag match to proceed
before the virtual to physical translation is done.
85
Cache Memory Performance
The access latency in cache memories is not fixed and depends on the delay and frequency of
cache misses. Average cache memory access time (effective latency) (tavg) is equal to:
L2-Level Cache
L2-level cache contents are always superset of L1-level cache contents.
• We want large caches to reduce frequency of more costly misses.
• Large caches are too slow for processor.
86
Cache Coherency Mechanisms
Snooping – the process where individual caches monitor address bus for accesses to memory
location that they have cached.
Snarfing – the process where individual caches monitor the address and data buses in an attempt
update its own copy of a memory location, when it is modified in main memory.
Cache Memory versus Virtual Memory
Address mapping 25-45 bits physical addresses to 14- 32-64 bits virtual address to 25-
20 bits cache address 45 bits physical address
Large main memory or cache size Fewer capacity misses Longer access time
87
Summary
88
Summary
Access type
Cache Cache Main Memory Main Memory
Read Write Read Write
Read hit 1 0 0 0
Read miss 1 n n 0
(line size in words) (line size in words)
Write hit 0 1 0 1
89
INPUT-OUTPUT SYSTEM
Input/Output Bottleneck
The input/output bottleneck is a discrepancy between the large speed of processing elements in the
system and the smaller speed of input/output elements of the system.
The main problem is that I/O transfers potentially have side-effects, such as initiating a DMA
transfer, clearing status bits, or removing data from a receive queue. This forces I/O transfers to be
done:
Input/Output Units
Input/Output units are the computing system's means of communication with the world outside
of primary (system) memory.
Features:
Functions:
1. Information transferring.
2. Sharing the I/O resources.
3. Handling and recovering errors in I/O operations.
90
The I/O Devices Organization Ways
1. Program-controlled;
POLLING
Read status
Special from I/O
Wait or
HW support device
Gadfly loop
N
READY?
I/O operation
2. Interrupt-driven;
(incl. I/O processors)
INTERRUPTS
Data bus
I/O
I/O device is active CPU Device
Int
3. DMA-managed.
(DMA controller managers single block transfers)
DMA CHANNEL
Int CPU
I/O Device RAM
DMA CH
91
I/O Performance Measurement
I/O system performance common metrics are:
1. Throughput (I/O bandwidth) ⇒ useful for file servicing and transaction processing ;
2. Response time (latency).
Response time = Controller time + Wait time +
Number _ of _ bytes + CPU time – Overlapping time
Bandwidth
Little`s Law
The mean number of tasks in a stable queuing system (over some time interval) is equal to the
mean arrival rate (λ) of new tasks to that system, multiplied by their mean time spent in that
system (T), i.e. mean response time:
N=λ×T
NB! Arrival rate (λ) and the response time (T) must use the same time unit.
Arrival rate (λ) is measured in messages per second.
92
TQueue – average time per task in the queue;
TServer – average time to service a task;
There are two types of communication that are used during I/O operation with input-output
devices:
1. Control flow;
2. Data flow.
A. Control flow
Control flow can be broken down into commands which flow from the processor to the I/O
device (outbound control flow) and back (inbound control flow).
93
D. I/O software functions
1. Device independence;
2. Buffering;
3. Error handling;
4. Synchronous and asynchronous data transfers.
OS kernel => I/O subsystem => Device driver => Device controller => Device
94
Interfaces and Buses
Interface – a physical /conceptual/ structure for channeling and facilitating communications
(a boundary between systems or devices).
Taxonomy
1. By topology:
Magistral (Bus),
Radial (Star),
Daisy chain,
Mixed.
2. By transmission mode:
Serial,
Parallel,
Mixed.
3. By transmission direction:
Simplex,
Half-duplex,
Duplex.
4. By synchronization:
Asynchronous,
Isochronous,
Synchronous.
5. By functionality:
System interface,
I/O interface,
Peripheral interface,
Local and Distributed network.
95
Buses
Bus ⇒ a shared communication link between subsystems.
Bus System ⇒ a collection of wires and connections for data transactions among
processors, memory modules and peripheral devices.
The bus is used for only one transaction at a time between a source (master) and a destination
(slave). In case of multiple requests the bus arbitration logic allocates (de-allocates) the bus
servicing the request one at a time. For this reason the digital bus is called contention bus or a
time-sharing bus.
Bus Types
• System bus or internal bus represents any bus within a microprocessor system. System
bus connects system components together.
• External bus is used to interface with the devices outside a microprocessor system.
• Processor-memory bus connects processor and main memory (no direct I/O interface).
• I/O bus connects I/O devices (no direct processor-memory interface).
• Backplane bus – processor, memory and I/O devices are connected to same bus.
It is an industry standard, but the processor-memory performance is compromised.
96
Data Transactions via a Bus-system
Simple bus
Address cycle
Master device asserts the request line it drives the address bus when the request
is granted (RQ → BG).
Data cycle
The master and selected slave devices communicate for one or more cycles.
Address: A0 A1 A2
t
Data: D0 D1 D2
One transaction
Td = 0
Td - latency time
Pipelined bus
Td
Address: An A0 A1 A2 A3 A4
t
Data: Dn D0 D1
D0 ready
Split-phase bus
1. Bus bandwidth increased by allowing out-of-order completion of requests,
2. Low-latency device can respond sooner,
97
3. Requires completely separate arbitration of address bus and data bus,
4. Devices recognize the appropriate data cycle by appropriate bus transaction tags,
5. Bus transaction tag size limits the number of outstanding transactions.
Td0
Td2
Address: An A0 A1 A2 A3 A4 A5
t
Data: Dn D1 D0 D3 D2
D1 ready D0 ready
Td1=0 Td0
Bus Options
Bus Width Separate Aaddress and Data lines Multiplex Address and Data lines
Data Width Wider is faster Narrower is cheaper
Transfer Size Multiple words (less bus overhead) Single-word transfer
Protocol Pipelined Serial
Bus Masters Multiple (requires bus arbitration) Single master
Clocking Synchronous Asynchronous
Bus Arbiter
Arbitration – the process of assigning control of the data bus to a requester.
The requester is a master, and the receiving end is called a slave.
Arbiter – a functional unit that accepts bus requests (RQ) from the requester unit
and grants control of the data bus (BG) to one requester at a time.
Bus timer – measures the time each data transfer takes on the data bus and terminates
the data bus cycle if a transfer takes too long.
98
Interrupts and Exceptions
The terms interrupt //katkestus// and exception //eriolek// is used, not in a consistent fashion.
There is an operating system service routine, called interrupt handler (Interrupt Service Routine
(ISR)), to process each possible interrupt. Each interrupt handler is a separate program stored in a
separate part of primary memory.
Processor Interrupt Sequence
Interrupt
t0
Store
Wait current internal Get
Recognize interrupt
instruction information
interrupt vector
to complete to external
stack
2 clocks 2–200 clocks 12-200 clocks 3 clocks
t1
Interrupt service program (ISR)
99
Response deadline – the maximum time that a interrupt handler (ISR) can take between
when a request is issued and when the device must be serviced.
Interrupt density – the maximum number of the interrupts which can be activated at the
same time.
IP2 handling
100
B. Absolute Priority System
IP2 handling
IP2 handling is stopped
IP4 handling Start
IP3
Hardware interrupts
1. Normal interrupts (IRQ);
2. Non-maskable interrupts (NMI);
3. Fast interrupts (FIRQ)
101
Within or
Synchronous or User request Resume or
Exception type between
asynchronous or coerced terminate
instructions
I/O device request Asynchronous Coerced Between Resume
Invoke operating system Synchronous User request Between Resume
Tracing instruction
Synchronous User request Between Resume
execution
Breakpoint Synchronous User request Between Resume
Arithmetic overflow or
Synchronous Coerced Within Resume
underflow
Page fault or misaligned
Synchronous Coerced Within Resume
memory access
Memory protection
Synchronous Coerced Within Resume
violation
Hardware malfunction Asynchronous Coerced Within Terminate
Power failure Asynchronous Coerced Within Terminate
Vectored interrupts
If the hardware devices are smart enough, then the CPU responds to each interrupt request with an
interrupt acknowledge sequence.
102
During the acknowledge cycle the CPU indicates that it is responding to an interrupt request at a
particular priority level, and then waits for the interrupting device to identify itself by placing its
device number on the system bus.
Upon receiving this device number, the CPU uses it to index into an interrupt vector table in
memory. The entries in the interrupt vector table are generally not the interrupt service routines
themselves, but the starting address of the service routine.
The interrupted device number is offset into interrupt vector table.
Auto-vectored interrupts
If “dumb” devices are used in the system, then a variation of the vectored interrupt scheme can be
used – auto-vectoring.
In a system with auto-vectored interrupts, a device that is not capable of providing an interrupt
vector number via the system bus, simply requests a given priority level interrupt, while activating
another signal to trigger the auto-vectoring process.
The CPU internally generates a device number based on the interrupt priority level that was
requested.
Vectored and auto-vectored interrupts can be used in the same system concurrently.
103
Recognition interrupt source
Serial polling
A. Software polling
SBUS
CPU PD1 PD 2 PD 3
I1 I2 I3
Int
B. Hardware polling
SBUS
Res ADDR
code
CPU PD1 PD 2 PD 3
1 & DC CT2
T R
I1 I2 I3
&
Int S
& BL GEN
CLK
104
Parallel arbitration
wired-OR
D-bus
D0
D(n-2)
D(n-1)
CPU
wired-OR
Int
Internal
interrupt L(E(n-1))
PD1
request t t LE0
&
PD2 PD3
t
t
1
&
Ai
S T
&
E(n-1) E(n-2) E0
Comments
Ei Di Ai Activity
Tarb = 4tn 0 0 1 continue
Tarb # f(the number of PDs) 1 1 1 continue
Tarb < Tclk(cpu) 0 1 0 fail i
1 0 - impossible !
105
Direct Memory Access (DMA)
DMA – a direct rapid link between a peripheral and a computer’s main memory,
which avoids accessing routines for each item of data read.
Link – a communications path or channel between two components or devices.
Q-total = Q-page x n
INT
Main memory
CPU
INTA (Q-total)
SBUS
Internal bus
DMA Controller
/Physical or virtual address/
RGS RGD
Control
RGAe
unit
SYNC CMP
RGAc
+1
Data
buffer
(Q-buff)
Control Data
PD
If processor whishes to read or write a block of I/O data, it issues a command to the DMA unit.
After that the processor continues with other work. The DMA unit transfers the entire block of
data, one word at a time, directly to or from memory, without going through the processor.
When the transfer is complete, the DMA unit sends an interrupt signal to the processor.
The processor is involved only at the beginning and end of the transfer.
106
The DMA unit can interrupt the processor’s instruction cycle only in certain points.
TIME
Instruction Cycle
DMA Interrupt
Breakpoints Breakpoint
DMA’s breakpoint is not an interrupt; the processor does not save a context and do
some thing else.
The processor pauses for one bus cycle.
Sometimes this requires performing atomic updates to the shared critical sections (regions).
Critical section – it is the part of the program where shared data is accessed.
Some processors also support disabling DMA operations by using locked bus cycles.
The processor could execute lock instruction to disable external bus grants (BG).
When critical region updates have been completed, the unlock instruction is used
to allow bus grants.
When the DMA runs in register mode, the DMA controller uses the values contained in the DMA
channel’s registers.
In the description mode, the DMA controller looks in memory for its configuration values.
107
A. In a register-based DMA, processor programs DMA control registers to initiate transfer.
Register-based DMA provides DMA controller performance, because registers don’t need to
keep reloading from descriptors in memory. Register-based DMA consist of two sub-
modes:
1. Autobuffer mode (autobuffer DMA);
2. Stop mode.
In autobuffer mode, when one transfer block completes, the control registers automatically
reloaded to their original setup values and the same DMA process restarts, with zero
overhead. Autobuffer DMA especially suits performance-sensitive applications.
Stop mode works identically to autobuffer DMA, except registers don’t reload after DMA
completes the entire block transfer. DMA transfer takes place only once.
B. The descriptor contains all of the same parameters normally programmed into the DMA
control register set. Descriptors allow the chaining together multiple DMA sequences.
In descriptor-based DMA operations it can be programmed a DMA channel to automatically
set up and start another DMA transfer after current sequence completes.
The descriptor-based mode provides the most flexibility in managing a system’s DMA
transfers.
Solutions:
a. Route all DMA I/O accesses to cache;
b. Disallow caching of I/O data;
c. Use hardware cache coherence mechanisms. A HW at cache invalidates or
updates data as DMA operation is done (expensive!).
Virtual DMA
The virtual DMA allows use virtual addresses that are mapped to physical addresses during the
DMA operation. Virtual DMA can specify large cross-page data transfers.
The DMA controller does address translation operations internally. For these operations the DMA
controller contains a small translation buffer, which content is initialized by OS, when it requires
an I/O transfer.
108
Connection DMA Channel to the System
The main task is to: balance system’s hardware and software. The DMA logic can be organized in a
system in two ways:
1. Each I/O device has its own interface with DMA capability (fig. A);
2. DMA controller can be separate from one or more devices for which it performs
block transfers. This separate DMA controller is called channel (fig. B).
Channel Types
1. If a channel can do block transfers for several devices, but only one at a time,
then it is a selector channel;
2. If it can control several block transfers at once, it is a multiplexer channel.
3. Channels can be made more complex and capable.
In the limit, the channel becomes a special purpose I/O processor (IOP), which can fetch
operations from memory and execute them in sequence.
A.
B.
109
DMA-channel Connection Schemes
C Main
CPU
Memory
A
DMA Channel
I/O channel
PD DMA
Connection variations:
1. Burst mode transfer,
2. Shared mode transfer (cycle stealing)
I/O Channel
Int
110
Input-output Operations and Cached Data Consistency
A. B. C.
Out
In
Out
In
PD PD PD
Out In
Cache and Main Memory Cache and Main Memory Cache and Main Memory
COHERENT INCOHERENT INCOHERENT
111
INPUT-OUTPUT SYSTEM CONTROLLERS
(PROCESSORS)
1. Recognition addresses;
2. Resolving the characteristics differences.
Evolution Stages
112
I/O processors are also called peripheral processors or sometimes
front-end processors (FEP)
(graphics PR - 82786, DMA PR - 82259, I/O RISC-PR - i960)
6. Transputers
High-performance pipelined parallel I/O and network processors.
(IMS T222, T800, T9000).
113
I/O Controller
S-bus / I/O-bus
MPS
Core
I/O controller
(IOC)
PD
Main I/O-bus
S-bus
A. memory CPU
(MEM)
IOC
Data traffic:
CPU IOC PD
PD
MEM
IOC
DMA
PD
Data traffic:
MEM DMA
Main Memory
CPU Programs
IOP Programs
Communication Area
I/O Bus
Attention
I/O Done
Bus Request
The communication area in main memory is used for passing information between CPU and IOP
in the form of messages.
Typically CPU has three I/O-oriented instructions (each of which will be a CW) to handle the IOP:
TEST I/O, START I/O, STOP I/O.
The instruction set of an IOP consists of data transfer instructions (READ, WRITE), address
manipulation instructions and IOP program control instructions.
Coprocessors
Accelerator – is a separate architectural substructure that is architected using a different set
of objectives than the base processor, where these objectives are derived
from the needs of a special class of applications.
The accelerator is tuned to provide higher performance at lower cost, or at lower power consumption.
115
Coprocessor – a secondary processor that is used to speedup operations by taking over
a specific part of the main processor’s work.
RISC or post-RISC micros with integrated (CPU + FPU + MMU + Cache) on one chip.
(Pentium; 68040; 68060; Super SPARC, M10000; Alpha)
Graphics Processor**
Graphics – the creation and manipulation of picture images in the computer.
116
higher-level description of its components.
Translating vertices into different coordinate systems;
Geometric calculations;
Texture mapping
Texel ⇒ texture pixel.
Programmable shadowing to determine the final surface
properties of an object or image.
Features:
1. Pix-bit operations;
2. Universal software;
3. GP is the universal device;
4. Universal information transferring path:
CPU <···> GP < = > Display Unit
117
Specific features:
a. Microprogrammed control;
b. Different text displaying modes;
c. Window overlapping;
d. Generating high quality moving objects;
e. Effective resolution is 1024×2048 or more pixels;
System Bus
Graphics
Processing Video Memory
Interface
Unit Memory DRAM
unit
Controller VRAM
Internal Bus
Frame
Buffer
VRAM
Frame SYNCHR.
Contr. Unit
RAMDAC
Display Controller
Graphic Processor Units (GPU) have become an essential part of every computer system
available today.
GPU has been evolving faster than the CPU, at least doubling performance every six months.
GPUs are stream processors that can operate in parallel by running a single kernel on many records
in a stream at once.
118
Memory Bandwidth and Computational Performance between the CPU and GPU
CPU GPU
Vertex processor (vertex shader) affects only vertexes, which are less relevant to overall
performance than the pixel shader.
The vertex shader program processes and alters the attributes of the vertex, on a vertex-by-vertex
basis, before they passed to the next step in the rendering process, by the vertex processing
hardware.
The vertex shader is used to transform the attributes of vertices such as color, texture, position and
direction from the original color space to the display space.
Vertex processor allows the original objects to be reshaped in any manner.
The output of a vertex shader, along with texture maps, goes to an interpolation stage.
Pixel processor (pixel shader) is devoted exclusively to pixel shader programs. Pixel shaders do
calculations regarding pixels. They are used for all sorts of graphical effects.
As vertex geometry and pixel shader code structures are functionally similar, but have dedicated
rolls, and then it is possible to create the unified shaders.
119
Texture mapping units (TMU) work in conjunction with vertex and pixel shaders. TMUs job is to
apply texture operations to pixels.
The process of rasterization takes the geometry processed by the vertex processor and converts it
into screen pixels to be processed by the pixel shader //pikslivarjusti// or pixel fragment hardware
(fragment processor.
The rasterization is the conversation of geometry into screen pixel fragments, generated by
walking over the geometry lists and analyzing them to see where they lie on the screen.
Each pixel can be treated independently from all other pixels.
The actual color of each pixel can be taken directly from the lighting calculations, but for added
realism, images called textures are often draped over the geometry to give the illusion of detail.
GPUs store these textures in high-speed memory (texture buffer).
Rasterization units (ROP) are responsible for writing pixel data to memory.
Processed pixel fragments are stored in frame buffer. The content of frame buffer is converted into
binary representation and directed to the monitor unit.
The Frame Buffer Controller (FBCNT) interfaces to the physical memory used to hold the actual
pixel values displayed on the display unit screen.
The Frame Buffer Memory is often used to store graphics commands, textures and other attributes
associated with each pixel.
Graphics Pipeline
The GPUs uses a hardwired implementation of the graphics pipeline. Within a graphic processor,
all pipeline stages are working in parallel.
There are different pipelines within a graphics processor. Today the term “pipeline” does not
longer describe accurately the newer graphics processor architecture. The newer graphics
processors have a fragmented structure – pixel processors are no longer attached to single TMUs.
Texture
Graphics Pipeline Model Buffer
120
Whereas CPUs are optimized for low latency, GPUs are optimized for high throughput.
Example
There are various stages, which are all work in parallel, in the typical pipeline of a GPU:
121
Graphics Pipeline
3D Image
Peripheral Bus
(AGP)
122
Transputer**
TRANSistor + comPUTER ⇒ TRANSPUTER
123
19. The Occam language (a trade mark of the INMOS Group Companies) implements a simple
syntax for parallel process, and for communication between processors. The architecture of
the Transputer is defined by reference to Occam. Multiple processes can be implemented
on one Transputer, or on multiple Transputers, the communication method at the software
level is the same.
20. The Occam language was developed to program concurrent, distributed systems.
21. Occam enables a system to be described as a collection of concurrent processes which
communicate with each other and with peripheral devices through channels.
22. Concurrent processes do not communicate via shared variables, and the Occam is a suitable
language for programming systems where there is no memory shared between processors
in the system.
23. Concurrent programs can be expressed with channels, inputs, and outputs combined in
parallel and alternative constructs.
24. Each Occam channel provides a communication path between two concurrent processes.
Communication is synchronized and takes place when both the inputting and outputting
processes are ready.
25. Data to be output is then copied from the outputting process to the inputting process, and
both processes continue. Data is sent over one of the two wires forming the link.
A transmitted byte is framed by service bits.
26. The Transputer does have an assembly language, but its use discouraged.
27. A process can be active or inactive:
a. An active process can be executing or awaiting execution;
b. An inactive process may be waiting for I/O resources (ready to input, ready to
output), or waiting for a particular time to execute.
28. Interrupts or exceptions are termed events in the Transputer.
29. The most powerful Transputer was/is T9000 (1991) [200 MIPS or 25 MFLOPS/50 MHz].
T9000`s hardware supports the virtual link mechanism. It makes possible to use one
physical link to conduct exchanges between any numbers of process pairs taking place in
different Transputers.
30. The Transputers are used for image processing, pattern recognition, artificial intelligence,
systems simulation, supercomputers, and Transputer networks.
The modern nearest equivalent to the Transputer like technology is the HyperTransport
processor (2001) interconnection fabric designed by AMD.
124
Literature
V. Korneev, K. Kiselev. Modern Microprocessors. Third edition. Charles River Media, Inc.,
Hingham Massachussets, 2004 [e-book].
125
MICROPROCESSOR SYSTEM'S PERFORMANCE
The CPU time is the time the CPU is computing, not including the time waiting for I/O or running
other programs. The CPU time can be divided into the CPU time spent in the program (user CPU
time), and the CPU time spent in the operating system performing tasks requested by the program
(system CPU time).
The CPU performance refers to user CPU time, while the system performance refers to elapsed time
on an unloaded system. System's performance is limited by the slowest part of the path between
CPU and I/O devices. The system's performance can be limited by the speed of:
1. The CPU,
2. The cache memory system,
3. The main memory and I/O bus,
4. The I/O controller (I/O channel),
5. The I/O device,
6. The speed of the I/O software,
7. The efficiency of the used software.
If the system is not balanced, the high performance of some components may be lost
due to the low performance of one link in the chain!
CPU Performance
The CPU performance is determined by several factors as for:
1. Instruction cache miss cycles - DIC,
2. Data cache miss cycles - DDC,
3. Bad branch prediction penalty cycles - DBR,
4. Scheduling unit stall cycles - DSU,
5. Loss of potential instructions from fetch inefficiency - LFH
It is mainly caused by instruction cache misses and branch instruction disturbings
in the control flow.
In multiple-issue processors the instruction buffer is not filled or partly filled with
instructions.
The instructions per cycle (IPC) is related to the size of the issued instruction block (SIB)
and utilization percentage (UT) //rakendatuse määr// by:
126
The average UT is approximately 64% (where DIC~2,1%, DDC~1,8%, DBR~8,6%, DSU~5,3%,
and LFH~18,6%).
But there are the bottlenecks – the memory delay (TMem) and input-output system delay (TI/O).
127
1
Ssys = f fI /O , where
CPU + Mem
+
SCPU + Mem S I / O
fCPU+Mem ─ fraction of total execution time during which an instruction is executing in the
CPU and the main memory, including the time for any overhead in these
subsystems;
fI/O ─ fraction of the total execution time during which I/O takes place, including the time for
3
I/O overhead (fI/O = 1 ─ fCPU+Mem) .
Overlapping
Overlapping is the phenomenon of concurrent processing.
In a microprocessor system there can be overlap between the CPU and I/O unit activities.
Within the CPU there can be overlap between instructions fetching and execution.
There are two tasks A and B. The both tasks take 10 seconds to run. Task A needs very little I/O, so it
is not mentioning. Task B keeps I/O devices busy for 4 seconds; this time is completely overlapped
with CPU activities. The old CPU is replaced by a newer model with 5× the performance.
Replacement causes the following effect:
OLD
Task
C P U tim e
A
NEW
Task
A C P U tim e
T im e
sec o n d s
0 2 4 10
128
System balance is the ability of the system to maximize processor productivity.
To maximize processor performance, the exchange of data must be fast enough to prevent the
compute processors from sitting idle, waiting for data.
The faster the processors, the greater the bandwidth required and the lower the latency that can be
tolerated.
Balanced systems take into account the needs of the applications and matches memory, I/O and
interconnect performance with computing power to maximize the processors utilization.
Tt = TCPU+TIO-TOL, where
TCPU - CPU is busy,
TIO - I/O system is busy,
TOL - overlap time.
Balanced System
OLD
CPU time Ta Tb
OL
I/O time
NEW
CPU time Ta Tb
Time
0 1 2 3 4 5 6 7 8
129
TtCPU = TCPU/SCPU+TIO-TOL/SCPU, where
SCPU - speed up CPU.
TCPU = Ta+Tb,
(Ta/Tb)OLD = (Ta/Tb)NEW
TtCPU = 6/3 + 4 - 2/3 = 5,34 sec. (TOL is now 2/3s, where Ta=1,34sec and Tb=0,66sec)
Example
The task takes 60 seconds to run. The CPU is busy 40 seconds and the I/O system is busy 30
seconds. How much time will the task take if the CPU is replaced with one that has two times the
performance?
Tto=60 sec
TCPU=40 sec
TIO=30 sec
SCPU=2
Ttn=?
130
At a system level bandwidths and capacities should be in balance. Each functional unit in the MPS
is capable of demanding bandwidth and supplying bandwidth.
131
INSTRUCTION FETCH AND EXECUTION
PRINCIPLES
Von Neumann conceived a program as comprising two orthogonal sets of operations that worked
in conjunction:
1. A dataflow which did the physical manipulation of the data values,
2. A control flow, which dynamically determined the sequence in which these manipulations
were done.
Principle of orthogonally specifies that each instruction should perform a unique task without
duplicating or overlapping the functionality of other instructions.
Instruction — a coded program step that tells the processor what to do for a single operation in a
program.
Instruction cycle - the process of fetching and executing an instruction.
Instruction cycle time (TICY).
Instruction types:
1. Data movement instructions;
2. Data processing instructions;
3. Branch instructions;
4. Environmental instructions.
Instruction set – the collection of instructions that a CPU can process.
132
Example
Categories of Instruction Operators
Arithmetic/logical Integer arithmetic and logical operations: add, subtract, and, or, multiply, divide
133
A pipelined n stages deep functional unit has the same degree of parallelism
as n functional units, since both of them need n independent operations to use the resources fully.
To deliver the same performance, a pipelined processor must run at n times the clock speed of the
parallel processor.
Beside the degree of parallelism, it is also important the degree of specialization of the functional
units available.
The more specialized the resources are, the harder it is to use them all efficiently.
Specialization can explain why the same degree of parallelism is harder to use in a pipelined
processor than in a non-pipelined processor.
While an n-stage pipelined unit has the same degree of parallelism as n parallel units, then the
hardware for each pipeline stage is specialized, whereas each of the n parallel units is capable of
performing the entire function.
A. If a processor has a higher degree of parallelism; it needs a large number of independent
operations to keep its resources fully utilized.
B. If it also has a higher degree of specialization, not all resources may be fully occupied even if
there are many independent operations. Processor is slowed down by its busiest type of resource.
Instruction-level parallelism (ILP) refers to degree to which (on average) the instructions of a
program can be executed in parallel.
Instruction-level parallelism exists when instructions in a (program) sequence are
independent and thus can be executed in parallel by overlapping.
Machine parallelism is a measure of the ability of the processor to take advantage of instruction-
level parallelism.
Machine parallelism is determined by the number of instructions that can be fetched and executed
at the same time and by the speed that the processor uses to find independent instructions.
Instruction Fetching
The instruction fetch unit is an important resource. Increasing the width of the instruction fetch and
decode unit does not necessarily translate to a similar increase in the effective instruction
bandwidth. This is due to low dynamic run lengths in programs.
Dynamic run length is the number of consecutive instructions executed up to and including the
next taken branch.
Since in each cycle the processor only fetches the continuous instructions in parallel, the
effective fetch bandwidth is limited by the dynamic run lengths in the execution.
134
The front-end includes the fetch unit, the decode unit, the rename unit and the support structures.
The aim of a high-performance front-end is to keep the later stages of the processing pipeline busy
by providing them with a sufficient number of instructions every cycle.
To improve fetch throughput, the mechanism must fetch a large number of instructions that are not
consecutive in the static program representation. This can be accomplished in several ways:
1. Rearranging the static code so that basic blocks are consecutive in the static program. The
rearrangement may be done statically or dynamically.
2. Using hardware that can read multiple cache lines simultaneously.
3. Observing the dynamic execution order as the program executes and caching instructions in
their dynamic execution order.
The main problem during instruction fetch cycle is control transfer formed by branch, call, jump,
return instructions or by interrupts.
A. There is a problem with target instruction addresses that are not aligned (misaligned) to the
cache line addresses.
Misalignement occurs when a jump address is in the middle of a cache block. If a fetch group
extends beyond the end of cache block, then another cache block must be read.
If we have a self-aligned instruction cache //isejoondav käsuvahemälu//, then it reads and
concatenates two consecutive lines within one cycle to be able to always return the full fetch
bandwidth.
135
in a branch target buffer (BTB) or in a branch target address cache (BTAC) and an immediate
reload of the program counter after a BTAC match.
Branch Branch
Branch
to I-cache instruction target
history
address address
BIA BTA
BTB is a small cache memory accessed during the instruction fetch stage using the instruction
fetch address. Each entry of the BTB contains two fields:
a. Branch instruction address (BIA) ;
b. Branch target address (BTA).
When a static branch instruction is executed for the first time, an entry in the BTB is allocated
for it. The BTB is accessed concurrently with the accessing of the I-cache.
When the current program counter’s content matches the BIA of an entry in the BTB, a hit in
the BTB results. This implies that the current instruction being fetched from the I-cache has
been executed before and is a branch instruction. When a hit in the BTB occurs, the BTA field
of the hit entry is accessed and can be used as for the next instruction fetch address.
There are two main methods for branch prediction: static prediction and dynamic prediction.
Static branch prediction predicts always the same direction for the same branch
during the whole program execution.
Dynamic branch prediction uses the special hardware for prediction.
The simplest form of static prediction is to design the fetch hardware to be biased for not
taken. When a branch instruction is encountered, prior to its resolution, the fetch stage
continues fetching down the fall-through path without stalling.
This is not very effective method!
The most common branch condition speculation technique is based on the history of
previous branch execution.
History-based branch prediction makes a prediction of the branch direction, whether taken or not
taken, based on previously observed branch directions.
136
Branch prediction is effective only then, if the branch is predictable.
Hardware techniques
a. Speculative execution instructions;
b. Register renaming;
c. Out-of-order issue with instructions lookahead.
137
Software techniques (usually incorporated into the compilers as optimizations)
a. Trace scheduling;
b. Loop unrolling;
c. Software pipelining;
d. Optimal register allocation algorithms;
e. Static branch prediction.
Technique Reduces
138
INSTRUCTION PIPELINING
Time
1/Throughput
I1 S1 S2 S3
Instructions
I2 S1 S2 S3 UNPIPELINED
I3 S1 S2 S3
Latency
SYNCHRONOUS SYSTEM
Time Latency => Instruction total execution time
Throughput [instructions per clock cycle]
1/Throughput
1/Throughput [clock cycles per instruction => CPI]
I1 S1 S2 S3
I2 S1 S2 S3 PIPELINED
Instructions
I3 S1 S2 S3
Latency
1 2 3 4 5 6 7 8 9
Clock cycle
Machine cycle – the time required between moving an instruction one step down the pipeline.
The length of a machine cycle is determined by the time required for the slowest pipeline stage.
The motivation for an n-stage pipeline is to achieve an n-fold increase in throughput.
139
The ideal pipeline is based on three idealized assumptions (pipelining idealism):
1. Uniform subcomputations
2. Identical computations
3. Independent computations
Pipelines Taxonomy
1. Arithmetic pipeline
Instruction pipeline
4. Synchronous pipeline
(tstage = const)
Asynchronous pipeline
(tstage = var.)
5 Scalar pipeline
Vector pipeline
140
Pipeline Models
D1
CLK
DATA
&
D2 D4
& 1 Q
D3
/CLK
&
tstage = fixed
tstage = tFU + tlatch
tclock = φ(tstage)
141
Linear and non-linear (folded-back) pipelines
A folded-back pipeline is a way of avoiding synchronization overhead by using the same pipeline
stage for more than one step in the processing. This strategy will only partially succeed because
partitioning is imperfect, and even a perfect partition may have variance due to data-dependency.
LINEAR PIPELINE
NON-LINEAR PIPELINE
1. Increased performance
2. Power saving
3. No clock problems
Reservation Table
In a reservation table, each row corresponds to a stage in the pipeline and each column corresponds
to a pipeline cycle.
The intersection of the ith row (stage) and jth column (clock) indicates that the stage i would be
busy performing a subtask at a cycle j; where the cycle 1 corresponds to the initiation of the task in
the pipeline. That is, stage i is reserved (not available for any other task) at cycle j.
142
S1 S2 S3 S4
Clock
1 2 3 4
Stage
S1 X
S2 X
S3 X
S4 X
The reservation table shows that each stage completes its task in one clock cycle time, and hence,
an instruction cycle requires four clock cycles to be completed. The number of clock cycles that
elapse between two initiations is referred as latency.
The problem is to properly schedule queued tasks awaiting initiation in order to avoid collisions
and to achieve high (maximum) throughput.
Collision ⇒ if two or more initations attempt to use the same pipeline stage at the same time.
S1 S2 S3 S4
Clock
1 2 3 4 5 6
Stage
S1 X
S2 X X
S3 X X
S4 X X
The last two stages in the pipeline are used twice by each operand. Stage 3 (S3) produces a partial
result and passes it on to stage 4 (S4). While stage 4 is processing the partial result, stage 3
produces the remaining part of the result and passes it to stage 4. Stage 4 passes the complete result
to stage2 (S2). Because stages 3 and 4 are used in subsequent cycles by the same set of operands,
the operand input rate cannot be as fast as one per cycle.
143
Pipelining Efficiency
We assume that:
Ti ≈ tST
tST = n × tcy
Total un-pipelined processing time (Ttotun ) is:
Ttotun = Ni × Ti = Ni × tcy × n
Total pipelined processing time (Ttotp ) is:
tST = TI,
Ttotp = (Ni-1) × tcy + tST = tcy × (Ni+n-1)
↓
In each tcy pipeline produces a result for one instruction.
lim SNi = n
Ni→∞
In the limit, the pipeline of n stages will be n times as fast as the corresponding
non-pipelined unit.
144
Pipeline Clocking
Clock
td Pmax tg
Clock
tw
tc
tc – cycle time
tw – clock pulse width
td – register output delay after data and clock are enable
tg – register data setup time
Pmax – maximum delay n FEU (without clock overhead C)
C – clock overhead
C = tg+td
Stage cycle time — the difference between the time that an instruction enters a pipeline
stage and the time that it leaves the stage.
Computation time - the difference between the time that an instruction enters a computation
block and the time that all computed outputs are stable.
Idle time — the time that an instruction waits (at stage i), after computation has been
completed, before it moves to the next stage (i+1).
Synchronization time - the time used to synchronize two adjacent stages which, for the
clock case, includes clock skew and latching time.
145
Stage clocking circuit
t(max) t(max)
t(min) t(min)
Clock
o
DATA
IN AND AND DATA
OUT
OR o FEU OR o
AND AND
/Clock
o
Clk Clk
HI Low Sc
Clock
/Clock
Clock cycle overhead time is a portion of the cycle used for clock skew, jitter //taktimpulsi
värin//, latching //lukustusviide// and other pipeline overheads.
If we assume, that the overhead per clock cycle is constant in a given circuit technology, then we
can increase the processor frequency by reducing the useful time per clock cycle.
146
Generic Instructions Pipeline
IF
DC
OF
EX
WB
IF – Instruction Fetch
DC – Instruction Decode
OF – Operand Fetch
EX – Execute
WB (OS) – Write Back (Operand Store)
147
Amdahl`s Law Applied to the Pipelined Processor
Spip = 1
, where
q
(1 − q) +
n
Spip – speedup of the pipeline
q – fraction of the time when the pipeline is filled
(1-q) – the fraction of time when the pipeline is stalled
n – the number of the pipeline stages
It is assumed that whenever the pipeline is stalled, there is only one instruction in the pipeline or it
becomes a sequential nonpipelined processor.
Deep Pipelines
Processor performance can monotonically increase with increased pipeline depth, but due to
unpredictable nature of code and data streams, the pipeline cannot always be filled correctly and
the flushing of the pipeline exposes the latency.
These flushes are inevitable, and pipeline exposures decrease IPC as the pipeline depth increases.
The branch misprediction latency is the single largest contributor to performance degradation as
pipelines are stretched.
The overall performance degrades when the decrease in IPC outweighs the increase in
frequency.
The higher performance cores, implemented with longer (deeper) pipelines, will put more pressure
on the memory system and require large on-chip caches.
Increasing the pipeline depth divides the computation among more cycles, allowing the time for
each cycle to be less.
Ideally, doubling the pipeline depth (n) would allow twice the frequency.
In reality, some mount of delay overhead (Toverhead) is added with each new pipeline stage, and this
overhead limits the amount of frequency improved by increasing pipeline depth.
1 1 n
f= T = =
Tlog ic Tlog ic + n × Toverhed , where
cycle
+ Toverhead
n
Tcycle = tCY
Tlogic – delay in stage logic circuits
The equation shows how frequency is improved by dividing the logic delay of the instruction
among a deeper pipeline.
148
Increasing the pipeline depth increases the number of pipeline stalls per instruction and reduces
the average instructions per cycle.
Example
4 4
tCYold
nold = 2
IF IE
tCYold = 4 fold = 0,25
nnew = 6
F D OF EX1 EX2 WB
tCYnew = 1 fnew = 1
tCYnew 1 1 1 1 1 1
nnew = nold + 4
fnew / fold = 4
Ttotold = tCYold × (Ni + nold - 1) = 4 × (Ni + 2 - 1) = 4Ni + 4
Ttotalnew = tCYnew × (Ni + nnew - 1) = 1 × (Ni + 6 - 1) = Ni + 5
Speedup = Ttotalold / Ttotalnew
4
4+
4Ni + 4 Ni
Speedup = =
Ni + 5 5
1+
Ni
Speedup = 4
limNi → ∞
149
In figures relative metric is tied to Pentium 4 microprocessor technical characteristics.
150
NB! It is supposed, that the instructions are processed with no branches.
151
The real improvement depends upon how much circuit design minimizes the delay overhed per
pipestage and how much microarchitectural improvements offset the reduction in IPC.
Cost Productivity
C P C/P
n n n
0 0 0 n-opt
The larger the number of pipeline stages, the greater the potential for speedup. Deep pipelines,
which implement a fine-grained decomposition of a task, only perform well on long, uninterrupted
sequences of the task iteration.
The optimum pipeline depth {the number of pipeline stages} (nopt) for microprocessors can be
expressed as:
NI × t p
2
nopt = , where
α × γ × N H × to
NI – the number of instructions;
NH – the number of hazards (each hazard causes a full pipeline stall);
tp – the total delay of the pipeline (processor);
to – the latch (between pipeline stages) overhead for the technology being used;
α – an average degree of superscalar processing (whenever the execution unit is busy).
The α varies with the workload running at the processor.
For scalar processors α = 1 and for superscalar processors α > 1;
γ – the weighted average of the fraction of the pipeline stalled by hazards.
0≤γ≤1
152
Some observations
1. The optimum pipeline depth increases for workloads with few hazards (NH↓; nopt↑).
2. As technology reduces the latch overhead (to), relative to the total logic path (tp), the optimum
pipeline depth increases (to↓;nopt↑).
3. As the latch overhead increases relative to the total logic delay, the optimum pipeline depth
decreases (tp/to↓; nopt↓).
4. As the degree of superscalar processing (α) increases, the optimum pipeline depth decreases
(α↑; nopt↓).
5. As the fraction of the pipeline that hazards stall (γ) decreases, the optimum pipeline depth
increases (γ↓;nopt↑).
o In the deep–and–narrow model additional pipeline stages can reduce the amount of work
performed at each stage and can increase the clock speed of the processor.
The deep-and–narrow model is well suited to linear code with few branches where are
doing a lot of repetitive operations.
The “branchy” code will perform poorly, regardless of clock speed.
153
Base-, Super- and Superscalar Pipelines
I1 F D EX W
I2 F D EX W Base Pipeline
I3 F D EX
I4 F D Per cycle is issued 1 instruction.
I5 F
I1 F D EX W
I2 F D EX W
I3 F D EX W
I4 F D EX W Superpipeline
I5 F D EX W
I6 F D EX W In a superpipeline of degree m,
I7 F D EX W
I8 F D EX W Per cycle is issued 1 instruction.
I9 F D EX W Simple instruction latency measured m cycles is m.
I10 F D EX
I11 F D EX
I12 F D EX m=4
I13 F D
I14 F D
I15 F
I16
I1 F D EX W
I2 F D EX W
I3 F D EX W
I4 F D EX W Superscalar Pipeline
I5 F D EX W
I6 F D EX W
I7 F D EX W A superscalar pipeline of degree n can issue n instructions
I8 F D EX W per cycle.
I9 F D EX Per cycle is issued 4 instructions.
I10 F D EX Simple instruction latency measured in cycles is 1.
I11 F D EX n=4
I12 F D EX
I13 F D
I14 F D
I15 F D
I16 F D
I17 F
I18 F
I19 F
I20 F
Correct operation of a pipeline requires that operation performed by a stage must not depend on
the operation(s) performed by other stage(s).
Dependencies are a property of programs. Presence of dependence indicates potential for hazard
(risk), but actual hazard and length of any stall is a property of the pipeline.
Hazard – the situation that prevents the next instruction in the instruction stream from
executing during its designated clock cycle.
Classes of Hazards
1. Structural hazards (resource conflicts)
The hardware cannot support all possible combinations of instructions in simultaneous
overlapped execution.
Multiple copies of the same resource (replication).
Instructions prefetch (forming instructions queue)
Starvation – the result of conservative allocation of resources in which a single process is
prevented from execution because it’s kept waiting for resources that never
become available.
Critical region ─ the parts of program that must complete execution before other processes
can have to the resources being used.
Deadlock – a problem occurring when the resources needed by some processes to finish
execution are held by other processes, in turn, are waiting for other resources to
become available.
Deadlock conditions:
1. Mutual exclusion //vastastikune välistamine// - only one process is allowed to have access
to a dedicated resource.
2. Resource holding //ressursihõive// - it’s an opposed to resource sharing.
3. No preemption //puudub väljasaalimise// - the lock of temporary reallocation of resources.
4. Circular wait //ringootus// - a process involved in the impasse is waiting for another to
voluntarily release the resource.
All four conditions are required for the deadlock to occur and as long all four conditions are
present the deadlock will continue; but if one condition can be removed the deadlock will be
resolved.
155
3. Data hazards
Hazards arise when an instruction depends on the results of a previous instruction in a
way that is exposed by the overlapping of instructions in the pipeline.
Data hazards due to register operands can be determined at the decode stage.
Data hazards due to memory operands can be determined only after computing
the effective address.
ADD F D E W (R1)
SUB F D (R1) E W
156
WAR: RR(a) ∩ WR(b) ≠ Ǿ;
RAW: WR(a) ∩ RR(b) ≠ Ǿ;
WAW: WR(a) ∩ WR(b) ≠ Ǿ.
Example
Execution instructions (I1, I2, I3) on an out-of-order CPU which has two execution units:
I1: R1 = R3 / R2 F D E E E E W
RAW
I2: R4 = R1 * R1 F D D D D E E W
WAW WAR
I3: R1 = R5 + R30 F D E W
Clock
Cycles
157
Generally, the CPI for a pipelined processor is equal to:
Pipeline CPI = Ideal Pipeline CPI + Structured stalls + Data hazard stalls + Control hazard stalls
Data speculation
Instructions, threads whose operands are not yet available, are executed with predicted data.
What can be speculated on?
a. Register operand;
b. Memory address;
c. Memory operand.
A. Hardware operand forwarding allows the result of one ALU operands to be available
to another ALU operation in the cycle that immediately follows.
To reduce pipeline stalls by data hazard, the register forwarding (bypass) is used to handle
RAW.
I1: R1 := R2 + R3
I2: R4 := R1 + R5
158
I1 F D E W
I2 F D D E W
0 1 2 3 4 5 6
Clock
I1 F D E W
I2 F D E W
0 1 2 3 4 5
Clock
Modified instruction Store R2, (R3) Load (R3), R2 Store R4, (R3)
sequence Move R2, R4 Move R2, R4 M[R3] ← R4
R4 ← R2
159
Solution Strategies for Pipeline Hazards Caused by Control (Procedural)
Dependences
1. Branch prediction //hargnemise prognoosimine//;
2. Delayed branches //viidatud siire//; Control
3. Traces (threads) //lõimed//; speculation
4. Multipath techniques //mitmiklõime meetodid//.
Branch Prediction
Branch prediction – a method whereby a processor guesses the outcome of a branch
instruction so that it can prepare in advance to carry out the
instructions that follow the predicted outcome.
Static branch prediction is a prediction that uses information that was gathered before the
execution of the program.
The common static strategies that are used to predict whether a branch will be taken or not
(predict never taken, predict always taken, predict by opcode). Overall probability a branch is
taken is about 60-70%, but probability of backward branch is ~90% and forward branch is
~50%.
Dynamic branch prediction uses information about taken or not-taken branches gathered at run-
time to predict the outcome of a branch.
Branch history tables (BHT) are widely used in dynamic prediction strategy.
Branch penalty (BP) can be evaluated as:
160
Delayed branch is a type of branch where the instruction, following the branch is always executed,
independent of whether the branch condition is true or false.
The delay slot may be formed also after a load instruction (Load Delay Slot).
I1 F D E W Load instruction
I2 F D E E W
1. The original istruction sequence
I1 F D E W Load instruction
Ix F D E W
Data is ready
I2 F D E W
2. Modified instruction sequence
Processors with very deep pipelines could have two or more delay slots.
The more delay slots that exist after a conditional branch instruction, the more difficult it will be
for the compiler to find useful, independent instructions with which to fill them.
Assume single cycle (tcy = 1) execution for all instructions except loads, stores and branches.
The average number of cycles per instruction (CPIave) is given by:
T. R. Gross and J. L. Hennessy developed an algorithm for optimizing delayed branches and have
shown that the first branch delay slot can be filled with useful instructions more than half the time,
while the subsequent delay slots are increasingly harder to fill.
161
Handling Conditional Branches in Modern Processors
The conditional branches are worse. Early pipelined CPUs stalled until it was known whether the
branch would be taken or not.
There are different possibilities to handle conditional branches in modern processors.
162
Reservation station is a buffer within a functional unit that holds the operands and operation.
As soon the buffer contains all its operands and the proper functional unit is ready to execute,
the result is calculated.
When the result is completed, it is sent to any RSs waiting for this particular result as well as to
the commit unit (CMU). Commit unit buffers the result until it is safe to put the result into the
register file or for store into memory.
The special buffer in the CMU, the reorder buffer (ROB) //ümberjärjestamise puhver//, is used
to supply operands.
ROB
FP
Operations
Queue
FPRGF
RS RS
FPU1 FPU2
C. Speculative Execution
163
Checking mechanisms to see if the prediction was correct;
Recovery mechanisms to cancel the effects of instructions that were issued under
false assumptions (e. g., branch mispredictions);
Restart mechanisms to re-establish the correct instruction sequence.
S peculative Execution
Address
Data Value
1. Each speculative instruction is tagged with speculative bit which is carried throughout all
pipelines;
2. The speculative instruction is stalled in the decode stage if second branch encountered before
the previous speculative branch resolves;
3. When speculative branch resolves, then if:
164
Data Speculation
Data dependencies are a major limitation to the amount of ILP that processors can achieve.
Data value speculation can eliminate the ordering imposed by data dependencies.
Data dependencies cause a serialization on the execution of program instructions.
Data value speculation refers to the mechanisms that deal with data dependencies by predicting the
value that flow through them and execute speculatively the instructions that consume the predicted
value.
Data dependence speculation refers to the techniques that are based on predicting dependencies
among instructions and executing speculatively the code by enforcing the predicted dependences.
Data value speculation is based on the observation that inputs and outputs of many instructions
sometimes follow a predictable pattern.
Load value prediction is more powerful than load address prediction, since the memory access has
to perform in order to obtain the predicted value, even if the memory address is correctly
predicted.
Speculatively issued loads must be verified.
This is done by issuing them to the address computation unit when their source operands are
available. The actual address is compared to the predicted one and in the case of a misprediction,
the predicted load and those instructions dependent on it are re-executed.
165
Instruction
flow
In-order
FU issue
RS RS
Out-of-order
EU EU issue
In-order
CMU issue
Example
Clock Cycle
Completion
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In-order I1 I2 ─ ─ I1 I2 I3 I4 ─ I3 I5 I4 I6 I5 I6
Out-of-order I1 I2 I2 I3 I1 I4 I3 I5 I5 I4 I6 I6
166
Memory Hazards and OOO
Memory hazards can occur during store and load operations. Store (address) buffers are used to make
sure memory operations don’t violate hazard conditions. Store addresses are buffered in a FIFO queue.
Store address buffers contain the addresses of all pending store operations.
Store addresses remain buffered until:
1. Their data is available;
2. The Store instruction is ready to be committed.
Decode &
PC Fetch Reorder Buffer (ROB) Commit
Rename
Register File
Storage
FU1 FU2 FUn MEM
Buffer
167
Appendix
IF Flush
Hazard
Detection
Unit
ID/EX
Control
WB
EX/MEM
M
U M WB
0 X MEM/WB
Sft EX M WB
+
M
U
X
ALU
Instruction Data M
P memory RGF memory U
M
C (Icache) (Dcache) X
U
X
M
U
X
IF/ID
Forwarding
unit
IF/ID - pipeline register between the IF (instruction fetch) and ID (instruction decode and register file read) stages.
ID/EX - pipeline register between ID and EX (execution or address calculation) stages.
EX/MEM - pipeline register between EX and MEM (data memory stages).
MEM/WB - pipeline register between MEM and WB (write back) stages.
168
Pipeline Interrupts
Precise interrupt – it is required that that the system state, when the interrupt (exception)
(exception) occurs, is the same as that in a nonpipelined CPU that executes
instructions in sequential order.
In that case an interrupt (exception) occurring during the execution of instruction Ij is precise if
the following conditions are met:
The most direct is to make all interrupts (exceptions) precise by forcing all instructions
to complete in the order in which they are issued.
Interrupts (exceptions) in pipelined processor that are not associated with exact instruction that was
the cause of the interrupt (exception) are called imprecise interrupts (exceptions).
When delayed branching is used, the instructions in the branch-delay slot are not sequentially
related. If during execution an instruction in the branch-delay slot the interrupt is occurs, and the
branch is taken, the instructions in the branch-delay slot and the branch-target instruction must be
restarted after interrupt is processed.
In the case of OOO completion we must provide a mechanism to recover the precise state or
context of the processor at the time of the interrupt.
A small register set a history buffer (HB), is used to store temporarily the initial state of
every register that is overwritten by each executing instruction Ij.
If an interrupt occurs during Ij’s execution, the corresponding precise CPU state can be
recovered from the values stored in HB, even if a second conflicting interrupt is generated
by a still-completing instruction.
169
Example
CPU parameters
TCPU = IC × CPI/F
Pipelined TCPU = (0,2 CPIB + 0,1 CPIC + 0,2 CPIL + 0,1 CPIS +
4-stage pipeline + 0,4 CPIA) × 1000 / 109
F, D, E, W
[In every clock cycle is TCPU = (0,2 × 1 + 0,1 × 1 + 0,2 × 1 + 0,1 × 1 + 0,4 × 1) × 10-6 =
retired only one instruction. = 1 × 10-6 s
Each instruction takes one
clock cycle to execute.]
Pipelined TCPU = (0,2 CPIB + 0,1 CPIC + 0,2 CPIL + 0,1 CPIS +
3-way superscalar + 0,4 CPIA) × 1000 / 109
[In a given clock cycle we
can retire one of the
following instruction groups: TCPU = (0,2 × (1/3) + 0,1 × (1/2) + 0,2 × (1/1) + 0,1 × (1/1) +
a. 3 Arithmetic instructions; + 0,4 × (1/3)) × 10-6 ≈ 0,5 × 10-6 s
b. 3 Branch instructions;
c. 2 Control instructions;
d. 1 Load instruction;
e. 1 Store instruction.]
170
Type Number Issue order Instructions Scheduling
of pipelines per cycle
Simple scalar 1 in - order 1 static
Scalar >1 in – order 1 static
Superscalar
in-order >1 in - order >1 dynamic
VLIW >1 in – order >1 static
Superscalar out-of-
out-of-order >1 order >1 dynamic
Comment
Timing Anomalies
Appendix
Operational Unit (Arithmetic) Pipeline
Floating-point ADD operation pipeline
Operands
A RG RG
Equalize Add Normalize
SUM
Exponents Mantissa Mantissa
B
Buffer registers (RG) needed when the processing times for pipeline stages are not equal. Some flag bits
indicate the completion of processing at the output of each stage may be needed. In addition to the
pipelining within the functional units, it is possible to pipeline arithmetic operations between the functional
units. This is called chaining //aheldus//.
171
SUPERSCALAR PROCESSORS FUNDAMENTALS
172
CPII reflects the temporal parallelism of instruction processing.
Issue interval for A
I1 I2
I1 I2 I3 I4 I5 I6 I7 I8 I9
CPII
time
0 1 2
The benefit of using IPII and CPII instead of IPC or CPI is being able to identify two sources of
parallelism separately.
While taking account the average number of data operations proceeds per cycle (DOPC) we get:
173
DOPC = IPC × OPI, where
OPI – operations per instruction (average).
OPI reveals intrainstruction parallelism.
Practically OPI<1.
Both factors, increasing issue width and pipeline depth, place increasing pressure on the instruction
fetch unit. Increasing the superscalar issue width requires a corresponding increase in the
instruction fetch bandwidth.
174
Pipelined ILP Machine
Question: How much instruction-level parallelism (ILP) required to keep processor pipelines
busy?
T = 8; Ltot = [(4 × 1) + (2 × 3) + (2 × 4)] / 8 = 9/4; P = 8 × 9/4 = 18 instructions
Contrary to pipeline techniques, ILP is based on the idea of multiple issue processor (MIP).
An MIP has multiple pipelined datapaths for instruction executing.
175
Buses
• Widening increases the width of the data bus, but not the control and address bus.
• Replication increases the number of buses between the register file and the L1 cache.
• Widening requires no additional address translations, while replication requires several
address translated per cycle.
Register File
• Applying replication increases the number of ports per bit.
• Both techniques increase the register file area on chip and clock cycle time.
Code size
• Widening can reduce the total number of instructions.
• The reduction of the code size can reduce the miss rate of the instruction cache and
improve performance.
For a given technology, the best performance is obtained when combining a certain degree of
replication and widening in the hardware resources.
176
Instruction Fetch
In the instruction fetch stage s instructions are getting out of the instruction cache during each
clock cycle. These instructions are called fetch group //käsuvõtu grupp//.
Let p be that probability that instruction is a taken branch or unconditional branch (jump).
Let assumes that the fetch stage can predict whether the branch is taken.
If the first instruction in the fetch group is a taken branch or unconditional jump, then the rest of
the instructions in the fetch group should not be executed.
The probability of P1 that an only one executable instruction (the branch or jump) is fetched:
P1 = p
The probability of fetching two executable instructions P2 is:
P2 = (1-p) × p
The last instruction in the fetch group is executable whether it is a taken branch or unconditional
branch
PS = (1-p)s-1
SE SE=s
s
0 1/p
The limit on the number of executable instructions in fetch group would seem to be a serious
limitation for large superscalar processors.
177
Instruction Window
Instruction window – the set of instructions that is examined for simultaneous executing.
Since each instruction in the window must be kept in processor and the number of comparisons
required to execute any instruction in the window grows quadratically in the window size, real
window size are likely to be small.
The number of comparisons required every clock cycle is equal to:
(maximum instruction completion rate) × (window size) × (number of operands per instruction)
The window size limits the maximum number of instructions that may issue.
The notion of the instruction window comprises all the waiting stations between the decode
(rename) and execute states (FUs).
The instruction window isolate the decode/rename stage from the execution stages of the
instruction pipeline. The decode stage continues to decode instructions regardless of whether they
can be executed or not. The decode stage places the decoded instructions in the instruction window
as long as there is room in the window.
The instructions in the instruction window are free from control dependences, which are removed
by branch prediction and free from antidependences or output dependences, which are removed by
renaming. Only data dependences (RAW) and resource dependences remain to be taken into
consideration.
Aligned issue – until all the n instructions have not completed their execution, the
following group is not decoded/issued.
178
Superscalar Processor’s General Model
Instruction Window
FU
RS
RS
Static
Program FU
Instruction
Instruction
Fetch
Reorder
&
&
Branch
Commit
Prediction RS
FU
FU
OoO
issue
In-order OoO
OoO = Out-of-Order dispatch writeback
RS = Reservation Station
FU = Functional Unit
OoO execution
179
The total window size is limited by the required storage, the comparison, and a limited issue rate,
which means large windows less helpful.
The instruction window size (WS) directly limits the number of instructions that begin execution
in a given cycle.
In practice, real processors will have a more limited number of FUs, as well as limited number of
buses, and register access ports, which serve as limits on the number of instructions initiated per
clock. The maximum numbers of instructions that may issue begin execution, or commit in the
same clock cycle is usually much smaller than the window size.
Instruction Shelving
Shelving //käsuladustus// decouples instruction issue and data dependency checking. This
technique presumes special instruction buffers in front of the execution units.
With shelving, instructions are issued first to the RS with essentially no need for dependency
checks. Shelving delays dependency checking to later step of processing. During dispatching
processor checks the instructions held in the RSs for data dependencies, and forwards dependency-
free instructions to available execution units.
180
Front-end Fetch
Fetch
contraction Decode Branch
Branch Decode Dispatch penalty
penalty Execute
Dispatch
Memory
Execute
Retire
Memory
Back-end
optimization Optimize
Retire
There is no longer the centralized control performed by the control path. Instead, a form of
distributed control via propagation of the control signals through the pipeline stages is used.
Not only the data path is pipelined, but also the control path is pipelined.
The traditional data path and the control path are integrated into the same pipeline.
181
Optimizations Types
Percentage of
Optimization the total number
Explanation
type of optimizing
transforms
182
Superscalar Pipeline Models
Superscalar pipelines are parallel pipelines [A], in being able initiate the processing of
multiple instructions in every machine cycle.
Superscalar pipelines are diversified pipelines [B] employing multiple and heterogeneous
functional units in their execution stage.
Superscalar pipelines can be implemented as dynamic pipelines [C] in order to achieve the
best possible performance without requiring reordering of instructions by the compiler.
1
2
1 2 k
IF
ID
RD
EX
M EM
WB
183
Parallel pipeline is a combination of pipelining and parallel processing.
For parallel pipelines or superscalar pipelines the speedup is measured with respect to a scalar
pipeline and is primarily determined by the width (wp) of the parallel pipeline.
A parallel pipeline with width wp can concurrently process up to wp instructions in each of its
pipeline stages.
B. Diversified Pipeline
Diversified pipelines are parallel pipelines that have been specialized to process different
instructions types.
In parallel pipelines multiple different functional units (FU) are used in the execution portion
stage. Instead of implementing p identical pipes in an p-wide parallel pipeline, in the execution
portion of the parallel pipeline, diversified execution pipes can be implemented.
184
2. Each instruction type incurs only the necessary latency and makes use of all stages of
an execution pipe;
3. If all instruction dependencies between different instruction types are resolved prior to
dispatching, the once instructions are issued into the individual execution pipes, no
further stalling can occur due to instructions in other pipes.
The idea of diversified pipelines is not new. The supercomputer CDC 6600 (1965) had 10 parallel
diversified execution pipelines.
C. Dynamic Pipeline
In any pipelined design, buffers are required between pipeline stages. The buffers hold all essential
control and data bits for the instruction that has just traversed stage i of the pipeline and is ready to
traverse stage (i+1) in the next machine cycle. Single-entry buffers are quite easy to control.
In every machine cycle, the buffer’s current content is used as input to stage (i+1), and at the end
of cycle, the buffer latches in the result produced by stage i. The buffer clocked each machine
cycle. The exception occurs when the instruction in the buffer must be held back and prevented
from traversing stage (i+1). In that case, the clocking on the buffer is disabled, and the instruction
is stalled in the buffer.
Instructions dynamic scheduling – the hardware rearranges the instruction execution to reduce
the pipeline stalls. Dynamic scheduling offers several advantages:
1. It enables handling some cases when dependencies are unknown at compile time;
2. It simplifies the compiler.
These advantages are gained at a cost of a significant increase in hardware complexity.
In introducing out-of-order execution, it must be split the instruction decode pipeline stage into
two stages:
1. Issue – decode instructions, check for structural hazards;
2. Read operands – wait until no data hazards, and then read operands.
In parallel pipeline multiple buffers are needed between two consecutive pipeline stages.
Multiple instructions can be latched into each multientry buffer in every machine cycle.
In the next cycle, these instructions can traverse the next pipeline stage.
185
If all of the instructions are in a multientry buffer are required to advance simultaneously in a
lockstep fashion, then the control of the multientry buffer is similar to that of the single-entry
buffer. Each entry of the simple multientry buffer is hardwired to one write port and one read port
of this buffer.
The entire multientry buffer is either clocked or stalled in each machine cycle. Such operation of
the parallel pipeline may induce unnecessary stalling of some of the instructions in a multientry
buffer. Simple multientry buffer enhancements could be:
Adding special connectivity between entries to facilitate movement of data between entries.
Providing a mechanism for independent accessing of each entry in the buffer.
Superscalar pipelines differ from scalar pipelines in one key aspect, which is the use of
complex multientry buffers for buffering instructions in flight.
In order to minimize unnecessary stalling of instructions in parallel pipeline, trailing instructions
must be allowed to bypass stalled leading instructions.
Such bypassing can change the order of execution of instructions from the original sequential order
of the static code. With out-of-order execution of instructions, there is the potential of approaching
the data flow limit of instruction execution.
A dynamic pipeline achieves out-of-order execution via the use of complex multientry buffers that
allow instructions to enter and leave the buffer in different orders.
The execution portion of the dynamic diversified pipelines is usually bracketed by two reordering
multientry buffers:
186
2. The second buffer is the completion buffer (reorder buffer - ROB).
It buffers the instructions that may have finished execution out of order and retires the instructions
in order by outputting instructions to the final write-back stage in program order.
The operation of the ROB is as follows:
1. When an instruction is decoded, it is allocated an entry at the top of the ROB so that result
value of this instruction can be written into the allocated entry it completes.
2. When the value reaches the bottom of the ROB, it is written into the register file.
If the instruction is not complete when its entry reaches the bottom of the ROB,
the ROB does not advance until the instruction completes.
B ypass
Issue Wakeup
Fetch Decode Rename RGF Data cache
Window & Select
RG F
Wakeup
& Select
187
Example
The 5-stage dynamic pipeline (wp=3) which has 4 execution pipes in execution stages which
are bracketed by two reordering multientry buffers (Dispatch Buffer and Reorder Buffer)
IF
ID
RD
In order
Dispatch buffer
Out of order
MEM2 FP2
FP3
Out of order
WB
188
Example 2
189
Dynamic Instruction Scheduling in Superscalar Processors
The most straightforward way to improve the IPC of a modern superscalar processor is to increase:
1. The instruction issue width (IW)
Not all types of instructions can be issued together.
2. The instruction window size (WS), [WS ≥ IW].
Restrictions
a. The IW is restricted mainly by the clock speed.
b. The WS is restricted by the number of permitted transistors.
c. The delays of almost all components of a superscalar processor except execution units are
increasing functions of IW and WS.
Dynamic instruction schedulers can be:
1. With data capture;
2. Without data capture.
Register
update Non-data-capturing
RGF scheduling window
Operand copying
Register
Data captured update
scheduling window Wake-up RGF
(Reservation Stations)
Forwarding
and wake-up
FUs FUs
Instruction scheduler with data capture Instruction scheduler without data capture
When instruction is dispatched, the operands The register read is performed after the
that are ready are copied from the RGF instructions are being issued to the FUs.
into the instruction window. At instruction dispatch there is no
For the operands that are not ready, tags are copying operands into the register
copied into the instruction window and used window, only tags for operands are
to latch in the operands when they are loaded into the window.
forwarded by the FUs. The scheduler performs tag match to
wake up ready instructions. All ready
instructions that are issued obtain their
operands directly from the RGF just
prior to execution.
190
Example
External Bus Lines
Register File
(RGF)
FXU
RGF
I-Cache IDEC
Access D-Cache
Register File
(RGF)
FXU 1
IDEC 1
FXU 2
IDEC 2 RGF
I-Cache Access D-Cache
IDEC 3 FPU11 FPU12 FPU13
Instr.
Issue
IDEC 4
FPU21 FPU22 FPU23
Load/Store
Branch Target Buffer Unit
(BTB)
191
Register Renaming and Reservation Stations
The dynamic instructions scheduling scheme was invented by Robert Tomasulo. Superscalar
processors include huge register sets, so register renaming and initial dependency checks are
performed before instructions are issued to the FUs.
Consider the 7-stage superscalar pipeline’s model for dynamically scheduled processor.
Instructions are scheduled without data capturing, in pipeline the stages are connected in order:
“Rename” (RS) => “Wakeup/Select” => “Register Read” => “Execute/Bypass” (RS).
Instruction format
Processor pipeline
Wait in
reservation stations Bypass
Execute
Issue Wakeup
Fetch Decode Rename RGF Commit
Window Logic
Comment
1. In the fetch stage, instructions are fetched from the instruction cache.
2. Then instructions are decoded and their register operands renamed.
3. Instructions in instruction window are free from control dependencies, and free from name
dependencies.
4. Next instructions are written into the reservation stations where they wait for their source
operands and a functional unit to become available.
Register renaming is a process whereby each new instruction is assigned a unique destination
register from a pool of physical registers.
Register renaming is used to ensure each instruction is given correct operands and is done by
dynamically mapping the logical (architectural) register numbers (names) used in program to
processor’s internal physical registers.
During register renaming processor removes false data dependencies (WAR, WAW) by writing the
results of the instruction first into dynamically allocated rename buffers (ROB), rather than into the
specified destination registers.
192
Before renaming After renaming
I1 ADD BX, AX ADD R2, R1 RAW!
I2 MUL CX. BX MUL R3, R2
I3 MOV BX, DX MOV R4, R5
I4 ADD BX, #5 ADD R6, #5
The architectural destination register BX is renamed in instructions I1, I3 and I4.
The number of physical registers can be greater than the number of architectural registers, allowing
registers to be reused less often. The mapping from architectural to physical registers is very
similar to the mapping of virtual to physical memory addresses.
There are two main ways to implement the register renaming scheme in the superscalar processor:
1. The first method provides a larger number of physical registers than the logical registers.
2. The second method uses a reorder buffer (ROB).
A common method to implement register renaming is to use a separate register rename file (RRF)
or physical register file in addition to the architectural register file (ARF).
In a typical load/store ISA an instruction may have up to one destination register (ARd or Dest)
and up to two source registers (ARs1 and ARs2 or Src1 and Src2). The register names as used by
the instruction refer to the architectural register name space.
A straightforward way to implement the register renaming is to duplicate the ARF and use the
RRF as a shadow version of the ARF. This will allow each architectural register to be renamed
once, but this is not an efficient way to use the registers in RRF.
An alternative is to use a mapping table to perform renaming of the ARF. The core renaming
activity takes place in the register alias table (RAT). The RAT holds the current mapping of
architectural to physical registers.
Register renaming involves looking up a map table to pick up current mappings for architectural
registers and, in parallel, true-dependence checking, followed by a modification of the mappings
for source registers and that were produced by earlier instructions in the current rename group.
193
PRdnew
PRd To ROB
ARd
PRdnew
ARs2 RAT
PRs2 To scheduler
ARs1
PRs1
After registers are renamed, the instruction is dispatched into out-of-order processing core. This is
accomplished by establishing the entries for the instruction in two queues: the issue queue (IQ) and
the ROB. The IQ is also known as the dispatch buffer or the reservation stations in distributed
implementation.
V - Valid
Operand
B - B usy
Stand-alone RRF
RO B
Map
ARF RRF
table
Data B Tag Data V
1
Register
1
Specifier
Operand
194
When a separate RRF is used for register renaming, there are implementation choices where to
place the RRF – stand-alone RRF or incorporated into ROB RRF.
In both options a busy field (B) is added to the ARF along with mapping table.
If the busy field of selected entry of the ARF is set (B:=1), indicates that the architectural register
has been renamed, the corresponding entry of the map table is associated to obtain tag or pointer to
the RRF entry.
If the RRF is in the ROB, then the tag specifies entry in a reorder buffer (ROB).
If the RRF is incorporated as part of the ROB, then every entry of the ROB contains an additional
field that functions as a rename register.
The task of register update takes place in the back end of the machine. It does not have direct
impact on the operation of the RRF.
When a register-updating instruction finishes execution, its result is written into the entry of
the RRF indicated by the tag.
When this instruction is completed its result is copied from the RRF into the ARF.
Once rename register is copied to its corresponding architectural register, its busy bit is reset (B=0)
and it can be used again to rename another architectural register.
The ROB allows instructions to complete only in program order by permitting an instruction to
complete only if it has finished its execution and all preceding instructions are already completed.
195
An advantage of the register renaming approach is that instructions commitment is simplified,
since requires only two actions:
a. Record that the mapping between an architectural register number and physical register
number is no longer a speculative;
b. Free up any physical registers being used to hold the “older” volume of the architectural
register.
REM
1. The physical registers contain values of completed but not yet retired instructions.
2. The architectural registers store the committed values.
3. After committing of an instruction, copying its result from the physical register to the
architectural register is required.
196
197
198
Example
Register Renaming
Step 2
Program Fragment R0 P0
P0
R1 P8 P0 P1 P1
LD R1, Op(R3)
ADD R3, R1, #7f ( Step 2)
R2 P2 P2
SUB R6, R7, R6
R3 P7 P1 P3 P3
R4 P4 P4
R5 P5 <R6> p
R6 P5 P6 <R7> p
R7 P6 P7 <R3> p
P8 <R1> p
P9
P31
X LD p P7 R1 P8 P0
X ADD P0 R3 P7 P1
ROB
199
Example
Register Renaming
Step 3
Program Fragment R0 P0
P0
R1 P8 P0 P1 P1
LD R1, Op(R3)
ADD R3, R1, #7f
R2 P2 P2
SUB R6, R7, R6 (Step 3)
R3 P7 P1 P3 P3
R4 P4 P4
R5 P5 <R6> p
R6 P5 P2 P6 <R7> p
R7 P6 P7 <R3> p
P8 <R1> p
P9
P31
X LD p P7 R1 P8 P0
X ADD P0 R3 P7 P1
ROB
X SUB p P6 p P5 P5 P2
200
Decoded Instruction
Issue
Rd, Rs1, Rs2 Rs1, Rs2
Update RRF
Tag
Rs1*
Architectural
Rename Register Update
Mapping Table Register File
Rs2* File (RRF)
(ARF)
OP1/Rs1* OP1
OC Rd*
OP2/Rs2* OP2
Dispatch
Update Reservation S tations (RS )
RS
Bypassing
OC, Rd*, OP1, OP2
EU
Instruction
Tag F Rd* Result value V
Address
Reorder Buffer
(ROB)
Result Rd*
201
Reservation Stations
The basic idea is that a reservation station fetches and buffers an operand as soon as it available,
eliminating the need to get the operand from a register.
A RS can potentially support multiple instruction issues per cycle.
o When an instruction completes execution and produces a result, the result’s name is
compared to the operands names in the RS.
o If an instruction is waiting for this result in the RS, the data is written into the
corresponding position and its availability indicator is set (V=1).
o When during dispatch an instruction is placed into the RS, all available operands
from the register file are copied into this instruction’s field.
o When all operands are ready, the instruction is executed.
After instruction completes execution, it waits in the reservation stations until all earlier
instructions have completed execution. After this condition is satisfied, instruction is committed.
A centralized RS allows all instruction types to share the same reservation station.
Distributed RSs can be single ported buffers, each with only small number of entries.
Reservation stations idling entries cannot be used by instructions destined for execution in other
functional unit.
202
Distributed and Centralized Reservation Stations
Distribute d
re se rvation
stations
Issue
FU FU FU
Execute FU FU
FU
Finish
Complete
B.
Ce ntralize d re se rvation station
(Dispatch buffe r)
Dispatch &
Issue
Execute FU FU FU
FU FU
FU
Finish
Complete
203
Dispatching implies the associating of instruction types with functional unit types after
instructions have been decoded.
Issuing means the initiation of execution in functional units.
C.
Dispatch
RSE1
RSE1
Select Select
RS1 RSn
Logic Logic
RSEm RSEm
m
Tag Bus
Register File
FU1 FUn
Result Bus
Reorder Buffer
204
Reservation Station Entries
Typically the dispatching of an instruction requires three steps:
1. Select a free RS entry;
2. Load operands and/or tags into the selected entry;
3. Set the busy bit of that entry.
Each reservation station entry contains information about each of the instruction’s sources,
whether the source is ready and the number of cycles it takes producer of the source’s value to
execute.
Dispatch slots Forwarding buses Dispatch slots Forwarding buses
SRC tag M Shift R Delay SRC tag M Shift R Delay DEST tag
When a RSE is waiting for a pending operand, it must continuously monitor the tag bus.
When a both operand fields (M, R) are valid, then it is referred to as instruction wake up.
205
For pipelined scheduling with speculative wakeup, it is accomplished by monitoring each
instruction’s parents and grandparents. The wakeup logic is a part of reservation stations.
Tag1
Destination tag Bus
Tag m
= OR Delay
load
Source is ready
206
The inputs to the select logic are the request signals from each of the FU’s RSEs, plus additional
information needed for scheduling heuristics such as priority information.
Because instructions cannot wakeup until all instructions they are dependent on have been selected,
the wakeup and select form a critical loop. Generally, wakeup and select together constitute an
atomic operation that is if they divided into multiple pipeline stages, dependent instructions cannot
execute in consecutive cycles.
Conventional Wakeup
Select/ Execute/
I1 Wakeup
Broadcast
Reg read
Bypass
Select/ Execute/
I2 Wait Wakeup
Broadcast
Reg read
Bypass
Select/ Execute/
I3 Wait Wait Wakeup
Broadcast
Reg read
Bypass
The schedule scheme assumes that each instruction has one-cycle latency.
207
Speculative Wakeup
Select/ Execute/
I1 Wakeup
Broadcast
Reg read
Bypass
Select/ Execute/
I2 Wait Wakeup
Broadcast
Reg read
Bypass
Select/ Execute/
I3 Wait Wakeup
Broadcast
Reg read
Bypass
Speculative execution
For scheduling logic pipelined over 2 cycles, the child can assume that when the tags of the grand
parent pair have been received, the parent (I2) is probably being selected for execution and will
broadcast its tag in the following cycle.
The child can (I3) then speculatively wakeup and be selected the cycle after its grand parent (I1) is
selected.
208
Reservation S tation
Address Generation
S tore Address Generation Load
Unit Unit
Address Translation
Address Translation
Tag match
Data Address
Data Address
LOAD BY PASSING
(Finished) If no match:
S tore buffer Update destination register
Data Cache
(Completed) Match
S tore buffer
LOAD FORWARDING
Match / No match If match:
Forward to desstination register
The execution core responsible for processing load/store instructions has one store unit and one
load unit. Both are fed by a common reservation station.
We assume that load and store instructions are issued from the RS in program order.
The store buffer operates as a queue and has two portions – finished and completed.
The finished portion contains those stores that have finished execution but are not yet
architecturally completed.
The completed portion of the store buffer contains those stores that are completed
architecturally but waiting to update the memory.
When a finished store is completed by the ROB, it changes from the finished state to the
completed state. A store (instruction) does not really finish its execution until it is retired.
Key issue in implementing load bypass is the need to check for possible aliasing with preceding
stores.
The load forwarding enhances and complements the load bypassing technique.
209
Instruction Execution Phase
Execution Units Mix
The instruction frequencies of each type of instruction determine the optimal number of execution
units (FU) of each type:
pi – probability that any instruction is of type i (depends on what program is running);
IW – superscalar width;
IW×pi – average number of instructions of type i in fetch group;
ni – the number of execution units (FU) for instruction type i.
The total number of execution units (FU) is equal to:
∑ ni f IW × ∑ pi ≅ IW ; ni > IWi
i i
The total number of execution units must be larger than the superscalar width.
Integer ALU 1 1
Integer Multiply 1 2
Store 1 —
FP Add 1 2
FP Multiply 1 4
FP Divide 12 12
FP Convert 1 2
(X10 XIEEE 754)
210
Superscalar Processor Performance Degradation due to Dependencies
Instructions
executed
per 10000 40000 32459 26797 20765
10000
Cycles
The renaming eliminates completely the false data dependencies and improves the performance
with about 25%.
The branch prediction provides 90% accuracy and improves the performance with about 28%.
The shelving provides full speed of the processor despite of the presence of true data dependencies
and improves its performance with about 25%.
211
There are two main approaches to implement the correct behavior control speculation.
1. The first mechanism that relay on taking snapshots, or checkpoints, of the processor state at
appropriate points and reverting back to the corresponding snapshot when a control mis-
speculation is detected.
2. The second approach involves mechanism that reconstruct the desired state by sequentially
processing the in-flight instructions in program order until a deviation from the program
order is reached.
When a branch misprediction is detected in a superscalar processor, the processor performs a series
of steps to ensure correct execution.
Typically, branch misprediction recovery requires stalling the front-end processor, repairing the
architectural state, and restarting the process of fetching and renaming instructions from the correct
path.
Instructions after the mispredicted branch are squashed and all resources they hold are freed. In
addition, the mapping of physical registers is backed up to the point of the mispredicted branch.
The instruction fetch unit is also backed up to the point of the mispredicted branch and the
processor begins sequencing on the correct path.
212
Finalizing Instruction Execution
Instruction Completion, Commitment and Retirement
Instruction finishes execution //käsk on töödeldud// when it exits the FU and enters
completion buffer.
An instruction is completed //lõpetatud// when the functional unit is finished the
execution of the instruction and the result is made available for forwarding or buffering.
Instruction committing //püsitatud// means that the instruction results have been made
permanent and the instruction retired from the instruction scheduler.
A result is made permanent //püsitulem// by:
a. Making the mapping of architectural to physical register permanent
b. Copying the result value from the rename register to the architectural register
Only the architectural registers are renamed. The RRF can be a separate stand-alone structure
similar to the ARF or it is incorporated into the reorder buffer.]
Instruction retiring //erustatatud// means instruction removal from the instruction
scheduler with or without the commitment of operation results.
213
Typical ROB entry fields
Instruction Rename
Busy Issued Finished Speculative Valid
Address Register
(B) (I) (F) (S) (V)
(IA) (RR)
An instruction can be in one of several states: waiting execution, in execution, and finished
execution. The status bits are updated as an instruction traverses from one state to the next.
A special bit (S) is used to indicate whether an instruction is speculative or not.
If speculation can cross multiple branches, additional bits can be employed to identify which
speculative basic block an instruction belongs to. When a branch is resolved, a speculative
instruction can become nonspeculative or invalid.
Ready
Physical Architectural Physical Architectural
to
destination destination destination destination
retire
JNZ NO R18 BX
Retired
YES R3 AX R3 AX
ADD completes
ex ecution
Oldest NO R18 BX
YES R6 BX
Oldest YES R6 BX
ROB RAT
214
The ADD instruction writes result to register R3, which has been mapped to architectural register
AX. The subtract instruction SUB writes result to register R6, which has been mapped to
architectural register BX.
The conditional jump instruction JNZ does not write to register.
At the right the RAT maintains a list of the physical registers holding the most recent committed
in-order results for each architectural register.
When the ADD instruction completes execution, as the oldest instruction, it is allowed to retire.
The ROB oldest pointer is incremented to show that the most recent committed results for
AX are stored in register R3.
Upon execution of the conditional jump (JNZ), it is discovered to have been mispredicted.
The jump is now the oldest entry in the ROB, and it is allowed to retire.
When SUB retires its result is discarded because the early retired jump instruction was
mispredicted. The RAT is not updated and the latest committed results for BX are still in register
R18.
The registers used by the subtract instruction will be recycled for use by other instructions.
The result the subtract instruction will be overwritten without ever having been read.
Often the ROB is implemented as a circular FIFO buffer with a head pointer and a tail pointer.
The tail pointer is advanced when ROB entries are allocated at instruction dispatch.
The number of entries that can be allocated per cycle is limited by the dispatch bandwidth.
Instructions are completed from the head of the queue.
Reorder buffer entries are allocated in the first issue stage and deallocated serially when the
instruction retires.
During the retire stage a number of instructions at the head of the FIFO queue are scanned and an
instruction committed if all previous instructions are committed or can be committed in the same
cycle.
In the case of instructions that are on a misspeculated path, the instructions are removed from ROB
and the physical registers freed without making the result permanent or copying back results.
When an instruction is completed its rename register and its reorder buffer entry are deallocated.
215
Typically the processors`s retire bandwidth is the same as the issue bandwidth.
Store Buffer
Store instructions are processed differently than Load instructions.
Unlike a Load instruction, the data for Store instruction are kept in the ROB. At the time when the
Store is being completed, these data are then written out to memory. The reason for this delayed
access to memory is to prevent the premature and potentially erroneous update of the memory in
case the Store instruction may have to be flushed.
For Store instruction it is possible to move the data to the Store buffer at completion.
The purpose of the Store buffer is to allow stores to be retired when the memory bus is not busy.
This is giving priority to Load instruction(s) that needs to access to the memory.
1. Execute instructions
Execute Execute instruction 2. Arbitrate for result buses
216
Performance (IPC) for Superscalar Architecture
Superscalar processors with parallel pipelines complete many instructions every clock cycle.
If there are s parallel pipelines which do not stall, then s instructions can complete every clock
cycle.
When stalls occur, the superscalar processor may still be able to complete a smaller number of
instructions (from (s-1) to zero).
IPC(c) – the number of instructions completed during clock cycle c.
s ≥ IPC(c) ≥ 0
The average IPC can be determined from
1
IPC = ∑ IPC (c) , where
Nc c
Nc - is the total number of clock cycles.
The complexity of superscalar processor usually makes it very difficult to calculate IPC(c) with
simple formulas. Simulation of the hardware running a set of benchmark programs is usually used.
Comment
At least two circumstances limit the superscalar processor efficiency:
217
Doubling issue rates [in modern superscalar processors] above today’s 3-6 instructions
per clock, says to be 6-12 instruction, probably requires a processor to:
Issue 3 or 4 data memory accesses per cycle;
Resolve 2 or 3 branches per cycle;
Rename and access more than 20 registers per cycle;
Fetch 12 to 24 instructions per cycle.
David Patterson
I-cache
Branch FETCH
Instruction
Predictor Flow
Instruction
Buffer
DECODE
Memory
Data
EXECUTE Flow
Reorder
Buffer
Register (ROB)
Data COMMIT
Flow Store D-cache
Queue
218
First-, Second- and Third-Generation Superscalar Processors
Issue width
Single ported data cache Dual ported data cache Dual ported data cache
219
RISC ARCHITECTURE**
CISC – Complex Instruction Set Computer
RISC – Reduced Instruction Set Computer
//kärbikprotsessor//
Term RISC was first used by David Patterson in 1980 (Berkeley Stanford University).
Design target:
max P,
min C.
IBM 801 was a high-performance minicomputer which was never marketed; it was designed
by John Cocke.
IBM 801 architecture main features:
1. The trivial instruction set (RISC-like),
2. HLL orientation (special compiler),
3. HW control,
4. One instruction per clock cycle,
5. 32-bit CPU with 32 registers,
6. ECL-technology.
Increases in architectural complexity catered to the belief that a significant barrier to better
computer performance is the semantic gap – the gap between the meanings of high-level language
statements and the meanings of machine-level instructions.
The studies for CISC computers show that:
1. About 80% of the instructions generated and executed use only 20% of an instruction set.
220
Conclusion 1
Conclusion 2
If only the simpler instructions are required, the processor hardware required to implement them,
could be reduced in complexity.
It should be possible to design a more performance processor with fewer transistors and less cost.
Conclusion 3
With a simpler instruction set, it should be possible for a processor to execute its instructions in a
single clock cycle and synthesise complex operations from sequences of instructions.
221
9. The procedure call-return is the most time-consuming operation in typical HLL programs.
The percentage of time spent on handling local variables and constants turned out to be
highest, compared to other variables.
10. It must be minimized as much as possible the number of instructions that have to access
memory during the execution stage.
a. Memory access, during the execution stage, is done by
LOAD and STORE instructions only.
b. All operations, except LOAD and STORE, are
register-to-register, within the processor.
c. Systems featuring these rules are said to be adhering to
a load/store memory access architecture.
11. Uniform handling of instructions can be achieved by using hardwired control.
12. The central processor unit is provided with a large register set (register file).
13. RISC architecture exploits ILP in the instruction pipeline. As a rule, the instruction pipeline
is supported by advanced cache memory system.
The original MIPS, SPARC and Motorola 88000 CPUs had classic scalar RISC pipelines,
which contained five pipeline stages:
1. Instruction fetch;
2. Decode;
3. Execute;
4. Access;
5. Write back.
14. Exceptions (interrupts) have a great influence to the RISC processor pipeline’s productivity.
The most common kind of software-visible exception on one of the classic RISC processor
is a TLB miss. Exceptions are different from branches and jumps, because branches and jumps
which change a control flow are resolved in the decode stage.
Exceptions are resolved in the writeback stage.
When an exception is detected, the following instructions in the pipeline are marked as
invalid, and as they flow to the end of the pipe their results are discarded. The program
counter is set to the address of a special exception handler.
The exceptions handling is based on the precise exception model.
A precise exception means that all instructions up to the excepting instruction have been
executed, and the excepting instruction and everything afterwards have not been executed.
222
Physical Register File Procedure A
Bank A
Virtual register window A
137 31 Total 32 registers
In A
132 26
131 25
Procedure B
Local A LRA
Bank B
122 16
121 15 31
Out A/ In B
116 10 26
115 25
Procedure C
Local B LRB
Bank C
106 16
105 15 31
Out B/In C
100 10 26
99 25
Local C LRC
90 16
89 15
Out C
84 10
Overlapping registers
9 Global 9 9 9
GRA GRB GRC
0 Registers 0 0 0
Active Register
Bank Pointer
223
Hardwired Contol Unit's Model
Master
Start Wait Stop
clock
Other signals
Clocking logic Decoder
from the ALU
..... Conditions
ld add br sto
Clk En
Step
generator
T1
T2
T3
Ts
T(s+1) Interrupts
Control
Control signal and other
Counter step &
decoder encoder external
signals
n
Tm 1
Load Reset
Perloadable
code
PCout PCin ADD SUB Wait
224
The Second Generation RISC Microprocessors
Clock (MHz) 20 40 33 25
225
Examples of the Second Generation RISC Microprocessors
Intel 80860
(1989)
System bus
160 MB/s
32 64
3D graphic
10 instr.
Graphic
Bus processor
&
MMU
Cache
4-stage RISC control
pipe FP MAC
CPU MULT instr.
core
128
FP
ADD/SUB
64
64
RGF Instruction Data
64
32x32 cache cache
2w / 4k 2w / 8k
FP FP
32 128 CPU RGF
64
3 stag3 pipe
26 instr.
226
Motorola 88000
source 1
Internal
source 2
system bus
destination
32 32 Processor bus 32 32
P-bus P-bus
M-bus M-bus
88200
Memory bus
Pipe stages 7 5 4 5 8
IC – 8 IC – 20 IC – 16
Cache (kB) DC – 8 32 DC – 16 DC – 16
Performance
SPEC int92 70 70 77 135 96
SPEC fp92 105 105 98 200 105
228
RISC versus CISC
Advantages Disadvantages
In recent years, the RISC versus CISC controversy has died down to great extent.
RISC Problem
Today superscalar micros can issue 4 – 5 instructions per cycle, but execute only 1,5 – 2
instructions per cycle.
Cause
1. The code and data are not on-chip when needed.
2. A cache miss in a high-speed superscalar RISC is a real performance downer.
229
Some potential solutions
1. Data prefetching,
2. Branch prediction,
3. Speculative execution,
4. Large, multilevel cache-system,
5. Smart memory controllers,
6. Multiple thread execution,
7. Code optimization, code preprocessing,
Another approach to RISCs was the zero-address instruction set (the majority of space in an
instruction was to identify the operands of the instruction).
In zero-address machines the operands are placed into a push-down (LIFO-type) stack.
Summary
One important development in architecture has been the focus on architecture’s role in efficiently
utilizing the underlying physical technology. The RISC breakthrough is a key example of this idea.
RISC architects recognized that by scaling the complexity and cost of microprocessor architecture
down to that point of which an effective pipeline could be implemented a single IC, a large
increase in performance was possible.
They recognized that a good match between architecture and the implementation technology is
crucial to achieving maximum performance and cost efficiency.
Today’s RISC computers are typically complicated and feature-laden relative to the original RISC
implementations, but they still reflect careful trade-offs by designers between architectural
complexity, implementation cost and clock speed.
Literature
1. Arvo Toomsalu. RISC-mikroprotsessorite arhitektuur. Tallinn, 1995.
2. IEEE Standard 1754-1999. IEEE Standard for a 32-bit Microprocessor Architecture.
[The standard is derived from SPARC, which was formulated at Sun Microsystems in 1985.
IEEE 1754 is designed to be a target for optimizing compilers and easily pipelined hardware
implementations.]
230
MODERN SUPERSCALAR ARCHITECTURE
CISC philosophy
If added hardware can result in an overall increase in speed, it’s good.
The ultimate goal of mapping every HLL statement on to a single CPU construction.
RISC philosophy
Whether hardware is necessary, or whether it can be replaced by software.
Higher instruction bandwidth is usually offset by a simpler chip that can run at a higher clock
speed, and more available optimizations for the compiler.
1. Speculative Execution
In a pipelined processor, branch instructions in the execute stage affect the instruction fetch stage.
Advanced speculative techniques are used, as for branch predictions, data prediction, address
prediction, memory dependencies and latencies prediction. To preserve program semantics, a
speculative movement of instructions should only result in speculative execution that is safe and
legal.
Conventional superscalar processor employ the strong-dependence model //jäik sõltuvuste mudel// for
program execution. In the strong dependence model two instructions are identified as either dependent or
independent, dependences are pessimistically assumed to exist.
To superspeculative processors the weak-dependence model //lõtv sõltuvuste mudel// is applied. In this case
dependences can be temporarily violated during instruction execution as long as recovery can be performed
prior to affecting the permanent machine state.
The term data value speculation refers to mechanisms that predict the operands of an instruction,
either source or destination, and execute speculatively the instructions dependent on it before the
231
actual value is computed, allowing the processor to avoid the ordering imposed by data
dependences.
Example
Predictor for arithmetic instructions stores the last result in the last value field.
Load address predictors store the last effective address.
Load value predictors store the last value read from memory.
Store predictor uses two tables: one for predicting the effective address and the other for predicting
the value to be written.
When an instruction is to be predicted, the prediction table is accessed and the predicted value is
computed adding the stride to the previous last value. If the most significant bit of the confidence
field is set and the prediction is correct, the predicted value can be used instead of the actual value
if the former is available earlier.
S peculative Execution
232
2. Instructions Predecoding
Instruction decoding involves the identification of the instruction types and detection of
instructions dependences among the group of instructions that have been fetched but not yet
dispatched.
The complexity of the instruction decoding task is influenced by the ISA and the width of the
parallel pipeline.
o For a RISC scalar pipeline, instruction decoding is trivial. Usually decode stage is used
for accessing the register operands and is merged with the register read stage.
o For a RISC parallel pipeline with multiple instructions being simultaneously decoded,
the decode stage must identify dependences between instructions and determine
independent instructions that can be dispatched parallel.
o For a CISC parallel pipeline the instruction decoding stage is more complex and usually
requires multiple pipeline stages.
o The modern CISC pipelines have an additional task – the decoder must translate the
architected instructions into internal low-level operations that can be directly executed
by the HW.
These internal operations resemble RISC instructions and can be viewed as vertical
microinstructions.
3. Threading
In virtual memory systems is distinguished between multitasking of processes and threads.
Thread //lõim, haru// - a sequential execution stream within a task (process).
A thread state is entirely stored in the CPU, while a process includes extra state information –
mainly operating system support to protect processes from unexpected and unwanted interferences.
Threads are sometimes called lightweight processes. Switching between threads does not involve
as much overhead as conventional processes.
One Multiple
Thread Threads
OPOT OPMT
One
Process
Instruction Trace
MPOT MPMT
Multiple
Processes
233
4. Multithreading
Contemporary superscalar microprocessors are able to issue 4-6 instructions at each clock cycle
from a conventional linear instruction stream. New technology will allow microprocessors with an
issue bandwidth of 8-32 instructions per cycle.
As the issue rate of microprocessors increases, the compiler or the hardware will have to extract
more instruction-level parallelism from a sequential program.
ILP found in conventional instruction stream is limited.
Most techniques for increasing performance increase power consumption.
The multithreaded approach may be easier to design and have higer power efficiency, but its utility
is limited to specific – highly parallel-applications.
Multithreading does not improve the performance of an individual thread, because it does not
increase the ILP. It improves the overall throughput of the processor.
Multithreading exploits a different set of solutions by utilizing coarse-grained parallelism.
A multithread processor is able to concurrently execute instructions of different threads of control
within a single pipeline. Multithreading replicates only the program counter (PC) and register file,
but not the number of execution units and memories.
Multithreading effectively divides the resources of a single processor into two or more logical
processors, with each logical processor executing instructions from a single thread. It means that
the processor must provide:
a) Two or more independent program counters,
b) An internal tagging mechanism to distinguish instructions of different threads within the pipeline
c) A mechanism that triggers a thread switch.
Thread 1
BB1 BB2 BB3 BB4
Part of program 1
Process 1 (Task 1) Thread 1 Registers
Program 1
Program 1 Task 1 Registers & Stack
Thread 2 MEM
Part of program 1
Thread 2 Registers
Program 2 Process 2 (Task 2)
OS Program 1
Task 2 Registers & Stack
Thread k
Part of program 1
Thread k Registers
Process n (Task n)
Program m
Program m
Task n Registers & Stack
234
In explicit multithreaded processor //ilmutatud lõim- e hargtöötlusega protsessor// is used a
coarce-grained parallelism. Multiple program counters are available in the fetch unit and the
multiple contexts are often stored in different register sets. The execution units are multiplexed
between the thread contexts.
Simultaneous multithreading allows multiple threads to execute different instructions in the same
clock cycle. In the SMT processor the multithreading technique is combined with a wide-issue
superscalar processor. The number of concurrent threads has practical restrictions and usually
limits the number from 2 to 8 concurrent threads.
235
Multithreaded Categories
S=4
Out
Processor 1 (S = 2)
Thread 1 Thread 4
Completely idle cycle
Processor 2 (S = 2)
(vertical waste) Thread 2 Thread 5
5. Predicated Instructions
The instructions which are executed only if conditions are true, bits in a condition code register
have an appropriate value. This eliminates some branches, and in a superscalar processor can allow
both branches in certain conditions to be executed in parallel, and the incorrect one discarded with
no branch penalty.
6. Branch Prediction
All pipelined processors do branch prediction of some form. There are many different branch
prediction methods in use - trivial prediction, next line prediction, bimodal branch prediction, local
branch prediction, global branch prediction, combined branch prediction, agree prediction,
overriding branch prediction.
Branch path – consist of the dynamic code between two conditional branches with
//hargnemistee// no intervening unconditional branches.
236
Split branch
//tükeldatud hargnemine//
Multipath execution
Multipath execution is the execution of code down both paths of one or more branches, discarding
incorrect results when a branch is resolved.
In typical speculative execution a branch’s final state, taken or not taken, is predicted when the
branch is encountered.
Multipath execution differs from unipath execution in that both paths of a branch may be
speculatively executed simultaneously. Thus, any branch misprediction penalty normally occurring
in unipath execution is reduced or eliminated, improving performance.
Speculative code can consist of both multipath and unipath sections, that is, section may proceed
down one or both paths of a branch, and this situation may change over time.
Most forms of multipath execution use some form of branch prediction.
7. Prefetching
A technique which attempts to minimize the time a processor spends waiting for instructions to be
fetched from memory. Instructions following the one currently being executed are loaded into a
prefetch queue when the processor's external bus is otherwise idle. If the processor executes a
branch instruction or receives an interrupt then the queue must be flushed and reloaded from the
new address.
8. Trace Cache
The trace cache is a mechanism for increasing the instruction fetches bandwidth by storing traces
of instructions that have already been fetched, and maybe even executed.
The trace cache line is filled as instructions are fetched from the instruction cache.
Each line in the trace cache stores a trace (a snapshot) of the dynamic instruction stream.
The same dynamic sequences of basic blocks that appear non-contiguous in the instruction cache
are contiguous in the trace cache. Trace caches are caches that store instructions either after they
have been decoded, or as they are retired.
237
A trace is a sequence of at most n instructions and at most m basic blocks starting at any point in
the dynamic instruction stream.
Trace lines stored in the trace cache are fully specified by a starting address and a sequence of
branch outcomes which describe the path followed.
BB3 BB5
BB4
In the instruction fetch stage of a pipeline, the current program counter along with a set of branch
predictions is checked in the trace cache for a hit. If there is a hit, a trace line is supplied to fetch.
The trace cache continues to feed the fetch unit until the trace line ends or until there is a
misprediction in the pipeline. Because traces also contain different branch paths, a good multiple
branch predictor is essential to the success rate of trace caches.
The limit n is the trace cache line size and m is the branch predictor throughput.
• A trace cache consists of control and data area. The control of each trace cache entry holds
a trace tag composed out of its starting address, number of basic blocks and branch
outcome of each block.
238
• The start PC and branch outcomes are collectively called the trace identifier or trace ID.
Looking up a trace in the trace cache is similar to looking up instructions or data in
conventional caches, except the trace ID is used instead of an address.
• A subset of the trace ID forms an index into the trace cache and the remaining bits form a tag.
Tag
Index
• One or more traces and their identifying tags are accessed at that index. If one of the tags
matches the tag of the supplied trace ID, there is a trace cache hit. Otherwise, there is a
trace cache miss.
• New traces are constructed and written into the trace cache either speculatively or non-
speculatively.
• A new predicated trace ID is supplied to the trace cache each cycle.
• In addition to typical parameters such as size, set-associativity and replacement policy, the
trace cache design includes:
1. Indexing methods;
2. Path associativity;
3. Partial matching;
4. Trace selection;
5. Trace cache fill policy;
6. Parallel or sequential accessing of the trace, etc.
• A problem of trace caches is that they necessarily use storage less effectively than
instruction caches.
• Longer traces improve trace prediction accuracy.
• Overall performance is not as sensitive to trace cache size and associativity.
The trace cache fetch mechanism consists of the trace cache (TC), the fill unit, a trace predictor,
and a conventional instruction cache with a branch predictor.
239
The trace predictor treats threads as basic units and explicitly predicts sequences of traces.
The output of the trace predictor is an identifier (ID) indicating a starting instruction address of
a trace. The identifier is used to index the trace cache and the trace is provided if the access is hit.
The index into the trace cache can be derived from the starting PC (program counter state), or a
combination of PC and branch outcomes.
When a trace cache miss occurs, the instruction cache works as a backup and supplies instructions
with the help of the branch predictor.
The fill unit collects the traces which are stored in the trace cache. It constructs the trace using the
instructions supplied from the instruction cache, and updates the trace cache according to branch
outcomes provided by the execution engine.
Each trace is dispatched to both the execution engine and an outstanding trace buffer (OSTB).
A trace line begins with a target and ends with a control transfer instruction.
Each bit in the branch flag encodes whether or not the corresponding embedded branch is taken.
The branch mask is a sequence of leading 1`s followed by railing 0`s. Only internal branches and
not a terminal branch distinguish the trace and considered by the hit logic.
If the maximum branch limit is m branches overall, then the maximum number of embedded
branches in trace is no more than (m-1). There are (m-1) branch flags, and the branch mask is
(m-1) bits long.
The bth branch is predicted by branch predictor terminates a trace.
The branch mask is used by the hit logic to compare the correct number of branch predictions with
branch flags, according to the actual number of branches within the trace.
The branch mask determines how many of the predictions supplied by the multiple-branch
predictor are actually “considered” by the matching procedure.
If the terminal instruction is a conditional branch, then the two address fields are equal to the
branch’s fall-through and target address respectively.
If the terminal branch is predicted taken, then the next PC is set equal to the trace`s target address.
If the terminal branch is predicted not-taken, then the next PC is set equal to the trace`s fall-
through address.
The trace construction finishes when:
1. The trace contains n instructions;
2. The trace contains m conditional branches;
3. The trace contains a single indirect jump, return or trap instruction;
4. Merging the incoming block would result in a segment larger than n instructions.
A basic block cannot be split across two traces.
240
The performance of the TC is strongly dependent of trace selection, the algorithm used to divide
the dynamic instruction stream into traces.
Sometimes partial trace cache hit occurs. When such event occurs, either the inter-trace prediction
is incorrect or at least one of the intra-trace predictions is incorrect.
At least one of the branches in the trace results in a miss or the entire trace is invalid.
The CTC has the same instruction fetch organization as the STC, except that the backing
instruction cache and the trace cache are probed in parallel.
241
STC and CTC Model
S el ect
Trace ID Trace
Predictor Cache
Update
OSTB
Execution
Engine
FETCH
TC hit TC miss
242
Example
M a in M e m o ry
P red ec o d e
In s tru c t io n C a c h e
T ra c e C a c h e
F e tc h / F lo w
D e c o d e / B ra n c h
In s tr u c tio n D is p a tc h /
R e o r d e r B u ffe r s
D a ta L /S BR
U n it
EU1 EU2 EU3 EU4 U n it
C ache
C o m p le te d In s tr u c tio n s
B u ffe r
R e tir e m e n t U n it
243
Fetch / Flow Prediction
Decode / Branch
3. Dispatched, but not ready, instructions are buffered into appropriate Reservation Stations.
4. Rename Registers hold results until the instruction retires.
5. At retirement, the Rename Registers are either copied to the architectural register.
6. When a branch is resolved, all of the Rename Registers allocated for wrong path are freed,
and only instructions on the correct path are allowed to retire.
Execution Units
1. Functional Units classification:
a. Single-cycle.
b. Multiple-cycle:
1 Multi-cycle units that are pipelined
2 Multi-cycle units that are pipelined but do not accept a new operation each cycle
3. Multi-cycle units that are not pipelined
c. Variable cycle time units
2. The Branch Unit communicates the results of a branch evaluation to the other units:
• The Fetch/ Flow Unit must be informed of mispredictions so that it can update the
flow history in the instruction cache and begin fetching at the correct target
address;
• The Branch/Decode Unit must be informed so that it updates the branch history
table.
• The Instruction Dispatch and Reorder Buffer must be informed so that it can discard
any instruction that should not be executed.
244
• The Completed Instruction Buffer must be informed to discard instructions, which
will never be retired.
• If there are multi-cycle Execution Units working on instructions that will never be
executed because of a branch, they also must be notified.
3. Completed Instruction Buffer and Retire Unit
1. The Completed Instruction Unit holds instructions that have been speculatively executed.
2. The Retire Unit removes executed instructions in program order from the Completed
Instruction Buffer.
3. The Retire Unit updates the architectural registers with the computed results from the Rename
Registers.
4. By retiring instructions in order, this stage can maintain precise exception.
5. In post-RISC architecture the important performance metric is the number of instructions
retired in parallel.
Next step is Advanced Super Scalar Processors, which shall achieve instruction issue from
16 up to 32 instructions per cycle.
Barrel processor is a CPU that switches between threads of execution on every cycle.
Summary
Multiple-Issue Architectures
Some
Superscalar Dynamic Hardware Dynamic out-of-order IBM
dynamic execution Power 2
245
to Main Me mory
Instruction Cache
(to Pre decode r)
Branch
Predi ctor
Fetch Unit
Post-RISC Processor
PC
relative
Decode Buffer Generalized Model of
Superscalar Pipeline
Decode Uni t
Register
indirect
with
offset Dispatch Buffer Register writeback
Register
indirect
Architected Rename
Dispatch Unit
Register File Register File
Distributed
RS RS RS RS RS Reservation
Stations
Completion Buffer
(Reorder Buffer)
Complete Unit
S tore Buffer
Reti re Unit
246
Summary
2. The number of instructions that can be concurrently scheduled is determined by data and
control dependencies. When scheduling in software it is only possible to use static
information.
5. Increasing the size of the instructions window increases the probability that functional units
can be fully utilized.
7. Data dependencies can occur via registers or via the memory subsystem.
8. Out-of-order execution by itself limits the instruction window to a single basic block,
which averages only about five instructions.
9. To maintain an instruction window large enough to keep a number of functional units busy,
instructions from multiple basic blocks must be included.
10. One way to do this is to speculate on the outcome of branches by executing both paths
through a branch and then discarding the untaken branch.
12. Speculative execution necessitates some method of recovery from data dependence
conflicts, usually by returning to an earlier state and re-executing.
13. While speculation increases the instruction window size it places additional requirements
on the CPU.
247
15. Other possibilities to handle branches are:
a. Loop buffers;
b. Multiple instruction streams;
c. Prefetch branch target;
d. Data fetch target;
e. Delayed branch;
f. Look ahead resolution;
g. Branch folding target buffers.
16. A different style of speculation is possible by noticing that some blocks within a program
are guaranteed to execute no matter what combination of branches precede them.
Literature
Kaeli David R., Yew Pen-Chung. Speculative Execution in High Performance Computer
Architectures. Chapman & Hall/CRC Computer and Information Science Series. Taylor & Francis
Group. LLC. CRC Press, 2005.
248
VLIW ARCHITECTURE
VLIW – Very Long Instruction Word
There are four major classes of ILP architectures, which can be differentiated by whether these
tasks are performed by the hardware or by the compiler:
1. Superscalar processor architecture,
2. EPIC architecture,
3. VLIW architecture,
4. Dynamic VLIW architecture.
VLIW is an architectural approach that relies on compiler technology, it exploits well-scheduled
code. In this architecture are generalized two concepts:
Horizontal microcoding and Superscalar processing
ADR
Horizontal microcoding MO1 MO2 MO3 MOn
nex t MI
W ide MO word width
In VLIW technology several instructions are issued during each clock cycle as in superscalar
case.
Programs written in conventional short instruction words must be compacted together to form
the VLIW instructions.
The compiler reveals parallelism in the program and notifies the hardware on which operations
do not depend on each other.
Different fields of the long instruction word carry the opcodes to be dispatched to differential
functional units.
249
VLIW versus Superscalar (I)
In VLIW processor the compiler guarantees that there are no dependencies between instructions
that issue at the same time and that there are sufficient HW resources to execute them.
In superscalar processor the HW buffers and dynamically schedules instructions during operation.
Pipelining utilizes temporal parallelism, whereas VLIW and superscalar techniques utilize also
spatial parallelism.
Load /
Branch F-unit F-unit
Store
Unit (s-1) (1)
Unit
Instruction
bundle
Instruction
Cache
Main Memory
250
VLIW Compiler
Techniques to find more ILP out of the sequential instruction stream during preparing information
for VLIW processor are:
1. Loop Unrolling
Loop unrolling is one basic technique to find more ILP out of the sequential instruction stream.
2. Software Pipelining
Software pipelining is a technique which enables the later iterations of a loop to start before the
previous iterations have finished the loop.
3. Trace Scheduling
The trace scheduling technique has been developed to find more ILP across the basic block
boundary. Trace is considered a large extended basic block which has the control path which is
frequently taken. This technique considers the branch instructions as jump instructions inside a
trace, and it schedules the instructions inside the trace as a large basic block.
251
All the three techniques can be used to increase the amount of parallelism by extending the
instruction pool. They are all strongly depending on an appropriate and accurate control flow
prediction.
The template field T //malliväli// indicates which of the instructions in the bundle or
in the neighbouring bundles can be used in parallel on different functional units.
Predicated Instructions
Predication - is a form of compiler-controlled speculation in which branches are removed
in favour of fetching multiple paths of control.
Predication is the conditional execution of instructions based on the value of a Boolean source
operand, referred to as the predicate. A qualifying predicate is in a 1-bit predicate register(s).
252
The values in the predicate register file are associated with each instruction in the extended
instruction set through the use of an additional source operand. These operand specifiers which
predicate register will determine whether the operation should modify processor state.
The values of the predicates registers are set by the results of instructions such as compare and test
bit.
1. If the value in the predicate source register is true (“1”), a predicated instruction is
allowed to execute normally.
2. Otherwise (“0”), the predicated instruction is nullified, preventing it from modifying
the processor state.
Example
if (x > y) // branch Bi //
a = b+c
else
d = e+f
pT, pF = compare (x > y) // branch Bi: pT is set to TRUE if x > y, else pF is set to TRUE //
if (pT) a = b+c
if (pF) d = e+f
g = h/j-k;
The branch is eliminated and the compiler can schedule the instructions under pT and pF to
execute in parallel.
Predication Features
253
VLIW Pipeline
Clock
1 2 3 4 5 6 7 8
VLIW architecture uses instruction bundles. As for IA-64 defines a 128-bit bundle that
contains three instructions, called syllables //silp//, and a template field.
The processor can fetch instructions one or more bundles at a time.
The interpretation of the template field is not confined to a single bundle.
The template field bits, placed there by compiler, tell to the CPU which instructions can be
executed in parallel. Template bits in each bundle specify dependencies both within bundles as
well as between sequential bundles.
When CPU finds a predicated branch, it begins to execute the code for every possible branch
outcome. When the outcome of the branch is known, the processor discards the results along
“wrong path” and commits results along the “right path”. In effect there is no branch at the
processor level.
If the compiler cannot predict a branch then the CPU will behave like a conventional
processor, it will try to predict which way the branch will turn.
A compiler scans the source code to find upcoming loads from main memory. It will add a
speculative load instruction and a speculative check instruction.
At run time, the first instruction loads the data from memory before the program needs it.
The second instruction verifies the load before letting the program use the data.
254
VLIW versus Superscalar (II)
1. The code density of the superscalar processor is better than in the VLIW processor.
2. The decoding of the VLIW instructions is easier than the superscalar instructions.
3. The VLIW-processors are exploiting different amounts of parallelism; they require the
different instruction sets.
4. VLIW architecture supports only in-order issue of instructions.
5. VLIW is an architectural technique, whereas superscalar is a microarchitectural technique.
255
2. A template in the instruction bundle identifies the instruction type and any architectural stops
between groups of instructions.
The length of a bundle does not define the limits of parallelism.
3. To avoid conditional branches, each instruction can be conditioned (predicated) on a true/false
value in a predicate register.
4. IF-THEN-ELSE sequences can be compiled without branches.
5. Multiple targets can be specified and instructions can be prefetched from those paths.
6. The change of the program counter can be delayed until an explicit branch instruction.
7. EPIC incorporates:
1. Speculation (data and control);
2. Prediction;
3. Explicit parallelism;
4. Scalability with respect to the number of FUs.
8. EPIC`s basic features are:
1. Separate large register files (64-bit general purpose registers (GR0-GR127), 82-bit floating
point registers (FR0-FR127), 1-bit predicate registers (PR0-PR63), 64-bit branch registers
(BR0-BR7) – used to specify the target addresses of indirect branches) for execution units.
2. Parallel instruction execution in separate execution units.
3. Multi-way branches – by bundling 1-3 branches in a bundle, the execution may jump to
any one of the three branches or may fall through to the next instruction. The first true
branch will be taken.
4. 128-bit instruction formats. Instructions are bundled in the groups of three instructions.
5. Instruction set is optimised to address the needs of cryptography, video encoding and other
functions needed by the servers and workstations.
IPG FET ROT EXP REN WLD REG EXE DET WRB
Instruction
Word-Line Register Exception
Pointer Fetch Rotate Expand Rename Execute Write-back
decode read detect
generation
Instruction Operand
Delivery Delivery
256
Instruction delivery (EXP, REN)
1. Distributes up to 6 instructions to the 9 functional units;
2. Implements register renaming for both rotation and register stacking.
Operand delivery (WLD, REG)
1. Accesses register file, performs register bypassing, accesses and updates a register scoreboard,
and checks predicate dependences.
Execution (EXE, DET, WRB)
1. Executes instructions through four ALUs and two load/store units, detects exceptions, posts NaTs,
retires instructions and performs write-back
2. Deferred exception handling for speculative instructions is supported by providing the
NaTs (Not a Thing), for the GPRs, and NaT Val (Not a Thing Value) for FPRs.
41 41 41 5
41-bit instruction
Major
Other modifying bits GR3 GR2 GR1 PR
opcode
4 10 7 7 7 6
257
Stops may appear within and/or at the end of the bundle. The absence of stops indicates that some
bundles could be executed in parallel.
4. Each basic template has two versions – one with a stop after the third slot and
one without.
5. Instructions must be placed in slots corresponding to their types based on the template
specification, except for A-type instructions that can go in either I or M slots.
258
The Execution Unit Slots and Instruction Types they may hold
259
Itanium
Fetch and prefetch engine
I-cache
L1
Branch ITLB
prediction Instruction queue
(8 bundles)
IA - 32
decode &
control
Ports
B B B M M I I F F
Register stack
Score board (remapping)
Predicate
exeption
Branch &
prediction RG
RGI (128 registers) RGF (128 registers)
Advanced
load
address
table
L2 cache
(error checking & correction)
Bus controller
(error checking & correcting)
260
Itanium Register Model
261
Comment
Superscalar IA-64
Instruction stream reordering and optimization at Instruction stream is reordered and optimized at
run time. compile time.
Branch prediction, speculative execution of one Speculative execution along both paths of a
path. branch.
Multiple parallel running execution units. Multiple parallel running execution units.
Data is loaded from memory only when needed. Data is speculatively loaded before it is needed.
Tries to find the data in the caches first. Tries to find the data in the caches first.
Summary
Functional unit
Architecture Grouping assignment Initiation
262
Superscalar Architecture
I-cache
FU 1 FU 2 FU 3 FU 4
D-cache
VLIW Architecture
I-cache
Instruction
Register
RGF
FU 1 FU 2 FU 3 FU 4
D-cache
263
Compiler Hardware
Code generation
SUPERSCALAR
EPIC
DYNAMIC VLIW
VLIW
Additional Readings
LM: EPIC (IA-64)-arhitektuurist
264
265
PROCESSOR ARCHITECTURE DEVELOPMENT
TRENDS
The main directions in future processor architecture principles
266
Fill Unit FU1
Two-level
Trace Cache
FUk
Reser-
vation
S tations
FU(k+1)
Multi-hybrid
Branch
Predictor I-Cache
D-Cache
FUn
L2 Caches
Superspeculative Processor
The basis for the superspeculative approach is that producer instructions generate many highly
predictable values in real programs. Consumer instructions can frequently and successfully
speculate on their source operand values and begin their execution with without results from the
producer instructions.
The theoretical basis for superspeculative microarchitecture rests on the weak dependence model.
A superspeculative microarchitecture must maximize:
Instruction flow
It is important the rate at which useful instructions are fetched, decoded, and dispatched to the
execution core. Instruction flow is improved by using a trace cache.
Data flow
It is important the rates at which results are produced and register values become available.
Eliminates and bypasses as many dependencies as possible. There is used the register renaming
mechanism.
Memory data flow
It is important the rate at which data values are stored or retrieved from data memory.
267
Simultaneous Multi Thread (SMT) Processor (SMP)
Main features:
RAW Processor
A highly parallel RAW processor is constructed of multiple identical tiles //klots//.
Each tile contains 16kB instruction memory (IMEM), 32kB data memory (DMEM), an ALU,
registers, configurable logic (CL) and a programmable switch associated with its 16 kB instruction
memory.
268
Trace Processor
The trace processor consists of multiple superscalar processing cores, each one executing a trace
issued by a shared instruction issue unit. It also employs trace and data prediction and shared
caches.
Instruction fetch HW fetches instructions from the I-cache and simultaneously generates traces of
the 8 to 32 instructions, including predicted conditional branches.
Traces are stored in a trace cache. A trace fetch unit reads traces from the trace cache and sends
them to the parallel processing elements (PE).
1. Multicores - integrating a set of cores, each of them preserves the same "single thread"
performance as previous generation;
2. Many-cores - integrating large number of cores, trading single threaded performance with
MTL (Multi-Threaded Level) performance;
3. Special cores - an integration of multi-cores with many cores, fixed function logic and/or
programmable logic such as FPGAs.
269
Appendix
270
Summary
271
DEVELOPMENT TRENDS IN TECHNOLOGY
MICROPROCESSOR
SYSTEM
DESIGN
TECHNOLOGY
ENVIRONMENT
ARCHITECTURE
272
Computer Technology Development
* 5000 additions per second, 537 multiplications per second, 38 divisions per second.
** Approximately 50 tubes had to be replaced a day!
Motor-car:
273
6. Hardware improvements since 1946 (approximately):
a. CPU clock speed ⇒ 104,5 times;
b. Bus clock speed ⇒ 103 times;
c. RAM latency ⇒ >105 times.
Years
24
1971 -
....?
20
16
1930 -
1946
12
1946 -
1959
8
1964 -
4
1971
1959 -
1964
Integrated
Mechanical Vacuum Microprocessor
Transistor circuits
& relay tube (VLSI, ULSI)
(SSI, MSI)
274
IC Technology Development
Intel Corporation
Year Resolution Wafer size Memory Processor
(µm) (mm) DRAM
1973 8 75 1 kbit 8080 (1974)
1975 5 75 4 kbit 80186
1978 2 100 16 kbit
1982 1,8 100 64 kbit 80286
1985 1,5 150 1 Mbit 80386
1989 0,8-1,0 150 4 Mbit 80486
1994 0,8 200 16 Mbit Pentium
1996 0,4 200 64 Mbit Pentium Pro
1998 0,2 300 64 Mbit Celeron
2000 0,18 300 256 Mbit Pentium 4
Chip complexity (as defined by the number of active elements on a single semiconductor
chip) will double about every device generation (usually taken as about 18 months).
For Moore’s Law to remain valid, feature size must continue to be reduced, but since the reduction
is insufficient in and of itself, chip size must continue to increase.
275
Microprocessor Development Trends
Scaling in General
Scaling theory is called for reducing transistors dimensions (scale factor S (where S<1).
The result would be a transistor whose gate delay is reduced by S1, area is reduced by S2,
and amount of energy used each time a gate switches ON or OFF is reduced by S3.
Reduced feature sizes have provided two benefits:
Since transistors are smaller, more can be placed on a single die, providing area fore
more complex microarchitectures.
Technology scaling reduces transistor’s gate length and hence transistor’s switcing
time.
To support the transistor density improvements, minimum feature sizes have had to be reduced
by ~0,7× every three year.
New-generation process technologies generally come out every three years.
276
Using 0,7x as the scale factor S, gives a transistor area improvement of 0.49×, a gate delay
improvement of 0,7×, and an energy reduction of 0,34× for every new process generation.
CMOS technologies allow digital circuits to be designed with signals that have full voltage
swing and low standby (leakage) current.
Transistor leakage is the greatest problem facing continued scaling.
The average power dissipation (P) of a digital CMOS circuit consists of a static (Pstat) and a
dynamic (Pdyn) components:
P = Pstat + Pdyn
The static power (Pstat) characterizes circuits that have a constant source current between the
power supplies.
The dynamic power Pdyn is the dominant part of the power dissipation in CMOS circuits, and
it is composed of three terms:
f is the clocking frequency; C is the gate-oxide capacitance; Ioff is the total transistor leakage
current and U is the power supply voltage.
Interconnections Scaling
1. Interconnects don't have the same scaling properties as transistors. Interconnects are
quickly becoming as critical as transistors in determining overall circuit performance.
2. The average width of interconnections is decreasing to provide the smaller dimensions needed to
pack in more transistors and wires.
3. The speed at which a signal can propagate down a wire is proportional to the product of the
wire resistance (R) and the wire capacitance(C). This speed is named as the wire's RC
delay. Increased R or C results in increased wire delay and slower signal propagation.
277
Delay time
Interconnect delay (RC)
nS
Intrinsic gate delay
2,5
2,0
1,5
1,0
0,5
Feature size
0 um
0,5 1,0 1,5 2,0 2,5 3,0
Reducing the feature sizes has caused wire width and height to decrease; resulting in larger wire
resistance due smaller wire cross-sectional area unfortunately wire capacitance has not decreased
proportionally.
4. By convention the wiring layers are subdivided into three categories:
1. Local, for connection within a cell;
2. Intermediate or mid-level, for connection across a module;
3. Global or top-level metal layers, for chip-wide communications.
5. To reduce communication delays, wires are both wider and taller in the mid-level and top-level
metal layers.
6. New interconnect materials are needed to improve performance without increasing the number
of metal layers.
7. Another way to improve interconnect performance is to reduce the capacitance between wires.
278
Shrinking linewidths not only enables more components to fit onto IC (typically 2× per linewidth
generation) but also lower costs (typically 30% per linewidth generation).
Shrinking linewidths have slowed down the rate of growth in die size to 1,14× per year versus
1,38× to 1,58× per year for transistor counts, and since the mid nineties accelerating linewidth
shrinks have halted and even reversed the growth in die size.
The goal of modern processor is to deliver as much performance as possible while keeping power
consumption within reasonable limits.
The power efficiency of a processor can be measured by Energy per Instruction (EPI).
{P [W] = E [J] / t [s]; Power (P) is the rate of using energy (E).}
It is the amount of energy expended by processor for each executed instruction.
EPI is measured in Joules per Instruction (Joules / Instruction).
EPI is related to other commonly used power-efficiency metrics as for performance / Watt or
MIPS / Watt.
279
EPI = Joules / Instruction
EPI = (Joules / second) / (Instruction / second) = Watt / IPS
EPI = Watt / IPS
IPS – Instructions per Second
EPI and MIPS / Watt do not consider the amount of time needed to process an instruction from
start to finish. The EPI is a function of energy (E) expended per instruction when the instruction is
processed.
E ≈ C × U2, where
C – switching capacitance; U – power supply voltage.
In each new processor generation has reduced both C and U compared to the prior generation.
The combination of limited instruction parallelism suitable for superscalar issue, practical limits to
pipelining, and a power limits has limited future speed increases within conventional processor
cores. Microprocessor performance increases will slow dramatically in the future.
Multi-threaded software that can take advance of chip multiprocessors or CMP inherently will
have phases of sequential execution.
During phases of limited parallelism (low IPS) the CMP will spend more EPI.
During phases of high parallelism the CMP will spend less EPI.
In both scenarios power is fixed.
When a two-way CMP replaces a uniprocessor, it is possible to achieve essentially the same or
better through put on server-oriented workloads with just half of the original clock speed.
It is because request processing time is more often limited by memory or disk performance than by
processor performance.
The lower clock rate allows designing the system with a significantly lower power supply
voltage, often a nearly linear reduction.
For a chip-multiprocessor it is necessary to parallelize most latency-critical software into multiple
parallel threads of execution to take advantage of a CMP. CMPs make information handling
process easier than with conventional multiprocessors.
The goal is to minimize the execution times of multi-threaded programs, while keeping the CMP
total power consumption within a fixed budget.
280
Summary
Energy consumption in general is a sum of two components: active energy and idle energy.
Minimizing the idle energy consumption is relatively simple: the processor enters a deep-sleep
power state, stops the clocks, and lowers the operating voltage to the minimum.
Minimizing the active energy consumption is more complex. A very slow execution consumes less
power for a longer period of time, while heavy parallelism reduces the active time but increases the
active power.
The processor’s performance benefit must be greater than the additional power consumed.
Literature
Vasanth Venkatachalam, Michael Franz. Power Reduction Techniques for Microprocessor
Systems. ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp.195-237.
281
Dual and Multicore Processors
Core – the core set of architectural, computational processing elements that provide the
functionality of a CPU.
Advantages Disadvantages
1. Processor uses slightly less power than 1. Processor requires operating systems support
two coupled single-core processors to make optimal use of the second computing
(because of the increased power required resource.
to drive signals external to the chip and 2. The higher integration of the dual-core chip
because the smaller silicon process drives the production yields down and it is more
geometry allows the cores to operate at difficult to manage thermally.
lower voltage) 3. Scaling efficiency is largely dependent on the
2. The cache coherency circuitry can application (applications that require processing
operate at much higher clock rate than is large amounts of data with low computer-
possible if the signals have to travel off- overhead algorithms may have I/O bottleneck).
chip. 4. If a dual-core processor has only one memory
3. The design requires much less printed bus, then the available memory bandwidth per
circuit board (PCB) space than multi- core is half the one available in a dual-processor
chip designs. mono-core system.
282
Dual core SMT allows multiple threads to execute different instructions in the same clock cycle.
It means that the processor has the ability to fetch instructions from multiple threads in a cycle and
it has a larger register file to hold data from multiple threads.
On single-core platforms, assumptions may be made by the developer to simplify writing and
debugging a multi-threaded application. These assumptions may not be valid on multi-core
platforms. There are two areas that highlight these differences:
1. Memory caching,
2. Thread priority.
In the case of memory caching, each processor core may have its own cache. At any point in time,
the cache on one processor core may be out of sync with the cache on the other processor core.
On multi-core platforms, the scheduler may schedule both threads on separate cores. Both threads
may run simultaneously independently.
283
Only when a program is mostly parallelized does adding more processors help more than
parallelizing the remaining code.
To make Amdahl’s Law reflect the reality of multi-core systems, rather than the theoretical
maximum, system overhead from adding threads should be included:
Where S is the time spent executing the serial portion of the parallelized version,
n is the number of processor cores and H(n) is the overhead.
The overhead consists of two portions:
The operating system overhead5
Inter-thread activities.
If the overhead is big enough, it offsets the benefits of the parallelized portion.
The important implication is that the overhead introduced by threading must be kept to a
minimum.
Example
IBM`s dual-core on a single die Power 6 processor run at speeds between 4-5 GHz with a total of
8 Mbytes L2 cache and a 75 GB/s peak to main memory. The processor is built in 65-nm process
using IBM`s silicon-on-insulator (SOI) technology. IBM applies new technology in variable gate
lengths and variable threshold voltages to squeeze maximum performance per Watt. The chip can
be fully operated at 0,8V.
In 2006 Intel developed a prototype 80-core chip that can process some 1,8 trillion operations per
second. There is a cache memory under each core. Chips power consumption is handled by
division of each tile //protsessortuum// into separate regions that can be powered individually.
5
Overhead – a. An extra code that has to stored to organize the program.
//ballast// b. The time a processor spends doing computations that do not contribute
directly to prograess of any user tasks in the system.
284
The 80-core chip achieves over a teraflop of performance while power consuming is about
62 Watts.
Chip Chip
285
Computer Architecture Formulas
1. Amdahl’s Law
Speedup = (Execution old time) / (Execution time new) =
1
=
f
(1 − f ) +
s
2. CPUtime = Instruction count × Clock cycles per instruction × Clock cycle time
CPUtime = IC × CPI × tCLK
CPI = IPC-1
4. Average memory access time (AMAT) = Hit time × Miss rate × Miss penalty
7. Cache index size: 2index = Cache size / [(Block size) × (Set associativity)]
Pipeline _ depth
8. Pipeline speedup =
1 + ( Branch _ frequency ) × ( Branch _ penalty )
286
GLOSSARY
Atomic instruction --- an instruction which can not be interrupted (cannot be split up).
Instruction must be performed entirely or not at all (as for
"test-and-set" instructions). These instructions usually are
not atomic at bus level, only at the CPU instruction level.
Associativity - the number of locations where a single memory address may be stored in a
cache memory.
Basic block --- a sequence of instructions without branches (except possibility at the end)
and without branch targets or branch labels (except possibility at the beginning).
Benchmark --- program or program fragment and related data inputs that are executed to test
the performance of a hardware component, a group of hardware or software
components or an entire microprocessor system (MPS).
Bus --- communication channel shared by multiple devices within a MPS or network.
Bus cycle --- period time required to perform one data transfer operation on a bus.
Cache --- area of high-speed memory that holds portions of data also held within another
storage device.
Card --- a printed-circuit panel (usually multilayer) that provides the interconnection and power
distribution to the electronics on the panel, and provides interconnect capability to the
next level package. It plugs into a mother board printed-circuit board.
Central Processing Unit (CPU) --- computer processor responsible for fetching, interpreting,
and executing program instructions.
Chipset - a pair of chips responsible for communication between the processor and other
components (between the high-speed components and the other with lower-speed
components) on the motheboard.
Clock cycle --- frequency at which a system clock generates timing pulses.
287
Clock jitter – variation in clock frequency, causing some clock cycles to be longer or shorter
//taktivärin// than others.
Clock rate --- a. rate at which clock pulses are generated by a clock.
b. with respect to a CPU , the time required to fetch and execute the simplest instruction.
Control unit --- component of a CPU that is responsible for moving data,
access instructions and controlling the ALU.
In a processor the part that retrieves instructions in proper sequence,
interprets each instruction, and applies the proper signals to the
arithmetic and logic unit and other parts in accordance with this interpretation.
Die --- integrated circuit chip as cut (diced) from finished wafer.
Embedded computer -– a computer inside another device used for running one predetermined
application or collection of software.
Extensibility --- the ease with which a third party is able to add capabilities to software or
hardware.
//laiendatavus//
Granularity --- refers to the number of instructions that can be performed concurrently before
some form of synchronization needs to take place. It is associated with the
ratio between computation and communication.
Fetch cycle --- portion of CPU cycle in which an instruction is loaded into the instruction
register and decoded.
Instruction --- the specification of an operations and the identification of any associated operands.
Instruction set --- the complete set instructions recognized by a given computer or provided
by a given programming language.
Instruction window – the set of instructions that is examined for simultaneous execution,
Interoperability --- the ability of two or more computer systems to exchange information
//koostalitlusvõime// and make o use of it transparently.
288
Gate --- processing device that implements a primitive Boolean operations or processing
function (AND, OR) by physically transforming one or more input signals.
Instruction level parallelism – the ability of individual machine language instructions from a
ILP single computer program to be executed in parallel.
The amount of ILP determines the maximum instructions per
cycle possible for a particular program.
I/O port --- communication pathway from the CPU to a peripheral device.
Level one (L1) cache --- within-CPU cache when all three cache types are used.
Level two (L2) cache --- primary storage cache implemented within the same chip as a processor.
Level three (L3) cache --- when three levels of primary storage cache are used, the cache
implemented outside the microprocessor.
Linear addressing space --- logical view of a secondary storage or I/O device as a sequentially
numbered set of storage locations.
Logical access --- access to a storage location within a linear address space.
Loop unrolling – a technique to get more performance from loops that access arrays, in which
multiple copies of the loop body are made and instructions from different iterations
are scheduled together.
Microarchitecture – the overall design details of a processor that are not visible to the software.
Multi-chip module (MCM) --- an integrated circuit comprising of several chips all packaged
within the same package.
Peripheral device --- device on a system bus other than the CPU and primary storage.
289
Pipeline - the series of steps required to fetch, decode, execute, and store the results of a
processor instruction.
Platform --- a standard type of hardware that makes up a particular range of computer.
Platform independence --- the fact that software or network can work or connect to different
types of incompatible hardware.
Portability ---- the ease with which applications software can be transferred to an environment
//teisaldatavus// from another while maintaining its capabilities.
//mobiilsus//
Processor core --- the part of microprocessor that reads instructions from memory
and executes them, including the instruction fetch unit, arithmetic
and logic unit and the register bank. It excludes optional coprocessors,
caches and memory management unit.
Register alias table – a table listing which physical registers are currently holding the values
RAT of the architectural registers.
Reorder buffer - the functional unit in an out-of-order processor responsible for committing
ROB results in the original program order
Scalability ------- the ease with which an existing system's performance can be increased
//mastaabitavus// or decreased as changes in the application demand.
//skaleeritavus//
Spatial locality - the tendency of memory accesses to a particular address to be more likely
if nearby addresses have recently been accessed.
Split cache – a scheme in which a level of the memory hierarchy is composed of two independent
caches that operate in parallel with each other with one handling instructions and
with one handling data.
290
Superscalar - processors capable of completing more than one instruction per cycle.
System architecture --- MPS components and the interaction among them.
System design --- process of determining the exact configuration of all hardware and software
components in an information system.
System on silicon (SoC) - all electronics for a complete, working product contained on a single chip.
While a computer on a chip includes all the hardware components
required to process, a SoC includes the computer and ancillary
electronics.
Thread --- subcomponent of a process that can be independently scheduled and executed.
Trace cache - a cache memory containing already partially decoded instructions, which may
be stored in program execution order rather than the order they were original
stored in main memory.
Wafer --- a flat round piece of semiconductor used as the starting material for making integral
circuit.
Vectorizable --- the property of a computer program, or program segment, that allows for the
simultaneous execution of operations on different data values (to accomlish the
work in parallel).
Virtual (adj) --- pertaining to something that appears to have some characteristics of something
else for which it serves as a surrogate.
Virtual machine --- a portion of a computer system or of a computer’s time that is controlled by
an operating system and functions as though it were a complete system,
although in reality the computer is shared with other independent operating
systems.
Virtual resource --- resource visible to a user or program, but not necessarily available at
all times or physically extant.
WISC (Writable Instruction Set Computer) – a CPU design that allows a programmer to add extra
machine code instructions using microcode, to
customize the instruction set.
291
Working set --- the set of memory locations referenced over a fixed period.
Workload ----------- a. a suite of programs that can be run and whose execution time can be
measured.
//kasulik koormus// b. the mixture of user programs and operating system commands.
//töömaht// c. the amount of work which a computer has to do.
Workspace --- a space on memory which is available for use or is being used currently by an operator.
z-buffer --- an area at memory used to store the z-axis information for a graphics object displayed
on screen.
292
Acronyms
293
DSP [D-S-P] – Digital Signal Processor
DSU [D-S-U] – Data Service Unit
DVD [D-V-D] – Digital Versatile (Video) Disk
DOS [doss] – Disk Operating System
DUI [D-U-I] – Digital Video Interface
ECC [E-C-C] – Error Correction Code
EDORAM [E-D-O-ram] – Extended Data-Out Random Access Memory
EDRAM [E-D-ram] – Enhanced Dynamic Random Access Memory
EEPROM [E-E-prom] – Electrically Erasable Programmable Read-Only Memory
EGA [E-G-A] – Enhanced Graphics Adapter
EIDE [E-I-D-E] – Enhanced Intelligent Drive Electronics
EISA [ee-suh] – Extended Industry Standard Architecture
EMM [E-M-M] – Extended Memory Manager
EPP [E-P-P] – Enhanced Parallel Port
EPROM [ee-prom] – Erasable Programmable Read-Only Memory
ESDI [E-S-D-I] – Enhanced System Device Interface
kb [kilobit] – kilobit
KB [K-B] – Kilobyte
Kbps [K-B-P-S] – Kilobits per second
294
LAN [lan] – Local area Network
LCD [L-C-D] – Liquid Crystal Display
LED [L-E-D] – Light Emitting Diode
LIFO [life-oh] Last In, Fist Out
LPT [L-P-T] – Local Printer Terminal
Mb [megabit] – Megabit
MB [M-B] – Megabyte
MCA [M-C-A] – Micro Channel Architecture
MCI [M-C-I] – Media Control Interface
MFLOPS [mega-flops] – Millions of FLoating-point OPerations per Second
MIDI [middy] – Musical Instrument Digital Interface
MIPS [mips] – Millions of Instructions per Second
MMU [M-M-U] – Memory Management Unit
MOS [moss] – Metal Oxide Semiconductor
MPEG [em-peg] – Motion Picture Experts Group
MTBF [M-T-B-F] – Mean time Between Failures
295
SIMD [simmed] – Single Instruction Multiple Data
SIMM [simm] – Single In-line Memory Module
SLDRAM [S-L-D-ram] – Sync Link DRAM
SOC [S-O-C] – System On-a-Chip
SPARC [spark] – Scalable Processor Architecture
SRAM [es-ram] – Static Random Access Memory
SVGA [S-V-G-A] – Super video Graphic Array
296