2.1: Advanced Processor Technology: Qn:Explain Design Space of Processor?
2.1: Advanced Processor Technology: Qn:Explain Design Space of Processor?
2.1: Advanced Processor Technology: Qn:Explain Design Space of Processor?
MODULE II
• Processor families can be mapped onto a coordinated space of clock rate versus cycles per instruction
(CPI), as illustrated in Fig. 4.1.
• As implementation technology evolves rapidly, the clock rates of various processors have moved from
low to higher speeds toward the right of the design space (ie increase in clock rate). and processor
manufacturers have been trying to lower the CPI rate(cycles taken to execute an instruction) using
innovative hardware approaches.
• Under both CISC and RISC categories, products designed for multi-core chips, embedded applications,
or for low cost and/or low power consumption, tend to have lower clock speeds. High performance
processors must necessarily be designed to operate at high clock speeds. The category of vector
1
TIST CS 405 - CSA Module 2
processors has been marked VP; vector processing features may be associated with CISC or RISC main
processors.
Qn:Compare CISC ,RISC, Superscalar and VLIW processors on the basis of design space?
Design space of CISC ,RISC, Superscalar and VLIW processors
• The CPI of different CISC instructions varies from 1 to 20. Therefore, CISC processors are at the upper
part of the design space. With advanced implementation techniques, the clock rate of today’s CISC
processors ranges up to a few GHz.
• With efficient use of pipelines, the average CPI of RISC instructions has been reduced to between one and
two cycles.
• An important subclass of RISC processors are the superscalar processors, which allow multiple instructions
to be issued simultaneously during each cycle. Thus the effective CPI of a superscalar processor should
be lower than that of a scalar RISC processor. The clock rate of superscalar processors matches that of
scalar RISC processors.
• The very long instruction word (VLIW) architecture can in theory use even more functional units than a
superscalar processor. Thus the CPI of a VLIW processor can be further lowered. Intel’s i860 RISC
processor had VLIW architecture.
The effective CPI of a processor used in a supercomputer should be very low, positioned at the lower
right corner of the design space. However, the cost and power consumption increase appreciably if
processor design is restricted to the lower right corner.
Instruction Pipelines
o decode
o execute
o write-back
• These four phases are frequently performed in a pipeline, or “assembly line” manner, as illustrated on the
figure below.
Qn:Define the following g terms related to modern processor technology: a: Instruction issue
latency b) Simple operation latency c) Instruction issue rate?
Pipeline Definitions
• Instruction pipeline cycle – the time required for each phase to complete its operation (assuming equal
delay in all phases)
• Instruction issue latency – the time (in cycles) required between the issuing of two adjacent instructions
• Instruction issue rate – the number of instructions issued per cycle (the degree of a superscalar)
• Simple operation latency – the delay (after the previous instruction) associated with the completion of a
simple operation (e.g. integer add) as compared with that of a complex operation (e.g. divide).
• Resource conflicts – when two or more instructions demand use of the same functional unit(s) at the same
time.
3
TIST CS 405 - CSA Module 2
Pipelined Processors
• Case 1 : Execution in base scalar processor -
• A base scalar processor, as shown in Fig. 4.2a and below. :
•
• CASE 2 : If the instruction issue latency is two cycles per instruction, the pipeline can be underutilized, as
demonstrated in Fig. 4.2b and below:
• Pipeline Underutilization – ex : issue latency of 2 between two instructions. – effective CPI is 2.
• CASE 3 : Poor Pipeline utilization – Fig. 4.2c and below:-, in which the pipeline cycle time is doubled
by combining pipeline stages. In this case, the fetch and decode phases are combined into one pipeline stage,
and execute and write-back are combined into another stage. This will also result in poor pipeline
utilization.
o combines two pipeline stages into one stage – here the effective CPI is ½ only
• The effective CPI rating is 1 for the ideal pipeline in Fig. 4.2a, and 2 for the case in Fig. 4.2b. In Fig.
4.2c, the clock rate of the pipeline has been lowered by one-half.
• Underpipelined systems will have higher CPI ratings, lower clock rates, or both.
4
TIST CS 405 - CSA Module 2
Qn:Draw and explain datapath architecture and control unit of a scalar processor?
Data path architecture and control unit of a scalar processor
• The data path architecture and control unit of a typical, simple scalar processor which does not employ an
instruction pipeline is shown above.
• Main memory, I/O controllers, etc. are connected to the external bus.
• The control unit generates control signals required for the fetch, decode, ALU operation, memory access,a
nd write result phases of instruction execution.
• The control unit itself may employ hardwired logic, or—as was more common in older CISC style
processors—microcoded logic.
• Modern RISC processors employ hardwired logic, and even modern CISC processors make use of many of
the techniques originally developed for high-performance RISC processors.
Qn:Compare ISA in RISC and CISC processors in terms of instruction formats, addressing
modes and cycles per instruction?
Qn:List out the advantages and disadvantages of RISC and CISC architectures?
• The instruction set of a computer specifics the primitive commands or machine instructions that a
programmer can use in programming the machine.
• The complexity of an instruction set is attributed to the instruction formats data formats, addressing modes.
general-purpose registers, opcode specifications, and flow control mechanisms used.
• ISA Broadly classified into 2:
CISC
RISC
• A computer with large number of instructions is called complex instruction set computer(CISC)
5
TIST CS 405 - CSA Module 2
• A computer that uses a few instructions with simple constructs is called Reduced Instruction set
computers (RISC). These instructions can be executed at a faster rate.
NOTE:MC68040 and i586 are examples of CISC processors which uses split caches and hardwired
control for reducing the CPI.(some CISC processor can also use split caches and hardwired control.
CISC Advantages
Smaller program size (fewer instructions)
Simpler control unit design
Simpler compiler design
RISC Advantages
Has potential to be faster
Many more registers
RISC Problems
More complicated register decoding system
Hardwired control is less flexible than microcode
6
TIST CS 405 - CSA Module 2
7
TIST CS 405 - CSA Module 2
8
TIST CS 405 - CSA Module 2
• VAX 8600 processor uses typical CISC architecture with microprogrammed control.
• The instruction set contained about 300 instructions with 20 different addressing modes.
• The CPU in the VAX 8600 consisted of two functional units for concurrent execution of integer and
floating-point instructions.
• The unified cache was used for holding both instructions and data.
• There were 16 GPRs in the instruction unit.
• Instruction pipelining was built with six stages in the VAX 8600.
• The instruction unit prefetched and decoded instructions, handled branching operations, and
supplied operands to the two functional units in a pipelined fashion.
• A translation lookaside buffer [TLB) was used in the memory control unit for fast generation of a
physical address from a virtual address.
• Both integer and floating-point units were pipelined.
• The CPI of VAX 8600 instruction varied from 2 to 20 cycles. Because both multiply and divide
instructions needs execution units for a large number of cycles.
9
TIST CS 405 - CSA Module 2
• Separate instruction and data memory unit, with a 4-Kbyte data cache, and a 4-Kbyte instruction
cache, with separate memory management units (MMUs) supported by an address translation cache
(ATC), equivalent to the TLB used in other systems.
• 18-Addressing modes includes:- register direct and indirect, indexing, memory indirect, program
counter indirect, absolute, and immediate modes.
• The instruction set includes data movement, integer, BCD, and floating point arithmetic, logical,
shifting, bit-field manipulation, cache maintenance, and multiprocessor communications, in addition to
program and system control and memory management instructions
• All instructions are decoded by the integer unit. Floating-point instructions are forwarded to the floating-
point unit for execution.
10
TIST CS 405 - CSA Module 2
• Separate instruction and data buses are used to and from the instruction and data from memory units,
respectively. Dual MMUs allow interleaved fetch of instructions and data from the main memory.
• Three simultaneous memory requests can he generated by the dual MMUs, including data operand read
and write and instruction pipeline refill.
• Snooping logic is built into the memory units for monitoring bus events for cache invalidation.
• The complete memory management is provided with support for virtual demand paged operating
system.
• Each of the two ATCs has 64 entries providing fast translation from virtual address to physical address.
Qn:Explain the relationship between the integer unit and floating point unit in most RISC
processor with scalar organization?
• Generic RISC processors are called scalar RISC because they are designed to issue one instruction per
cycle, similar to the base scalar processor
• Simpler: - RISC design Gains power by pushing some less frequently used operations into software.
• Needs a good compiler when compared to CISC processor.
• Instruction-level parallelism is exploited by pipelining
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode, EX =
Execute, MEM = Memory access, WB = Register write back). The vertical axis is successive instructions;
the horizontal axis is time. So in the green column, the earliest instruction is in WB stage, and the latest
instruction is undergoing instruction fetch.
11
TIST CS 405 - CSA Module 2
• It was a 64-bit RISC processor fabricated on a single chip containing more than l million transistors.
• The peak performance of the i860 was designed to reach 80 Mflops single-precision or 60 Mflops
double-precision, or 40 MIPS in 32-bit integer operations at a 40-MHz clock rate.
• In the block diagram there were nine functional units (shown in 9 boxes) interconnected by multiple
data paths with widths ranging from 32 to 128 bits.
12
TIST CS 405 - CSA Module 2
• All external or internal address buses were 32-bit wide, and the external data path or internal data
bus was 64 bits wide. However, the internal RISC integer ALU was only 32 bits wide.
• The instruction cache had 4 Kbytes organized as a two-way set-associative memory with 32 bytes per
cache block. lt transferred 64 bits per clock cycle, equivalent to 320 Mbytes/s at 40 MHz.
• The data cache was a two-way set associative memory of 8 Kbytes. lt transferred 128 bits per clock
cycle (640 Mbytes/s) at 40 MHZ .
• The bus control unit coordinated the 64-bit data transfer between the chip and the outside world.
• The MMU implemented protected 4 Kbyte paged virtual memory of 2^32 bytes via a TLB .
• The RISC integer unit executed load, store. Integer , bit, and control instructions and fetched
instructions for the floating-point control unit as well.
• There were two floating-point units, namely, the multiplier unit and the adder unit which could be
used separately or simultaneously under the coordination of the floating-point control unit.Special
dual-operation floating-point instructions such as add-and-multiply and subtract-and-multiply used both
the multiplier and adder units in parallel .
• The graphics unit supported three-dimensional drawing in a graphics frame buffer, with color
intensity, shading, and hidden surface elimination.
• The merge register was used only by vector integer instructions. This register accumulated the results
of multiple addition operations .
13
TIST CS 405 - CSA Module 2
In a superscalar processor, multiple instructions are issued per cycle and multiple results are generated per
cycle.
A vector processor executes vector instructions on arrays of data; each vector instruction involves a string of
repeated operations, which are ideal for pipelining with one result per cycle.
• The fig shows a three-issue (m=3) superscalar pipeline, m instructions execute in parallel.
14
TIST CS 405 - CSA Module 2
• This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a
wait state.
• In a superscalar processor, the simple operation latency should require only one cycle, as in the base
scalar processor.
• Due to the desire for a higher degree of instruction-level parallelism in programs, the superscalar
processor depends more on an optimizing compiler to exploit parallelism.
Table below lists some landmark examples of superscalar processors from the early 1990s.
15
TIST CS 405 - CSA Module 2
The instruction cache supplies multiple instructions per fetch. However, the actual number of
instructions issued to various functional units may vary in each cycle.
The number is constrained by data dependences and resource conflicts among instructions that are
simultaneously decoded .
Multiple functional units are built into the integer unit and into the floating point unit. Multiple data
buses exist among the functional units. In theory, all functional units can be simultaneously used if
conflicts and dependences do not exist among them during a given cycle
16
TIST CS 405 - CSA Module 2
The maximum number of instructions issued per cycle ranges from two to five in these superscalar
processors.
Typically. the register files in the lU and FPU each have 32 registers. Most superscalar processors
implement both the IU and the FPU on the same chip.
The superscalar degree is low due to limited instruction parallelism that can be exploited in ordinary
programs.
Qn:Why are Reservation stations and reorder buffers are needed in a super scalar processor?
Besides the register files, reservation stations and reorder buffers can be used to establish instruction
windows. The purpose is to support instruction lookahead and internal data forwarding, which are
needed to schedule multiple instructions simultaneously. ie reorder buffer and reservation stations
are needed for controlling unordered executions in execution phase.
17
TIST CS 405 - CSA Module 2
• As shown above, Multiple functional units are use concurrently in a VLIW processor.
• All functional units share a common large register file.
• The operations to be simultaneously executed by functional units are synchronized in a VLIW
instruction. ,say 256 or 1024 bits per instruction word.
• Different fields of the long instruction word carry the opcodes to be dispatched to different functional
units.
• Programs written in short instruction words (32 bits) must be compacted together to form the VLIW
instructions – the code compaction must be done by compiler.
• The execution of instructions by an ideal VLIW processor is shownbelow: each instruction specifies
multiple operations. The effective CPI becomes 0.33 in this example.
• VLIW machines behave much like superscalar machines with three differences:
2. The code density of the superscalar machine is better when the available instruction-level
parallelism is less than that exploitable by the VLIW machine. This is because the fixed VLIW
format includes bits for non-executable operations, while the superscalar processor issues only
executable instructions.
18
TIST CS 405 - CSA Module 2
• lnstruction parallelism and data movement in a VLIW architecture are completely specified at compile
time. Run-time resource scheduling and synchronization are in theory completely eliminated.
• One can view a VLIW processor as an extreme example of a superscalar processor in which all
independent or unrelated operations are already synchronously compacted together in advance.
• The CPI of a VLIW processor can be even lower than that of a superscalar processor. For example, the
Multiflow trace computer allows up to seven operations to be executed concurrently with 256 bits per
VLIW instruction.
• VLIW reduces the effort required to detect parallelism using hardware or software techniques.
• The main advantage of VLIW architecture is its simplicity in hardware structure and instruction set.
• Unfortunately, VLIW does require careful analysis of code in order to “compact” the most
appropriate ”short” instructions into a VLIW word.
Superscalar VLIW
1. Code size is smaller 1. code size is larger
2. Complex hardware for decoding and 2. simple hardware for decoding and
issuing instruction issuing
3. Compatible across generations 3. not compactable across generations.
4. No change in hardware is required 4. Requires more registers but simplified
5. They are scheduled dynamically by hardware
processor 5. Scheduled dynamically by compiler.
Application – VLIW processors useful for special purpose DSP(digital signal processing) ,and scientific
application that requires high performance and low cost. But they are less successful as General purpose
computers. Due to its lack of compatibility with conventional hardware and software, the VLIW architecture
has not entered the mainstream of computers.
Vector Instructions
• We denote vector register of length n as VI, a scalar register as si ,a memory array of length n as M(1 :
n). operator denoted by a small circle ‘o’.
• Vector length should be equal in the two operands used in binary vector instruction.
• The reduction is an operation on one or two vector operands, and the result is a scalar—such as the
dot product between two vectors and the maximum of all components in a vector.
• In all cases, these vector operations are performed by dedicated pipeline units, including functional
pipelines and memory-access pipelines.
• Long vectors exceeding the register length n must be segmented to fit the vector registers n elements at
a time.
where M1(1 : n) and M2(1 : n) are two vectors of length n and M(k) denotes a scalar quantity stored
in memory location k. Note that the vector length is not restricted by register length. Long vectors are
handled in a streaming fashion using super words cascaded from many shorter memory words.
20
TIST CS 405 - CSA Module 2
Vector Pipelines
SYMBOLIC PROCESSORS
• In effect, the hardware provides a facility for the manipulation of the relevant data objects with
“tailored” instructions.
• These processors (and programs of these types) may invalidate assumptions made about more
traditional scientific and business computations.
• Applied in areas like – theorem proving, pattern recognition, expert systems, machine intelligence
etc because in these applications data and knowledge representations, operations, memory, I/o and
communication features are different than in numerical computing.
• Also called Prolog processor, Lisp processor or symbolic manipulators.
21
TIST CS 405 - CSA Module 2
For example, a Lisp program can be viewed as a set of functions in which data are passed from function to
function. The concurrent execution of these functions forms the basis for parallelism. The applicative and
recursive nature of Lisp requires an environment that efficiently supports stack computations and function
calling. The use of linked lists as the basic data structure makes it possible to implement an automatic garbage
collection mechanism.
Instead of dealing with numerical data, symbolic processing deals with logic programs, symbolic lists,
objects, scripts, blackboards, production systems, semantic networks, frames, and artificial neural
networks. Primitive operations for artificial intelligence include search, compare, logic inference, pattern
matching, unification. Filtering, context, retrieval, set operations, transitive closure, and reasoning operations.
These operations demand a special instruction set containing compare, matching, logic, and symbolic
manipulation operations. Floating point operations are not often used in these machines.
The Symbolics 3600 executed most Lisp instructions in one machine cycle. Integer instructions fetched
operands form the stack buffer and the duplicate top of the stack in the scratch-pad memory. Floating-point
addition, garbage collection, data type checking by the tag processor, and fixed-point addition could be
carried out in parallel.
22
TIST CS 405 - CSA Module 2
23
TIST CS 405 - CSA Module 2
The memory technology and storage organization at each level is characterized by 5 parameters
1. Access Time (ti ) - refers to round trip time for CPU to the ith-level memory.
2. Memory size(si) - is the number of bytes or words in level i
3. Cost per byte(ci) – is the cost of ith level memory esitamed by cisi
4. Transfer bandwidth(bi) – rate at which information is transferred between adjacent levels.
5. Unit of transfer(xi) - grain size of data transfer between levels i and i+1
Faster to access
Smaller in size
More expensive/byte
Higher bandwidth
Uses smaller unit of transfer
Registers
• Registers are part of processor – Thus, at times not considered a part of memory
• Register assignment made by compiler
• Register operations directly controlled by compiler – thus register transfers take place at processor speed
Cache
• Controlled by MMU
• Can be implemented at one or multiple levels
24
TIST CS 405 - CSA Module 2
• Information transfer between CPU and cache is in terms of words(4 or 8 bytes- depends on word length
of machine)
• Cache is divided into cache blocks(typically 32 bytes)
• Blocks are unit of data transfer between cache and Main memory or btw L1 and L2 cache
A typical workstation computer has the cache and main memory on a processor board and hard disks in an
attached disk drive. Table below presents representative values of memory parameters for a typical 32-bit
mainframe computer built in 1993.
25
TIST CS 405 - CSA Module 2
• inclusion,
• coherence,
• locality
Inclusion Property
• Information transfer between the CPU and cache----- is in terms of words (4 or 8 bytes each depending on
the word length of a machine). The cache (M1) is divided into cache blocks, also called cache lines by some
authors.
26
TIST CS 405 - CSA Module 2
• Information transfer between cache and main memory, or between L1 and L2 cache---- is in terms of
Blocks(such as “a” and “b”). Each block may be typically 32 bytes (8 words). The main memory (M2) is
divided into pages, say, 4 Kbytes each.
• Information transfer between disk and main memory.---- is in terms of Pages. Scattered pages are
organized as a segment in the disk memory, for example, segment F contains page A, page B, and other
pages.
• Data transfer between the disk and backup storage is ------ at the file level, such as segments F and G.
Coherence Property
1. Temporal locality – recently referenced items are likely to be referenced again in near future. (
ex: loop portion in a code is executed continuously until loop terminates)
• leads to usage of least recently used replacement algorithm
• helps to determine size of memory at successive levels
2. Spatial locality – tendency for a process to access items whose addresses are near one another.
(ex: array elements , macros, routines etc)
• Assists in determining size of of unit data transfer between adjacent memory levels
3. Sequential locality – execution of instructions follows a sequential order unless branched
instruction occurs
• Affects determination of grain size for optimal scheduling
27
TIST CS 405 - CSA Module 2
Principle of localities guide in design of cache memory , main memory and even virtual memory organization..
Working Sets
Figure Below, shows the memory reference patterns of three running programs or three software
processes. As a function of time, the virtual address space (identified by page numbers) is clustered
into regions due to the locality of references.
The subset of addresses (or pages) referenced within a given time window is called the
working set by Denning (1968).
During the execution of a program, the working set changes slowly and maintains a certain degree of
continuity as demonstrated in Fig. below. This implies that the working set is often accumulated at the
innermost (lowest) level such as the cache in the memory hierarchy. This will reduce the effective
memoryaccess time with a higher hit ratio at the lowest memory level. The time window is a critical
parameter set by the OS kernel which affects the size of the working set and thus the desired cache size.
Terms-
Hit Ratio – when an information item is found in Mi, we call it a hit, otherwise a miss.
Considering memory level Mi to Mi-1 in a hierarchy , i=1,2,3…n-1. The hit ratio ,hi at Mi is the probability
that an item required will be found in Mi. The miss ratio at Mi is 1-hi.
28
TIST CS 405 - CSA Module 2
The hit ratios at successive levels are a function of memory capacities, management policies and program
behavior.
Every time a miss occurs a penalty has to be paid to access next higher level of memory. Cache miss is 2 to 4
times costlier than a cache hit. Page faults are 1000 to 10000 times costly than page hit
We assume h0 =0 and hn = 1, which means CPU always accesses M1 first, and the access to the outermost
memory Mn is always a hit.
Due to the locality property, the access frequencies decrease very rapidly from low to high levels; that is, f1
>>f2>> f3>> …>> fn. This implies that the inner levels of memory are accessed more often than the outer
levels.
In practice, we wish to achieve as high a hit ratio as possible at M1. Every time a miss occurs, a penalty must be
paid to access the next higher level of memory. The misses have been called block misses in the cache and
page faults in the main memory because blocks and pages are the units of transfer between these levels.
Using the access frequencies fi=1,2,….n, we can formally define the effective access time of a memory
hierarchy as follows
∑𝑛𝑛𝑖𝑖=1 𝑓𝑓𝑓𝑓. 𝑡𝑡𝑡𝑡 = h1t1+(1-h1)h2t2+(1-h1)(1-h2)h3t3+…….(1-h1)(1-h2)….(1-hn-1)tn
Hierarchy Optimization
The optimal design of memory should result in total effective access time close to effective access time of
cache and a total cost of highest memory level-----practically not possible.
29