Coa-Unit - 5 Notes
Coa-Unit - 5 Notes
Suppose that the algorithm is started in state LNT. When the Branch Instruction is Executed
and if Branch is taken, the machine moves to State LT. Otherwise it remains in state LNT.
Assume that the state of the algorithm is initially set to LNT. After the Branch instruction has
been executed and if the Branch is taken, the state is changed to ST, otherwise it is changed to
SNT.
UNIT -V
PARALLELISM
Parallel processing challenges – Flynn‘s classification – SISD, MIMD, SIMD, SPMD, and Vector
Architectures - Hardware multithreading – Multi-core processors and other Shared Memory
Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale Computers and
other Message-Passing Multiprocessors.
Strong scaling: Speedup achieved on a multiprocessor without increasing the size of the
problem.
Weak scaling: Speedup achieved on a multiprocessor while increasing the size of the
problem proportionally to the increase in the number of processors.
1
PROCESSOR ORGANAIZATION [FLYNN’S CLASSIFICATION]
SISD
• Single Instruction stream, Single Data stream.
• Example of SISD is uniprocessor.
• It has a single control unit and producing a single stream of instruction.
• It has one processing unit and the processing has more than one functional unit these
are under the supervision of one control unit.
• It has one memory unit.
SIMD
• It has one instruction and multiple data stream.
• It has a single control unit and producing a single stream of instruction and multi
stream of data.
• It has more than one processing unit and each processing unit has its own associative
data memory unit.
• In this organization, multiple processing elements work under the control of a single
control unit.
• A single machine instruction controls the simultaneous execution of a number
of processing element.
• Each instruction to be executed on different sets of data by different processor.
• The same instruction is applied to many data streams, as in a vector processor.
• All the processing elements of this organization receive the same instruction
broadcast from the CU.
• Main memory can also be divided into modules for generating multiple data streams
acting as a distributed memory as shown in figure.
• Therefore, all the processing elements simultaneously execute the same instruction and are said
to be 'lock-stepped' together.
• Each processor takes the data from its own memory and hence it has on distinct data
streams.
• Every processor must be allowed to complete its instruction before the next
instruction is taken for execution. Thus, the execution of instructions is synchronous.
• Example of SIMD is Vector Processor and Array Processor.
2
Advantage of SIMD:
• The original motivation behind SIMD was to amortize the cost of the control unit over
dozens of execution units.
• Another advantage is the reduced instruction bandwidth and space.
• SIMD needs only one copy of the code that is being simultaneously executed while
message-passing MIMDs may need a copy in every processor, and shared memory
MIMD will need multiple instruction caches.
• SIMD works best when dealing with arrays in for loops because parallelism achieved
by performing the same operation on independent data.
• SIMD is at its weakest in case or switch statements, where each execution unit must
perform a different operation on its data, depending on what data it has. Execution units
with the wrong data must be disabled so that units with proper data may continue.
MISD
• Multiple Instruction and Single Data stream (MISD)
• In this organization, multiple processing elements are organized under the control of
multiple control units.
• Each control unit is handling one instruction stream and processed through its
corresponding processing element.
• But each processing element is processing only a single data stream at a time.
• Therefore, for handling multiple instruction streams and single data stream, multiple
control units and multiple processing elements are organized in this classification.
• All processing elements are interacting with the common shared memory for the
organization of single data stream as shown in figure.
3
MIMD
• Multiple Instruction streams and Multiple Data streams (MIMD). In this organization,
multiple processing elements and multiple control units are organized.
• Compared to MISD the difference is that now in this organization multiple instruction
streams operate on multiple data streams.
• Therefore, for handling multiple instruction streams, multiple control units and multiple
processing elements are organized such that multiple processing elements are handling
multiple data streams from the main memory as shown in figure.
• The processors work on their own data with their own instructions. Tasks executed by
different processors can start or finish at different times.
• They are not lock-stepped, as in SIMD computers, but run asynchronously.
• This classification actually recognizes the parallel computer. That means in the real
sense MIMD organization is said to be a Parallel computer.
4
SIMD-VECTOR ARCHITECTURE [SPMD]
• SIMD is called vector architecture.
• It is also a great match to problems with lots of data-level parallelism. i.e.
Parallelismachieved by performing the same operation on independent data.
• Rather than having 64 ALUs perform 64 additions simultaneously, like the old array
processors.
• The vector architectures pipelined the ALU to get good performance at lower cost.
• The basic idea of vector architecture is to collect data elements from memory, put them
in order into a large set of registers, operate on them sequentially in registers using
pipelined execution units, and then write the results back to memory.
• A key feature of vector architectures is then a set of vector registers. Thus, vector
architecture might have 32 vector registers, each with 64-bit elements.
Comparing Vector to Conventional Code
• Suppose we extend the MIPS instruction set architecture with vector instructions and
vector registers. Vector operations use the same names as MIPS operations, but with
the letter V appended.
• For example, addv.d adds two double-precision vectors. The vector instructions take as
their input either a pair of vector registers (addv.d) or a vector register and a scalar
register (addvs.d).
• The value in the scalar register is used as the input for all operations.
• The operation addvs.d will add the contents of a scalar register to each element in a
vector register.
• The names lv and sv denote vector load and vector store, and they load or store an
entire vector of double-precision data.
• One operand is the vector register to be loaded or stored; the other operand, which is a
MIPS general-purpose register, is the starting address of the vector in memory.
• The conventional MIPS code versus the vector MIPS code for
Y = a× X + Y
• Where X and Y are vectors of 64 double precision floating-point numbers, initially
resident in memory, and a is a scalar double precision variable.
• This example is the so-called DAXPY loop that forms the inner loop of the DAXPY
stands for double precision a × X plus Y.).
• Assume that the starting addresses of X and Y are in $s0 and $s1, respectively.
Here is the conventional MIPS code for DAXPY:
5
Here is the vector MIPS code for DAXPY:
• The most dramatic is that the vector processor greatly reduces the dynamic instruction
bandwidth, executing only 6 instructions versus almost 600 for the traditional MIPS
architecture.
• This reduction occurs both because the vector operations work on 64 elements at a time
and because the overhead instructions that constitute nearly half the loop on MIPS are
not present in the vector code.
The compiler or programmer indicates that the The compiler or programmer indicates that
computation of each result in the vector is the computation of each result is not
independent of the computation of other independent of the computation of other
results in the same vector, results.
Hardware does not have to check for data Hardware to check for data hazards within a
hazards within a vector instruction. vector instruction.
Vector architectures and compilers have a Not easy to perform this operation
reputation of making it much easier than
when using MIMD multiprocessors to write
efficient applications when they contain data-
level parallelism.
Hardware need only check for data hazards Hardware need only check for data hazards
between two vector instructions once per for every element within the array.
vector operand, not once for every element
within the vectors.
Save energy because of reduced checking It does not save energy because it has more
number of checking.
The cost of the latency to main memory is The cost of the latency to main memory is
seen only once for the entire vector, rather seen for each word of the scalar.
than once for each word of the vector.
Efficient use of memory bandwidth and No efficient use of memory bandwidth and
instruction bandwidth. instruction bandwidth.
Entire loop behavior is predetermined Entire loop behavior is not a predetermined
6
Vector Versus Multimedia Extensions
• Like multimedia extensions found in the x86 AVX instructions, a vector instruction
specifies multiple operations.
• However, multimedia extensions typically specify a few operations while vector
specifies dozens of operations.
• The number of elements in a vector operation is not in the opcode but in a separate
register.
• This distinction means different versions of the vector architecture can be implemented
with a different number of elements just by changing the contents of that register and
hence retain binary compatibility.
• In contrast, a new large set of opcodes is added each time the vector length changes in
the multimedia extension architecture of the x86: MMX, SSE, SSE2, AVX, AVX2 …
• Also unlike multimedia extensions, the data transfers need not be contiguous.
• Hardware finds the addresses of the items to be loaded in a vector register.
• Indexed accesses are also called gather scatter, in that indexed loads gather elements
from main memory into contiguous vector elements and indexed stores scatter vector
elements across main memory.
• The following figure illustrates how to improve vector performance by using parallel
pipelines to execute a vector add instruction.
• The figure using multiple functional units to improve the performance of a single vector
add instruction, C = A + B.
• The vector processor (a) on the left has a single add pipeline and can complete one
addition per cycle.
• The vector processor (b) on the right has four add pipelines or lanes and can complete
four additions per cycle.
Vector Lane
• One or more vector functional units and a portion of the vector register file. Inspired by
lanes on highways that increase traffic speed, multiple lanes execute vector operations
simultaneously.
• Figure shows the structure of a four-lane vector unit. Thus, going to four lanes from one
lane reduces the number of clocks per vector instruction by roughly a factor of four.
7
• The figure shows three vector functional units: an FP add, an FP multiply, and a load-
store unit.
• For multiple lanes to be advantageous, both the applications and the architecture must
support long vectors.
• The elements within a single vector add instructions are interleaved across the four
lanes.
• The vector-register storage is divided across the lanes, with each lane holding every
fourth element of each vector register.
• Each of the vector arithmetic units contains four execution pipelines, one per lane,
which acts in concert to complete a single vector instruction.
HARDWARE MULTITHREADING
Multithreading
• A mechanism by which the instruction streams is divided into several smaller streams
(threads) and can be executed in parallel is called multithreading.
Hardware Multithreading
• Increasing utilization of a processor by switching to another thread when one thread is
stalled is known as hardware multithreading.
Thread
• A thread includes the program counter, the register state, and the stack. It is a
lightweight process; whereas threads commonly share a single address space, processes
don’t.
Thread Switch
• The act of switching processor control from one thread to another within the same
process. It is much less costly than a processor switch.
Process
• A process includes one or more threads, the address space, and the operating system
state. Hence, a process switch usually invokes the operating system, but not a thread
switch.
8
What are the approaches to hardware multithreading?
There are two main approaches to hardware multithreading.
1. Fine-grained Multithreading
2. Coarse-grained Multithreading
Fine-grained Multithreading
• A version of hardware multithreading that implies switching between threads after
every instruction resulting in interleaved execution of multiple threads. It switches from
one thread to another at each clock cycle.
• This interleaving is often done in a round-robin fashion, skipping any threads that are
stalled at that clock cycle.
• To make fine-grained multithreading practical, the processor must be able to switch threads on
every clock cycle.
Advantage
• Vertical waste is eliminated.
• Pipeline hazards cannot arise.
• Zero switching overhead
• Ability to hide latency within a thread i.e., it can hide the throughput losses that arise
from both short and long stalls.
• Instructions from other threads can be executed when one thread stalls.
• High execution efficiency
• Potentially less complex than alternative high performance processors.
Disadvantage
• Clock cycles are wasted if a thread has little operation to execute.
• Needs a lot of threads to execute.
• It is expensive than coarse-grained multithreading.
• It slows down the execution of the individual threads, since a thread that is ready to
execute without stalls will be delayed by instructions from other threads.
Coarse-grained Multithreading
• Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading.
• A version of hardware multithreading that implies switching between threads only after
significant events, such as a last-level cache miss.
• This change relieves the need to have thread switching be extremely fast and is much
less likely to slow down the execution of an individual thread, since instructions from
other threads will only be issued when a thread encounters a costly stall.
Advantage
• To have very fast thread switching.
• Doesn’t slow down thread.
Disadvantage
• It is hard to overcome throughput losses from shorter stalls, due to pipeline start-up
costs.
• Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be
9
emptied.
• New thread must fill pipeline before instructions can complete.
• Due to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to
the stall time.
Simultaneous multithreading (SMT)
• It is a variation on hardware multithreading that uses the resources of a multiple-issue,
dynamically scheduled pipelined processor to exploit thread-level parallelism at the
same time it exploits instruction level parallelism.
• The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than most single threads can effectively use.
• Since SMT relies on the existing dynamic mechanisms, it does not switch resources
every cycle.
• Instead, SMT is always executing instructions from multiple threads, to associate
instruction slots and renamed registers with their proper threads.
• The four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support.
• The three examples at the bottom show how they would execute running together in
three multithreading options.
• The horizontal dimension represents the instruction issue capability in each clock cycle.
• The vertical dimension represents a sequence of clock cycles.
• An empty (white) box indicates that the corresponding issue slot is unused in that clock
cycle.
• The shades of gray and color correspond to four different threads in the multithreading
processors.
• The additional pipeline start-up effects for coarse multithreading, which are not
illustrated in this figure, would lead to further loss in throughput for coarse
multithreading.
Advantage
• It is ability to boost utilization by dynamically scheduling functional units among
multiple threads.
• It increases hardware design facility.
• It produces better performance and add resources to a fine grained manner.
Disadvantage
It cannot improve performance if any of the shared resources are the limiting
bottlenecks for the performance.
10
Figure: How four threads use the issue slots of a superscalar processor in
different approaches?
11
MULTICORE AND OTHER SHARED MEMORY MULTIPROCESSORS
• Multiprocessor: A computer system with at least two processors
• Multicore: More than one processor available within a single chip.
12
Two common styles of implementing Shared Memory Multiprocessors (SMP) are,
Uniform memory access (UMA) multiprocessors
• In this model, main memory is uniformly shared by all processors in multiprocessor
systems and each processor has equal access time to shared memory.
• This model is used for time-sharing applications in a multi user environment
• Tightly-coupled systems (high degree of resource sharing) suitable for general
purpose and time-sharing applications by multiple users
• Physical memory uniformly shared by all processors, with equal access time to all
words.
• Processors may have local cache memories. Peripherals also shared in some fashion.
• UMA architecture models are of two 20types,
Symmetric:
•All processors have equal access to all peripheral devices. All
processors are identical.
Asymmetric:
• One processor (master) executes the operating system other processors
may be of different types and may be dedicated to special tasks.
13
Non Uniform Memory Access (NUMA) multiprocessors
• In shared memory multiprocessor systems, local memories can be connected with every
processor. The collections of all local memories form the global memory being shared.
• In this way, global memory is distributed to all the processors. In this case, the access
to a local memory is uniform for its corresponding processor as it is attached to the local
memory.
• But if one reference is to the local memory of some other remote processor, then the
access is not uniform.
• It depends on the location of the memory. Thus, all memory words are not accessed
uniformly. All local memories form a global address space accessible by all
processors
• Programming NUMAs are harder but NUMAs can scale to larger sizes and have
lower latency to local memory
• Memory is common to all the processors. Processors easily communicate by means of
shared variables.
• These systems differ in how the memory and peripheral resources are shared or
distributed
• The access time varies with the location of the memory word.
• The shared memory is distributed among the processors as local memories, but each
of these is still accessible by all processors (with varying access times).
• Memory access is fastest from the locally –connected processor, with the
interconnection network adding delays for other processor accesses.
• Additionally, there may be global memory in a multiprocessor system, with two
separate interconnection networks, one for clusters of processors and their cluster
memories, and another for the global shared memories.
• Local memories are private with its own program and data. No memory contention so
that the number of processors is very large
• The processors are connected by communication lines, and the precise way in which
the lines are connected is called the topology of the multicomputer.
14
Distributed Memory [Loosely Coupled Systems]
• These systems do not share the global memory because shared memory concept gives
rise to the problem of memory conflicts, which in turn slows down the execution of
instructions.
• Therefore, to alleviate this problem, each processor in loosely coupled systems is
having a large local memory (LM), which is not shared by any other processor.
• Thus, such systems have multiple processors with their own local memory and a set
of I/O devices.
• This set of processor, memory and I/O devices makes a computer system.
• Therefore, these systems are also called multi-computer systems. These computer
systems are connected together via message passing interconnection network through
which processes communicate by passing messages to one another.
• Since every computer system or node in multicomputer systems has a separate memory,
they are called distributed multicomputer systems. These are also called loosely
coupled systems.
• Message passing: Communicating between multiple processors by explicitly sending
and receiving information.
• Clusters: Collections of computers connected via I/O over standard network switches
to form a message-passing multiprocessor.
• Send message routine: A routine used by a processor in machines with private
memories to pass a message to another processor.
• Receive message routine: A routine used by a processor in machines with private
memories to accept a message from another processor.
15
Introduction to Graphics Processing Units
Here are some of the key characteristics as to how GPUs vary from CPUs:
➢ GPUs are accelerators that supplement a CPU, so they do not need be able to perform all the tasks of a
CPU. Th is role allows them to dedicate all their resources to graphics.
➢ Th e GPU problems sizes are typically hundreds of megabytes to gigabytes, but not hundreds of
gigabytes to terabytes.
➢ Perhaps the biggest difference is that GPUs do not rely on multilevel caches to overcome the long
latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading to hide the latency to
memory.
➢ Th e GPU memory is thus oriented toward bandwidth rather than latency. Th ere are even special
graphics DRAM chips for GPUs that are wider and have higher bandwidth than DRAM chips for CPUs.
➢ Given the reliance on many threads to deliver good memory bandwidth, GPUs can accommodate many
parallel processors (MIMD) as well as many threads. Hence, each GPU processor is more highly
multithreaded than a typical CPU, plus they have more processors.
We use NVIDIA systems as our example as they are representative of GPU architectures. Specifically,
we follow the terminology of the CUDA parallel programming language and use the Fermi architecture
as the example. Like vector architectures, GPUs work well only with data-level parallel problems. Both
styles have gather-scatter data transfers, and GPU processors have even more registers than do vector
processors. Unlike most vector architectures, GPUs also rely on hardware multithreading within a
single multi-threaded SIMD processor to hide memory latency.
21
GPU hardware has two levels of hardware schedulers:
1. Th e Th read Block Scheduler that assigns blocks of threads to multithreaded SIMD processors, and
2. the SIMD Th read Scheduler within a SIMD processor, which schedules when SIMD threads should
run.
22
Clusters, Warehouse Scale Computers, and Other Message-Passing
Multiprocessors
Clusters: A set of computers connected over a local area network that function as a single large
multiprocessor. Collections of computers connected via I/O over standard network switches to form a
message-passing multiprocessor.
23
Energy efficiency: If servers are not energy-efficient they will increase cost of electricity, cost of
infrastructure to provide electricity, cost of infrastructure to cool the servers.
Dependability via redundancy: The hardware and software in a WSC must collectively provide at least
99.99% availability, while individual servers are much less reliable.
Network I/O: Networking is needed to interface to the public as well as to keep data consistent
between multiple WSCs.
Interactive and batch-processing workloads: Search and social networks are interactive and require
fast response times. At the same time, indexing, big data analytics etc.
Ample parallelism: Servers need not to worry about the parallelism available in applications to justify
the amount of parallel hardware.
Request-Level parallelism (RLP) is a way of representing tasks which are set of requests which are to
be to run in parallel. Interactive internet service applications, the workload consists of independent
requests of millions of users.
Operational costs count: Server architects normally design systems for peak performance with in a
cost budget.
Scale and its opportunities and problems: The WSCs are massive internally, so it gets volume
discounts and economy of scale, even if there are not too many WSCs. On the other hand, customized
hardware for WSCs can be very expensive, particularly if only small numbers are manufactured. The
economies of scale lead to cloud computing, since the lower per-unit costs of WSCs lead to lower
rental rates.
24