Talking about…
• Multi-core processors/chips
• CMP
• MPSoCs
Why parallel architectures
• Concurrency in applications
• Limits to frequency:
– The memory wall
– The ILP wall
– the power wall
• Multicore can reduce power
Example
• Core A
• Execution time = 10 sec A
• P=10 W
• E=100 J
• 2 Core A
• Execution time = 5 sec A A
• P=2*10W = 20W
• E=100 J
• 2 Core A
• Execution time = 10 sec A A
• P=2*5W = 10W (no saving)
• E=100 J
Example
• Core B
• Execution time = 15 sec B
• P=5 W
• E=100 J
• 2 Core B
• Execution time = 7.5 sec B B
• P=2*5W = 10W
• E=75 J
Why (multiple) slower cores are power-efficient
• Pswitch = Activity * Frequency * V2
rule of thumb ∝ F2
• Less pipeline stages are needed (lower number of flip flops)
• Synthesizing using a higher frequency constraints leads to
more power hungry netlists
• In a multi-core cores can be specialized
Ahmdal’s law
• Amdahl's Law states that potential program speedup is defined by the fraction of code (P)
that can be parallelized:
speedup = 1/1 - P
• Introducing the number of processors performing the parallel fraction of work, the
relationship can be modeled by:
speedup = 1/ (S+ P/N)
where
P = parallel fraction,
N = number of processors and
S = serial fraction.
Limitation and costs
• Complexity
• Portability
• Resource requirement
• Scalability
MPSoCs, a stack view
Application
APIs APIs
OS Communication
APIs
APIs APIs
HAL
Core A Core B Core C Core D
Interconnect and memory subsystem I/O
Serial computation
Traditionally software has been written for serial computation:
- run on a single computer
- instructions are run one after another
- only one instruction executed at a time
Parallel Computing
Simultaneous use of multiple compute sources to solve a single
problem
Concepts and Terminology
Flynn’s Taxonomy (1966)
SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
SISD
Serial computer
Deterministic execution
Examples: older generation main frames, work stations, PCs
SIMD
A type of parallel computer
All processing units execute the same instruction at any given clock cycle
Each processing unit can operate on a different data element
Two varieties: Processor Arrays and Vector Pipelines
Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
MISD
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent
instruction streams.
Few actual examples : Carnegie-Mellon C.mmp computer (1971).
MIMD
Currently, most common type of parallel computer
Every processor may be executing a different instruction stream Every
processor may be working with a different data stream Execution can
be synchronous or asynchronous, deterministic or non-
deterministic
Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
Note: many MIMD
architectures also
include SIMD execution
sub-components
Parallel Computer Architectures
Shared memory:
all processors can access the
same memory
Uniform memory access (UMA):
- identical processors
- equal access and access
times to memory
Non-uniform memory access (NUMA)
Not all processors have equal access to all memories
Memory access across link is slower
Advantages:
- user-friendly programming perspective to memory
- fast and uniform data sharing due to the proximity of memory to CPUs
Disadvantages:
- lack of scalability between memory and CPUs.
- Programmer responsible to ensure "correct" access of global
- Expense
Distributed memory
Distributed memory systems require a communication network to
connect inter-processor memory.
Advantages:
- Memory is scalable with number of processors.
- No memory interference or overhead for trying to keep cache coherency.
- Cost effective
Disadvantages:
- programmer responsible for data communication between processors.
- difficult to map existing data structures to this memory organization.
Hybrid distributed-shared memory
Generally used for the currently largest and fastest
computers
Has a mixture of previously mentioned advantages and
disadvantages
Parallel programming models
Shared memory
Threads
Message Passing
Data Parallel
Hybrid
All of these can be implemented on any architecture.
Shared memory
tasks share a common address space, which they read and write
asynchronously.
Various mechanisms such as locks / semaphores may be used to
control access to the shared memory.
Advantage: no need to explicitly communicate of data between
tasks -> simplified programming
Disadvantages:
• Need to take care when managing memory, avoid
synchronization conflicts
• Harder to control data locality
Threads
A thread can be considered as a
subroutine in the main program
Threads communicate with each
other through the global memory
commonly associated with shared
memory architectures and
operating systems
Posix Threads or pthreads
OpenMP
Message Passing
A set of tasks that use their
own local memory during
computation.
Data exchange through
sending and receiving
messages.
Data transfer usually requires
cooperative operations to be
performed by each process.
For example, a send operation
must have a matching receive operation.
MPI (released in 1994)
MPI-2 (released in 1996)
Data Parallel
The data parallel model demonstrates the following characteristics:
• Most of the parallel work
performs operations on a data
set, organized into a common
structure, such as an array
• A set of tasks works collectively
on the same data structure,
with each task working on a
different partition
• Tasks perform the same operation
on their partition
On shared memory architectures, all tasks may have access to the
data structure through global memory. On distributed memory
architectures the data structure is split up and resides as "chunks"
in the local memory of each task.
Other Models
Hybrid
- combines various models, e.g. MPI/OpenMP
Single Program Multiple Data (SPMD)
- A single program is executed by all tasks simultaneously
Multiple Program Multiple Data (MPMD)
- An MPMD application has multiple executables. Each task can
execute the same or different program as other tasks.
Designing Parallel Programs
Examine problem:
- Can the problem be parallelized?
- Are there data dependencies?
- where is most of the work done?
- identify bottlenecks (e.g. I/O)
Partitioning
- How should the data be decomposed?
Various partitionings
How should the algorithm be decomposed
Communications
Types of communication:
- point-to-point
- collective
Synchronization types
Barrier
• Each task performs its work until it reaches the barrier. It then
stops, or "blocks".
• When the last task reaches the barrier, all tasks are
synchronized.
Lock / semaphore
• The first task to acquire the lock "sets" it. This task can then
safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait until
the task that owns the lock releases it.
• Can be blocking or non-blocking
Load balancing
Keep all tasks busy all of the time. Minimize idle time.
The slowest task will determine the overall performance.
Granularity
In parallel computing, granularity is a qualitative
measure of the ratio of computation to
communication.
Fine-grain Parallelism:
- Low computation to communication ratio
- Facilitates load balancing
- Implies high communication overhead and less
opportunity for performance enhancement
Coarse-grain Parallelism:
- High computation to communication ratio
- Implies more opportunity for performance
increase
- Harder to load balance efficiently