Parallel_computing
Parallel_computing
Parallel_computing
• Multi-core processors/chips
• CMP
• MPSoCs
Why parallel architectures
• Concurrency in applications
• Limits to frequency:
– The memory wall
– The ILP wall
– the power wall
• 2 Core A
• Execution time = 5 sec A A
• P=2*10W = 20W
• E=100 J
• 2 Core A
• Execution time = 10 sec A A
• P=2*5W = 10W (no saving)
• E=100 J
Example
• Core B
• Execution time = 15 sec B
• P=5 W
• E=100 J
• 2 Core B
• Execution time = 7.5 sec B B
• P=2*5W = 10W
• E=75 J
Why (multiple) slower cores are power-efficient
• Pswitch = Activity * Frequency * V2
rule of thumb ∝ F2
• Portability
• Resource requirement
• Scalability
MPSoCs, a stack view
Application
APIs APIs
OS Communication
APIs
APIs APIs
HAL
SISD SIMD
MISD MIMD
Serial computer
Deterministic execution
Examples: older generation main frames, work stations, PCs
SIMD
A type of parallel computer
All processing units execute the same instruction at any given clock cycle
Each processing unit can operate on a different data element
Two varieties: Processor Arrays and Vector Pipelines
Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
MISD
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent
instruction streams.
Few actual examples : Carnegie-Mellon C.mmp computer (1971).
MIMD
Currently, most common type of parallel computer
Every processor may be executing a different instruction stream Every
processor may be working with a different data stream Execution can
be synchronous or asynchronous, deterministic or non-
deterministic
Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
Note: many MIMD
architectures also
include SIMD execution
sub-components
Parallel Computer Architectures
Shared memory:
all processors can access the
same memory
Advantages:
- user-friendly programming perspective to memory
- fast and uniform data sharing due to the proximity of memory to CPUs
Disadvantages:
- lack of scalability between memory and CPUs.
- Programmer responsible to ensure "correct" access of global
- Expense
Distributed memory
Shared memory
Threads
Message Passing
Data Parallel
Hybrid
tasks share a common address space, which they read and write
asynchronously.
Various mechanisms such as locks / semaphores may be used to
control access to the shared memory.
Advantage: no need to explicitly communicate of data between
tasks -> simplified programming
Disadvantages:
• Need to take care when managing memory, avoid
synchronization conflicts
• Harder to control data locality
Threads
Examine problem:
- Can the problem be parallelized?
- Are there data dependencies?
- where is most of the work done?
- identify bottlenecks (e.g. I/O)
Partitioning
- How should the data be decomposed?
Various partitionings
How should the algorithm be decomposed
Communications
Types of communication:
- point-to-point
- collective
Synchronization types
Barrier
• Each task performs its work until it reaches the barrier. It then
stops, or "blocks".
• When the last task reaches the barrier, all tasks are
synchronized.
Lock / semaphore
• The first task to acquire the lock "sets" it. This task can then
safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait until
the task that owns the lock releases it.
• Can be blocking or non-blocking
Load balancing
Keep all tasks busy all of the time. Minimize idle time.
The slowest task will determine the overall performance.
Granularity
In parallel computing, granularity is a qualitative
measure of the ratio of computation to
communication.
Fine-grain Parallelism:
- Low computation to communication ratio
- Facilitates load balancing
- Implies high communication overhead and less
opportunity for performance enhancement
Coarse-grain Parallelism:
- High computation to communication ratio
- Implies more opportunity for performance
increase
- Harder to load balance efficiently