Parallel Computing Systems - Unit I & II Notes
UNIT I: Introduction to Parallel Computing Systems
Need of High-Speed Computing
The demand for high-speed computing arises due to the growth of data-intensive and
compute-intensive applications such as weather forecasting, molecular modeling, AI, and
large-scale simulations. Sequential (single-processor) systems cannot cope with these demands,
which led to the evolution of parallel computing systems.
History of Parallel Computers
Early computers in the 1940s were sequential. In the 1960s, vector processors like CDC STAR and
Cray-1 introduced limited parallelism. The 1980s–1990s saw multiprocessor systems (SMP, MPP).
Recent systems include clusters, cloud-based HPC, and GPUs for massive parallelism.
Temporal vs Data Parallelism
- Temporal Parallelism: Overlaps execution in time by pipelining operations. Example: Instruction
pipeline in CPUs. - Data Parallelism: Distributes large data sets across processors performing the
same operation simultaneously. Comparison: Temporal parallelism increases throughput without
changing total task time, while data parallelism reduces execution time by dividing the workload.
Specialized Processors & Inter-task Dependency
Specialized processors such as GPUs, DSPs, and TPUs accelerate domain-specific tasks using
data parallelism. Inter-task dependency refers to the scenario where the output of one task is
required as input for another, which limits the degree of parallelism. Managing dependencies is
essential for efficient parallel execution.
Trends in Microprocessor Architectures
Modern processors have evolved from single-core to multi-core and many-core architectures.
Techniques such as superscalar execution, speculative execution, and simultaneous multithreading
have improved performance. However, the power wall and memory wall limit further speedup,
emphasizing parallelism instead of clock speed increases.
Limitations of Memory Systems
The speed gap between CPU and memory (memory wall) causes bottlenecks in parallel systems.
Caching, prefetching, and multi-level memory hierarchies partially solve this. Memory bandwidth
and latency become critical in high-performance parallel systems.
Parallel Computing Platforms
Parallel platforms include shared-memory multiprocessors, distributed-memory multicomputers
(clusters), hybrid systems, and cloud-based HPC environments. GPUs and accelerators form
another class of specialized platforms.
Communication Costs & Routing Mechanisms
In parallel systems, communication cost includes startup latency and per-word transfer time.
Efficient interconnection networks (mesh, hypercube, fat-tree, torus) and routing algorithms
(store-and-forward, cut-through, wormhole routing) are essential to reduce communication delays.
UNIT II: Parallel Computation and Communication Methods
Principles of Parallel Algorithm Design
Designing efficient parallel algorithms involves breaking problems into smaller sub-problems and
assigning them to processors. Key steps: 1. Decomposition Techniques – dividing problem into
tasks (domain, functional, recursive decomposition). 2. Task Characteristics – granularity,
dependencies, communication needs. 3. Mapping Techniques – assigning tasks to processors to
balance load. 4. Interaction Overheads – minimizing synchronization, communication, and
contention. 5. Parallel Algorithm Models – PRAM, message-passing, data-parallel models.
Basic Communication Operations
Parallel programs often require collective communication operations: - One-to-all broadcast: one
process sends data to all others. - All-to-one reduction: combines data from many processes into
one (sum, max, etc.). - All-reduce: combines values from all processes and shares result with all. -
Prefix-sum (scan): computes partial sums across processors. - Scatter and Gather: distributes
distinct data chunks or collects data from multiple processors. - All-to-all personalized
communication: each processor sends unique data to every other processor. - Circular shift: data
elements are rotated among processors. Optimizing these operations is key to scalable
performance.