L 1 ParallelProcess Challenges
L 1 ParallelProcess Challenges
L 1 ParallelProcess Challenges
5
Edition
th
Parallel Processing
Challenges
§6.1 Introduction
Introduction
FractionE
ExTimeE = ExTime x (1 - FractionE) +
SpeedupE
Execute complex
instructions parallely
by increasing the
speed up either by no.
of times faster or by
increasing the no. of
1 processors
ExTime
Speedup = =
ExTimeE (1 - FractionE) + FractionE
SpeedupE
1
=
(1 - F) + F/S
Performance CS510 Computer Architectures Lecture 3 - 5
Amdahl’s Law
Floating point instructions are improved to run 2
times(100% improvement); but only 10% of
actual instructions are FP
1
Speedup =
(1-F) + F/S
1 1
= = = 1.053
(1-0.1) + 0.1/2 0.95
5.3% improvement
x = 99.74
Sequential = 1 - 99.74
code
= 0.25 %
Amdahl’s Law
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
Tnew = Tparallelizable/100 + Tsequential
1
Speedup= =90
(1−F parallelizable)+F parallelizable /100
Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original
time
Amount_of_improvent
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
8 × Streaming
processors
The TPC of a G80 GPU has 2 SMs while the TPC of a GT200 has 3 SMs.
A SP includes several ALUs and FPUs. An ALU is an arithmetical and Logical Unit and a FPU is a
Floating Point Unit. The SP is the real processing element that acts on vertex or pixel data.
Several TPCs can be grouped in higher level entity called a Streaming Processor Array.
But in NVIDIA’s new GPU, the GF100 / Fermi, the TPC is no longer valid: only remain the SMs.
We can also say that on Fermi architecture, a TPC = a SM.
In Fermi architecture, a SM is made up of two SIMD 16-way units. Each SIMD 16-way has 16 SPs
then a SM in Fermi has 32 SPs or 32 CUDA cores.
Graphics Processing Units
• A texture mapping unit (TMU) is a component in modern graphics
processing units (GPUs). Historically it was a separate physical processor.
• A TMU is able to rotate, resize, and distort a bitmap image (performing
texture sampling), to be placed onto an arbitrary plane of a given 3D
model as a texture.
• This process is called texture mapping.
1. Offloading computation
2. Single Instruction Multiple Threads (SIMT):
- Streaming and massively-parallel multithreading
3. Work well with the memory
- Both between host memory and GPU memory
4. Make use of CUDA library
Terminology
- Host: The CPU and its memory (host memory)
- Device: The GPU and its memory (device memory)
Chapter 6 — Parallel Processors from Client to Cloud — 33
Streaming MultiProcessors
Simple Processing Flow
PCI Bus
48 48
Simple Processing Flow
PCI Bus
49 49
Simple Processing Flow
PCI Bus
50 50
A CUDA kernel is executed by an array of threads. All threads run the same code. Each thread has an ID that it uses to compute memory
addresses and make control decisions.
Offloading Computation
// DAXPY in CUDA
__global__
void daxpy(int n, double a, double *x, double *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x; CUDA kernel
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void) {
int n = 1024;
double a;
double *x, *y; /* host copy of x and y */
double *x_d; *y_d; /* device copy of x and y */
int size = n * sizeof(double)
// Alloc space for host copies and setup values
x = (double *)malloc(size); fill_doubles(x, n);
y = (double *)malloc(size); fill_doubles(y, n); serial code
// Alloc space for device copies
cudaMalloc((void **)&d_x, size);
cudaMalloc((void **)&d_y, size);
// Copy to device
cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);
// Invoke DAXPY with 256 threads per Block parallel exe on GPU
int nblocks = (n+ 255) / 256;
daxpy<<<nblocks, 256>>>(n, 2.0, x_d, y_d);
51 51
Streaming Processing
To be efficient, GPUs must have high throughput, i.e.
processing millions of pixels in a single frame, but may
be high latency
52 52
Parallelism in CPUs v. GPUs
CPUs use task parallelism GPUs use data parallelism
Multiple tasks map to multiple SIMD model (Single
threads Instruction Multiple Data)
SIMT
Tasks run different
instructions Same instruction on different
data
10s of relatively heavyweight
threads run on 10s of cores 10,000s of lightweight threads
on 100s of cores
Each thread managed and
scheduled explicitly Threads are managed and
scheduled by hardware
Each thread has to be
individually programmed Programming done for
(MPMD) batches of threads (e.g. one
pixel shader per group of
pixels, or draw call)
53 53
Streaming Processing to Enable Massive Parallelism
54 54
Example: NVIDIA Tesla
Streaming Processors
Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads
Executed in parallel,
SIMD style
8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs, …
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )