Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

UNIVERSIDAD NACIONAL

MAYOR DE SAN MARCOS


Decana de América
FACULTAD DE INGENIERÍA DE
SISTEMAS E INFORMATICA

ARQUITECTURA DE
COMPUTADORAS
Mg. JUAN CARLOS GONZALES
SUAREZ
Procesador
Grafico

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Procesador
Grafico

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
System Architecture Snapshot With a GPU
(2019)

GPU Memory
GDDR5: 100s GB/s, 10s of GB (GDDR5, GPU
HBM2: ~1 TB/s, 10s of GB HBM2,…)

NVMe
I/O Hub
CPU DDR4 2666 MHz Network
(IOH)
128 GB/s Interface
100s of GB

QPI/UPI
12.8 GB/s PCIe
Host
(QPI) 16-lane PCIe Gen3: 16 GB/s
Memory
20.8 GB/s (UPI) …
(DDR4,…)
High-Performance Graphics Memory
Modern GPUs even employing 3D-stacked memory via silicon interposer
o Very wide bus,
o very high
bandwidth
o e.g., HBM2
in Volta

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6


Memory Comparison,” 2019
Massively Parallel Architecture For
Massively Parallel Workloads!
• NVIDIA CUDA (Compute Uniform Device Architecture) – 2007
– A way to run custom programs on the massively parallel architecture!
• OpenCL specification released – 2008
• Both platforms expose synchronous execution of a massive number of
threads
GPU Threads

GPU

Thread Copy over Copy over PCIe


PCIe
CPU
ATI Radeon 5000 Series Architecture

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Radeon SIMD Engine

• 16 Stream Cores (SC)


• Local Data Share

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
VLIW Stream Core (SC)

Local Data Share (LDS)

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
GPU NVIDIA

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
GPU: Un coprocesador
multithreaded
SM
SP: scalar processor SP SP SP SP
‘CUDA core’ SP SP SP SP

Ejecuta un thread SP SP SP SP

SP SP SP SP

SM SHARED
streaming multiprocessor MEMORY
32xSP (or 16, 48 or more)
GLOBAL MEMORY
Fast local ‘shared memory’ (ON DEVICE)
(shared between SPs)
16 KiB (or 64 KiB)
Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
• GPU:
SMs SM
SP SP SP SP
o 30xSM en GT200,
SP SP SP SP
o 14xSM en Fermi
SP SP SP SP
Por ejemplo, GTX 480:
SP SP SP SP
14 SMs x 32 cores
= 448 cores en un GPU SHARED
MEMORY

Memoria GDDR GLOBAL MEMORY


(ON DEVICE)

512 MiB - 6 GiB


Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Como Programar GPUs
• Paralelización SM
– Decomposición a threads
SP SP SP SP
• Memoria
– Memoria compartida, memoria SP SP SP SP
global SP SP SP SP

SP SP SP SP

SHARED
MEMORY

GLOBAL MEMORY
(ON DEVICE)

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Notas importantes en mente
• Evitar saltos divergentes SM
– Los Threads de un solo SM deben SP SP SP SP
ejecutar el mismo codigo
SP SP SP SP
– Los còdigos que realizan saltos
extensos e impredecibles seran SP SP SP SP
ejecutados lentamente
SP SP SP SP
• Los Threads deberian ser lo mas
independientes como se posible SHARED
– La Sincronización y comunicación MEMORY
pueden ser realizadas eficentemente
para un solo procesador. GLOBAL MEMORY
(ON DEVICE)

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Como Programar GPUs
• Paralelización SM
– Decomposición a threads
SP SP SP SP
• Memoria
– Memoria compartida, memoria SP SP SP SP
global SP SP SP SP

SP SP SP SP
• Enorme potencia de
procesamiento SHARED
MEMORY
– Evitar divergencia
• Comunicaciòn por Thread
GLOBAL MEMORY
– Sincronizacion, no (ON DEVICE)
interdependencias

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
CUDA Execution Abstraction

• Block: Multi-dimensional array of threads


– 1D, 2D, or 3D
– Threads in a block can synchronize among themselves
– Threads in a block can access shared memory
– CUDA (Thread, Block) ~= OpenCL (Work item, Work group)
• Grid: Multi-dimensional array of blocks
– 1D or 2D
– Blocks in a grid can run in parallel, or sequentially
• Kernel execution issued in grid units
• Limited recursion (depth limit of 24 as of now)
Simple CUDA Example
Asynchronous call
CPU side GPU side

Host
C/C++
Compiler
+ CUDA NVCC CPU+GPU
Code Compiler Software
Device
Compiler
Simple CUDA Example
1 block __global__:
N threads per block
In GPU, called from host/GPU
__device__:
In GPU, called from GPU
__host__:
Should wait for kernel to finish
In host, called from host

N instances of VecAdd spawned in GPU

One function
can be both
Which of N threads am I?
Only void allowed See also: blockIdx
More Complex Example:
Picture Blurring
• Slides from NVIDIA/UIUC Accelerated
Computing Teaching Kit
• Another end-to-end example
https://devblogs.nvidia.com/even-easier-introduction-cuda/

• Great! Now we know how to use GPUs – Bye?


Ejemplos de Aplicación

Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Gracias

Juan Carlos Gonzales Suarez


jgonzaless@unmsm.edu.pe

You might also like