Khoruzhenko Olha
A GPU, as the very name suggests, is a specialized chip for graphics, that deals with all
the processing of objects appearing on the screen. Since it has to operate with a large amount of
data (each operation should be applied to every pixel of the image) in parallel, GPUs have much
more cores, than CPUs (Fig. 1 (a)), because the goal of the latter is finishing the task quickly, at
the same time keeping the ability to switch between different operations, whilst the former should
allow pushing the maximum number of tasks at once. So, a CPU is latency optimized and a GPU
is throughput optimized [1].
Fig. 1 a) Comparison of a CPU and a GPU architecture [2]; b) GPU architecture [1]
Let’s dive into GPU’s architecture. It contains multiple processor clusters, that consist of
streaming multiprocessors, each of which has a cache layer with associated cores (Fig. 1 (b)) [1].
In comparison to CPUs, GPUs have smaller and fewer memory cache layers, because they have
more transistors specifically for computation, so the time, during which data is retrieved from the
memory, is less important. This latency is masked, if GPU is kept busy by having enough
computations to deal with. To give an approximate level of parallelism possible, we can refer to
the number of cores, for example, Tesla V100 by NVIDIA consists of 80 streaming
multiprocessors, each one containing 64 cores, what makes 5120 cores in total.
To understand the ideal match for a GPU in terms of computer architecture categorization,
let’s summarize the goal: for graphics, one should simply run the same mathematical function
again and again on all the pixels of an image. So a single instruction is applied to multiple data
sets, which makes it a SIMD [3]. This programming model allows serious acceleration of a large
number of applications. For example, if we want to scale an image, then each core takes care of
each pixel by simply scaling it absolutely parallelly. In contrast to a sequential machine case,
where it’d take n clock cycles to perform the task on n pixels, it only takes 1 clock cycle for SIMD
(if we assume that there are enough cores to cover the whole computational load). But in fact, the
task should not even be that parallel for the GPU, it only has to match the SIMD scheme of
computation, so it can be decomposed via repeating the same operation on different data at each
moment of time.
Let’s take a look at memory hierarchy, which logically exists due to the architectural
hierarchy in CUDA (Fig. 1(b)) [1]. The lowest level is the registers, which are allocated to individual
cores, since it’s an individual on-chip memory, register data can be processed quicker, than any
other. Read0only memory is the on-chip memory of streaming multiprocessors, which is used for
particular tasks, like texture memory, that can be accessed via texture functions of CUDA [3].
Within the thread blocks the on-chip memory is layer 1 cache and shared memory, where the
latter is controlled by software and the former by hardware. The next level is layer 2 cache, which
can be accessed by all the threads in all thread blocks and stores global and local memory at the
same time. And at the top, there’s a global memory, which resides in the DRAM of a device, it is
comparable to a CPU’s RAM. Retrieving data from each next level of memory in the hierarchy,
naturally, gets slower.
The memory bandwidth of a GPU is the potential maximum amount of data, that can be
handled by the bus, so it characterizes the speed of retrieving and usage of a GPU framebuffer.
Modern GPUs are capable of an order of 100 Gb of transfer per second [5]. This can be a
system’s bottleneck, if it’s too slow, because all the numerous cores of a GPU will be doing
nothing, while awaiting the memory response. For example, if the GPU can process data blocks
repeatedly n-times, then the external peripheral component interconnect should be 1/n of the
internal bandwidth of the GPU [6].
Scatter/gather engines (a.k.a. memory management unit) play a crucial role in GPUs for
efficiwnt memory access and data movement [7]. For instance, they translate the virtual memory
addresses to physical ones, thus ensuring correct memory access; they combine neighboring
threads’ memry accesses into single transactions, thus reducing latency and maximizing
throughput; they ease efficient data movement of data to and from scattered memory locations,
because GPUs quite frequently process data in irregular memory regions; and additionally, they
enable caching mechanisms, storing often accesses memory regions, and prefetching
techniques, anticipating future accesses and fetching data in advance.