The Case For GPGPU Spatial Multitasking: 978-1-4673-0826-7/12/$26.00 ©2011 IEEE
The Case For GPGPU Spatial Multitasking: 978-1-4673-0826-7/12/$26.00 ©2011 IEEE
The Case For GPGPU Spatial Multitasking: 978-1-4673-0826-7/12/$26.00 ©2011 IEEE
1. Introduction
Set-top and portable devices are becoming increasingly
popular and powerful. Due to the cost, power, and thermal
constraints placed on these devices, often they are designed
with a low-power general-purpose CPU and several heterogeneous processors, each specialized for a subset of the
devices tasks. These heterogeneous systems increasingly
include programmable Graphics Processing Units (GPUs).
Michael J. Schulte
AMD Research
Michael.Schulte@amd.com
2. GPU Multitasking
Initially, multiple graphics applications could only share
a GPU via cooperative multitasking, requiring applications executing on the GPU to yield GPU control voluntarily. If a malicious or malfunctioning application never
yielded, other applications were unable to use the GPU.
Windows Vista, together with DirectX 10, introduced GPU
preemptive multitasking for graphics applications, but not
for GPGPU applications [17,24]. GPGPU applications continue to use cooperative multitasking.
In cooperative multitasking, a computation block offloaded to the GPU runs to completion before yielding the
GPU. NVIDIA refers to this computation block as a kernel. A GPGPU application may contain one or more kernels. To avoid problems with malfunctioning and mali-
B
A
B
A
Time
Spatial
Resources
Preemptive
Resources
Resources
Cooperative
B
A
Time
Unused Resources
Figure 1. Illustration of multitasking methods. A and B represent applications using the GPU. The Y-axis represents the GPU
resources and the X-axis represents the duration those resources are used. In spatial multitasking, the applications execute
simultaneously with resources split among them.
3. Evaluation
To determine how efficiently applications use GPU resources and to evaluate potential benefits of spatial multitasking, we profile the effects of varying available GPU resources on the applications described in Section 3.1. We
also simulate these applications sharing the GPU using both
spatial and cooperative multitasking. Our methodology is
described in Section 3.2 and the profiling results and their
relevance to spatial multitasking are discussed in Section
3.3. Our evaluation of spatial multitasking is presented in
Section 3.4 and Section 3.5 compares several heuristics for
partitioning GPU SMs among applications sharing the GPU
via spatial multitasking.
3.1. Applications
We characterize workloads targeting the portable and settop markets. These workloads are selected from source code
available online [5]. Although the workloads and input data
sets target set-top and portable devices, the results should
also apply to desktop devices. We have placed all workloads and input data online for public use [7]. Some workloads include CPU computation as detailed in the following
workload descriptions; however, we profile only the GPU
portions of computation because this is the portion directly
impacted by GPU spatial multitasking.
Ray Tracing [14] implements a rendering engine that
supports shadows and reflections up to five levels deep. The
CPU performs object creation and movement. A single
GPU kernel performs rendering and is executed each time
a new frame is created.
AES [13] supports 128- and 256-bit Advanced Encryption Standard (AES) encryption and decryption. A single
GPU kernel performs the entire encryption/decryption algorithm. The CPU performs initialization, cleanup, and I/O
before and after the encryption/decryption. For this study, a
128-bit key encrypts 4MB of randomly generated data.
RSA [28] performs RSA asymmetric encryption. As
with AES, a single GPU kernel performs the entire encryp-
Application
AES Decrypt
AES Encrypt
DVC
Fractal Generation
Image Denoising
JPEG Decode
JPEG Encode
RSA
Radix Sort
Ray Tracing
SAD
SHA1
Kernels
1
1
14
1
1
1
2
1
4
1
3
1
Average
Thread
Blocks
4,097
4,097
112
512
110,592
13,824
73,694
4
357
2,048
4,354
65
Average
Threads
per Block
256
256
216
64
64
64
64
32
250
128
66
64
3.2. Methodology
n
!
w i xi
(1)
i=1
30
650MHz
8
800MHz
Crossbar
650MHz
64KB/SM
32 Threads
8KB/SM
8KB/SM
16KB/SM
48
40
Speedup
SMs
SM Freq.
Memory Controllers
Memory Freq.
Interconnect Model
Interconnect Freq.
Registers
Warp Size
Texture Cache
Constant Cache
Shared Memory
AES Decrypt
AES Encrypt
DVC
Fractals
Image Denoising
JPEG Decode
JPEG Encode
RSA
Radix Sort
Ray Tracing
SAD
SHA1
56
32
24
16
8
16
24
32
40
Number of SMs
48
56
our baseline configuration (30 SMs) but performance leveled off shortly beyond that. Most of the other applications
scaled well initially, but leveled off as the number of SMs increased. Most importantly, Figure 2 shows that many applications scale sub-linearly up to 30 SMs (the number in our
baseline configuration), and scale even worse at expected
future resource levels. As expected from the thread configuration shown in Table 1, RSA showed almost no performance change as SMs were increased.
Figure 3 examines memory frequency to study the effect of memory bandwidth limits on GPGPU applications;
we use memory frequency as an approximation for bandwidth since we cannot easily model fixed-frequency bandwidth caps. Speedup is normalized to a 200MHz GPU
memory clock. All the applications studied showed sublinear speedup before reaching even the baseline memory
frequency (800MHz), indicating these GPGPU applications
under-utilized memory bandwidth. JPEG Decode, JPEG
Encode, SAD, and SHA1 showed the highest demand for
memory bandwidth, while AES Decrypt, AES Encrypt,
Fractals, Image Denoising, and RSA showed little demand.
In Figure 4, the GPU interconnect frequency is varied.
Speedup is normalized to a 150MHz GPU interconnect
clock. JPEG Encode, SAD, and SHA1 showed near-linear
speedup at higher interconnect frequencies than the other
applications, but as we observed with memory frequency,
all of the applications showed sub-linear speedup before the
baseline interconnect frequency (650MHz). In this experiment, AES Decrypt, AES Encrypt, Fractals, Image Denoising, and RSA showed little change in performance as interconnect bandwidth increased, demonstrating their interconnection bandwidth needs are modest.
Figures 2 to 4 confirm that many GPGPU applications
exhibit unbalanced GPU resource utilization. It is thus
8
AES Decrypt
AES Encrypt
DVC
Fractals
Image Denoising
JPEG Decode
JPEG Encode
RSA
Radix Sort
Ray Tracing
SAD
NVIDIA Quadro FX 5800
SHA1
Speedup
6
5
4
3
7
6
5
4
3
2
1
200
AES Decrypt
AES Encrypt
DVC
Fractals
Image Denoising
JPEG Decode
JPEG Encode
RSA
Radix Sort
Ray Tracing
SAD
NVIDIA Quadro FX 5800
SHA1
Speedup
400
600
800
1000
1200
Memory Frequency (MHz)
1400
1600
200
400
600
800
1000
Interconnect Frequency (MHz)
1200
likely that spatial multitasking will improve system performance by allowing multiple GPGPU applications to share
the GPU without significantly impacting individual performance. RSA, for example, benefits little from increases in
memory bandwidth, interconnect bandwidth, or SMs for the
size of the input data simulated in this study. RSA would
likely perform just as well using a single SM as it would if
it were assigned the entire GPU. With spatial multitasking,
assigning RSA only one SM would still leave the remaining
SMs free for other applications.
We have classified each application based on the information provided in Table 1 and Figures 2 to 4. We verify
our assertions regarding the impact of spatial multitasking
on these classes of applications using the methodology presented in Section 3.4.
Compute-bound applications scale well as the number of SMs are increased but exhibit sub-linear speedup as
memory and interconnect frequency is increased. Assuming
spatial multitasking is implemented such that SMs are partitioned among applications sharing the GPU, these applications are unlikely to contribute significantly to increases in
total system performance using spatial multitasking. These
applications are likely to take roughly N times longer to
execute if they are given N times fewer SMs. AES Decrypt, AES Encrypt, Fractals, Image Denoising, JPEG Decode, and Ray Tracing are compute-bound.
Interconnect/Memory-bound applications do not scale
well as the number of SMs are increased even though they
are sufficiently parallel to potentially make full use of the
available SMs. These applications are likely to contribute to
an increase in total system performance using spatial multi-
0.3
0.25
0.25
Fraction of Combinations
Fraction of Combinations
0.3
0.2
0.15
0.1
0.05
0
1.2
1.4
1.6
1.8
0.1
0.25
0.2
0.15
0.1
0.05
1.5
2
Speedup
2.5
1.5
2.5
Speedup
3.5
0.3
Fraction of Combinations
0.15
0.05
Speedup
0.2
spatial and cooperative multitasking. We simulate all combinations, with repetition, of the applications presented in
Section 3.1. For example, if there were three applications
A, B, and C, then the two-application combinations are AA,
AB, AC, BB, BC, and CC. For this evaluation, kernel initialization and cleanup is not modelled for either multitasking
method. This favors neither cooperative nor spatial multitasking because they both require this overhead. In this initial evaluation of spatial multitasking, SMs are naively partitioned evenly among each application. When the number
of SMs is not divisible by the number of applications, the
assignment is made arbitrarily with a near-even split. We
have determined through experimentation that this arbitrary
assignment does not significantly affect results. More intelligent methods for SM partitioning are examined in Section
3.5. GPU global memory is not virtualized for this study.
Applications allocate memory from the same physical pool
but memory isolation is not enforced. In a real implementation of spatial multitasking, memory isolation should be
provided to ensure faulty or malicious applications do not
have access to another applications memory space.
For many applications, it is not feasible to simulate the
entire execution in a reasonable amount of time since cycleaccurate GPU simulation is considerably more complex and
time consuming than cycle-accurate CPU simulation due to
the large number of threads that execute simultaneously in
GPUs. Instead, we simulate spatial multitasking for a fixed
number of cycles and measure the work completed (instructions executed). When we began experimentation, we tried a
number of simulation lengths and compared how the results
changed as the length of simulation changed. We found that
results from shorter simulations (1M cycles or less) varied
considerably, but longer simulations (greater than 1M cycles) converged on similar results for the applications evaluated in this research. This is probably due to startup variations that are not representative of the steady-state behavior
of applications. We simulate for 5M GPU cycles because
it is sufficiently greater than 1M cycles so that startup variations have minimal impact on performance results and is
sufficiently small that simulation time is still reasonable.
We simulate each combination of applications in spatial
multitasking for 5M GPU cycles; if an application completes in less than 5M cycles, it is restarted and simulation
continues to ensure we observe 5M GPU cycles. We simulate each combination of applications in cooperative multitasking for a fixed number of instructions. The number of
instructions for which each application is run in cooperative
multitasking corresponds to the number of instructions that
are completed by the application in the corresponding spatial multitasking simulation. This ensures we are comparing
the same amount of work for each application between spatial and cooperative multitasking.
Geometric Mean
Maximum
Minimum
2 Apps
1.14
2.00
0.99
3 Apps
1.22
2.91
0.99
4 Apps
1.30
3.83
0.99
Table 4. Speedup of spatial multitasking compared to cooperative multitasking for several benchmark combinations.
Applications
DVC
SHA1
AES Encrypt
Image Denoising
JPEG Decode
Radix Sort
JPEG Encode
SAD
Oracle Best
SMs Speedup
8
1.23
22
25
1.01
5
20
1.08
10
25
1.08
5
Even
SMs Speedup
15
1.14
15
15
1.01
15
15
1.06
15
15
1.05
15
Figures 5 to 7 plot the distribution of speedup of spatial multitasking relative to cooperative multitasking for all
combinations of two, three, and four applications sharing
the GPU when SMs are split evenly among the applications.
For the applications we have evaluated, in 75% of simulations spatial multitasking shows speedups relative to cooperative multitasking greater than 1.03, 1.07, and 1.12 for
two, three, and four applications, respectively; greater than
1.09, 1.14, and 1.19 in 50% of simulations; and greater than
1.20, 1.26, and 1.55 in 25% of simulations.
Table 3 presents the speedup of spatial multitasking compared to cooperative multitasking for all combinations of
two to four applications sharing the GPU. For these results,
only an even-split of SMs among applications in spatial
multitasking is presented. When four applications share a
GPU with 30 SMs, two applications are arbitrarily assigned
eight SMs while the other two applications are assigned
seven.
As shown in Table 3, spatial multitasking performs quite
well in general and in the worst case only performs slightly
worse than cooperative multitasking. As the number of
applications sharing the GPU grows, the speedup of spatial multitasking compared to cooperative multitasking also
grows due to more efficient utilization of GPU resources.
Table 4 details the performance of several specific combinations of applications. We examine both Oracle Best
and Even SM partitioning. Oracle Best partitioning is an
exhaustive search of all possible partitions of SMs among
applications that selects the partitioning leading to the
greatest speedup over cooperative multitasking. Oracle is
Fraction of Combinations
0.8
0.6
0.4
1 Application
2 Applications
3 Applications
4 Applications
0.2
0.2
0.4
0.6
0.8
DRAM Bandwidth Utilization
Geometric Mean
Maximum
Minimum
1 App
37.0%
92.6%
2.2%
2 Apps
38.6%
94.5%
2.3%
3 Apps
40.5%
93.3%
2.4%
4 Apps
42.2%
91.7%
2.6%
Fraction of Combinations
0.8
Geometric Mean
Maximum
Minimum
0.6
0.4
1 Application
2 Applications
3 Applications
4 Applications
0.2
100
200
300
400
500
Average Interconnect Latency
600
700
3.5. SM Partitioning
In the previous section we analyzed the performance advantages of spatial multitasking over cooperative multitasking using a simple even-split of GPU SMs among applications. In this section we compare the following alternative
SM partitioning heuristics:
Oracle Best performs an exhaustive search of SM partitionings and chooses the one that maximizes speedup over
cooperative multitasking. This heuristic is impractical to
implement in real systems, but provides an upper bound on
performance. Due to constraints on real world simulation
time we were unable to simulate Oracle scheduling for more
than two applications sharing the GPU.
Oracle Worst performs an exhaustive search of SM partitionings and chooses the one that minimizes speedup over
1 Apps
205.5
545.5
2.6
2 Apps
164.5
463.0
2.5
3 Apps
138.8
608.3
2.5
4 Apps
119.0
689.8
2.5
cooperative multitasking. This heuristic is infeasable to implement outside of simulation but provides a lower bound
on performance.
Even distributes SMs as evenly as possible among applications. When the number of SMs is not divisible by the
number of applications, the assignment is made arbitrarily
with a near-even split. For example, in the case of four applications sharing a 30 SM GPU two applications receive
seven SMs and two receive eight.
Smart Even is the same as Even except no application
is given more SMs than it can use based on the number of
thread blocks in the program.
Rounds attempts to assign SMs such that there are fewer
idle SMs near the completion of a kernel. In our simple
implementation of spatial multitasking, SMs cannot be reassigned to another application after the initial assignment.
Thus, when a GPGPU kernel has nearly completed execution, SMs can be idle because there are no new thread
blocks to assign to them. We have observed that most thread
blocks from a single kernel take the same amount of time
to execute, since thread blocks tend to execute in rounds
where groups of thread blocks start and complete execution
at nearly the same time. By knowing the number of thread
blocks the GPU would assign to each SM (it can be different
for each kernel) we can predict how many rounds it would
take to execute a kernel. The Rounds heuristic starts with an
even-split of SMs among applications. With the even-split
we calculate the number of rounds each application would
take to execute. After this, we find the minimum number
of SMs an application can be assigned without increasing
the number of rounds it would take to execute over an evensplit of SMs. We then find which applications would benefit
from adding SMs assuming other applications receive the
minimum number of SMs just calculated. Finally, we chose
the partitioning that results in the minimum total number of
rounds summed over all of the applications. In the case of
ties, we choose the partitioning that is closest to an evensplit. To summarize, this heuristic attempts to minimize the
total number of rounds executed among all the applications
while ensuring that no application takes more rounds to execute than it would require with a simple even-split.
Profile uses the information from Figure 2 to choose the
best SM partitioning. An implementation of this heuristic
requires that applications are profiled in isolation before the
heuristic can partition SMs. This would typically be done
at install or compile time. For some applications the per-
1.3
1.2
Oracle Best
Even
Smart Even
Rounds
Profile
Oracle Worst
1.25
1
1.2
0.6
Oracle Worst
Even
Smart Even
Rounds
Profile
Oracle Best
0.4
0.2
0
Speedup
0.8
3
Applications Sharing the GPU
1.1
0.95
1.15
1.05
10
15
20
Number of SMs
25
30
Figure 11. Average speedup of spatial multitasking compared to cooperative multitasking for several SM partitioning heuristics as the total number of SMs in the GPU
is varied. Two applications share the GPU.
1.3
Even
Smart Even
Rounds
Profile
1.25
1.2
S(i, j)
1
N
(2)
i=1
where S(i, j) is the speedup of application i depicted in Figure 2 when j SMs are assigned to the application and N is
the number of applications sharing the GPU.
All of these experiments use a static partitioning of GPU
SMs among applications. The SM partitioning is decided at
the beginning of each simulation and never changed. In a
real implementation, SMs should be reassigned among applications dynamically to allow SMs to be assigned to new
GPGPU kernels and reclaimed from completed kernels. Dynamic partitioning is likely to result in even greater performance improvements for spatial multitasking due to more
efficient use of GPU SMs.
Figure 10 compares the performance of our SM partitioning heuristics using the baseline NVIDIA Quadro FX 5800
configuration. Geometric mean speedup of spatial multitasking across all combinations of applications is shown
with the results normalized to cooperative multitasking.
Smart Even improves performance over Even because SMs
are not wasted on applications that will not use them. However, the difference in performance between Smart Even,
Rounds, and Profile is insignificant for this GPU configuration. When two applications share the GPU, these simple partitioning heuristics approach the optimal Oracle Best
performance. The average performance of Oracle Worst is
Speedup
1.4
1.15
1.1
1.05
1
0.95
5
10
15
20
Number of SMs
25
30
Figure 12. Average speedup of spatial multitasking compared to cooperative multitasking for several SM partitioning heuristics as the total number of SMs in the GPU
is varied. Three applications share the GPU.
coarse-grained scheduling decision than scheduling individual threads to SMs. For this reason, in spatial multitasking,
the OS is best-suited for resource partitioning. The GPU
would still be responsible for scheduling individual threads
to SMs within the partitioning constraints the OS has set.
5. Related Work
Multitasking has been previously researched on several
platforms that share properties with GPUs. Intels Larrabee,
a many-core architecture intended for graphics processing [25], would have fully supported preemptive and spatial
multitasking for the general-purpose cores, like most other
x86-based multicore CPUs. However, this work differs from
Larrabee by investigating multitasking for standard GPUs,
which have a much larger number of very simple cores than
Larrabee. Additionally, the number of hardware threads
supported on each GPU core is much higher than the four
threads per core supported on Larrabee.
The Cell multiprocessor is made up of a general-purpose
super-scalar processor and several SIMD processors [10].
The SIMD processors, known as Synergistic Processing Elements (SPEs), support both cooperative and preemptive
multitasking, but not spatial multitasking. The context of an
SPE is roughly 256KB [9], but spatial multitasking is still
likely to improve performance in the Cell multiprocessor by
using resources more efficiently. Like Larrabee, the scale of
parallelism on the Cell multiprocessor is much smaller than
what this work targets, which presents different challenges
for spatial multitasking.
NVIDIAs Fermi architecture supports concurrent execution of GPGPU kernels from a single application [19],
but the opportunity to execute multiple kernels from the
same application in parallel is likely to be rare compared
to the frequency of multitasking separate GPGPU applications. This is partly caused by limitations imposed by data
dependencies across kernels, a problem not present in spatial multitasking.
The issue of time-sharing versus space-sharing has been
investigated on shared-memory multiprocessors as well [15,
21]. McCann et al. showed space-sharing scheduling
policies out-perform time-sharing [15]. They concluded
this is because parallel applications tend to have convex
speedup versus resource curves, caused by insufficient parallelism and overheads that grow with the number of processors used. Although GPUs are architected differently from
shared-memory multiprocessors, GPGPU applications still
tend to have convex speedup versus resource curves, as we
have shown in this paper. This feature is one of the motivating factors for GPU spatial multitasking. On the other hand,
Ousterhout noted that applications exhibiting fine-grained
inter-process communication perform poorly when only a
subset of the communicating processes are executing con-
currently [21]. To handle this situation, he proposed a technique that schedules processes from the same job to multiple processors simultaneously (i.e., co-scheduling), effectively time-sharing multiprocessors. In GPGPU architectures, fine-grained communication can only occur among
small groups of threads, known as thread blocks in NVIDIA
architectures. Thread blocks are executed in parallel on
a single SM, eliminating the need for co-scheduling in
GPGPU architectures.
6. Conclusion
We proposed spatial multitasking for GPGPU applications. Spatial multitasking allows multiple applications to
simultaneously share the GPU by partitioning its resources
among applications rather than, or in addition to, time multiplexing applications, as is done by cooperative and preemptive multitasking. We presented application characterizations that indicate many GPGPU applications fail to utilize GPU resources fully; GPGPU applications are even
more likely to exhibit unbalanced resource utilization in future GPUs. Our simulation results also indicated that spatial multitasking has the potential to offer significant performance benefits compared to cooperative multitasking by
executing applications in parallel. Smart Even resource partitioning showed average speedups of 1.16, 1.24, and 1.32
over cooperative multitasking for two, three, and four applications sharing the GPU, respectively. By allowing applications to share the GPU, spatial multitasking out-performs
cooperative multitasking by using available GPU resources
more efficiently. Finally, spatial multitasking provides opportunities for new, flexible scheduling and QoS techniques
that can further improve the user experience.
Acknowledgement
We would like to thank the anonymous reviewers for their valuable comments. This work is supported in part by an SRC grant
(Task ID: 2080.001), an NSF CAREER grant (CCF-095360), and
a generous gift grant from AMD.
References
[1] S. Al-Kiswany et al. StoreGPU: exploiting graphics processing units to accelerate distributed storage systems. In
HPDC 08, pages 165174, 2008.
[2] AMD. AMD64 Architecture Programmers Manual Volume
2: System Programming, rev. 3.17 edition, June 2010. http:
//support.amd.com/us/Processor TechDocs/24593.pdf.
[3] A. Bakhoda et al. Analyzing CUDA workloads using a
detailed GPU simulator. In ISPASS 09, April 2009.
[4] S. Brodhead. GPU flame fractal renderer. http://sourceforge.
net/projects/flam4/.
[5] CUDA zone the resource for CUDA developers. http:
//www.nvidia.com/object/cuda home.html.
[6] CUDPP: CUDA data parallel primitives library. http://
gpgpu.org/developer/cudpp.
[7] GPGPU benchmarks. http://ercbench.ece.wisc.edu.