Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing
Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing
Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters
Abstract—As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to
efficiently support concurrent execution of multiple different applications. Spatial multitasking, which assigns a different amount of streaming
multiprocessors (SMs) to multiple applications, is one of the most common solutions for this. However, this is not a panacea for maximizing total
resource utilization. It is because an SM consists of many different sub-resources such as caches, execution units and scheduling units, and the
requirements of the sub-resources per kernel are not well matched to their fixed sizes inside an SM. To solve the resource requirement mismatch
problem, this paper proposes a GPU Weaver , a dynamic sub-resource management system of multitasking GPUs. GPU Weaver can maximize
sub-resource utilization through a shared resource controller (SRC) that is added between neighboring SMs. The SRC dynamically identifies idle
sub-resources of an SM and allows them to be used by the neighboring SM when possible. Experiments show that the combination of multiple
sub-resource borrowing techniques enhances the total throughput by up to 26% and 9.5% on average over the baseline spatial multitasking GPU.
Index Terms—computer architecture, GPUs, multi-programmed, resource sharing, spatial multitasking
1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters
MEM
SP
MEM
SFU
SP
MEM
SFU
SP
SFU
SP
MEM
SP
MEM
SFU
SP
MEM
SFU
SFU
S-board
I-buffer
Stack
S S S
PC
BASE 2X BASE 2X BASE 2X
SM SM R SM SM R SM SM
R
C C C
(a) MQ (b) HS (c) BP
SP L1-D
Fig. 1. Execution unit status categorization: the components of the bar Interconnection Network
SFU Cache
indicate ratio of execution time in Empty, Proper, and Starve status. Mem Mem Mem
Register File Shared Memory Threads Thread Block
100
Fig. 3. GPU-Weaver Overview.
Resource utilization
80
1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters
Collector Units(SM0) Collector Units(SM1) Fetch Decode Issue Operand Collect Execute
W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3
S* W R S R W S W W W R W W W R W
Fig. 4. (a) Execution sharing flow is highlighted and only the unidirection case is represented. (b) The Cooperative Streaming Multiprocessor: 1)
shared resources and additional features are highlighted and 2) execution flow of scheduling unit sharing is also provided.
interconnects and the instructions are marked as “Shared” (Fig 4- file, 7 however, release of scoreboards should be done in neighbor
(a) 2 ). When the execution of the migrated warp-instructions is done, SM where initially assigned for next instruction in progress.
the results are returned to the write back stage of the source SM using 3.3 Runtime Overhead
saved “Shared” bit information (Fig 4-(a) 3 ).
A GPU Weaver has two kinds of runtime performance overheads.
3.2 Scheduling Unit Sharing
First, a newly launched kernel to a target SM needs to wait for
In a GPU Weaver system, scheduling unit sharing resolves the TLP
some period of time when it cannot secure enough scheduling
limitation by leveraging existing idle scheduling units from the other
structures because some scheduling structures are used for other
SM without context saving and additional storage overhead. Algo-
kernel execution on the other SM using Scheduling-Sharing. Second
rithm 1 shows how thread blocks are issued to use the scheduling unit
overhead comes from interference of each techniques. The scheduling
sharing technique. When a SM has a kernel to execute, thread blocks
sharing enables more threads to run on execution units within a SM,
are first allocated to the original SM when enough sub-resources are
thereby decreasing the chance of neighbor kernel’s execution sharing.
available in the SM (line 2-3). If the kernel has more remaining
The cache sharing also reduces the idle time of execution units by
thread blocks and all the sub-resources other than scheduling units can
enhancing the performance of the kernel execution with higher cache
be secured, the thread block scheduler decides to issue more thread
hit ratio. However, the overheads are small as shown in Section 5.
blocks to the other SM whenever possible (line 4-5).
Table 1 Specification list of benchmarks
Algorithm 1 Threadblock Scheduling Benchmark Kernel Limiting Cache
Input: KernelList (with initial context information) Type Sub-Unit Type
Hotspot(HS) [4] Com SP Insensitive
Output: KernelList (with updated context information)
1: Kernel = SM.getKernel(); Backprop(BP) [4] Com SP/SFU Sen/Insensitive
2: if (Kernel.hasTBtoLaunch()) and (!SM.fullOrigin())) then Dxtc (DX) [15] Sched Sched Insensitive
3: IssueBlock OriginSM(); StringMatch (SM) [5] Mem LD/ST Sensitive
4: else if (Kernel.hasTBtoLaunch()) and (!NeighborSM.fullShared())) then Reduction (RD) [15] Sched Sched Insensitive
5: IssueBlock NeighborSM(); Mri-q(MQ) [19] Com SFU Insensitive
6: end if ThreadFence Reduction [15] Sched Sched Insensitive
3DConvolution (3D) [18] Mem LD/ST Sensitive
A detailed execution flow of scheduling unit-shared warp instruc- Fdtd-2d (FD) [18] Mem LD/ST Sensitive
tions of kernel A on the neighboring SM is shown in Fig 4-(b). In this
scenario, we assume that kernel A is scheduling-limited, and kernel 4 E XPERIMENTAL R ESULTS
B is not limited by the scheduling structures. 1 Some TBs of kernel We used a GTX480-like modified GPGPU-Sim v3.2.2 [2] to support
A (Shared TBs) are first allocated to SM 1 with Shared bits as the concurrent execution of multi-programmed kernels [16]. Smart Even
sharing indicator by the thread block scheduler. 2 The warps from in spatial multitasking [1] with Drain preemption [16] is used as the
Shared TBs fetch their instructions from the I-cache of SM 1 and baseline structure, and SMK [17] is also used for our evaluation. In
the instructions are assigned to their own I-buffer based on original- addition to the baseline, we modified SM structures to implement
WarpIds that are given by SM 1. 3 The warp scheduler in SM 1 issues Co-SMs consisting of two SMs and an SRC. For scheduling unit
warp instructions for both kernel A and kernel B using their own stack, sharing, sizes of an I-cache and a decoder are doubled to minimize the
barrier control logic, and score board structure with shared bit indica- interference between the kernels, and the additional hardware cost is
tion. 4 Kernel B’s warp instructions are issued to its own Collector small. We also added two additional pipeline stages to avoid increased
Units to access the SM 1 register file. 5 However, warp instructions critical path delay problems due to expensive wirings for execution
of Shared TBs should be issued to SM 0 for their register file access. sharing. We employ a widely used re-run based simulation [16] for fair
When warp instructions of Shared TBs access the SM 0 register file, performance comparison. We considered various kernels among Ro-
their WarpIDs must be transformed similar to VWI (Virtual Warp Id) dinia [4], Parboil [19], Polybench [18], Mars [5], and Cuda SDK [15].
in [21] to avoid conflicts between WarpID of two SMs. In order to Table 1 shows the benchmark list along with several important pa-
avoid same register location access by WarpID, we transform WarpID rameters. We evaluated all 36 pairs using nine workloads classified by
by adding MaxWarpId - NeighborKernel’s MaxWarpID of SM (e,g various kernel characteristics (compute-intensive, cache-sensitive, and
MaxWarpId is 48 in our configuration). Thereby, A share-warp which scheduling unit limited). We used average normalized turnaround time
has assigned Id 24 in Neighbor-SM (bottom SM in Fig 4-(b)) with (ANTT) and system throughput (STP) as performance metrics [16].
16 max NeighborKernel’s warps change their warp id to 56 during In general, ANTT shows the user-perceived response time and STP
issue to original-SM. 6 Using transformed WarpIDs, Shared TB represents the overall progress of the system.
instructions can access their own registers in SM 0. After accessed Fig 5-(a) and 5-(b) show ANTT and STP results of the GPU
own operands from original SM’s register file and executed, the result Weaver, respectively, and the performance results are compared to
of shared warp-instruction should be write-back to original register baseline SM level spatial multitasking and SMK. Fig 5-(a) shows
1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters
1.4 73% Spatial SMK GPU-Weaver 1.4 41% Spatial SMK GPU-Weaver
-21.5% 26%
1.2 1.2 9.5%
-7.4%
1 1
0.8 0.8
0.6 0.6
SM+RD
SM+RD
MQ+TF
RD+DX
MQ+TF
MQ+BP
HS+3D
MQ+BP
RD+DX
MQ+3D
BP+3D
BP+SM
SM+DX
SM+TF
MQ+3D
HS+3D
HS+SM
BP+3D
MQ+FD
MQ+RD
BP+SM
BP+TF
3D+TF
SM+DX
SM+TF
MQ+SM
HS+SM
HS+FD
HS+RD
BP+FD
3D+SM
3D+FD
HS+TF
BP+DX
BP+RD
BP+TF
3D+RD
3D+DX
3D+TF
FD+RD
MQ+HS
MQ+SM
MQ+FD
BP+FD
3D+SM
3D+FD
MQ+DX
HS+BP
HS+FD
HS+RD
HS+TF
BP+DX
BP+RD
3D+RD
FD+RD
MQ+HS
3D+DX
MQ+RD
MQ+DX
HS+BP
HS+DX
FD+DX
FD+TF
RD+TF
geomean
HS+DX
FD+DX
FD+TF
RD+TF
geomean
SM+FD
DX+TF
SM+FD
DX+TF
(a) ANTT (b) STP
Fig. 5. (a) Total ANTT reduction and (b) Total STP improvement
4.4% Exonly GW Exonly GW
0.35 36 36 3 6-bit registers (WarpId transformation). In addition to the above-
Sharing Access Ratio(%)
0.25
0.2
24 24 mentioned costs, additional control logic costs are also required. In
0.15
0.1 0.048%
18
12
18
12
order to estimate the total hardware overhead of GPU Weaver, we
0.05
0
6 6
0
synthesized the hardware logic using 40nm low power profile standard
0
MQ+DX
BP+DX
BP+TF
HS+RD
SM+RD
SM+TF
SM+DX
geomean
MQ+RD
BP+RD
BP+FD
cell library at 1GHz target frequency. Based on the synthesis and P&R
SFU
SP
SP
SFU
SP
SFU
SP
SFU
SFU
SFU
SFU
SP
SP
SP
SP
SFU
MQ HS BP geomean MQ HS BP geomean result, the additional hardware logic occupies 5.27mm2 , resulting in
(a) Kernel Launch Delay (b) Ex+Scheduling Sharing Cases (c) Ex+Cache Sharing Cases
only 0.997% of the total area of the baseline GPU (529mm2 ) [7].
Fig. 6. 1) Delay cycle ratio of (a) kernel launch, 2) execution unit 6 C ONCLUSION
sharing access ratio of compute-intensive(MQ,HS,BP) in (b) compute
+ scheduling workload cases (c) compute + cache workload cases. In this paper, we proposed a GPU Weaver,which is a dynamic
sub-resource management system on spatial multitasking GPUs. It
that GPU Weaver successfully reduces ANTT of most workload
enhances total performance by orchestrating three sharing techniques
combinations by up to 21.5%. Fig 5-(b) also shows that up to 26%
of execution units, scheduling units and L1 caches on top of baseline
STP performance improvement is seen in the GPU Weaver over the
GPU spatial multitasking. Execution unit sharing is performed by exe-
simple spatial multitasking. On average, GPU Weaver shows better
cuting ready warp instructions using execution units of a neighbor SM
performance gain of 7.4% ANTT reduction and 9.5% STP improve-
when possible. It also maximize thread level parallelism by allocating
ment than SMK (-0.02% ANTT and 7.9% STP improvement). Each
more thread blocks using scheduling structures of a neighbor SM
technique individually contributes 3.8% (Ex-sharing), 2% (Cache-
whenever possible. Evaluations show that GPU Weaver can improve
sharing) and 2.8% (Sched-sharing) for ANTT reduction and 3.9%
ANTT and STP by up to 21.5% and 26%, respectively.
(Ex-sharing), 4% (Cache-sharing) and 2% (Sched-sharing) for STP
improvement. The results show that GPU weaver can share sub- 7 ACKNOWLEDGMENTS
resources efficiently similar to SMK and also minimize sub-resource This work was supported in part by the National Re-
interference problem successfully. For example, SMK shows low search Foundation of Korea (NRF) grant funded by the Korea
ANTT reduction for workload pairs having MQ because MQ requires government (MSIP)(NO.NRF-2015R1C1A1A01053844, NO.NRF-
high utilization of SFUs and the other kernels cannot fully utilize the 2016R1C1B2016072), ICT R&D program of MSIP/IITP (No.2017-
SFU units. However, GPU weaver guarantees the fair performance of 0-00142), and the R&D program of MOTIE/KEIT (No.10077609).
other kernels (BP/FD/RD/DX/TF) because the execution sharing for
R EFERENCES
MQ execution is allowed only when they do not use their own SFUs. [1] J. T. Adriaens et al. The case for gpgpu spatial multitasking. In IEEE International
Symposium on High-Performance Computer Architecture, pages 1–12, Feb 2012.
[2] A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator.
5 R UNTIME AND H ARDWARE OVERHEAD E STIMATION In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE
International Symposium on, pages 163–174. IEEE, 2009.
As mentioned in section 3.3, GPU Weaver has two runtime overheads: [3] M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. Bulldozer: An approach to
multithreaded compute performance. IEEE Micro, 31(2):6–15, March 2011.
kernel launch delay and interference between sharing techniques. [4] S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of
In this evaluation, interferences between execution and the other the IEEE Symposium on Workload Characterization, pages 44–54, 2009.
[5] B. He et al. Mars: A mapreduce framework on graphics processors. In Proceedings
sharing techniques are only considered because execution unit sharing of the 17th International Conference on Parallel Architectures and Compilation
Techniques, PACT ’08, pages 260–269, New York, NY, USA, 2008. ACM.
performance is expected to mainly depend on the performance of [6] KHRONOS Group. OpenCL - the open standard for parallel programming of
heterogeneous systems, 2010.
the other kernel. Fig 6-(a) shows the ratio of kernel launch delay [7] J. Kim et al. Efficient gpu multitasking with latency minimization and cache
due to scheduling sharing to the total execution cycles. In the case boosting. IEICE Electronics Express, advpub, 2017.
[8] R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-core chip multiprocessing. In
of (BP+DX), BP kernel needs to wait for 4.4% of total execution Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages
195–206, Dec 2004.
cycles because of the long TB execution time of DX with scheduling [9] E. Lindholm et al. Nvidia tesla: A unified graphics and computing architecture.
IEEE Micro, 28(2):39–55, Mar. 2008.
sharing. However, the total overhead of kernel launch delay looks [10] S. Liu et al. Operand collector architecture, Nov. 16 2010. US Patent 7,834,881.
small because the overheads are less than 0.3% for all the other cases [11] M. Mantor and B. Sander. Amd’s radeon next generation gpu. In HC 29 Aug. 2017.
[12] NVIDIA’s Kepler GK110, 2012. http://www.nvidia.com/content/PDF/NVIDIA-
with 0.048% on average. Fig 6-(b) and (c) show the ratio of shared kepler-GK110-architecture-whitepaper.pdf.
[13] NVIDIA. Sharing a gpu between mpi processes: Multi-process service (mps)
execution unit accesses to the entire execution unit accesses when 1) overview, 2014.
[14] NVIDIA’s the world’s most advanced data center gpu Volta V100,
only execution unit sharing is applied (Exonly) and 2) all the sharing 2017. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
techniques are applied (GW) to workload pairs having compute- whitepaper.pdf.
[15] G. Nvidia. GPU computing sdk,” Available at: https://developer.nvidia.com/gpu-
intensive workloads. For all cases in Fig 6-(b), GPU-Weaver shows computing-sdk, 22(07):2013.
[16] J. J. K. Park et al. Chimera: Collaborative preemption for multitasking on a shared
slightly less execution sharing accesses than Exonly case. Similarly, gpu. In Proceedings of the Twentieth International Conference on Architectural
Fig 6-(c) also shows that the performance impact of execution sharing Support for Programming Languages and Operating Systems, ASPLOS ’15, pages
593–606, New York, NY, USA, 2015. ACM.
is minimal from cache sharing as the decrease of shared execution [17] J. J. K. Park et al. Dynamic resource management for efficient utilization of
multitasking gpus. In Proceedings of the Twenty-Second International Conference
unit accesses is negligible. on Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’17, pages 527–540, New York, NY, USA, 2017. ACM.
Additional hardware cost per SM is estimated as follows: a) [18] L.-N. Pouchet. Polybench: The polyhedral benchmark suite. URL: http://www. cs.
Execution unit sharing: 6 2-input muxes and pipeline registers for ucla. edu/pouchet/software/polybench, 2012.
[19] J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial
pre-sharing stage and 3 2-input muxes and pipeline registers for post- throughput computing. Center for Reliable and High-Performance Computing, 127,
2012.
sharing stage (buffers are added to meet timing) b) Cache sharing: 64 [20] Z. Wang et al. Simultaneous multikernel gpu: Multi-tasking throughput processors
via fine-grained sharing. In 2016 IEEE International Symposium on High Perfor-
2-input muxes (all cache lines for 2 sharing ways) and a 1-bit register mance Computer Architecture (HPCA), pages 358–369. IEEE, 2016.
[21] M. K. Yoon et al. Virtual thread: Maximizing thread-level parallelism beyond gpu
(sensitivity), c) Scheduling unit sharing: a 77-bit I-fetch buffer, an scheduling limit. In Proc. of the 43rd Annual International Symposium on Computer
additional instruction decoder, 96 1-bit registers (shared bits), and Architecture, June 2016.
1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.