Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters

Improving GPU Multitasking Efficiency using Dynamic


Resource Sharing
Jiho Kim∗ , Jehee Cha∗ , Jason Jong Kyu Park† , Dongsuk Jeon‡ and Yongjun Park§
∗ Hongik University, † University of Michigan, ‡ Seoul National University, § Hanyang University
{ jihokimhi, carjehee }@gmail.com, jasonjk@umich.edu, djeon1@snu.ac.kr, yongjunpark@hanyang.ac.kr

Abstract—As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to
efficiently support concurrent execution of multiple different applications. Spatial multitasking, which assigns a different amount of streaming
multiprocessors (SMs) to multiple applications, is one of the most common solutions for this. However, this is not a panacea for maximizing total
resource utilization. It is because an SM consists of many different sub-resources such as caches, execution units and scheduling units, and the
requirements of the sub-resources per kernel are not well matched to their fixed sizes inside an SM. To solve the resource requirement mismatch
problem, this paper proposes a GPU Weaver , a dynamic sub-resource management system of multitasking GPUs. GPU Weaver can maximize
sub-resource utilization through a shared resource controller (SRC) that is added between neighboring SMs. The SRC dynamically identifies idle
sub-resources of an SM and allows them to be used by the neighboring SM when possible. Experiments show that the combination of multiple
sub-resource borrowing techniques enhances the total throughput by up to 26% and 9.5% on average over the baseline spatial multitasking GPU.
Index Terms—computer architecture, GPUs, multi-programmed, resource sharing, spatial multitasking

1 I NTRODUCTION utilization over spatial multitasking while avoiding interference and


complexity of multikernel execution. In addition to Cache sharing [7],
Today’s mobile and automotive systems require both high perfor- we propose a complete multitasking framework that contains three
mance and energy efficiency to support various applications such sub-resource sharing techniques: L1D caches, execution units, and
as wireless signal processing, multimedia decoding and augmented scheduling structures. The purpose of this work is to solve the low sub-
reality. Multicore CPUs are the most popular to power the systems. resource utilization problem for mapping multiple applications onto a
However, heterogeneous systems, which add multiple accelerators shared GPU using spatial multitasking. We first design a configurable
to the traditional systems, have become more promising solutions SM architecture that can fine-tune the amount of sub-resources to meet
for further improvement. Graphics processing units (GPUs) are the the requirement of running applications, referred to as the Cooperative
most successful accelerators as they show high throughput with Streaming Multiprocessor (Co-SM). A Co-SM basically consists of
sustainable power budget and are well supported by single instruction a typical SM pair with a shared resource controller to perform sub-
multiple thread (SIMT) programming models such as CUDA [9] and resource allocation and assignment to each SM. When two SMs inside
OpenCL [6]. Recent mobile application processors (APs) contain a Co-SM execute different applications, the shared resource controller
multiple CPU cores and a shared GPU, and hence efficient GPU dynamically assigns an appropriate amount of each sub-resource per
resource sharing support is essential to enhance the performance of SM, and then recycles the under-utilized sub-resources by assigning
concurrent multikernel executions that are requested from different them to be used by the other SM whenever possible.
CPUs. Furthermore, efficient multitasking is expected to be more
critical since the trend of recent mobile AP scaling is to increase 2 BACKGROUND AND M OTIVATION
the “size” of a shared GPU with a greater “number” of CPU cores.
To meet the multitasking requirement, many approaches have Basic GPU Multitasking: GPUs consist of multiple SMs (Cores),
been introduced by both industry and academia. Nvidia’s Hyper- memories (Caches) and interconnection networks. The most common
Q [12], Multi-Process Service [13] and AMD’s Multi-Process Shar- method for multitasking is spatial multitasking [1] which divides
ing [11] enable multiple CPU threads/processes to schedule their GPU as an SM granularity for mapping multiple kernels. This is the
tasks on a single GPU, but they support multitasking on software simplest multitasking method because multiple SMs on a GPU do not
only, or use a simple FIFO-style resource allocation policy. Several share kernel states, and therefore, there is no need to consider any
recent developments by the academia have improved the efficiency interference between multiple kernels in each SM. Though it shows
of preemptive multitasking by introducing GPU-specific preemption considerable performance gain, sub-resource under-utilization within
techniques [16]. Spatial multitasking [1], which assigns a different each SM exists. Because the number of concurrent threads on an SM
number of Streaming Multiprocessors (SMs) to different applications, is limited by one of several factors such as storages (RF, SharedMem)
is one of the most common architectural solutions. Although spatial and scheduling units (PC, Stack). Non-critical resources are likely to
multitasking has proven quite effective, SM-level partitioning often be wasted when only single kernel runs on a single SM.
incurs low resource utilization due to the mismatch between the Execution Unit Resource Variance: Though SMs have their
requirement and the actual amount of sub-resources within an SM. own sizes of streaming processors (SPs) and special function units
For example, compute-intensive kernels assigned to several SMs (SFUs) [9], the execution unit utilization can vary highly according
may require more execution units but few memory accesses. A to the kernel characteristics. For example, execution units are idle for
simultaneous multikernel (SMK) [20] technique can be a promising most of time in memory intensive kernels, but can be the performance
solution for the sub-resource low utilization problem as it supports bottleneck in compute-intensive kernels. Fig 1 shows the relative
multikernel execution within an SM. However, efficient orchestration execution cycles per different input buffer status to total execution
of multikernel execution within an SM is quite complex [17] due to cycles of three execution units: SP, SFU, and MEM. Empty means the
interferences of multikernel. target execution unit has no input values (idle), and proper means the
To overcome these inefficiencies, this study proposes the GPU unit has input values for only one cycle and therefore the inputs can be
Weaver, which is a light-weight approach to maximize resource resolved at the cycle. While these two conditions do not incur pipeline
stalls, starve may incur pipeline stalls as it means many input values
Manuscript submitted: 09-Jul-2018. Manuscript accepted: 06-Nov-2018. are waiting to be processed at the target execution unit and only the
Final manuscript received: 16-Nov-2018. subset can be handled at the cycle. Therefore, the portion of execution

1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters

Empty Proper Starve


100%
GPU Code Kernel Context

Execution Time Ratio


80% Cooperative
Kernel call ((( grid_ dim, Command
60% block_ dim, Cache, Thd )))
Thread block scheduler Processor Streaming
SDUDPHWHU $ «
40% Multiprocessor
Shared
20% Resource
Scheduling
Controller K1 K2 K1 K2 K1 K2
0% Structures

MEM
SP

MEM
SFU

SP

MEM
SFU

SP

SFU

SP

MEM

SP

MEM
SFU

SP

MEM
SFU

SFU

S-board
I-buffer
Stack
S S S

PC
BASE 2X BASE 2X BASE 2X
SM SM R SM SM R SM SM
R
C C C
(a) MQ (b) HS (c) BP
SP L1-D
Fig. 1. Execution unit status categorization: the components of the bar Interconnection Network
SFU Cache
indicate ratio of execution time in Empty, Proper, and Starve status. Mem Mem Mem
Register File Shared Memory Threads Thread Block
100
Fig. 3. GPU-Weaver Overview.
Resource utilization

80

60 possible. To achieve this, GPU Weaver performs conditional resource


40 sharing according to the kernel, similar to the Conjoined-core [8] on
20
the CPU side. Therefore, higher energy efficiency can be achieved by
0
3D FD-2 FD-3 FD-1 BP-1 DX RD TF-1 TF-2 BP-2 HS MQ-2 SM-1 SM-2 minimizing performance loss from interference with minimum HW
Thread Thread Block Reg Task overhead, based on spatial partitioning.
Max TLP Limiting factor
* Task : too small number of total threads 3 GPU W EAVER A RCHITECTURE
Fig. 2. Utilization of in-SM memory (RF and shared memory) and The GPU Weaver is composed of multiple Co-SMs. Fig 3 shows
scheduling structures (thread and TB) when a resource is fully occupied. a high level overview of the GPU Weaver with multiple Co-SMs.
cycles for starve status should be minimized. As shown in Fig 1,
A pair of neighboring SMs forms a Co-SM. Each warp within a
there are a substantial number of cycles for starve status with the
SM shares sub-resources of the other SM within the same Co-SM
baseline (base) execution unit configuration (32 SPs, 4 SFUs). Among
including the scheduling structures (PC, I-buffer, Stack, and Score-
three compute-intensive benchmarks, MQ and BP performance is
board) using Scheduling-Sharing, execution units (SP, SFU) using
limited by SFUs, and HS is limited by SPs. An interesting question
Ex-Sharing, and a L1 D-cache using Cache-Sharing [7], only when
here is whether the ratio of starve status can be greatly reduced
the neighboring sub-resources are temporally or spatially available. A
if more execution resources are assigned. In order to estimate the
Co-SM structure is not a simple double-sized SM, because it supports
effectiveness, we also performed the same experiment with the 2x
efficient sub-resource sharing mechanisms when executing multiple
execution unit resources over the baseline configuration (2x). As
kernels. Each Co-SM has a Shared Resource Controller (SRC) to
shown, assigning more resources can be a successful performance
allow warp-instructions of a SM to use sub-resources of the other
enhancing approach as the ratio of cycles for starve status are highly
SM efficiently. In order to achieve high sub-resource utilization on
reduced. Therefore, it will be highly effective if more execution unit
Co-SMs, allocating different kernels to each SM within a Co-SM is
resources can be assigned when the input buffer is full.
essential. Therefore, the kernel scheduler of the GPU Weaver tries to
Scheduling Unit Resource Variance: In order to fully exploit Thread
launch different kernels on each SM per Co-SM as shown in Fig 3.
Level Parallelism (TLP), maximum allowable thread blocks should
be assigned to each SM. The assigned number of thread blocks for 3.1 Execution Unit Sharing
each SM are determined by multiple factors [21] such as the number The purpose of execution unit sharing is to alleviate the SP and SFU
of issuable thread blocks, threads and the size of in-SM memory shortage problem when many warp instructions are ready to execute.
structures (register file and shared memory). Thread block and thread Fig 4-(a) depicts the micro-architectural details of a Co-SM including
number limitation are due to the control logic complexity, and the in- an SRC for execution sharing. In Fig 4-(a), additional pipeline stages
SM memory structure limitation is due to the total register and shared are shown: 1) from dispatch units of an SM to execution units of the
memory requirement of all assigned threads. Therefore, threads with other SM (Pre-sharing stage) and 2) from execution units of an SM
a small size of in-SM memory can be maximally assigned to SMs to the writeback stage of the other SM (Post-sharing stage). Because
if the maximum issuable thread block and thread number increases. additional interconnects across SMs highly increase wire delay, the
However, a simple increase of the thread control logic is not the best additional stages are required to avoid critical path problems. The
solution due to its high logic complexity. Fig 2 depicts the relative performance side effects of a deeper pipeline are negligible due to the
utilization of four possible limiting factors, including size of register fast context switching mechanisms of GPUs [14], as shown in Fig 4-
file, shared memory, number of threads and thread blocks. In this (a). Further, the complexity of data-forwarding does not increase,
figure, we also categorize the applications according to their actual because GPUs utilize simple in-order execution with a scoreboarding
limiting factors. Applications in a task category are applications that system [2]. A previous system proved the hardware-level feasibility
do not contain enough threads to fully utilize at least one limiting of execution unit sharing [3], so the area overhead is acceptable and
factor. Applications limited by threads and thread blocks on the left will gradually become smaller as technology improves.
side of Fig 2 do not fully utilize the in-SM memory, and the other Issued warp-instructions first stay in Collector Units of the
applications on the right side do not fully utilize the scheduling Operand Collector [10] until all operand values are loaded from
structures. This means that more threads can be assigned to SMs if the register file. Then, dispatch units send ready warp-instructions to
left side applications can use more scheduling units, and they can be appropriate execution units for calculation. At this point, if the number
borrowed from the SMs that execute the applications on the right side of ready warp-instructions is more than the available execution units,
without more complex scheduling structure implementation. the instructions will be executed later in conventional GPUs. To
Previous related works and Insight: To exploit the sub-resources resolve this problem, first, the SRC checks the Collector Unit status
within an SM more efficiently, SMK [20], which schedules multiple from dispatch units and the number of issuable execution units from
kernels on the same SM has been proposed. However, sub-resource both SMs within a Co-SM (Fig 4-(a) 1 ).
interference between multiple kernels are hard to eliminate when same If an SM has many ready warp-instructions and the other SM has
type of kernels are running concurrently or unpredictable resource idle SP or SFU units, the SRC finds the number of migratable warp-
usage patterns are shown [17]. To avoid this interference, we decided instructions to avoid write-back stalls by checking the write-back bus
to allocate multiple kernels to different SM subsets , and to borrow of the SM. It then allows dispatch units of the SM to migrate the
under-utilized sub-resource from the other SM subset only when instructions to the other SM’s SP or SFU units through the additional

1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters

Collector Units(SM0) Collector Units(SM1) Fetch Decode Issue Operand Collect Execute
W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3
S* W R S R W S W W W R W W W R W

OC Dispatch Dispatch 1 Dispatch Dispatch


Unit Unit Unit Unit I-Cache Decoder Warp Scheduler Collector
Units
SP
Pre- SM 0 0 PC 0 W0 6 Units L1-D WB
1 0 1 0 Shared 0 1 0 1 Score Register Dispatch
Resource (kernel A) 0 PC 0 W1 bank
sharing Controller board Unit SFU Cache
0 PC 0 W2 Scoreboard
2 0 PC 0 W3 Stack release
Units
Shared bit
Scheduling Structures 7 Shared Resource Controller
EX SP 0 Units 0 SFU0 Units 0 SP1 Units 1 SFU1 Units 1
PC SP
1 1 W0 Score
1 1 PC 2 1 W1 Units L1-D
board
0 PC 0 W2 5 Register Dispatch SFU Cache
Post- 1 0 1 0 1 0 1 0 SM 1 0 PC 0 W3 Stack bank Unit WB
SFU 0 Units
(kernel B)
sharing 1 0 SP 1
Shared?
1 0 SFU 1
Shared?
0 1 SP 0
Shared?
0 1 Shared?
Collector Execution Unit
3 I-Cache Decoder Warp Scheduler Units sub-resources
4
WB WB

(a) Ex-unit Sharing (b) Schedule unit Sharing


* R:Ready S:Starve W:wait

Fig. 4. (a) Execution sharing flow is highlighted and only the unidirection case is represented. (b) The Cooperative Streaming Multiprocessor: 1)
shared resources and additional features are highlighted and 2) execution flow of scheduling unit sharing is also provided.
interconnects and the instructions are marked as “Shared” (Fig 4- file, 7 however, release of scoreboards should be done in neighbor
(a) 2 ). When the execution of the migrated warp-instructions is done, SM where initially assigned for next instruction in progress.
the results are returned to the write back stage of the source SM using 3.3 Runtime Overhead
saved “Shared” bit information (Fig 4-(a) 3 ).
A GPU Weaver has two kinds of runtime performance overheads.
3.2 Scheduling Unit Sharing
First, a newly launched kernel to a target SM needs to wait for
In a GPU Weaver system, scheduling unit sharing resolves the TLP
some period of time when it cannot secure enough scheduling
limitation by leveraging existing idle scheduling units from the other
structures because some scheduling structures are used for other
SM without context saving and additional storage overhead. Algo-
kernel execution on the other SM using Scheduling-Sharing. Second
rithm 1 shows how thread blocks are issued to use the scheduling unit
overhead comes from interference of each techniques. The scheduling
sharing technique. When a SM has a kernel to execute, thread blocks
sharing enables more threads to run on execution units within a SM,
are first allocated to the original SM when enough sub-resources are
thereby decreasing the chance of neighbor kernel’s execution sharing.
available in the SM (line 2-3). If the kernel has more remaining
The cache sharing also reduces the idle time of execution units by
thread blocks and all the sub-resources other than scheduling units can
enhancing the performance of the kernel execution with higher cache
be secured, the thread block scheduler decides to issue more thread
hit ratio. However, the overheads are small as shown in Section 5.
blocks to the other SM whenever possible (line 4-5).
Table 1 Specification list of benchmarks
Algorithm 1 Threadblock Scheduling Benchmark Kernel Limiting Cache
Input: KernelList (with initial context information) Type Sub-Unit Type
Hotspot(HS) [4] Com SP Insensitive
Output: KernelList (with updated context information)
1: Kernel = SM.getKernel(); Backprop(BP) [4] Com SP/SFU Sen/Insensitive
2: if (Kernel.hasTBtoLaunch()) and (!SM.fullOrigin())) then Dxtc (DX) [15] Sched Sched Insensitive
3: IssueBlock OriginSM(); StringMatch (SM) [5] Mem LD/ST Sensitive
4: else if (Kernel.hasTBtoLaunch()) and (!NeighborSM.fullShared())) then Reduction (RD) [15] Sched Sched Insensitive
5: IssueBlock NeighborSM(); Mri-q(MQ) [19] Com SFU Insensitive
6: end if ThreadFence Reduction [15] Sched Sched Insensitive
3DConvolution (3D) [18] Mem LD/ST Sensitive
A detailed execution flow of scheduling unit-shared warp instruc- Fdtd-2d (FD) [18] Mem LD/ST Sensitive
tions of kernel A on the neighboring SM is shown in Fig 4-(b). In this
scenario, we assume that kernel A is scheduling-limited, and kernel 4 E XPERIMENTAL R ESULTS
B is not limited by the scheduling structures. 1 Some TBs of kernel We used a GTX480-like modified GPGPU-Sim v3.2.2 [2] to support
A (Shared TBs) are first allocated to SM 1 with Shared bits as the concurrent execution of multi-programmed kernels [16]. Smart Even
sharing indicator by the thread block scheduler. 2 The warps from in spatial multitasking [1] with Drain preemption [16] is used as the
Shared TBs fetch their instructions from the I-cache of SM 1 and baseline structure, and SMK [17] is also used for our evaluation. In
the instructions are assigned to their own I-buffer based on original- addition to the baseline, we modified SM structures to implement
WarpIds that are given by SM 1. 3 The warp scheduler in SM 1 issues Co-SMs consisting of two SMs and an SRC. For scheduling unit
warp instructions for both kernel A and kernel B using their own stack, sharing, sizes of an I-cache and a decoder are doubled to minimize the
barrier control logic, and score board structure with shared bit indica- interference between the kernels, and the additional hardware cost is
tion. 4 Kernel B’s warp instructions are issued to its own Collector small. We also added two additional pipeline stages to avoid increased
Units to access the SM 1 register file. 5 However, warp instructions critical path delay problems due to expensive wirings for execution
of Shared TBs should be issued to SM 0 for their register file access. sharing. We employ a widely used re-run based simulation [16] for fair
When warp instructions of Shared TBs access the SM 0 register file, performance comparison. We considered various kernels among Ro-
their WarpIDs must be transformed similar to VWI (Virtual Warp Id) dinia [4], Parboil [19], Polybench [18], Mars [5], and Cuda SDK [15].
in [21] to avoid conflicts between WarpID of two SMs. In order to Table 1 shows the benchmark list along with several important pa-
avoid same register location access by WarpID, we transform WarpID rameters. We evaluated all 36 pairs using nine workloads classified by
by adding MaxWarpId - NeighborKernel’s MaxWarpID of SM (e,g various kernel characteristics (compute-intensive, cache-sensitive, and
MaxWarpId is 48 in our configuration). Thereby, A share-warp which scheduling unit limited). We used average normalized turnaround time
has assigned Id 24 in Neighbor-SM (bottom SM in Fig 4-(b)) with (ANTT) and system throughput (STP) as performance metrics [16].
16 max NeighborKernel’s warps change their warp id to 56 during In general, ANTT shows the user-perceived response time and STP
issue to original-SM. 6 Using transformed WarpIDs, Shared TB represents the overall progress of the system.
instructions can access their own registers in SM 0. After accessed Fig 5-(a) and 5-(b) show ANTT and STP results of the GPU
own operands from original SM’s register file and executed, the result Weaver, respectively, and the performance results are compared to
of shared warp-instruction should be write-back to original register baseline SM level spatial multitasking and SMK. Fig 5-(a) shows

1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCA.2018.2889042, IEEE Computer
Architecture Letters

1.4 73% Spatial SMK GPU-Weaver 1.4 41% Spatial SMK GPU-Weaver
-21.5% 26%
1.2 1.2 9.5%
-7.4%
1 1

0.8 0.8

0.6 0.6

SM+RD

SM+RD
MQ+TF

RD+DX

MQ+TF
MQ+BP

HS+3D

MQ+BP

RD+DX
MQ+3D

BP+3D
BP+SM

SM+DX
SM+TF

MQ+3D

HS+3D
HS+SM

BP+3D
MQ+FD

MQ+RD

BP+SM

BP+TF

3D+TF

SM+DX
SM+TF
MQ+SM

HS+SM
HS+FD
HS+RD

BP+FD

3D+SM
3D+FD
HS+TF

BP+DX
BP+RD
BP+TF

3D+RD
3D+DX
3D+TF

FD+RD

MQ+HS

MQ+SM
MQ+FD

BP+FD

3D+SM
3D+FD
MQ+DX

HS+BP

HS+FD
HS+RD

HS+TF

BP+DX
BP+RD

3D+RD

FD+RD
MQ+HS

3D+DX
MQ+RD
MQ+DX

HS+BP

HS+DX

FD+DX
FD+TF

RD+TF

geomean

HS+DX

FD+DX
FD+TF

RD+TF

geomean
SM+FD

DX+TF

SM+FD

DX+TF
(a) ANTT (b) STP
Fig. 5. (a) Total ANTT reduction and (b) Total STP improvement
4.4% Exonly GW Exonly GW
0.35 36 36 3 6-bit registers (WarpId transformation). In addition to the above-
Sharing Access Ratio(%)

Sharing Access Ratio(%)


0.3
30 30
Cycle Ratio (%)

0.25
0.2
24 24 mentioned costs, additional control logic costs are also required. In
0.15
0.1 0.048%
18
12
18
12
order to estimate the total hardware overhead of GPU Weaver, we
0.05
0
6 6
0
synthesized the hardware logic using 40nm low power profile standard
0
MQ+DX
BP+DX
BP+TF

HS+RD
SM+RD
SM+TF
SM+DX

geomean
MQ+RD

BP+RD

BP+FD

cell library at 1GHz target frequency. Based on the synthesis and P&R

SFU
SP

SP
SFU
SP
SFU
SP
SFU
SFU

SFU

SFU
SP

SP

SP

SP
SFU
MQ HS BP geomean MQ HS BP geomean result, the additional hardware logic occupies 5.27mm2 , resulting in
(a) Kernel Launch Delay (b) Ex+Scheduling Sharing Cases (c) Ex+Cache Sharing Cases
only 0.997% of the total area of the baseline GPU (529mm2 ) [7].
Fig. 6. 1) Delay cycle ratio of (a) kernel launch, 2) execution unit 6 C ONCLUSION
sharing access ratio of compute-intensive(MQ,HS,BP) in (b) compute
+ scheduling workload cases (c) compute + cache workload cases. In this paper, we proposed a GPU Weaver,which is a dynamic
sub-resource management system on spatial multitasking GPUs. It
that GPU Weaver successfully reduces ANTT of most workload
enhances total performance by orchestrating three sharing techniques
combinations by up to 21.5%. Fig 5-(b) also shows that up to 26%
of execution units, scheduling units and L1 caches on top of baseline
STP performance improvement is seen in the GPU Weaver over the
GPU spatial multitasking. Execution unit sharing is performed by exe-
simple spatial multitasking. On average, GPU Weaver shows better
cuting ready warp instructions using execution units of a neighbor SM
performance gain of 7.4% ANTT reduction and 9.5% STP improve-
when possible. It also maximize thread level parallelism by allocating
ment than SMK (-0.02% ANTT and 7.9% STP improvement). Each
more thread blocks using scheduling structures of a neighbor SM
technique individually contributes 3.8% (Ex-sharing), 2% (Cache-
whenever possible. Evaluations show that GPU Weaver can improve
sharing) and 2.8% (Sched-sharing) for ANTT reduction and 3.9%
ANTT and STP by up to 21.5% and 26%, respectively.
(Ex-sharing), 4% (Cache-sharing) and 2% (Sched-sharing) for STP
improvement. The results show that GPU weaver can share sub- 7 ACKNOWLEDGMENTS
resources efficiently similar to SMK and also minimize sub-resource This work was supported in part by the National Re-
interference problem successfully. For example, SMK shows low search Foundation of Korea (NRF) grant funded by the Korea
ANTT reduction for workload pairs having MQ because MQ requires government (MSIP)(NO.NRF-2015R1C1A1A01053844, NO.NRF-
high utilization of SFUs and the other kernels cannot fully utilize the 2016R1C1B2016072), ICT R&D program of MSIP/IITP (No.2017-
SFU units. However, GPU weaver guarantees the fair performance of 0-00142), and the R&D program of MOTIE/KEIT (No.10077609).
other kernels (BP/FD/RD/DX/TF) because the execution sharing for
R EFERENCES
MQ execution is allowed only when they do not use their own SFUs. [1] J. T. Adriaens et al. The case for gpgpu spatial multitasking. In IEEE International
Symposium on High-Performance Computer Architecture, pages 1–12, Feb 2012.
[2] A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator.
5 R UNTIME AND H ARDWARE OVERHEAD E STIMATION In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE
International Symposium on, pages 163–174. IEEE, 2009.
As mentioned in section 3.3, GPU Weaver has two runtime overheads: [3] M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. Bulldozer: An approach to
multithreaded compute performance. IEEE Micro, 31(2):6–15, March 2011.
kernel launch delay and interference between sharing techniques. [4] S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of
In this evaluation, interferences between execution and the other the IEEE Symposium on Workload Characterization, pages 44–54, 2009.
[5] B. He et al. Mars: A mapreduce framework on graphics processors. In Proceedings
sharing techniques are only considered because execution unit sharing of the 17th International Conference on Parallel Architectures and Compilation
Techniques, PACT ’08, pages 260–269, New York, NY, USA, 2008. ACM.
performance is expected to mainly depend on the performance of [6] KHRONOS Group. OpenCL - the open standard for parallel programming of
heterogeneous systems, 2010.
the other kernel. Fig 6-(a) shows the ratio of kernel launch delay [7] J. Kim et al. Efficient gpu multitasking with latency minimization and cache
due to scheduling sharing to the total execution cycles. In the case boosting. IEICE Electronics Express, advpub, 2017.
[8] R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-core chip multiprocessing. In
of (BP+DX), BP kernel needs to wait for 4.4% of total execution Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages
195–206, Dec 2004.
cycles because of the long TB execution time of DX with scheduling [9] E. Lindholm et al. Nvidia tesla: A unified graphics and computing architecture.
IEEE Micro, 28(2):39–55, Mar. 2008.
sharing. However, the total overhead of kernel launch delay looks [10] S. Liu et al. Operand collector architecture, Nov. 16 2010. US Patent 7,834,881.
small because the overheads are less than 0.3% for all the other cases [11] M. Mantor and B. Sander. Amd’s radeon next generation gpu. In HC 29 Aug. 2017.
[12] NVIDIA’s Kepler GK110, 2012. http://www.nvidia.com/content/PDF/NVIDIA-
with 0.048% on average. Fig 6-(b) and (c) show the ratio of shared kepler-GK110-architecture-whitepaper.pdf.
[13] NVIDIA. Sharing a gpu between mpi processes: Multi-process service (mps)
execution unit accesses to the entire execution unit accesses when 1) overview, 2014.
[14] NVIDIA’s the world’s most advanced data center gpu Volta V100,
only execution unit sharing is applied (Exonly) and 2) all the sharing 2017. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-
techniques are applied (GW) to workload pairs having compute- whitepaper.pdf.
[15] G. Nvidia. GPU computing sdk,” Available at: https://developer.nvidia.com/gpu-
intensive workloads. For all cases in Fig 6-(b), GPU-Weaver shows computing-sdk, 22(07):2013.
[16] J. J. K. Park et al. Chimera: Collaborative preemption for multitasking on a shared
slightly less execution sharing accesses than Exonly case. Similarly, gpu. In Proceedings of the Twentieth International Conference on Architectural
Fig 6-(c) also shows that the performance impact of execution sharing Support for Programming Languages and Operating Systems, ASPLOS ’15, pages
593–606, New York, NY, USA, 2015. ACM.
is minimal from cache sharing as the decrease of shared execution [17] J. J. K. Park et al. Dynamic resource management for efficient utilization of
multitasking gpus. In Proceedings of the Twenty-Second International Conference
unit accesses is negligible. on Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’17, pages 527–540, New York, NY, USA, 2017. ACM.
Additional hardware cost per SM is estimated as follows: a) [18] L.-N. Pouchet. Polybench: The polyhedral benchmark suite. URL: http://www. cs.
Execution unit sharing: 6 2-input muxes and pipeline registers for ucla. edu/pouchet/software/polybench, 2012.
[19] J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial
pre-sharing stage and 3 2-input muxes and pipeline registers for post- throughput computing. Center for Reliable and High-Performance Computing, 127,
2012.
sharing stage (buffers are added to meet timing) b) Cache sharing: 64 [20] Z. Wang et al. Simultaneous multikernel gpu: Multi-tasking throughput processors
via fine-grained sharing. In 2016 IEEE International Symposium on High Perfor-
2-input muxes (all cache lines for 2 sharing ways) and a 1-bit register mance Computer Architecture (HPCA), pages 358–369. IEEE, 2016.
[21] M. K. Yoon et al. Virtual thread: Maximizing thread-level parallelism beyond gpu
(sensitivity), c) Scheduling unit sharing: a 77-bit I-fetch buffer, an scheduling limit. In Proc. of the 43rd Annual International Symposium on Computer
additional instruction decoder, 96 1-bit registers (shared bits), and Architecture, June 2016.

1556-6056 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like