Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand
Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand
Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand
GPGPU
Abhinish Anand
MTP Phase-1
Guide: Prof. Virendra Singh
Introduction
GPU Architecture
Bottlenecks in GPGPU
Literature review
Observation & Motivation
Proposed approach
Experimental results
Future work
Figure: GPGPU
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing Thread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 8 / 33
Bottlenecks in GPGPU
[1] Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”. IEEE,
2012.
Abhinish Anand Predictive warp scheduling October 23, 2019 11 / 33
Literature review
TWO LEVEL SCHEDULING
[5] Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling” (MICRO), 2011
Abhinish Anand Predictive warp scheduling October 23, 2019 12 / 33
Literature review
Equalizer
Problem: As threads wait to access the bottleneck resource, other
resources end up being under-utilized, leading to inefficient execution.
Solution: Saves energy by lowering the frequency of under-utilized
resources (memory system or SM) with minimal performance loss.
Increases the frequency of highly-utilized resources to gain
performance and modulate the number of threads for efficient
execution.
[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th
AnnualIEEE/ACM International Symposium on Microarchitecture, 2018
Abhinish Anand Predictive warp scheduling October 23, 2019 13 / 33
Literature review
Neither More Nor Less
For the memory intensive application, the core spends its most of the
cycles in fetching the data from the memory.[2]
The high memory requests create contention in the caches, network
and memory, leading to long stalls at the cores.
Best choice is to execute the optimal number of CTAs for each
application.
Optimal number of CTA per SM is decided by checking all the
possible number of CTAs that can be assigned to an SM per
application.
Requires exhaustive analysis for each application, thus inapplicable.
Idea: Dynamically modulate the number of CTAs on each core using
the CTA scheduler
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 14 / 33
Literature review
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Abhinish Anand Predictive warp scheduling October 23, 2019 15 / 33
Observation
Pausing only the lastly assigned CTA when the memory stalls have
increased a lot cannot always decrease the memory congestion.
The paused lastly assigned CTA may have already created the request
and other CTAs which are not paused may create more load request.
Also, pausing all the warps of any CTA is not efficient as some warps
may be ready to execute and thus pausing them will reduce the TLP.
Also, when the DRAM will serve the request of lastly assigned CTA, it
wouldn’t be able to execute as it is in paused state.
Figure: count of load operation call from each CTA of each SM in vectorAdd
application
Figure: Number of cycles in which the lastly assigned CTA has already created
load request and some other CTAs are available which are going to create load
request in near future
Abhinish Anand Predictive warp scheduling October 23, 2019 18 / 33
Motivation-1
Figure: Fraction of total cycles in which all warps are waiting for their data to
come back
Check the increase in congestion using stall cycles for each SM which
denotes that the core is stalled because all the warps are waiting for
their data to come back.
When memory congestion increases from the threshold (paused state)
then
Pause the warps which is going to create memory request.
Pause only those warps whose request is going to get miss in the L1
cache
Predict the hit or miss of the warp using the last warps hit-miss status.
When the SM is in paused state, keep track of pending DRAM
request and when it decreases from a threshold, then change the state
of SM to unpause.
In unpaused state, all warps of SM will get scheduled without any
blocking.
Predictor Table
PC CTA id Miss counter Access bit
(8 bit) (3 bit) (6 bit) (1 bit)
Predictor Table
Access bit is set for that row when any warp update the table
regarding its corresponding PC and CTA id.
Access bit will be reset after every epoch. It will ensure that in the
last epoch that row (PC) is not being used by any warps and all
warps have executed that PC.
So, that row can be cleared up so that space can be made for newer
entries.
When any CTA exits after it completes its execution, all the rows
belonging to that CTA is cleared.
Predictor Table will not get updated when the SM is in paused state.
SM configuration
No. of SMs 15 clusters, 1 SM per cluster
1.4 GHz, 32 SIMT width, 48 KB shared memory,
SM resources Max. 1536 threads (48 warps/SM, 32 threads/warp),
32768 registers/SM
Scheduler 2 warp schedulers per SM, LRR policy
L1 data cache 32-sets, 128B block size, 4-way set associative
LLC 64-sets, 128B block size, 6-way set associative
DRAM Configuration
DRAM Scheduler FR-FCFS
6 Memory Channels/Memory Controllers(MC),
DRAM Capacity
16banks/MC, 4KB row size/bank, 32 columns/row
Figure: Normalized IPC of the dynamic CTA scheduling w.r.t two level scheduling
(dyncta400: epoch=400, t idle=50, tmem l=2800, tmem h=3200)
(dyncta1000: epoch=1000, t idle=50, tmem l=4000, tmem h=5000)
[1]. Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”.
IEEE International Symposium on High-Performance Comp Architecture, 2012
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing
Thread-level Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques, 2013
[3]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA-Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018
[4]. Wilson W.L. Fung ; Ivan Sham ; George Yuan ; Tor M. Aamodt . ” Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow”. 40th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO), 2007
[5]. Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling”. 44th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2011
[6]. Farzad Khorasani ; Hodjat Asghari Esfeden ; Amin Farmahini-Farahani ; Nuwan Jayasena ; Vivek Sarkar. ”
RegMutex: Inter-Warp GPU Register Time-Sharing”. ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), 2018
[7]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018
[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th Annual
IEEE/ACM International Symposium on Microarchitecture, 2018
[9] Jacob T. Adriaens ; Katherine Compton ; Nam Sung Kim ; Michael J. Schulte . ” The case for GPGPU spatial
multitasking”. IEEE International Symposium on High-Performance Comp Architecture, 2012
[10] Adwait Jog ; Onur Kayiran ; Tuba Kesten ; Ashutosh Pattnaik ; Evgeny Bolotin ; Niladrish Chatterjee; Stephen W.
Keckler ; Mahmut T. Kandemir ; Chita R. Das. ” Anatomy of GPU Memory System for Multi-Application Execution”.
Proceedings of the 2015 International Symposium on Memory Systems, 2015
[11] Adwait Jog ; Evgeny Bolotin ; Zvika Guz ; Stephen W. Keckler ; Mahmut T. Kandemir ; Mike Parker ; Chita R. Das
. ” Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications”. Proceedings
of Workshop on General Purpose Processing Using GPUs, 2014
[12] Zhen Lin ; Hongwen Dai ; Michael Mantor ; Huiyang Zhou ; ” Coordinated CTA Combination and Bandwidth
Partitioning for GPU Concurrent Kernel Execution”. ACM Transactions on Architecture and Code Optimization (TACO),
2019
[13] Lingyuan Wang ; Miaoqing Huang ; Tarek El-Ghazawi ” Exploiting concurrent kernel execution on graphic
processing units”. International Conference on High Performance Computing Simulation, 2011
Thank You