Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand

Predictive warp scheduling for efficient execution in
GPGPU
Abhinish Anand
MTP Phase-1
Guide: Prof. Virendra Singh
Department of Electrical Engineering

IIT Bombay
October 23, 2019
Abhinish Anand Predictive warp scheduling October 23, 2019 1 / 33

Outline
Introduction
GPU Architecture
Bottlenecks in GPGPU
Literature review
Observation & Motivation
Proposed approach
Experimental results
Future work

Introduction
Graphical Processing Units(GPUs) are gaining momentum for general

purpose workloads like scientific application, signal processing, neural
networks.
The programming models such as CUDA and OpenCL have made
programming GPGPUs simpler.
As simpler cores are used for GPU, it is easy to design, have high
yield and low cost per core.
Also, parallelism in GPU provides effective way to hide memory
latency and thus improves performance.

GPU Architecture
The GPU consists of Streaming Multiprocessor(SMs), high bandwidth

DRAM channels and on-chip L2 cache.
The number of SMs and cores per SM varies as per the price and
target market of the GPU.
Example:
Nvidia Tesla K40 - 15 SMs
Nvidia Tesla P100 - 56 SMs.
Nvidia GeForce RTX 2080Ti - 72 SMs

GPU Architecture
Figure: GPGPU

Streaming Multiprocessor
Each SM features in-order

Streaming Processors(SPs).
SP has a fully pipelined integer
ALU and FPU.
Each SM has load/store
units(LSUs) and SFUs, 64 KB
shared memory/L1 cache,
constant cache and texture
cache.
DRAM and L2-cache is off-chip
and shared among SMs. Figure: Streaming Multiprocessor

Software Model
Programmer decides #CTAs

and #threads in GPU kernel
code.
A CTA consists of multiple
threads having same code.
CTA is further sub-organized
into groups called warps.
Scheduling happens at the
granularity of warps inside SM.
All threads in a warp execute
together using a common Figure: Software model
program counter.

CTA distribution
GPU compilers estimate the

maximum number of concurrent
CTAs that can be assigned to
an SM using resource usage
information.[2]
Then, it assigns a CTA to each
SM in a round robin fashion
until all SMs are assigned upto
the maximum concurrent CTAs.
Figure: CTA distribution
Later on, CTA assignments are
completely demand driven.
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing Thread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and Compilation
Techniques, 2013
Bottlenecks in GPGPU
Limited on-chip memory

If the per-CTA requirements for memory is high, then the number of
CTAs that can be scheduled simultaneously will be small.
Leads to lower core utilization.
High control flow divergence
GPUs handle branch divergence by serializing the execution path.
Reduces SIMD utilization and IPC in general purpose computing.
Inefficient scheduling mechanisms
Most of the warps arrive at long latency memory operations roughly
at the same time.
The SM becomes inactive because there may be no warps that are
not stalled due to a memory operation.
Reduces the capability of hiding long memory latencies.

CTA distribution issue
In the baseline architecture, round robin scheduling policy schedules

the maximum number of CTAs per SM which is not always the
optimal choice from the performance perspective.
High number of threads ⇒ more memory requests ⇒ contention in
the cache, network and memory ⇒ long stalls to the core.
Different techniques to counter this issues:
CPU-Assisted prefetching (Fused Architecture)
Two level warp scheduling
Equalizer
Neither More Nor Less

Literature review
CPU Assisted Pre-fetching
After GPU kernel launch, CPU uses pre-execution program to
prefetch data in the L3 cache.
It contains the memory access instructions of the GPU kernel for
multiple thread blocks and thus increases the cache hit rate..
Two level scheduling
Problem: All warps arrive at a single long latency memory operation
at the same time. So all warps get stalled and idle FU cycles get
increased.
Solution: This policy groups all concurrently executing warps into
fixed size fetch groups. These groups and warps inside them have
priorities and are scheduled accordingly.
Prioritizing fetch groups prevents all warps from stalling together.
[1] Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”. IEEE,
2012.
Literature review
TWO LEVEL SCHEDULING
Figure: Baseline warp scheduling vs two level warp scheduling
[5] Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling” (MICRO), 2011
Literature review
Equalizer
Problem: As threads wait to access the bottleneck resource, other
resources end up being under-utilized, leading to inefficient execution.
Solution: Saves energy by lowering the frequency of under-utilized
resources (memory system or SM) with minimal performance loss.
Increases the frequency of highly-utilized resources to gain
performance and modulate the number of threads for efficient
execution.
[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th
AnnualIEEE/ACM International Symposium on Microarchitecture, 2018
Literature review
For the memory intensive application, the core spends its most of the
cycles in fetching the data from the memory.[2]
The high memory requests create contention in the caches, network
and memory, leading to long stalls at the cores.
Best choice is to execute the optimal number of CTAs for each
application.
Optimal number of CTA per SM is decided by checking all the
possible number of CTAs that can be assigned to an SM per
application.
Requires exhaustive analysis for each application, thus inapplicable.
Idea: Dynamically modulate the number of CTAs on each core using
the CTA scheduler
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Literature review

Assign N/2 CTAs to each core instead of N CTAs per core.[2]
Distribute the CTAs to core in round robin fashion. Check stall cycles
and idle cycles of an SM periodically.
Figure: Dynamic CTA scheduling mechanism
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: OptimizingThread-level
Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures andCompilation
Techniques, 2013
Observation
Pausing only the lastly assigned CTA when the memory stalls have
increased a lot cannot always decrease the memory congestion.
The paused lastly assigned CTA may have already created the request
and other CTAs which are not paused may create more load request.
Also, pausing all the warps of any CTA is not efficient as some warps
may be ready to execute and thus pausing them will reduce the TLP.
Also, when the DRAM will serve the request of lastly assigned CTA, it
wouldn’t be able to execute as it is in paused state.

Observation
Figure: count of load operation call from each CTA of each SM in vectorAdd
application

Observation
Figure: Number of cycles in which the lastly assigned CTA has already created
load request and some other CTAs are available which are going to create load
request in near future
Motivation-1
To decrease the stall cycles and increase the utilization of SM.

Experimentally found that the memory intensive applications have
high memory stalls.
These stalls can be reduced by decreasing the congestion by pausing
the bad warps and allowing the ready warps to execute.

Motivation
Figure: Fraction of total cycles in which all warps are waiting for their data to
come back

Motivation-2
To decrease the number of total misses in the L1D cache

can be achieved by utilizing the data locality across warps in CTA.
implemented by introducing a predictor for each SM which will keep
track of the hit-miss status of warps from each CTA.

Motivation-2
Figure: Normalized IPC improvement when ideal L1 cache is used

Proposed Approach
Check the increase in congestion using stall cycles for each SM which
denotes that the core is stalled because all the warps are waiting for
their data to come back.
When memory congestion increases from the threshold (paused state)
then
Pause the warps which is going to create memory request.
Pause only those warps whose request is going to get miss in the L1
cache
Predict the hit or miss of the warp using the last warps hit-miss status.
When the SM is in paused state, keep track of pending DRAM
request and when it decreases from a threshold, then change the state
of SM to unpause.
In unpaused state, all warps of SM will get scheduled without any
blocking.

Proposed Approach
Figure: Flowchart for the proposed approach

Proposed Approach
Predictor Table
PC CTA id Miss counter Access bit
(8 bit) (3 bit) (6 bit) (1 bit)
Total size = 4608 bits

In unpaused state, when any warp executes memory instructions, a
new entry is made in the predictor table with the corresponding CTA
id and the last 8-bit PC of that instruction.
The miss counter is initialized with 100000 and if a warp gets miss in
the L1 cache, then the miss counter will be incremented by 1 and for
hit it will be decremented by 1.

Proposed Approach
Predictor Table
Access bit is set for that row when any warp update the table
regarding its corresponding PC and CTA id.
Access bit will be reset after every epoch. It will ensure that in the
last epoch that row (PC) is not being used by any warps and all
warps have executed that PC.
So, that row can be cleared up so that space can be made for newer
entries.
When any CTA exits after it completes its execution, all the rows
belonging to that CTA is cleared.
Predictor Table will not get updated when the SM is in paused state.

Simulator: GPGPU-sim v3.2.2
SM configuration
No. of SMs 15 clusters, 1 SM per cluster
1.4 GHz, 32 SIMT width, 48 KB shared memory,
SM resources Max. 1536 threads (48 warps/SM, 32 threads/warp),
32768 registers/SM
Scheduler 2 warp schedulers per SM, LRR policy
L1 data cache 32-sets, 128B block size, 4-way set associative
LLC 64-sets, 128B block size, 6-way set associative
DRAM Configuration
DRAM Scheduler FR-FCFS
6 Memory Channels/Memory Controllers(MC),
DRAM Capacity
16banks/MC, 4KB row size/bank, 32 columns/row

Figure: Normalized IPC of the dynamic CTA scheduling w.r.t two level scheduling
(dyncta400: epoch=400, t idle=50, tmem l=2800, tmem h=3200)
(dyncta1000: epoch=1000, t idle=50, tmem l=4000, tmem h=5000)

Conclusion
Memory intensive applications are unable to use the high TLP

because of increase in congestion in memory and NOC bandwidth.
The performance can be increased by decreasing the congestion in the
memory bandwidth by pausing those warps only which is going to
flood the DRAM or NOC bandwidth.
So by predicting those bad warps before issuing using predictor and by
pausing their execution, the bandwidth can be used efficiently to
increase the TLP and thus performance.

Future work
Simulating the proposed approach using predictor for L1 cache in

GPGPU-Sim simulator and analyzing the performance and utilization
of GPU cores for different workloads.
Extend the above implementation for multiple kernels scheduling in
GPGPU and analyze the fairness for different combination of
workloads.

References
[1]. Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou . ”CPU-Assisted GPGPU on Fused CPU-GPU Architectures”.
IEEE International Symposium on High-Performance Comp Architecture, 2012
[2]. Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das. ”Neither More Nor Less: Optimizing
Thread-level Parallelism for GPGPUs”. Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques, 2013
[3]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA-Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018
[4]. Wilson W.L. Fung ; Ivan Sham ; George Yuan ; Tor M. Aamodt . ” Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow”. 40th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO), 2007
[5]. Veynu Narasiman ; Michael Shebanow ; Chang Joo Lee ; Rustam Miftakhutdinov ; Onur Mutlu ; Yale N. Patt . ”
Improving GPU performance via large warps and two-level warp scheduling”. 44th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 2011
[6]. Farzad Khorasani ; Hodjat Asghari Esfeden ; Amin Farmahini-Farahani ; Nuwan Jayasena ; Vivek Sarkar. ”
RegMutex: Inter-Warp GPU Register Time-Sharing”. ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), 2018
[7]. Gunjae Koo ; Hyeran Joe ; Zhenhong Liu; Nam Sung Kim; Murali Annavaram. ” CTA Aware Prefetching and
Scheduling for GPU”. IEEE International Parallel and Distributed Processing Symposium, 2018

References
[8] Ankit Sethia ; Scott Mahlke. ” Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution”. 47th Annual
IEEE/ACM International Symposium on Microarchitecture, 2018
[9] Jacob T. Adriaens ; Katherine Compton ; Nam Sung Kim ; Michael J. Schulte . ” The case for GPGPU spatial
multitasking”. IEEE International Symposium on High-Performance Comp Architecture, 2012
[10] Adwait Jog ; Onur Kayiran ; Tuba Kesten ; Ashutosh Pattnaik ; Evgeny Bolotin ; Niladrish Chatterjee; Stephen W.
Keckler ; Mahmut T. Kandemir ; Chita R. Das. ” Anatomy of GPU Memory System for Multi-Application Execution”.
Proceedings of the 2015 International Symposium on Memory Systems, 2015
[11] Adwait Jog ; Evgeny Bolotin ; Zvika Guz ; Stephen W. Keckler ; Mahmut T. Kandemir ; Mike Parker ; Chita R. Das
. ” Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications”. Proceedings
of Workshop on General Purpose Processing Using GPUs, 2014
[12] Zhen Lin ; Hongwen Dai ; Michael Mantor ; Huiyang Zhou ; ” Coordinated CTA Combination and Bandwidth
Partitioning for GPU Concurrent Kernel Execution”. ACM Transactions on Architecture and Code Optimization (TACO),
2019
[13] Lingyuan Wang ; Miaoqing Huang ; Tarek El-Ghazawi ” Exploiting concurrent kernel execution on graphic
processing units”. International Conference on High Performance Computing Simulation, 2011

The End
Thank You

Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand

Uploaded by

Copyright:

Available Formats

Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Warp Scheduling For Efficient Execution in Gpgpu: Abhinish Anand

Uploaded by

Copyright:

Available Formats

Predictive warp scheduling for efficient execution in

Department of Electrical Engineering

October 23, 2019

Abhinish Anand Predictive warp scheduling October 23, 2019 1 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 2 / 33

Graphical Processing Units(GPUs) are gaining momentum for general

Abhinish Anand Predictive warp scheduling October 23, 2019 3 / 33

The GPU consists of Streaming Multiprocessor(SMs), high bandwidth

Abhinish Anand Predictive warp scheduling October 23, 2019 4 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 5 / 33

Each SM features in-order

Abhinish Anand Predictive warp scheduling October 23, 2019 6 / 33

Programmer decides #CTAs

Abhinish Anand Predictive warp scheduling October 23, 2019 7 / 33

GPU compilers estimate the

Limited on-chip memory

Abhinish Anand Predictive warp scheduling October 23, 2019 9 / 33

In the baseline architecture, round robin scheduling policy schedules

Abhinish Anand Predictive warp scheduling October 23, 2019 10 / 33

Figure: Baseline warp scheduling vs two level warp scheduling

Neither More Nor Less

Figure: Dynamic CTA scheduling mechanism

Abhinish Anand Predictive warp scheduling October 23, 2019 16 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 17 / 33

To decrease the stall cycles and increase the utilization of SM.

Abhinish Anand Predictive warp scheduling October 23, 2019 19 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 20 / 33

To decrease the number of total misses in the L1D cache

Abhinish Anand Predictive warp scheduling October 23, 2019 21 / 33

Figure: Normalized IPC improvement when ideal L1 cache is used

Abhinish Anand Predictive warp scheduling October 23, 2019 22 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 23 / 33

Figure: Flowchart for the proposed approach

Abhinish Anand Predictive warp scheduling October 23, 2019 24 / 33

Total size = 4608 bits

Abhinish Anand Predictive warp scheduling October 23, 2019 25 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 26 / 33

Simulator: GPGPU-sim v3.2.2

Abhinish Anand Predictive warp scheduling October 23, 2019 27 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 28 / 33

Memory intensive applications are unable to use the high TLP

Abhinish Anand Predictive warp scheduling October 23, 2019 29 / 33

Simulating the proposed approach using predictor for L1 cache in

Abhinish Anand Predictive warp scheduling October 23, 2019 30 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 31 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 32 / 33

Abhinish Anand Predictive warp scheduling October 23, 2019 33 / 33

You might also like