Modeling and Characterizing Shared and Local Memories of The Ampere Gpus
Modeling and Characterizing Shared and Local Memories of The Ampere Gpus
Modeling and Characterizing Shared and Local Memories of The Ampere Gpus
Ampere GPUs
Hamdy Abdelkhalik1 , Yehia Arafa1,3 , Nandakishore Santhi2 , Nirmal Prajapati2 , and Abdel-Hameed
A. Badawy1,2
1 Klipsch School of ECE, New Mexico State University, Las Cruces, NM 80003, USA
{enghamdy, yarafa, badawy}@nmsu.edu
2 Los Alamos National Laboratory, Los Alamos, NM 87545, USA
{nsanthi, prajapati}@lanl.gov
3 Qualcomm Inc, USA
ABSTRACT 1 INTRODUCTION
The rapid evolution of GPU architectures necessitates advanced The advent of advanced Graphics Processing Units (GPUs) like
modeling techniques for optimization and understanding their intri- Nvidia Ampere [14] architecture has brought forth a new era of
cate functionalities. The Performance Prediction Toolkit for GPUs computational power and efficiency. To fully leverage these capa-
(PPT-GPU) is an innovative modeling tool developed to deliver bilities and optimize GPU-based applications, there is a need for
detailed modeling of GPU architectures. It allows researchers and precise modeling and understanding of such complex architectures.
developers to understand, analyze, and predict the behavior of differ- In this study, we have significantly enhanced a tool known as PPT-
ent GPU components under various computational loads. PPT-GPU GPU [4, 6] to facilitate comprehensive modeling of the Ampere
is invaluable in enabling detailed insights into GPU architectures, architecture, specifically focusing on shared and local memories.
such as memory performance metrics, total active cycles, and uti- The A100 GPU [13], as a representation of the cutting-edge Am-
lization of GPU resources. pere architecture, embodies significant innovations in its on-chip
This paper extends PPT-GPU to model the NVIDIA Ampere memory systems and computational units. The new design of the
architecture. Specifically, we focus on modeling the shared memory shared memory (Fig. 1) presented in Ampere architecture achieves
and the register spilling represented by the local memory operations. high optimization via asynchronous copy operations, which elim-
We have used several performance metrics to validate our work inate idleness by facilitating simultaneous data transmission and
against Nvidia Nsight. Additionally, our refined version of PPT- computations and can bypass the L1 cache to save bandwidth and
GPU offers detailed metrics, capturing active cycles of functional time. The A100 GPU also provides local memory as an additional
units, an essential aspect in analyzing application performance resource for storing thread-specific data when register capacities
constraints. The enhanced PPT-GPU has an average prediction are exhausted, thereby balancing data processing speed and storage
error of 14.6% for total active cycles and 13% and 13.3% for the L1 requirements.
and L2 hit rates, respectively. Before our enhancements, PPT-GPU was accurately modeling
Volta and Turing architectures [4]. However, accurate modeling
CCS CONCEPTS of the Ampere architecture, with its distinctive shared memory
• Hardware → Hardware accelerators; • General and reference structure, demanded extensions to PPT-GPU.
→ Performance; Initially, the instructions latencies for Ampere were estimated [2].
This was a fundamental step to set the stage for subsequent model-
KEYWORDS ing efforts. With these latencies and other specific configurations as
input, we model the new design of the shared and local memories
GPU Modeling, Ampere Architecture, PPT-GPU, GPU shared mem- using our extended PPT-GPU implementation.
ory, GPU local memory We compared our results against Nvidia Nsight [15] to uphold
ACM Reference Format: the integrity and corroborate the efficacy of PPT-GPU. Additionally,
Hamdy Abdelkhalik1 , Yehia Arafa1,3 , Nandakishore Santhi2 , Nirmal Prajapati2 , to corroborate the dependability of our tool, we ran a diverse set of
and Abdel-Hameed A. Badawy1,2 . 2023. Modeling and Characterizing Shared benchmarks, extending from linear algebra (e.g., 2MM, BICG, etc.)
and Local Memories of the Ampere GPUs. In The International Symposium on to machine learning applications (e.g., Deepbench).
Memory Systems (MEMSYS ’23), October 2–5, 2023, Alexandria, VA, USA. ACM,
New York, NY, USA, 3 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
2 RELATED WORK
Permission to make digital or hard copies of part or all of this work for personal or GPGPU-Sim [7], an early tool, provided cycle-level simulations for
classroom use is granted without fee provided that copies are not made or distributed GPUs from Fermi to Pascal, but its slow speed limited its use in
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
large-scale projects. Barra [8], meanwhile, offered functional CUDA
For all other uses, contact the owner/author(s). program simulations but needed more depth in capturing architec-
MEMSYS ’23, October 2–5, 2023, Alexandria, VA, USA tural details. Wong et al. presented the Multi2Sim tool [18], focusing
© 2023 Copyright held by the owner/author(s). on AMD’s architecture. This tool offered a combined CPU and GPU
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn simulation. Kerr et al. developed a model primarily for predicting
MEMSYS ’23, October 2–5, 2023, Alexandria, VA, USA Abdelkhaliq, Arafa, Prajapati, Santhi, and Badawy
3 PPT-GPU
The Performance Prediction Toolkit for GPUs (PPT-GPU) is an Figure 2: Prediction results.
integral part of the Performance Prediction Toolkit (PPT), an open-
source project [5] developed at Los Alamos National Laboratory. possess 12 hexadecimal digits, and the local memory addresses are
PPT-GPU is a specialized tool designed to facilitate in-depth model- represented with only six digits (e.g., 0X7F088EA22000 vs. 0XFFFC44).
ing of complex GPU architectures and predicts their performance. It We couple this analysis with the traces collected from load and store
relies on the SASS [12] instruction traces collected using NVBit [19] operations of local memory reuse distance profiling to estimate
to capture the dynamic behavior of the application accurately. the hit rates for both operations [3]. Furthermore, we accurately
We undertake an expansive enhancement of PPT-GPU in re- compute the number of requests and transactions throughout the
sponse to the need for a more intricate understanding of the Am- kernel execution.
pere architectures. In the extended PPT-GPU, we make significant The enhanced PPT-GPU now integrates a wider range of addi-
adjustments to cater to the new shared memory design (see Fig- tional metrics. Among these are local hit rates, transactions, re-
ure 1). The updated model now recognizes that asynchronous copy quests, and the measurement of active cycles for Tensor Cores
operations in shared memory follow two distinct hardware paths, and various other compute units. Such details aid in providing an
resulting in different SASS instructions. One path allows the data to in-depth understanding of the inner workings of the GPU. This ul-
flow from the global memory through the L1 and L2 caches before timately assists in pinpointing performance bottlenecks and paves
going to the shared memory (LDGSTS.E). The other path, identified the way for more efficient optimization of code specific to the Am-
by the “Bypass” operator, facilitates direct data movement from the pere architecture.
global memory to the shared memory (LDGSTS.BYPASS.E).
To model these operations, we collect two different memory 4 EXPERIMENTAL EVALUATION
traces for L1 and L2 caches (in the same run) and recognize that We ran experiments using Cuda 11.7.0 on an A100 GPU. We ex-
certain operations use L2 while bypassing L1. These traces are then perimented with a wide range of benchmarks, from simple kernels
integrated with the traces of other standard memory operations. like matrix multiplication to large kernels from Deepbench [1], a
Subsequently, we utilize these traces to calculate the hit rates for machine learning application leveraging tensor cores and shared
both L1 and L2 caches by leveraging existing reuse distance profiling memory. We also cross-verified our results against Nvidia Nsight.
methods in PPT-GPU. We also estimate the latency of the direct Figure 2 highlights three pivotal performance metrics: total ac-
path from the global to shared memory via L2 to be about 170 cycles, tive cycles and hit rates for L1 and L2 caches. The first six appli-
compared to 300 cycles for the full path. However, this latency is cations incorporate the updated operations and structure of the
hidden as memory operations exchange with the computational shared memory. For instance, FP64_Gemm instructions bypass the
operations. L1 cache, while instructions in TF32_Gemm access it. The rest use
Local memory plays a significant role in influencing application a combination of these operations.
performance, making its accurate modeling critical for performance Finally, we have three applications that use register spilling,
prediction and optimization efforts [11, 16, 20]. Local memory uses which results in local memory operations (FP64_Gemm, TF32_Symm,
the same data path as normal memory operations. It takes part and Gemm_Softmax). These observations highlight that these ap-
of the global memory and can be cached in the L1 and L2 caches. plications effectively leverage the two principal components we
Local memory has a thread-private scope similar to the register integrated into PPT-GPU.
file [11]. We gather address traces on a per-block granularity for the Figure 2 shows the percentage error in prediction for active cy-
normal global memory operations. Whereas for the local memory, cles, L1 hit rate, and L2 hit rate. The average error in the prediction
we collect the traces per warp. Moreover, our analysis found that of hit rates for L1 and L2 caches are 13% and 13.3%, respectively.
the local memory addresses differ from those of other memory Furthermore, in our evaluation of PPT-GPU, the error in total active
operations. For instance, the standard global operations consistently cycles across all applications ranges from a minimum of 3% to a
Modeling and Characterizing Shared and Local Memories of the Ampere GPUs MEMSYS ’23, October 2–5, 2023, Alexandria, VA, USA
maximum of 42%, with an average of 14.6% error. Table 1 summa- [7] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009.
rizes the local memory results. The accuracy of all metrics for the Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE Inter-
national Symposium on Performance Analysis of Systems and Software. 163–174.
FP64_Gemm is 100%. For the other two applications, the prediction https://doi.org/10.1109/ISPASS.2009.4919648
error in hit rates of LD and ST operations ranges from 4% to 15%. [8] Caroline Collange, Marc Daumas, David Defour, and David Parello. 2010. Barra: A
parallel functional simulator for gpgpu. In 2010 IEEE International Symposium on
Modeling, Analysis and Simulation of Computer and Telecommunication Systems.
5 CONCLUSIONS IEEE, 351–360.
[9] Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-
In this paper, we modeled the Nvidia Ampere shared and local CPU workloads and systems. In Proceedings of the 3rd workshop on general-purpose
memories in-depth and substantially enhanced the PPT-GPU to computation on graphics processing units. 31–42.
depict its local memory and the novel design of its shared mem- [10] Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020.
Accel-Sim: An extensible simulation framework for validated GPU modeling. In
ory. We used a broad set of benchmarks to validate PPT-GPU and 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture
compared the results against Nvidia Nsight. The active cycles, a (ISCA). IEEE, 473–486.
primary performance metric, have an average error rate of 14.6%, [11] Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z. Zhang, Daniel Chavarría-
Miranda, and Henk Corporaal. 2016. Critical points based register-concurrency
while the L1 and L2 cache memories hit rates stand at 13% and autotuning for GPUs. In 2016 Design, Automation & Test in Europe Conference &
13.3%, respectively. Exhibition (DATE). 1273–1278.
[12] NVIDIA. CUDA Binary Utilities. https://docs.nvidia.com/cuda/
cuda-binary-utilities/
REFERENCES [13] NVIDIA. NVIDIA A100 whitepaper. https://images.nvidia.com/aem-dam/en-zz/
[1] DeepBench. https://svail.github.io/DeepBench/. Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
[2] Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A. [14] NVIDIA. NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog.
Badawy. 2022. Demystifying the Nvidia Ampere Architecture through Mi- https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/
crobenchmarking and Instruction-level Analysis. In 2022 IEEE High Performance [15] NVIDIA. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems
Extreme Computing Conference (HPEC). 1–8. https://doi.org/10.1109/HPEC55821. [16] Putt Sakdhnagool, Amit Sabne, and Rudolf Eigenmann. 2019. arXiv preprint
2022.9926299 arXiv:1907.02894 (2019).
[3] Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Atanu Barai, Nan- [17] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan
dakishore Santhi, and Stephan Eidenbenz. 2020. Fast, Accurate, and Scal- Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely,
able Memory Modeling of GPGPUs Using Reuse Profiles. In Proceedings of Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus
the 34th ACM International Conference on Supercomputing (ICS ’20). Associ- Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. https://github.com/
ation for Computing Machinery, New York, NY, USA, Article 31, 12 pages. NVIDIA/cutlass.
https://doi.org/10.1145/3392717.3392761 [18] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli.
[4] Yehia Arafa, Abdel-Hameed Badawy, Ammar ElWazir, Atanu Barai, Ali Eker, 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceed-
Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2021. Hy- ings of the 21st international conference on Parallel architectures and compilation
brid, scalable, trace-driven performance modeling of GPGPUs. In Proceedings of techniques. 335–344.
the International Conference for High Performance Computing, Networking, Storage [19] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019.
and Analysis. 1–15. Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceed-
[5] Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore ings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.
Santhi, and Stephan Eidenbenz. PPT-GPU Tool. https://github.com/lanl/PPT 372–383.
[6] Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore [20] Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and
Santhi, and Stephan Eidenbenz. 2019. IEEE Computer Architecture Letters 18, 1 Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-Level
(2019), 55–58. https://doi.org/10.1109/LCA.2019.2904497 Parallelism Optimization for GPUs. In Proceedings of the 48th International Sym-
posium on Microarchitecture (MICRO-48). Association for Computing Machinery,
New York, NY, USA, 395–406. https://doi.org/10.1145/2830772.2830813