Abstract
Graphical Processing Units (GPUs) became an important platform to general purpose computing, thanks to their high performance and low cost when compared to CPUs. Modern GPU architectures are constantly evolving with growing resources. In order to take advantage of all the resources available and increase the GPU efficiency, new generation GPUs include support for concurrent kernel execution. Different kernels can be executed at the same time and share the GPU resources. Thus, benchmark suites developed to evaluate GPU performance and scalability should take this aspect into account that could be quite different from traditional CPU benchmarks. Nowadays, SHOC, Parboil, and Rodinia are the main benchmark suites for evaluating GPUs. This work analyzes these benchmark suites in a novel way. We propose to categorize the kernels of each application of these benchmarks by multiple criteria, built on their behavior in terms of computation type (integer or float), usage of memory hierarchy, efficiency and hardware occupancy. Based on the characterization results, we analyze kernel concurrency opportunities. The focus is on disclosing the resource requirements of the kernels of these benchmarks and to explain their behavior when executed concurrently.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)
Asanovic, K.: The landscape of parallel computing research: a view from berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkley, CA, USA (2006)
Bakhoda, A., Yuan, G.L., Fung, W.W., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, pp. 163–174. IEEE (2009)
Bienia, C.: Benchmarking Modern Multiprocessors. Princeton University, Princeton (2011)
Bienia, C.: Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University (2011)
Breder, B., Charles, E., Cruz, R., Clua, E., Bentes, C., Drummond, L.: Maximizando o uso dos recursos de GPU através da reordenação da submissão de kernels concorrentes. In: Anais do WSCAD 2016 Simpósio de Sistemas Computacionais de Alto Desempenho, pp. 98–109. Editora da Sociedade Brasileira de Computação (SBC) (2016)
Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151. IEEE (2012)
Carvalho, P., Drummond, L., Bentes, C., Clua, E., Cataldo, E., Marzulo, L.: Analysis and characterization of gpu benchmarks for kernel concurrency efficiency. In: Mocskos E., Nesmachnow S. (eds.) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol. 796 (2017)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (2010)
Che, S., Skadron, K.: Benchfriend: correlating the performance of GPU benchmarks. Int. J. High Perform. Comput. Appl. 28(2), 238–250 (2014)
Cruz, R., Drummond, L., Clua, E., Bentes, C.: Analyzing and estimating the performance of concurrent kernels execution on GPUs. In: Proceedings of the XVIII Simpósio em Sistemas Computacionais de Alto Desempenho-WSCAD (2017)
Cruz, R.A., Bentes, C., Breder, B., Vasconcellos, E., Clua, E., de Carvalho, P., Drummond, L.: Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurr. Comput.
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74 (2010)
Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2010)
Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66. IEEE (2016)
Jog, A., Kayiran, O., Kesten, T., Pattnaik, A., Bolotin, E., Chatterjee, N., Keckler, S.W., Kandemir, M.T., Das, C.R.: Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, pp. 223–234. ACM (2015)
Joshi, A., Phansalkar, A., Eeckhout, L., John, L.K.: Measuring benchmark similarity using inherent program characteristics. IEEE Trans. Comput. 55(6), 769–782 (2006)
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12. IEEE (2009)
Li, T., Narayana, V.K., El-Ghazawi, T.: A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In: IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015, pp. 562–569 (2015)
NVIDIA: Cuda multi process service overview (2017). https://docs.nvidia.com/pdf/CUDA_Multi_Process_Service_Overview.pdf
NVIDIA Corp: Profiler user’s guide. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview (2017). An optional note
O’Neil, M.A., Burtscher, M.: Microarchitectural performance characterization of irregular GPU kernels. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139. IEEE (2014)
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)
Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 217–228. ACM (2011)
SHOC: (2012). https://github.com/vetter/shoc/wiki
Spafford, K., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R.W., Sankaran, R.: Accelerating S3D: a GPGPU case study. In: Euro-Par Workshops, pp. 122–131. Springer, New York (2009)
Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing (2012)
Wende, F., Cordes, F., Steinke, T.: On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 74–83 (2012)
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 230–242. IEEE Press (2016)
Zhong, J., He, B.: Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Carvalho, P., Cruz, R., Drummond, L.M.A. et al. Kernel concurrency opportunities based on GPU benchmarks characterization. Cluster Comput 23, 177–188 (2020). https://doi.org/10.1007/s10586-018-02901-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-018-02901-1