Accelerating LBM and LQCD Application Kernels by In-Memory Processing

Baumeister, Paul F.; Boettiger, Hans; Brunheroto, José R.; Hater, Thorsten; Maurer, Thilo; Nobile, Andrea; Pleiter, Dirk

doi:10.1007/978-3-319-20119-1_8

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Included in the following conference series:

International Conference on High Performance Computing

2917 Accesses
1 Altmetric

Abstract

Processing-in-memory architectures promise increased computing performance at decreased costs in energy, as the physical proximity of the compute pipelines to the data store eliminates overheads for data transport. We assess the overall performance impact using a recently introduced architecture of that type, called the Active Memory Cube, for two representative scientific applications. Precise performance results for performance critical kernels are obtained using cycle-accurate simulations. We provide an overall performance estimate using performance models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 42.79; Price includes VAT (France)

Softcover Book: EUR 52.74; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Toward a BLAS library truly portable across different accelerator types

Article 10 June 2019

Optimizing CUDA code by kernel fusion: application on BLAS

Article 22 July 2015

Exploring Strategies to Improve Locality Across Many-Core Affinities

References

Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S., Kelly, S.M., Le, H., Leung, V.J., Resnick, D.R., Rodrigues, A.F., Shalf, J., Stark, D., Unat, D., Wright, N.J.: Abstract machine models and proxy architectures for exascale computing. In: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing (Co-HPC 2014), pp. 25–32. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/Co-HPC.2014.4
Balasubramonian, R., Chang, J., Manning, T., Moreno, J.H., Murphy, R., Nair, R., Swanson, S.: Near-data processing: insights from a MICRO-46 workshop. IEEE Micro 34(4), 36–42 (2014)
Article Google Scholar
Biferale, L., Mantovani, F., Pivanti, M., Sbragaglia, A., Schifano, S., Toschi, F., Tripiccione, R.: Lattice Boltzmann fluid-dynamics on the QPACE supercomputer. Procedia Comput. Sci. 1(1), 1075–1082 (2010). http://www.sciencedirect.com/science/article/pii/S1877050910001201, ICCS 2010
Article Google Scholar
Biferale, L., Mantovani, F., Pivanti, M., Pozzati, F., Sbragaglia, M., Scagliarini, A., Schifano, S.F., Toschi, F., Tripiccione, R.: Optimization of multi-phase compressible lattice Boltzmann codes on massively parallel multi-core systems. Procedia Comput. Sci. 4, 994–1003 (2011). http://www.sciencedirect.com/science/article/pii/S1877050911001633, Proceedings of the International Conference on Computational Science, ICCS 2011
Article Google Scholar
Boyle, P.A., Christ, N.H., Kim, C.: Co-design of the IBM BlueGene/q level 1 prefetch engine with QCD. IBM J. Res. Dev. 57(1/2), 13:1–13:10 (2013)
Article Google Scholar
Calore, E., Schifano, S.F., Tripiccione, R.: A portable OpenCL lattice Boltzmann code for multi- and many-core processor architectures. Procedia Comput. Sci. 29, 40–49 (2014). http://www.sciencedirect.com/science/article/pii/S1877050914001811, 2014 International Conference on Computational Science
Article Google Scholar
Elliott, D., Snelgrove, W., Stumm, M.: Computational ram: a memory-simd hybrid and its application to dsp. In: Proceedings of the IEEE 1992 on Custom Integrated Circuits Conference, pp. 30.6.1–30.6.4, May 1992
Google Scholar
Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice Wilson Dirac operator. SIAM J. Sci. Comput. 36, A1581–A1608 (2014)
Article MATH MathSciNet Google Scholar
Hall, M., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., LaCoss, J., Granacki, J., Brockman, J., Srivastava, A., Athas, W., Freeh, V., Shin, J., Park, J.: Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In: ACM/IEEE 1999 Conference on Supercomputing, pp. 57–57, November 1999
Google Scholar
Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M., Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on intel xeon phi co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
Hybrid Memory Cube Consortium: Hybrid Memory Cube Specification (2013)
Google Scholar
Kang, Y., Huang, W., Yoo, S.M., Keen, D., Ge, Z., Lam, V., Pattnaik, P., Torrellas, J.: FlexRAM: toward an advanced intelligent memory system. In: International Conference on Computer Design (ICCD 1999), pp. 192–201 (1999)
Google Scholar
Koutsou, G., Krieg, S., Pleiter, D., Simma, H.: EIC co-design questionnaire: lattice QCD (unpublished, 2013)
Google Scholar
Nair, R., Antao, S.F., Bertolli, C., Bose, P., Brunheroto, J.R., Chen, T., Cher, C.-Y., Costa, C.H.A., Evangelinos, C., Fleischer, B.M., Fox, T.W., Gallo, D.S., Grinberg, L., Gunnels, J.A., Jacob, A.C., Jacob, P., Jacobson, H.M., Karkhanis, T., Kim, C., Moreno, J.H., O’Brien, J.K., Ohmacht, M., Park, Y., Prener, D.A., Rosenburg, B.S., Ryu, K.D., Sallenave, O., Serrano, M.J., Siegl, P.D.M., Sugavanam, K., Sura, Z.: Active memory cube: a processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59(2/3), 17:1–17:14 (2015)
Article Google Scholar
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–13, November 2010
Google Scholar
Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., Yelick, K.: A case for intelligent RAM. IEEE Micro 17(2), 34–44 (1997)
Article Google Scholar
Scagliarini, A., Biferale, L., Sbragaglia, M., Sugiyama, K., Toschi, F.: Lattice Boltzmann methods for thermal flows: continuum limit and applications to compressible Rayleigh-Taylor systems. Phys. Fluids 22(5), 055101 (2010)
Article Google Scholar
Schifano, S.F., Tripiccione, R.: EIC co-design questionnaire: LBM (unpublished, 2013)
Google Scholar
Torrellas, J.: Flexram: toward an advanced intelligent memory system: a retrospective paper. In: IEEE 30th International Conference on Computer Design (ICCD 2012), pp. 3–4, September 2012
Google Scholar
Williams, S., Oliker, L., Carter, J., Shalf, J.: Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2011), pp. 55:1–55:12. ACM, New York (2011). http://doi.acm.org/10.1145/2063384.2063458
Winter, F., Clark, M., Edwards, R., Joo, B.: A framework for lattice QCD calculations on GPUs. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1073–1082, May 2014
Google Scholar

Download references

Acknowledgements

We thank the AMC team at IBM Research, in particular J. Moreno, for sharing their knowledge on the AMC and continued help on this project including many fruitful discussions. Furthermore, we gratefully acknowledge F.S. Schifano and R. Tripiccione (INFN/University of Ferrara) for making a mini-application version of their D2Q37 code available and for discussing their future roadmaps [18]. We also thank G. Koutsou, S. Krieg, and H. Simma from the Simulation Lab LQCD at Cyprus Institute/DESY/JSC for discussing the future requirements of LQCD [13]. Finally, we thank A. Frommer and S. Krieg for making their implementation of their AMG solver [8] available.

Author information

Authors and Affiliations

Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425, Jülich, Germany
Paul F. Baumeister, Thorsten Hater & Dirk Pleiter
IBM Deutschland Research and Development GmbH, 71032, Böblingen, Germany
Hans Boettiger & Thilo Maurer
IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
José R. Brunheroto
Institute for Advanced Simulation, Forschungszentrum Jülich, 52425, Jülich, Germany
Andrea Nobile

Authors

Paul F. Baumeister
View author publications
You can also search for this author in PubMed Google Scholar
Hans Boettiger
View author publications
You can also search for this author in PubMed Google Scholar
José R. Brunheroto
View author publications
You can also search for this author in PubMed Google Scholar
Thorsten Hater
View author publications
You can also search for this author in PubMed Google Scholar
Thilo Maurer
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Nobile
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Pleiter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thorsten Hater .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Julian M. Kunkel
Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Thomas Ludwig

Copyright information

About this paper

Cite this paper

Baumeister, P.F. et al. (2015). Accelerating LBM and LQCD Application Kernels by In-Memory Processing. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-20119-1_8
Published: 20 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Accelerating LBM and LQCD Application Kernels by In-Memory Processing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Toward a BLAS library truly portable across different accelerator types

Optimizing CUDA code by kernel fusion: application on BLAS

Exploring Strategies to Improve Locality Across Many-Core Affinities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Accelerating LBM and LQCD Application Kernels by In-Memory Processing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Toward a BLAS library truly portable across different accelerator types

Optimizing CUDA code by kernel fusion: application on BLAS

Exploring Strategies to Improve Locality Across Many-Core Affinities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation