AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA
Abstract
:1. Introduction
- t0 clock: ; ; .
- t1 clock:
- –
- {; ; ;}
- –
- {; ; ;}
- –
- {; ; .}
- t2 clock:
- –
- {; ; ;}
- –
- {; ; ;}
- –
- {; ; ;}
- –
- {; ; ;}
- –
- {; ; ;}
- –
- {; ; ;}
- ……
- t11 clock: {; ; ;}
- 1.
- The resource composition of systolic arrays is analyzed.
- 2.
- A resource calculation method called AFHRE has been established to compute the FPGA resources occupied by any given set of systolic array parameters.
- 3.
- Resource calculation methods are used to derive a series of parameters for systolic arrays, assisting design tools in finding suitable scales of CNN accelerators more quickly under hardware resource constraints.
2. Related Work
3. Hardware Resource Usage Analysis
3.1. Minimalist Systolic Arrays
- 1.
- Expression: An expression is a segment of code that computes a value in programming and hardware description languages (HDLs). It may consist of operators, variables, constants, and function calls.
- 2.
- FIFO: FIFO (First In, First Out) is a data structure that allows data to be read in the order in which they were stored.
- 3.
- Instance: An instance in hardware design refers to the specific implementation of a module or component.
- 4.
- Multiplexer: A multiplexer is a selector that chooses one output from multiple inputs, forwarding the selected input signal to the output based on control signals.
- 5.
- Register: A register is a storage unit capable of temporarily holding data in hardware. Registers are used to store data, states, or control information, and they read or write data under the control of a clock signal. Registers are often used in pipelined designs to maintain data values across different computation cycles.
3.1.1. Estimation of Expression
3.1.2. Estimation of FIFO
3.1.3. Estimation of Instance
3.1.4. Estimation of Multiplexer
3.1.5. Estimation of Register
3.2. Systolic Arrays Generated by Polyhedral Compilers
3.2.1. Estimation of BRAM
3.2.2. Estimation of DSP
3.2.3. Estimation of FF
3.2.4. Estimation of LUT
4. Accurate and Fast Hardware Resources Estimation Method (AFHRE) of Hardware Resources for CNN Model
Algorithm 1: AFHRE |
Input: Result: |
5. Evaluation
5.1. Theory and Experimental Analysis of Minimalist Systolic Arrays Hardware Resources
5.2. Theory and Experimental Analysis of Systolic Arrays Generated by Polyhedral Compilers
5.3. Estimation Time
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3813–3824. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Hao, C.; Dotzel, J.; Xiong, J.; Benini, L.; Zhang, Z.; Chen, D. Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Codesign. IEEE Des. Test 2021, 38, 7–26. [Google Scholar] [CrossRef]
- Xu, P.; Zhang, X.; Hao, C.; Zhao, Y.; Zhang, Y.; Wang, Y.; Li, C.; Guan, Z.; Chen, D.; Lin, Y. AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 40–50. [Google Scholar] [CrossRef]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
- Zhou, X.; Du, Z.; Zhang, S.; Zhang, L.; Lan, H.; Liu, S.; Li, L.; Guo, Q.; Chen, T.; Chen, Y. Addressing Sparsity in Deep Neural Networks. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 1858–1871. [Google Scholar] [CrossRef]
- Coates, A.; Huval, B.; Wang, T.; Wu, D.J.; Ng, A.Y.; Catanzaro, B. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, Atlanta, GA, USA, 17–19 June 2013; pp. III-1337–III-1345. [Google Scholar]
- Jadhav, S.S.; Gloster, C.; Naher, J.; Doss, C.; Kim, Y. A Multi-Memory Field-Programmable Custom Computing Machine for Accelerating Compute-Intensive Applications. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 0619–0628. [Google Scholar] [CrossRef]
- Chen, Z.; Ma, Y.; Wang, Z. Hybrid Stochastic-Binary Computing for Low-Latency and High-Precision Inference of CNNs. IEEE Trans. Circuits Syst. Regul. Pap. 2022, 69, 2707–2720. [Google Scholar] [CrossRef]
- Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [Google Scholar] [CrossRef]
- Nurvitadhi, E.; Venkatesh, G.; Sim, J.; Marr, D.; Huang, R.; Ong Gee Hock, J.; Liew, Y.T.; Srivatsan, K.; Moss, D.; Subhaschandra, S.; et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 5–14. [Google Scholar] [CrossRef]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar] [CrossRef]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [Google Scholar] [CrossRef]
- Wei, X.; Yu, C.H.; Zhang, P.; Chen, Y.; Wang, Y.; Hu, H.; Liang, Y.; Cong, J. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Basalama, S.; Sohrabizadeh, A.; Wang, J.; Guo, L.; Cong, J. FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 3570928. [Google Scholar] [CrossRef]
- Samajdar, A.; Joseph, J.M.; Zhu, Y.; Whatmough, P.; Mattina, M.; Krishna, T. A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Virtual Meeting, 23–26 August 2020; pp. 58–68. [Google Scholar] [CrossRef]
- Nazemi, M.; Nazarian, S.; Pedram, M. High-performance FPGA implementation of equivariant adaptive separation via independence algorithm for Independent Component Analysis. In Proceedings of the 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 10–12 July 2017; pp. 25–28. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2014, arXiv:1312.4400. [Google Scholar]
- Chen, Y.H.; Krishna, T.; Emer, J.; Sze, V. 14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 31 January–4 February 2016; pp. 262–263. [Google Scholar] [CrossRef]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar] [CrossRef]
- Li, J.; Shen, G.; Zhao, D.; Zhang, Q.; Zeng, Y. FireFly: A High-Throughput Hardware Accelerator for Spiking Neural Networks with Efficient DSP and Memory Optimization. IEEE Trans. Very Large Scale Integr. (VLSI) Systems 2023, 31, 1178–1191. [Google Scholar] [CrossRef]
- Basalama, S.; Wang, J.; Cong, J. A Comprehensive Automated Exploration Framework for Systolic Array Designs. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Dave, S.; Kim, Y.; Avancha, S.; Lee, K.; Shrivastava, A. dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators. ACM Trans. Embed. Comput. Syst. 2019, 18, 1–27. [Google Scholar] [CrossRef]
- Huang, Q.; Kang, M.; Dinh, G.; Norell, T.; Kalaiah, A.; Demmel, J.; Wawrzynek, J.; Shao, Y.S. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Online, 14–19 June 2021; pp. 554–566. [Google Scholar] [CrossRef]
- Parashar, A.; Raina, P.; Shao, Y.S.; Chen, Y.H.; Ying, V.A.; Mukkara, A.; Venkatesan, R.; Khailany, B.; Keckler, S.W.; Emer, J. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 304–315. [Google Scholar] [CrossRef]
- Lu, L.; Guan, N.; Wang, Y.; Jia, L.; Luo, Z.; Yin, J.; Cong, J.; Liang, Y. TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Online, 14–19 June 2021; pp. 720–733. [Google Scholar] [CrossRef]
- Kao, S.C.; Jeong, G.; Krishna, T. ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 622–636. [Google Scholar] [CrossRef]
- Kao, S.C.; Krishna, T. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design, Virtual Conference, 2–5 November 2020. [Google Scholar] [CrossRef]
- Wang, J.; Guo, L.; Cong, J. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 28 Februar–2 March 2021; pp. 93–104. [Google Scholar] [CrossRef]
- Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
Search Algorithm | Speed | Accuracy | On-Board Validation | |
---|---|---|---|---|
Odyssey [24] | Mathematical Programming Search | Slow | accurate Using Vivado HLS | FPGA |
dMazeRunner [25] | Pruning Random Search | Medium | not given | No (CPU) |
CoSA [26] | Mathematical Programming | Fast | not given | GPU |
Timeloop [27] | Pruning Random Search | Medium | not given | DianNao GPU Eyeriss |
TENET [28] | Pruning Random Search | Medium | estimation accuracy 89.6% | No |
Con-fuciuX [29] | RL Evolutionary Search | Medium | not given | No |
GAMMA [30] | Genetic Algorithm-Based Method | Medium | not given | TPU GPU Eyeriss |
BRAM | DSP | FF | LUT | |
---|---|---|---|---|
Expression | - | - | - | |
FIFO | - | - | ||
Instance | - | |||
Multiplexer | - | - | - | |
Register | - | - | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Zhao, H.; Zhao, J. AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics 2025, 14, 168. https://doi.org/10.3390/electronics14010168
Wang Y, Zhao H, Zhao J. AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics. 2025; 14(1):168. https://doi.org/10.3390/electronics14010168
Chicago/Turabian StyleWang, Yongchang, Hongzhi Zhao, and Jinyao Zhao. 2025. "AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA" Electronics 14, no. 1: 168. https://doi.org/10.3390/electronics14010168
APA StyleWang, Y., Zhao, H., & Zhao, J. (2025). AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGA. Electronics, 14(1), 168. https://doi.org/10.3390/electronics14010168