skip to main content
10.1145/2847263.2847276acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Published: 21 February 2016 Publication History

Abstract

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

References

[1]
Y. LeCun, et al. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, 396--404, 1990.
[2]
O. Russakovsky, et al. ImageNet large-scale visual recognition challenge. In Int. J. Computer Vision, 2015.
[3]
A. Karpathy, et al. Large-scale video classification with convolutional neural networks. In CVPR, 1725--1732, 2014.
[4]
H. Li, Z. Lin, X. Shen, J. Brandt and G. Hua. A convolutional neural network cascade for face detection. In CVPR, 5325--5334, 2015.
[5]
P. Barros, S. Magg, C. Weber and S. Wermter. A multichannel convolutional neural network for hand posture recognition. In Int. Conf. on Artificial Neural Networks (ICANN), 403--410, 2014.
[6]
O. Abdel-Hamid, et al. Convolutional neural networks for speech recognition. In IEEE Trans. on Audio, Speech and Language Processing, 1533--1545, Oct 2014.
[7]
R. Collobert and J. Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Int. Conf. on Machine Learning, 160--167, 2008.
[8]
S. Lai, L. Xu, K. Liu and J. Zhao. Recurrent convolutional neural networks for text classification. In AAAI Conf. on Artificial Intelligence, 2267--2273, 2015.
[9]
C. Szegedy, et al. Going deeper with convolutions. In CVPR, 1--9, 2015.
[10]
C. Farabet, et al. Hardware accelerated convolutional neural networks for synthetic vision systems. In ISCAS, 257--260, 2010.
[11]
S. Chakradhar, et al. A dynamically configurable coprocessor for convolutional neural networks. In ISCA, 247--257, 2010.
[12]
M. Peemen, et al. Memory-centric accelerator design for convolutional neural networks. In ICCD, 13--19, 2013.
[13]
C. Zhang, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM Int. Symp. On Field-Programmable Gate Arrays, 161--170, 2015.
[14]
V. Gokhale, et al. A 240 G-ops/s mobile coprocessor for deep neural networks. In CVPR Workshops, 696--701, 2014.
[15]
Y. Chen, et al. DaDianNao: A machine-learning supercomputer. In IEEE/ACM Int. Symp. on Microarchitecture, 602--622, 2014.
[16]
A. Krizhevsky, et al. ImageNet classification with deep convolutional neural networks. In NIPS, 1097--1105, 2012.
[17]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
[18]
Y.L. Boureau, et al. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Int. Conf. on Machine Learning, 2010.
[19]
M. Denil, et al. Predicting parameters in deep learning. In NIPS, 2148--2156, 2013.
[20]
Y. Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
[21]
Khronos OpenCL Working Group. The OpenCL Specification, version 1.1.44, 2011.
[22]
M. S. Abdelfattah, et al. Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL. In Int. Workshop on OpenCL 2014.
[23]
K. Chellapilla, S. Puri and P. Simard. High performance convolutional neural networks for document processing. In Int. Workshop on Frontiers in Handwriting Recognition, 2006.
[24]
Altera OpenCL design examples. Available online at https://www.altera.com/support/support-resources/design-examples/design-software/opencl.html
[25]
Nallatech P395-D8 OpenCL FPGA accelerator cards. http://www.nallatech.com/wp-content/uploads/openclcardspb_v1_51.pdf
[26]
DE5-Net FPGA kit user manual. Available online at ftp://ftp.altera.com/up/pub/Altera_Material/Boards/DE5/DE5_User_Manual.pdf
[27]
R.C. Whaley and J.J. Dongarra. Automatically tuned linear algebra software. In Proc. SuperComputing 1998: High Performance Networking and Computing, 2001.

Cited By

View all
  • (2025)Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image ClassificationApplied Sciences10.3390/app1501042215:1(422)Online publication date: 5-Jan-2025
  • (2025)DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343599644:2(540-553)Online publication date: Feb-2025
  • (2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
    February 2016
    298 pages
    ISBN:9781450338561
    DOI:10.1145/2847263
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 February 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. convolutional neural networks
    2. fpga
    3. opencl
    4. optimization

    Qualifiers

    • Research-article

    Conference

    FPGA'16
    Sponsor:

    Acceptance Rates

    FPGA '16 Paper Acceptance Rate 20 of 111 submissions, 18%;
    Overall Acceptance Rate 125 of 627 submissions, 20%

    Upcoming Conference

    FPGA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)242
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 01 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image ClassificationApplied Sciences10.3390/app1501042215:1(422)Online publication date: 5-Jan-2025
    • (2025)DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.343599644:2(540-553)Online publication date: Feb-2025
    • (2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
    • (2024)Artificial Intelligence in CommunicationsConvergence of Antenna Technologies, Electronics, and AI10.4018/979-8-3693-3775-2.ch008(209-238)Online publication date: 27-Sep-2024
    • (2024)Quantization-Based Optimization Algorithm for Hardware Implementation of Convolution Neural NetworksElectronics10.3390/electronics1309172713:9(1727)Online publication date: 30-Apr-2024
    • (2024)EXPRESS: A Framework for Execution Time Prediction of Concurrent CNNs on Xilinx DPU AcceleratorACM Transactions on Embedded Computing Systems10.1145/369783524:1(1-31)Online publication date: 3-Oct-2024
    • (2024)Reconfigurable Hardware Accelerator for Convolution Operations in Convolutional Neural NetworksProceedings of the 2024 12th International Conference on Communications and Broadband Networking10.1145/3688636.3688655(20-26)Online publication date: 24-Jul-2024
    • (2024)A High-Performance Accelerator for Real-Time Super-Resolution on Edge FPGAsACM Transactions on Design Automation of Electronic Systems10.1145/365285529:3(1-25)Online publication date: 16-Mar-2024
    • (2024)REC: REtime Convolutional Layers to Fully Exploit Harvested Energy for ReRAM-based CNN AcceleratorsACM Transactions on Embedded Computing Systems10.1145/365259323:6(1-25)Online publication date: 11-Sep-2024
    • (2024)Architectural Support for Sharing, Isolating and Virtualizing FPGA ResourcesACM Transactions on Architecture and Code Optimization10.1145/364847521:2(1-26)Online publication date: 21-May-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media