skip to main content
10.1145/3020078.3021740acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Published: 22 February 2017 Publication History

Abstract

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability.
This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.

References

[1]
M. Courbariaux, Y. Bengio, J-P. David "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," NIPS 2015.
[2]
M. Courbariaux, I. Hubara, et al., "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1," arXiv:1602.02830 [cs.LG].
[3]
M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," arXiv:1603.05279 [cs.CV]
[4]
F. Li, B. Liu. "Ternary Weight Networks," arXiv:1605.04711 [cs.CV]
[5]
G. Venkatesh, E. Nurvitadhi, D. Marr, ".Accelerating Deep Convolutional Networks Using Low-Precision and Sparsity," ICASSP, 2017.
[6]
S. Han, H. Mao, W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding," ICLR 2016.
[7]
P. Gysel, et al., "Hardware-Oriented Approximation of Convolutional Neural Networks," ICLR Workshop 2016.
[8]
J. Albericio, P. Judd, T. Hetherington, et al, "Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing," ISCA 2016.
[9]
S. Han, X. Liu, et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," ISCA 2016.
[10]
N. Suda, V. Chandra, et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," ISFPGA 2016.
[11]
J. Qiu, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," ISFPGA 2016.
[12]
P.K. Gupta, "Accelerating Datacenter Workloads," Keynote at FPL 2016. Slides available at www.fpl2016.org.
[13]
A. Putnam, A. M. Caulfield, et al., "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," ISCA 2014.
[14]
S. Y. Kung, "VLSI Array Processors," Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1987.
[15]
A. Pedram, et al., "A High-Performance, Low-Power Linear Algebra Core," ASAP 2011.
[16]
Altera Arria 10 Website. https://www.altera.com/products/fpga/arria-series/arria-10/overview.html
[17]
Altera Stratix 10 Website. https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html
[18]
Nvidia Titan X Website. http://www.geforce.com/hardware/10series/titan-x-pascal
[19]
Altera's PowerPlay Early Power Estimators (EPE) and Power Analyzer, https://www.altera.com/support/support-resources/operation-and-testing/power/pow-powerplay.html
[20]
S. Gross, M. Wilber, "Training and investigating Residual Nets," http://torch.ch/blog/2016/02/04/resnets.html
[21]
J. C. Johnson, "cnn-benchmarks", available at https://github.com/jcjohnson/cnn-benchmarks
[22]
G. Baeckler, "HyperPipelining of High-Speed Interface Logic," ISFPGA Tutorial, 2016.
[23]
A. Lavin, S. Gray, "Fast Algorithms for Convolutional Neural Networks," arXiv:1509.09308 [cs.NE].
[24]
P. D'Alberto, P. A. Milder, et al., "Generating FPGA Accelerated DFT Libraries," FCCM 2007.
[25]
W. Chen, J. Wilson, et al., "Compressing Neural Networks with the Hashing Trick," ICML 2015.
[26]
Y. Chen, T. Luo, S. Liu, et al., "Dadiannao: A machine-learning supercomputer," Int. Symposium on Microarchitecture (MICRO), 2014.
[27]
S. Kestur, et al., "BLAS Comparison on FPGA, CPU and GPU," IEEE Annual Sym. on VLSI (ISVLSI), 2010
[28]
E. Nurvitadhi, J. Sim, D. Sheffield, et al, "Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC," FPL 2016.
[29]
MAGMA: Matrix Algebra on GPU and Multicore Architectures. Website: http://icl.cs.utk.edu/magma/
[30]
E. Nurvitadhi, D. Sheffield, J. Sim, et al, "Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC," FPT 2016.

Cited By

View all
  • (2025)Programming FPGAs for economics: An introduction to electrical engineering economicsQuantitative Economics10.3982/QE234416:1(49-87)Online publication date: 2025
  • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
  • (2025)Using FPGA devices to accelerate the evaluation phase of tree-based genetic programming: an extended analysisGenetic Programming and Evolvable Machines10.1007/s10710-024-09505-226:1Online publication date: 7-Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2017
312 pages
ISBN:9781450343541
DOI:10.1145/3020078
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. GPU
  3. accelerator
  4. deep learning
  5. intel stratix 10

Qualifiers

  • Research-article

Conference

FPGA '17
Sponsor:

Acceptance Rates

FPGA '17 Paper Acceptance Rate 25 of 101 submissions, 25%;
Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)237
  • Downloads (Last 6 weeks)18
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Programming FPGAs for economics: An introduction to electrical engineering economicsQuantitative Economics10.3982/QE234416:1(49-87)Online publication date: 2025
  • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
  • (2025)Using FPGA devices to accelerate the evaluation phase of tree-based genetic programming: an extended analysisGenetic Programming and Evolvable Machines10.1007/s10710-024-09505-226:1Online publication date: 7-Jan-2025
  • (2024)A2Q+Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692439(9275-9291)Online publication date: 21-Jul-2024
  • (2024)Синтез нейрокомп'ютерних систем з узгоджено-паралельним обробленням інтенсивних потоків даних у реальному часіScientific Bulletin of UNFU10.36930/4034061134:6(76-86)Online publication date: 5-Sep-2024
  • (2024)FPGA Implementation of Complex-Valued Neural Network for Polar-Represented Image ClassificationSensors10.3390/s2403089724:3(897)Online publication date: 30-Jan-2024
  • (2024)FOLD: Low-Level Image Enhancement for Low-Light Object Detection Based on FPGA MPSoCElectronics10.3390/electronics1301023013:1(230)Online publication date: 4-Jan-2024
  • (2024)EXPRESS: A Framework for Execution Time Prediction of Concurrent CNNs on Xilinx DPU AcceleratorACM Transactions on Embedded Computing Systems10.1145/369783524:1(1-31)Online publication date: 3-Oct-2024
  • (2024)TabConv: Low-Computation CNN Inference via Table LookupsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649212(180-188)Online publication date: 7-May-2024
  • (2024)A real-time demonstrator for image classification using FPGA-based logic neural networksReal-time Processing of Image, Depth, and Video Information 202410.1117/12.3017459(4)Online publication date: 20-Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media