CNN hw1
CNN hw1
CNN hw1
Abstract— Convolutional neural network (CNN) is the state-of- CNN achieves satisfied recognition accuracy as long as the
the-art deep learning approach employed in various applications. network is deep enough (a large number of layers), which
Real-time CNN implementations in resource limited embedded makes it a deep Convolutional Neural Network (DCNN).
systems are becoming highly desired recently. To ensure the pro-
grammable flexibility and shorten the development period, field Recent DCNNs, which consist of tens to hundreds of con-
programmable gate array is appropriate to implement the CNN volutional layers, have shown great promise in visual under-
models. However, the limited bandwidth and on-chip memory standing [7].
storage are the bottlenecks of the CNN acceleration. In this paper, Since existing systems with general purpose processors have
we propose efficient hardware architectures to accelerate deep not been optimized for DCNNs, the acceleration of DCNN
CNN models. The theoretical derivation of parallel fast finite
impulse response algorithm (FFA) is introduced. Based on FFAs, should be carefully investigated for real-time embedded sys-
the corresponding fast convolution units (FCUs) are developed for tems [8]. Most of the software based CNNs are implemented
the computation of convolutions in the CNN models. Novel data in GPUs [9], which have numerous execution units and large
storage and reuse schemes are proposed, where all intermediate memory bandwidth to obtain high computational throughput.
pixels are stored on-chip and the bandwidth requirement is As a result, both the intensive computations of training and
reduced. We choose one of the largest and most accurate
networks, VGG16, and implement it on Xilinx Zynq ZC706 and classification are shifted to GPUs. However, for local accelera-
Virtex VC707 boards, respectively. We achieve the top-5 accuracy tors, GPU based implementations are infeasible due to limited
of 86.25% using an equal distance non-uniform quantization hardware resources and tightly bounded energy consumption.
method. It is estimated that the average performances are One of the solutions to provide required performance
316.23 GOP/s under 172-MHz working frequency on Xilinx of DCNN within the restricted energy is to use hardware
ZC706 and 1250.21 GOP/s under 170-MHz working frequency
on VC707, respectively. In brief, the proposed design outperforms implementations such as ASIC and field programmable
the existing works significantly, in particular, surpassing related gate array (FPGA) [10]. Although the state-of-the-art
designs by more than two times in terms of resource efficiency. CNN algorithms and designs are rapidly evolving, the
Index Terms— Convolutional neural network (CNN), field low level operations stay the same, such as convolution and
programmable gate array (FPGA) platform, fast FIR algorithm, pooling. Convolutions in CNNs generally dominate the overall
on-chip data storage scheme. computational complexity and consume the major computation
time and power in real implementations. Therefore, it is
I. I NTRODUCTION promising to develop efficient architectures for DCNNs in
order to achieve high computation and energy efficiency.
C ONVOLUTIONAL Neural Network (CNN) has been
proven to be a powerful tool in domains of various tasks
including object recognition [1], [2] and detection [3], [4],
There are three challenges of hardware based real-time high
performance CNN implementations: 1) multiple and large-size
speech recognition [5], [6]. Real-time image recognition feature maps in convolutional layers and the huge amount
could be applied everywhere using CNN-like deep learning of parameters, both of which require large storage space;
approaches in the near future. However, most of the CNN 2) the high complexity of convolutional computations, which
models are trained and implemented with software platforms greatly slows down the training and inference process; 3) the
due to the versatility. For mobility and privacy reasons, real- limited memory bandwidth, which incurs long data exchange
time applications requiring of high accuracy and low power delays. These factors hinder the real-time performance and the
image and voice recognitions based on CNNs should be able widespread deployment of DCNNs, particularly on resource
to be performed on local embedded processors. constrained embedded systems.
Various hardware implementations of different CNN models
Manuscript received January 26, 2017; revised June 11, 2017; accepted have been proposed in recent years. Sankaradas et al. [11]
October 13, 2017. This work was supported in part by the National Natural
Science Foundation of China under Grant 61774082 and Grant 61604068 presented a massively parallel coprocessor, whose functional
and in part by the Fundamental Research Funds for the Central Universities units consist of parallel two-dimensional (2D) convolution
under Grant 021014380065. This paper was recommended by Associate primitives and programmable units, performing the pooling
Editor Y. Ha. (Corresponding authors: Jun Lin; Zhongfeng Wang.)
The authors are with the School of Electronic Science and and non-linear functions in CNNs. Zhang et al. [12], pro-
Engineering, Nanjing University, Nanjing 210008, China (e-mail: posed a hardware/software co-designed library to efficiently
jcwang@smail.nju.edu.cn; jlin@nju.edu.cn; zfwang@nju.edu.cn). accelerate an entire CNN on FPGAs, which employed a
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. uniformed convolutional matrix multiplication representation
Digital Object Identifier 10.1109/TCSI.2017.2767204 for both convolutional layers and fully connected layers.
1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Chen et al. [13], Luo et al. [14], and Liu et al. [15] designed
general accelerators for large scale CNNs and DNNs, with a
special emphasis on the impact of memory on an accelerator.
These three main strategies employed to optimize the memory
systems are: 1) tiling and data reuse, for reduction of memory
traffic; 2) storage buffer, for data reuse; 3) on-chip memory,
for storage of all parameters [16].
Most of prior works focus on high level parallel architec-
Fig. 1. Design of a CNN model.
tures and efficient memory designs. However, these architec-
tures seldomly reduce the number of computations signifi-
cantly, especially in convolutions. In fact, 2D convolutions is introduced in Section V. The novel storage scheme and
occupy more than 90% of the overall computation time [8]. computation flow are proposed and described in detail in
The latency and energy consumed due to data access between Section IV. Three methods of parallel processing along with
external and internal memories lead to low computation the computation and storage architectures are presented in
throughput and large power consumption. Binary weights were Section VI. We analyse the architecture efficiency in aspects of
employed in [17] and [18] to reduce both the computational computing, storage, and energy in Section VII and compare the
complexity and the storage requirement at the cost of certain implementation results with previous works in Section VIII.
accuracy loss.
In this paper, efficient hardware architectures are proposed II. BACKGROUND
to process and store DCNN models. The main contributions A CNN can be viewed as the combination of a feature
of this work are: extractor and a category classifier. Three architectural ideas are
• The theoretical derivation of 3-parallel fast finite impulse introduced to ensure the invariance of some degree of shift,
response (FIR) algorithm (FFA) is presented. Based on scale, and distortion: 1) local receptive fields; 2) shared kernel
the 3-parallel FFA, an efficient 3-parallel Fast Convo- weights; 3) spatial or temporal sub-sampling (pooling) [5].
lution Unit (FCU) for convolutions with 3 × 3 kernels CNNs are usually composed of three main different types
is developed. For convolutions with large convolutional of layers: convolutional (CONV) layers, pooling layers, and
kernels such as 5×5 and 7×7, we also derivate 5-parallel fully connected (FC) layers. Some new types of layers are
and 7-parallel FFAs and propose their corresponding developed such as normalization and dropout layers. A CNN
FCUs. model usually has a feed-forward design, where each layer
• With the proposed FCU, multiple outputs can be com- takes the output of a preceding layer as its input and produces
puted in parallel, which improves the computation the results for the next layer. Normally, pooling layers follow
throughput. It is estimated that 33% of the multiplica- convolutional layers and fully connected layers are the last few
tions are reduced with slightly increased add operations layers. Fig. 1 shows an example of CNN design.
by the proposed 3-parallel FCU and the computational
energy is reduced as well. Moreover, multiple FCUs are A. Convolutional Layers
intelligently organized into an FCU array which enable
Convolutional layers are the major parts of a CNN. Each
efficient data reuse.
output pixel is connected to only a local region of the input
• A novel data reuse and storage scheme which avoids the
layer and the extent of this connectivity is called the receptive
transfer of intermediate data between internal memory
field. The connections are local in space (along length and
and external memory is proposed. DRAM is utilized in a
width), but always extend along the entire channel of the input
one-way transmission fashion, where only the raw input
layer. The receptive fields are overlapped both along the height
pictures and weights need to be read out and no data
and width by a certain stride (usually 1) in an input feature map
is required to be written back to DRAM. A further on-
and the kernel weights are shared with them. The convolution
chip memory reuse scheme is proposed to reduce memory
in CNN is a 2D operation, where shared kernel weights are
requirement.
multiplied with the corresponding receptive field in an element
• The equal distance non-uniform quantization (ENQ)
by element way. These products are summed together with an
method is introduced to our quantization flow, which
additional bias.
contributes to the reduction of data width.
Usually, an input layer contains multiple feature maps. The
• An overall hardware architecture which contains scalable
convolution results of all input feature maps are added together
computational logics and storage resources is proposed.
to get an output feature map. Thus the kernels are extended to
Three methods of parallel processing are implemented by
3 dimensions, where each 2D kernel corresponds to an input
the proposed architecture, which considerably exploit the
feature map. The pixel y at location (x, y) in output feature
parallel computations of DCNNs.
map n is given by:
The rest of this paper is organized as follows:
i −1 K
N y−1 K
x−1
In Section II, we introduce the background of CNN and
implementation model. In Section III, the fast convolution sum = w(n) [i ][ j ][k]×i n[i ][x + j][y +k],
algorithm is described in detail and the corresponding fast i=0 j =0 k=0
convolution unit is proposed. The proposed ENQ scheme y (n) [x][y] = f (sum + b[n]),
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
where w(n) and Ni represent the kernel weights and the If the input sequence x(n) is decomposed into even and odd
number of input feature maps, respectively. K y and K x denote indexed samples, we have X (z) = X 0 + z −1 X 1 . Similarly,
the height and width of the convolutional kernel, b[n] is the H (z) = H0 + z −1 H1 and the output:
bias added to output feature map n, respectively. f represents
Y (z) = Y0 + z −1 Y1 = (X 0 + z −1 X 1 )(H0 + z −1 H1), (3)
the activation function, which could be sigmoid, tanh or
ReLU. where
Y0 = H0 X 0 + z −2 H1 X 1 ,
B. Pooling Layers
Y1 = H1 X 0 + H0 X 1 . (4)
Pooling layers aim at simplifying and smoothing the
precedent feature maps. Usually, a pooling layer follows Based on the fast FIR algorithm (FFA) [20], Y0 and Y1 can
a convolutional layer. Two types of pooling operations are be expressed in the form of 2-parallel FFA,
commonly employed, which are average and max pooling. Y0 = H0 X 0 + z −2 H1 X 1 ,
Average pooling computes the mean of a small local field
Y1 = (H0 + H1)(X 0 + X 1 ) − H0 X 0 − H1 X 1 . (5)
in each input feature map, while max pooling picks the max
value of the local field to output a pixel. For a pooling layer, Similarly, the input sequence x(n) and tap coefficients h(n)
the number of output feature maps equals to that of input can be decomposed into three parts as follows:
feature maps.
Y = Y0 + z −1 Y1 + z −2 Y2
= (X 0 + z −1 X 1 + z −2 X 2 )(H0 + z −1 H1 + z −2 H2 )
C. Fully Connected Layers
= (X 0 + z −1 V )(H0 + z −1 W ), (6)
Fully connected layers are placed at the end of a CNN
design and perform as classifiers. They are usually flattened to where V = X 1 + z −1 X 2 and W = H1 + z −1 H2. The last
one dimension and act in the form of regular neural networks, equation of Eq. (6) and its intermediate term V W are both
where every neuron has full connections to all neurons in the in the 2-parallel FIR filter form and can be computed by
previous layer. employing Eq. (5) recursively [20]. Thus, the 3-parallel FFA
is computed as follows:
D. Implementation Model Y0 = H0 X 0 −z −3 H2 X 2 +z −3 [(H1 + H2)(X 1 + X 2 )− H1 X 1 ],
The VGG model [19], proposed by Simonyan and Zisser- Y1 = [(H0 v + H1)(X 0 + X 1 )− H1 X 1 ]−[H0 X 0 −z −3 H2 X 2 ],
manachieved, won the second place in image classification
Y2 = [(H0 + H1 + H2 )(X 0 + X 1 + X 2 )]
task of ILSVRC 2014. VGG models are a series of networks
with deep layers and use very small (3 × 3) convolutional − [(H0 + H1)(X 0 + X 1 ) − H1 X 1 ]
kernels. It was shown that a significant improvement on − [(H1 + H2 )(X 1 + X 2 ) − H1 X 1 ]. (7)
the classification can be achieved by pushing the depth to
For parallel FFAs of larger sizes of prime numbers
16-19 layers. In general case, VGG models consist of mul-
(e.g., 5 and 7 parallel FFAs), we have already derived their
tiple convolutional layers, 4 max-pooling layers and 3 fully
expressions, but do not put them down in this paper because
connected layers.
the derivation process is relatively complex compared to
Only 3 × 3 kernels are employed in VGG model so that
3 parallel FFA. It should be noted that large kernels (e.g.,
fewer weights are needed and higher accuracy is obtained.
5 × 5 and 7 × 7) are indeed employed in some CNN models
VGG model can branch out to VGG11, VGG16, and VGG19
depending on the total number of layers. In this paper, such as [7] and [21]. Parallel FFAs with sizes of composite
numbers can be obtained by applying cascading short term
the VGG16 model is considered due to its relative balance
FFAs [20], [22].
between complexity and accuracy.
Parallel FFAs are essentially parallel computing algorithms.
A large size filter is decomposed into several small sub-
III. FAST C ONVOLUTION A LGORITHM
filters and each performs short convolutions. For a filter of
A. Parallel Fast FIR Algorithm size P, where P is a prime number such as 3 or 5, the
An FIR filter with N taps can be expressed as: filter can be decomposed into P sub-filters. For the presented
parallel FFAs, the convolution of each sub-filter becomes a
N−1
y(n) = h(n) ∗ x(n) = h(i )x(n − i ), multiplication. These partial convolutional results are added
i=0 with specific combinations to compute several output results
n = 0, 1, 2, · · · , ∞, (1) in parallel.
Reference [23] introduced Winograd’s minimal filtering
where x(n) is an infinite input sequence and h(n) contains the algorithms to minimize multiplications in small size FIR fil-
coefficients of an N-length FIR filter. The convolution shown ters. However, more additions and complex pre-computations
in Eq. (1) can be expressed in z domain: are required, which is unfriendly to hardware implementation.
N−1 ∞
Moreover, it is difficult to use Winograd’s minimal filtering
Y (z) = H (z)X (z) = h(n)z −n x(n)z −n . (2) algorithms to derive the expressions of large tap filters with
n=0 n=0 fewer additions.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
C OMPARISONS OF A RITHMETIC O PERATIONS
V. Q UANTIZATION M ETHOD
In CNN implementation, quantization is necessary to reduce
the computation and storage bits while preserving the recog-
nition accuracy. Previous works focus on the quantization of
kernel weights and the quantization of pixel data is rarely
discussed. The equal distance quantization (ENQ) method is
proposed in this section.
It was shown in [26] that the distributions of numerical
values of both pixels and weights in most layers are roughly
Gaussian distribution. For VGG16 model, the distribution of
pixel values in each convolutional layer is Gaussian-like as
well [27].
Fig. 7. Further reuse of on-chip memories for architecture of VGG16 data
flow. A. Quantizaition Flow
The quantization method is to a map pixel values to a set of
segment of RAM j instead of RAM i for layer i . Therefore,
quantization points (QPs) and each QP corresponds to a fixed-
only if the block RAMs have the same storage design, they
point value. {P0 , P1 , · · · , PN−1 } denote a set of QPs with N
can be reused. Once two segments of 1 or 2 block RAMs are
elements and Vi denotes the corresponding fixed-point value
labelled as “1” and both the segments store the row data of the
associated with Pi (i = 0, 1, · · · , N − 1). A pixel value s is
same layer, the computation is triggered and the output 12 Tr
quantized to a QP:
rows of pixel data are written to a segment which is labelled as
“0”. When the computation is completed, one of the segments s → Pk , k = 0, 1, · · · , N − 1. (9)
whose stored pixel data is computed twice is labelled as “0”
and can be overwritten in further computations. It requires K = log2 N bits to store a quantized pixel
For VGG16 model, RAMs for corresponding layers that value.
can be reused in the two different memory reuse schemes are From [27], it can be concluded that most pixel values are
marked in Fig. 7. relatively small in a layer. It is reasonable to assign more
The fully reused block RAMs are linked by solid arrows. QPs to represent the smaller pixels without incurring obvious
These convolutional layers share the same block RAM, size degradation of accuracy. The proposed quantization scheme is
of which is determined by the largest one. By coincidence, described as follows:
in VGG16 these block RAMs are of the same size except • For uniform quantization, i.e., each pixel is quantized to
for the last one. The partial reuse among block RAMs are q bits (2q QPs in total) and each weight is quantized
linked with dash arrows. On the basis of inter-layer partial to qw bits. All pixel values are non-negative because we
storage and intra-layer ping-pong reuse scheme for VGG16 store pixels after the ReLU operation. Let F represent
implementation, the further on-chip memory reuse method the number of fractional bits in the q-bits uniform quan-
can further reduce about 20% of block memories without any tization. Therefore, Vi = i 2−F , i = 0, 1, · · · , 2q − 1.
overhead. A and A represent the accuracy with floating and q-bit
For VGG16 implementation, in order to avoid the whole uniform quantization, respectively. The minimal value of
intermediate pixel data transfer with regular layer-wise q (denoted as q ) is calculated so that A − A δ,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
ENQ R ESULT OF VGG16 M ODEL
layer, the computation time of fully connected layer can to transmit kernel weights from external DRAM is calculated
be saved. in the following equation:
We combine the above three parallel processes together
k2 P2 Q f
to obtain the max degree of parallelism and improve the BW = . (12)
Li
throughput of our design considerably. mi n{ kin }
The critical path is restricted by the operations of convo-
lution in our architecture. Due to the feedforward hardware For pooling layer, the number of computation clocks are
design of convolutional layer, multiple stages of pipelines can much less than that of convolutional layers. In the 2 × 2 max
be inserted to reduce the critical path. In our case, a three-stage pooling pattern, each input feature map is cut down to a quarter
pipeline is employed in the design. of its original size. The computation clocks of T2r rows in
pooling layer j are calculated as follows:
VII. E FFICIENCY A NALYSIS j j
pooing j L in n in
A. Computing Efficiency Analysis Ncycles = × , (13)
k P
The massive computations are intensive in convolutional
where P denotes the degree of parallelism, k is the kernel
layers and the performance of the proposed system is evaluated j j
size, L in represents the length of input feature map, and n in
in this section.
is the number of feature maps of the input layer. We divide
For CONV layer, the required number of computation j
L in by k because every k pixels in a row are packaged and
cycles of layer i with the proposed architecture can be cal-
stored, so k pixels are read out every cycle.
culated by the following formula:
Moreover, the computation cycles of a whole layer are:
C O N Vi L iin n iin ni j
Ncycles = × × out , i = 1, 2, . . . , pooing j pooing j 2Win
k P P Ncycles_t ot al = Ncycles × , j = 3, 6, . . . , (14)
Tr
(10)
j
where Tr denotes the row-tiling factor and Win represents the
where P denotes the degree of parallelism, k is the kernel
width of input feature map, respectively.
size, L iin represents the length of input feature map, and n iin
Comparing Eq. (10) with Eq. (13), their first two terms look
and n iout are the numbers of feature maps (channels) of input
similar. However, in VGG16 model, the product of the first two
and output layers repectively. n iout
The above equation handles the computations of Tr rows in terms remains equal and the last term of Eq. (10) P >> 1,
pooing
input layer to T2r rows in output layer. The total computation C O N Vi
so Ncycles_t ot al >> Ncycles
j
and the pooling time can be
cycles of a whole input layer to an output layer is shown in saved if inter-layer parallel processing is applied.
the following equation: For fully connected layer, the number of computation
cycles is estimated in the following equations:
2Wini
C O N Vi C O N Vi
Ncycles_t ot al = Ncycles × , (11) L l nlin nlout
Tr FC
Ncycles = , (15)
Pl
i i represent
where Tr denotes the row-tiling factor, Win and Wout where L l denotes the length of input feature map, nlin and nlout
the width of input feature map and output feature map, represent the channels of input layer and output layer, and P l
respectively. is the degree of parallelism of fully connected layer.
Since pixels are stored in PRAMs, no pixel needs to be sent Similarly, the total computation cycles of a whole FC layer
to off-chip DRAM, so there is no loading and unloading time is shown in Eq. (16):
of feature maps.
For each feature map in an input convolutional layer, it will FC W L l nlin nlout
be reused P times. It is emphasized that the same reused input
Ncycles_t ot al = , (16)
Pl
feature map for different output feature maps corresponds to where W denotes the width of input layer.
different kernels. In our architecture, each PU is fed with k ×k The number of interval clocks between two partial outputs
weights and we employ P CPs, each of which contains P PUs, C O N Vi C O N Vi
of FC layers is i Ncycles (Note that Ncycles can be
so the total number of weights fed to overall architecture is FC
repeated several times) and in general cases, Ncycles is smaller
k 2 P 2 every computi ng phase (computi ng phase is defined C O N Vi
as the computing process of P feature maps). The kernel than i Ncycles . Therefore, once the inter-layer parallel
weights do not need to be updated until a computi ng phase processing is utilized, the computation time of fully connected
is completed, so the minimum clock cycles during which layer can be saved as well.
the same
batch of kernel weights can stay unchanged are
L iin B. Storage and Energy Analysis
mi n{ k }. The clock frequency is set to f and each weight
For embedded FPGA platforms, due to limited internal
is quantized to Q bits. The
minimum
refresh frequency of memory resources, it is alomst impossible to place a whole
L iin
kernel weights is f/mi n{ k } and the minimum bandwidth large DCNN model on chip. If intermediate pixel data transfer
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
C OMPARISON W ITH O THER W ORKS
Compared to regular CNN architectures, the advantages of [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
the proposed architectures are shown as follows: hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
1) The proposed FCUs are computationally energy efficient pp. 580–587.
because the multiplications are reduced. Based on the [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
synthesis results, it is shown that the power of a 16-bit pp. 2278–2324, Nov. 1998.
multiplier in the Design Compiler is approximately [6] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn,
20 times that of a 16-bit adder. The power of an FCU is and D. Yu, “Convolutional neural networks for speech recogni-
tion,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22,
dominated by the multipliers. It is estimated that 33% no. 10, pp. 1533–1545, Oct. 2014. [Online]. Available: http://dx.doi.org/
of the computational energy in regular convolutions with 10.1109/TASLP.2014.2339736
3 × 3 kernel can be saved by applying 3-parallel FCUs. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.
Moreover, due to the reduction of multiplications, the Process. Syst., 2012, pp. 1097–1105.
degree of parallelism in computation can be improved [8] H. Nakahara and T. Sasao, “A deep convolutional neural network based
on embedded systems. on nested residue number system,” in Proc. 25th Int. Conf. Field
2) A novel data storage and reuse scheme are proposed. Program. Logic Appl. (FPL), Sep. 2015, pp. 1–6.
[9] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of
Data transfer is significantly reduced and restricted to GPU-based convolutional neural networks,” in Proc. 18th Euromicro
data reads of input pictures and weights. Since the Conf. Parallel, Distrib. Netw.-Based Process., Feb. 2010, pp. 317–324.
write transfer is avoided, we achieve a high degree [10] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space explo-
ration of FPGA-based deep convolutional neural networks,” in Proc.
of parallelism and an outstanding performance in our 21st Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2016,
design. A further on-chip memory reuse method is pp. 575–580.
also developed, which further reduces on-chip memory [11] M. Sankaradas et al., “A massively parallel coprocessor for convolutional
neural networks,” in Proc. 20th IEEE Int. Conf. Appl.-Specific Syst.,
requirement without any overhead. Archit. Process. (ASAP), Jul. 2009, pp. 53–60.
3) The ENQ method is employed in quantization process [12] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:
and it makes considerable reduction of the stored bits Towards uniformed representation and acceleration for deep convolu-
tional neural networks,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided
of pixel data, which further saves on-chip storage and Design (ICCAD), Nov. 2016, pp. 1–8.
energy. [13] T. Chen et al., “DianNao: A small-footprint high-throughput acceler-
ator for ubiquitous machine-learning,” ACM SIGPLAN Notice Archit.
News, vol. 49, no. 1, pp. 269–284, Mar. 2014. [Online]. Available:
IX. C ONCLUSION http://doi.acm.org/10.1145/2644865.2541967
[14] T. Luo et al., “DaDianNao: A machine-learning supercomputer,” in
In this paper, we focus on the efficient hardware designs for Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
Dec. 2014, pp. 609–622.
CNN implementations. Convolutions dominate the computa-
[15] D. Liu et al., “PuDianNao: A polyvalent machine learning accelerator,”
tion complexity. While the latency of data transfer between ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381,
external and internal memories is the main bottleneck of per- Mar. 2015.
formance improvement. The fast FIR algorithm (FFA) is intro- [16] J. Qiu et al., “Going deeper with embedded FPGA platform for
convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-
duced and Fast Convolution Units (FCUs) are developed based Programm. Gate Arrays (FPGA), New York, NY, USA, 2016, pp. 26–35.
on FFAs. By employing the parallel FCUs in the convolutions [17] M. Courbariaux, Y. Bengio, and J.-P. David. (2015). “BinaryConnect:
of CNN models, the computation complexity and energy are Training deep neural networks with binary weights during propagations.”
[Online]. Available: https://arxiv.org/abs/1511.00363
cut down significantly. The data storage and reuse scheme [18] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An ultra-
approach are especially proposed for data arrangement, which low power convolutional neural network accelerator based on binary
avoid the intermediate pixel data transfer between on-chip and weights,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI),
Jul. 2016, pp. 236–241.
off-chip memories. Hence the intermediate data transfer is [19] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional
avoided and meanwhile the throughput is largely improved. networks for large-scale image recognition.” [Online]. Available:
The ENQ method is also introduced to further reduce power https://arxiv.org/abs/1409.1556
[20] D. A. Parker and K. K. Parhi, “Area-efficient parallel FIR digital
and memory requirement. The VGG16 model is implemented filter implementations,” in Proc. Int. Conf. Appl. Specific Syst., Archit.
on both Xilinx Zynq ZC706 and Virtex VC707 platforms and Process. (ASAP), Aug. 1996, pp. 93–111.
we achieve a frame rate of 8.76 FPS and 33.80 FPS with [21] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
average performances of 316.23 GOP/s and 1250.21 GOP/s,
[22] C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter
respectively. structures based on iterated short convolution,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004.
[23] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
R EFERENCES works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Aug. 2016,
pp. 4013–4021.
[1] K. He, X. Zhang, S. Ren, and J. Sun. (2015). “Deep residual [24] Y. H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 Eyeriss: An energy-
learning for image recognition.” [Online]. Available: https://arxiv.org/ efficient reconfigurable accelerator for deep convolutional neural net-
abs/1512.03385 works,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Jan. 2016,
[2] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial pp. 262–263.
expression recognition using deep neural networks,” in Proc. IEEE [25] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1–10. accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
[3] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and ture (MICRO), Oct. 2016, pp. 1–12.
Y. LeCun. (2013). “OverFeat: Integrated recognition, localization [26] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. (2015). “Fixed
and detection using convolutional networks.” [Online]. Available: point quantization of deep convolutional networks.” [Online]. Available:
https://arxiv.org/abs/1312.6229 https://arxiv.org/abs/1511.06393
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[27] F. Sun, J. Lin, and Z. Wang, “Intra-layer nonuniform quantization of Jun Lin received the B.S. degree in physics and the
convolutional neural network,” in Proc. 8th Int. Conf. Wireless Commun. M.S. degree in microelectronics from Nanjing Uni-
Signal Process. (WCSP), Oct. 2016, pp. 1–5. versity, Nanjing, China, in 2007 and 2010, respec-
[28] J. Tonfat and R. Reis, “Low power 3–2 and 4–2 adder compressors tively, and the Ph.D. degree in electrical engineering
implemented using ASTRAN,” in Proc. IEEE 3rd Latin Amer. Symp. from Lehigh University, Bethlehem, in 2015.
Circuits Syst. (LASCAS), Feb./Mar. 2012, pp. 1–4. From 2010 to 2011, he was an ASIC Design
[29] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, Engineer with AMD. During summer 2013, he was
“A 240 G-ops/s mobile coprocessor for deep neural networks,” in an Intern with Qualcomm Research, Bridgewater,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, NJ, USA. In 2015, he joined the School of Electronic
pp. 682–687. Science and Engineering, Nanjing University, where
[30] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing he is currently an Associate Research Professor.
FPGA-based accelerator design for deep convolutional neural networks,” His current research interests include low-power high-speed VLSI design,
in Proc. ACM/SIGDA Int. Symp. Field-Programm. Gate Arrays, 2015, specifically VLSI design for digital signal processing, cryptography, and deep
pp. 161–170. learning.
[31] S. Han et al., “EIE: Efficient inference engine on compressed deep Dr. Lin is a member of the Design and Implementation of Signal Processing
neural network,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Systems Technical Committee of the IEEE Signal Processing Society. He was
Archit. (ISCA), Jun. 2016, pp. 243–254. a co-recipient of the Merit Student Paper Award at the IEEE Asia Pacific
Conference on Circuits and Systems in 2008. He was a recipient of the 2014
IEEE Circuits and Systems Society Student Travel Award.