CNN hw1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

Efficient Hardware Architectures for Deep


Convolutional Neural Network
Jichen Wang , Jun Lin, Member, IEEE, and Zhongfeng Wang, Fellow, IEEE

Abstract— Convolutional neural network (CNN) is the state-of- CNN achieves satisfied recognition accuracy as long as the
the-art deep learning approach employed in various applications. network is deep enough (a large number of layers), which
Real-time CNN implementations in resource limited embedded makes it a deep Convolutional Neural Network (DCNN).
systems are becoming highly desired recently. To ensure the pro-
grammable flexibility and shorten the development period, field Recent DCNNs, which consist of tens to hundreds of con-
programmable gate array is appropriate to implement the CNN volutional layers, have shown great promise in visual under-
models. However, the limited bandwidth and on-chip memory standing [7].
storage are the bottlenecks of the CNN acceleration. In this paper, Since existing systems with general purpose processors have
we propose efficient hardware architectures to accelerate deep not been optimized for DCNNs, the acceleration of DCNN
CNN models. The theoretical derivation of parallel fast finite
impulse response algorithm (FFA) is introduced. Based on FFAs, should be carefully investigated for real-time embedded sys-
the corresponding fast convolution units (FCUs) are developed for tems [8]. Most of the software based CNNs are implemented
the computation of convolutions in the CNN models. Novel data in GPUs [9], which have numerous execution units and large
storage and reuse schemes are proposed, where all intermediate memory bandwidth to obtain high computational throughput.
pixels are stored on-chip and the bandwidth requirement is As a result, both the intensive computations of training and
reduced. We choose one of the largest and most accurate
networks, VGG16, and implement it on Xilinx Zynq ZC706 and classification are shifted to GPUs. However, for local accelera-
Virtex VC707 boards, respectively. We achieve the top-5 accuracy tors, GPU based implementations are infeasible due to limited
of 86.25% using an equal distance non-uniform quantization hardware resources and tightly bounded energy consumption.
method. It is estimated that the average performances are One of the solutions to provide required performance
316.23 GOP/s under 172-MHz working frequency on Xilinx of DCNN within the restricted energy is to use hardware
ZC706 and 1250.21 GOP/s under 170-MHz working frequency
on VC707, respectively. In brief, the proposed design outperforms implementations such as ASIC and field programmable
the existing works significantly, in particular, surpassing related gate array (FPGA) [10]. Although the state-of-the-art
designs by more than two times in terms of resource efficiency. CNN algorithms and designs are rapidly evolving, the
Index Terms— Convolutional neural network (CNN), field low level operations stay the same, such as convolution and
programmable gate array (FPGA) platform, fast FIR algorithm, pooling. Convolutions in CNNs generally dominate the overall
on-chip data storage scheme. computational complexity and consume the major computation
time and power in real implementations. Therefore, it is
I. I NTRODUCTION promising to develop efficient architectures for DCNNs in
order to achieve high computation and energy efficiency.
C ONVOLUTIONAL Neural Network (CNN) has been
proven to be a powerful tool in domains of various tasks
including object recognition [1], [2] and detection [3], [4],
There are three challenges of hardware based real-time high
performance CNN implementations: 1) multiple and large-size
speech recognition [5], [6]. Real-time image recognition feature maps in convolutional layers and the huge amount
could be applied everywhere using CNN-like deep learning of parameters, both of which require large storage space;
approaches in the near future. However, most of the CNN 2) the high complexity of convolutional computations, which
models are trained and implemented with software platforms greatly slows down the training and inference process; 3) the
due to the versatility. For mobility and privacy reasons, real- limited memory bandwidth, which incurs long data exchange
time applications requiring of high accuracy and low power delays. These factors hinder the real-time performance and the
image and voice recognitions based on CNNs should be able widespread deployment of DCNNs, particularly on resource
to be performed on local embedded processors. constrained embedded systems.
Various hardware implementations of different CNN models
Manuscript received January 26, 2017; revised June 11, 2017; accepted have been proposed in recent years. Sankaradas et al. [11]
October 13, 2017. This work was supported in part by the National Natural
Science Foundation of China under Grant 61774082 and Grant 61604068 presented a massively parallel coprocessor, whose functional
and in part by the Fundamental Research Funds for the Central Universities units consist of parallel two-dimensional (2D) convolution
under Grant 021014380065. This paper was recommended by Associate primitives and programmable units, performing the pooling
Editor Y. Ha. (Corresponding authors: Jun Lin; Zhongfeng Wang.)
The authors are with the School of Electronic Science and and non-linear functions in CNNs. Zhang et al. [12], pro-
Engineering, Nanjing University, Nanjing 210008, China (e-mail: posed a hardware/software co-designed library to efficiently
jcwang@smail.nju.edu.cn; jlin@nju.edu.cn; zfwang@nju.edu.cn). accelerate an entire CNN on FPGAs, which employed a
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. uniformed convolutional matrix multiplication representation
Digital Object Identifier 10.1109/TCSI.2017.2767204 for both convolutional layers and fully connected layers.
1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Chen et al. [13], Luo et al. [14], and Liu et al. [15] designed
general accelerators for large scale CNNs and DNNs, with a
special emphasis on the impact of memory on an accelerator.
These three main strategies employed to optimize the memory
systems are: 1) tiling and data reuse, for reduction of memory
traffic; 2) storage buffer, for data reuse; 3) on-chip memory,
for storage of all parameters [16].
Most of prior works focus on high level parallel architec-
Fig. 1. Design of a CNN model.
tures and efficient memory designs. However, these architec-
tures seldomly reduce the number of computations signifi-
cantly, especially in convolutions. In fact, 2D convolutions is introduced in Section V. The novel storage scheme and
occupy more than 90% of the overall computation time [8]. computation flow are proposed and described in detail in
The latency and energy consumed due to data access between Section IV. Three methods of parallel processing along with
external and internal memories lead to low computation the computation and storage architectures are presented in
throughput and large power consumption. Binary weights were Section VI. We analyse the architecture efficiency in aspects of
employed in [17] and [18] to reduce both the computational computing, storage, and energy in Section VII and compare the
complexity and the storage requirement at the cost of certain implementation results with previous works in Section VIII.
accuracy loss.
In this paper, efficient hardware architectures are proposed II. BACKGROUND
to process and store DCNN models. The main contributions A CNN can be viewed as the combination of a feature
of this work are: extractor and a category classifier. Three architectural ideas are
• The theoretical derivation of 3-parallel fast finite impulse introduced to ensure the invariance of some degree of shift,
response (FIR) algorithm (FFA) is presented. Based on scale, and distortion: 1) local receptive fields; 2) shared kernel
the 3-parallel FFA, an efficient 3-parallel Fast Convo- weights; 3) spatial or temporal sub-sampling (pooling) [5].
lution Unit (FCU) for convolutions with 3 × 3 kernels CNNs are usually composed of three main different types
is developed. For convolutions with large convolutional of layers: convolutional (CONV) layers, pooling layers, and
kernels such as 5×5 and 7×7, we also derivate 5-parallel fully connected (FC) layers. Some new types of layers are
and 7-parallel FFAs and propose their corresponding developed such as normalization and dropout layers. A CNN
FCUs. model usually has a feed-forward design, where each layer
• With the proposed FCU, multiple outputs can be com- takes the output of a preceding layer as its input and produces
puted in parallel, which improves the computation the results for the next layer. Normally, pooling layers follow
throughput. It is estimated that 33% of the multiplica- convolutional layers and fully connected layers are the last few
tions are reduced with slightly increased add operations layers. Fig. 1 shows an example of CNN design.
by the proposed 3-parallel FCU and the computational
energy is reduced as well. Moreover, multiple FCUs are A. Convolutional Layers
intelligently organized into an FCU array which enable
Convolutional layers are the major parts of a CNN. Each
efficient data reuse.
output pixel is connected to only a local region of the input
• A novel data reuse and storage scheme which avoids the
layer and the extent of this connectivity is called the receptive
transfer of intermediate data between internal memory
field. The connections are local in space (along length and
and external memory is proposed. DRAM is utilized in a
width), but always extend along the entire channel of the input
one-way transmission fashion, where only the raw input
layer. The receptive fields are overlapped both along the height
pictures and weights need to be read out and no data
and width by a certain stride (usually 1) in an input feature map
is required to be written back to DRAM. A further on-
and the kernel weights are shared with them. The convolution
chip memory reuse scheme is proposed to reduce memory
in CNN is a 2D operation, where shared kernel weights are
requirement.
multiplied with the corresponding receptive field in an element
• The equal distance non-uniform quantization (ENQ)
by element way. These products are summed together with an
method is introduced to our quantization flow, which
additional bias.
contributes to the reduction of data width.
Usually, an input layer contains multiple feature maps. The
• An overall hardware architecture which contains scalable
convolution results of all input feature maps are added together
computational logics and storage resources is proposed.
to get an output feature map. Thus the kernels are extended to
Three methods of parallel processing are implemented by
3 dimensions, where each 2D kernel corresponds to an input
the proposed architecture, which considerably exploit the
feature map. The pixel y at location (x, y) in output feature
parallel computations of DCNNs.
map n is given by:
The rest of this paper is organized as follows:
i −1 K
N y−1 K
x−1
In Section II, we introduce the background of CNN and
implementation model. In Section III, the fast convolution sum = w(n) [i ][ j ][k]×i n[i ][x + j][y +k],
algorithm is described in detail and the corresponding fast i=0 j =0 k=0
convolution unit is proposed. The proposed ENQ scheme y (n) [x][y] = f (sum + b[n]),
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 3

where w(n) and Ni represent the kernel weights and the If the input sequence x(n) is decomposed into even and odd
number of input feature maps, respectively. K y and K x denote indexed samples, we have X (z) = X 0 + z −1 X 1 . Similarly,
the height and width of the convolutional kernel, b[n] is the H (z) = H0 + z −1 H1 and the output:
bias added to output feature map n, respectively. f represents
Y (z) = Y0 + z −1 Y1 = (X 0 + z −1 X 1 )(H0 + z −1 H1), (3)
the activation function, which could be sigmoid, tanh or
ReLU. where
Y0 = H0 X 0 + z −2 H1 X 1 ,
B. Pooling Layers
Y1 = H1 X 0 + H0 X 1 . (4)
Pooling layers aim at simplifying and smoothing the
precedent feature maps. Usually, a pooling layer follows Based on the fast FIR algorithm (FFA) [20], Y0 and Y1 can
a convolutional layer. Two types of pooling operations are be expressed in the form of 2-parallel FFA,
commonly employed, which are average and max pooling. Y0 = H0 X 0 + z −2 H1 X 1 ,
Average pooling computes the mean of a small local field
Y1 = (H0 + H1)(X 0 + X 1 ) − H0 X 0 − H1 X 1 . (5)
in each input feature map, while max pooling picks the max
value of the local field to output a pixel. For a pooling layer, Similarly, the input sequence x(n) and tap coefficients h(n)
the number of output feature maps equals to that of input can be decomposed into three parts as follows:
feature maps.
Y = Y0 + z −1 Y1 + z −2 Y2
= (X 0 + z −1 X 1 + z −2 X 2 )(H0 + z −1 H1 + z −2 H2 )
C. Fully Connected Layers
= (X 0 + z −1 V )(H0 + z −1 W ), (6)
Fully connected layers are placed at the end of a CNN
design and perform as classifiers. They are usually flattened to where V = X 1 + z −1 X 2 and W = H1 + z −1 H2. The last
one dimension and act in the form of regular neural networks, equation of Eq. (6) and its intermediate term V W are both
where every neuron has full connections to all neurons in the in the 2-parallel FIR filter form and can be computed by
previous layer. employing Eq. (5) recursively [20]. Thus, the 3-parallel FFA
is computed as follows:
D. Implementation Model Y0 = H0 X 0 −z −3 H2 X 2 +z −3 [(H1 + H2)(X 1 + X 2 )− H1 X 1 ],
The VGG model [19], proposed by Simonyan and Zisser- Y1 = [(H0 v + H1)(X 0 + X 1 )− H1 X 1 ]−[H0 X 0 −z −3 H2 X 2 ],
manachieved, won the second place in image classification
Y2 = [(H0 + H1 + H2 )(X 0 + X 1 + X 2 )]
task of ILSVRC 2014. VGG models are a series of networks
with deep layers and use very small (3 × 3) convolutional − [(H0 + H1)(X 0 + X 1 ) − H1 X 1 ]
kernels. It was shown that a significant improvement on − [(H1 + H2 )(X 1 + X 2 ) − H1 X 1 ]. (7)
the classification can be achieved by pushing the depth to
For parallel FFAs of larger sizes of prime numbers
16-19 layers. In general case, VGG models consist of mul-
(e.g., 5 and 7 parallel FFAs), we have already derived their
tiple convolutional layers, 4 max-pooling layers and 3 fully
expressions, but do not put them down in this paper because
connected layers.
the derivation process is relatively complex compared to
Only 3 × 3 kernels are employed in VGG model so that
3 parallel FFA. It should be noted that large kernels (e.g.,
fewer weights are needed and higher accuracy is obtained.
5 × 5 and 7 × 7) are indeed employed in some CNN models
VGG model can branch out to VGG11, VGG16, and VGG19
depending on the total number of layers. In this paper, such as [7] and [21]. Parallel FFAs with sizes of composite
numbers can be obtained by applying cascading short term
the VGG16 model is considered due to its relative balance
FFAs [20], [22].
between complexity and accuracy.
Parallel FFAs are essentially parallel computing algorithms.
A large size filter is decomposed into several small sub-
III. FAST C ONVOLUTION A LGORITHM
filters and each performs short convolutions. For a filter of
A. Parallel Fast FIR Algorithm size P, where P is a prime number such as 3 or 5, the
An FIR filter with N taps can be expressed as: filter can be decomposed into P sub-filters. For the presented
parallel FFAs, the convolution of each sub-filter becomes a

N−1
y(n) = h(n) ∗ x(n) = h(i )x(n − i ), multiplication. These partial convolutional results are added
i=0 with specific combinations to compute several output results
n = 0, 1, 2, · · · , ∞, (1) in parallel.
Reference [23] introduced Winograd’s minimal filtering
where x(n) is an infinite input sequence and h(n) contains the algorithms to minimize multiplications in small size FIR fil-
coefficients of an N-length FIR filter. The convolution shown ters. However, more additions and complex pre-computations
in Eq. (1) can be expressed in z domain: are required, which is unfriendly to hardware implementation.

N−1 ∞
 Moreover, it is difficult to use Winograd’s minimal filtering
Y (z) = H (z)X (z) = h(n)z −n x(n)z −n . (2) algorithms to derive the expressions of large tap filters with
n=0 n=0 fewer additions.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE I
C OMPARISONS OF A RITHMETIC O PERATIONS

Fig. 2. Architecture of the 3-parallel FCU.


designs, the pre-additions and post-additions can be pipelined
to increase the clock frequency.

IV. M EMORY E FFICIENT C OMPUTATION F LOW


A. Inter-Layer Partial Storage and Intra-Layer
Ping-Pong Reuse Scheme
Due to limited on-chip memory resources, tiling is com-
monly used for most hardware implementations of CNNs.
Row (or column) tiling [24] and channel tiling [13] are often
Fig. 3. Illustration of convolution with three 3-parallel FCUs.
applied. Tiling is essentially a split of block data from large
chunks to small ones that can be stored on chip. However,
data tiling does not avoid intermediate data exchanges between
B. Fast Convolution Unit
internal and external memories in previous architectures. The
For convolution with a k × k kernel in a CNN, the 2D con- data transmission latency is large and the throughput is con-
volution is decomposed into k 1D convolutions and each 1D strained by the interface bandwidth. In this paper, we propose
convolution can be implemented by a k-tap FIR filter, where a novel inter-layer partial storage and intra-layer ping-pong
the tap coefficients are the corresponding kernel weights. reuse scheme. This scheme makes the best use of on-chip
Based on the parallel FFA, a Fast Convolutional Unit (FCU) memory resources and does not need to send intermediate data
is proposed to perform the convolution in a CNN. to external memories such as DRAM.
Take the 3 × 3 kernel as an example, the 3-parallel FCU Each input feature map of some convolutional layers is tiled
architecture is shown in Fig. 2, where the coefficients h 0 , h 1 , by factor Tr in row and the Tr rows of pixel data are stored in
h 2 denote the kernel weights, x 0 , x 1 , x 2 and y0 , y1 , y2 are a specific on-chip block RAM, including the raw input picture
inputs and outputs, respectively. They all belong to one row layer which contains three feature maps of RGB channels and
or column in a 2D field. their tiled Tr row data is stored in raw picture RAM. The
It is noted that the kernel weights must be in the reverse convolutional kernel size is k × k and Tr ≥ k. From the
order in each FCU, since in convolution the tap coefficients calculation, Tr − k + 1 rows of pixel data will be computed
are reversed and followed by element-wise multiplications. For assuming the stride is 1. Each of these block RAMs has two
example, if the kernel weights in one row of a 3 ×3 kernel are segments, both of which stores T2r rows of pixel data. One
in the order of k0 , k1 , k2 , then in the 3-parallel FCU, h 0 = k2 , segment is for buffering the fresh new calculated Tr − k + 1
h 1 = k1 , h 2 = k0 . row data and the other is to store the previous Tr − k + 1 row
In order to finish a 2D convolution, k identical FCUs are data. These two segments are utilized in a ping-pong manner
employed. The outputs of all FCUs are added together. For and Tr is calculated as follows:
example, when k = 3, three 3-parallel FUCs are needed to
perform the 2D convolution. For j = 0, 1, 2, let y j,0 , y j,1, 2(Tr − k + 1) = Tr ,
y j,2 denote the outputs of the j -th FCU. The m-th output of Tr = 2k − 2. (8)
the 2D convolution is y0,m + y1,m + y2,m for m = 0, 1, 2.
The 2D convolution with three 3-parallel FCUs are illustrated Reference [25] proposed an efficient dataflow across convo-
in Fig. 3. lutional layers, which employed a pyramid-shaped multi-layer
From the 3-parallel FFA and 3-parallel FCU, it is estimated sliding window to enable effective on-chip caching during
that 33% multiplications can be saved with some extra addi- CNN evaluation. However, it only tested on fusing the first five
tions. convolutional layers of the VGGNet-E network and the on-
With large convolutions such as 5 × 5 and 7 × 7, based on 5 chip memory requirements grow exponentially as the network
and 7 parallel FFAs, the corresponding 5 and 7 parallel FCUs goes deeper with their method.
are designed and employed to perform the corresponding con- To further make reuse of the on-chip memories, the block
volutions. The comparisons of arithmetic operations are listed RAM sizes of some convolutional layers should be the same
in Table I. Because convolutions with FCUs are feedforward and the number of rows stored in such layers is identical.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 5

Fig. 4. An illustration of data inter-layer partial storage and intra-layer ping-


pong reuse.

Not all block RAMs need to store Tr rows of the corre-


sponding layers. Once a convolutional layer arranged before
a pooling layer is computed, there is no need to store the
previous rows of such layer. Such convolutional layer is only
allocated half size of the block RAM compared to other
convolutional layers and the block RAM just stores T2r rows
of pixel data.
Before the description of computation flow, we signify three
status signals to represent the storage conditions of the block Fig. 5. Data flow of VGG16 with partial-inter-layer storage architecture.
RAMs that store Tr row data. The three status signals are
“empty”, “half”, and “full”. Each of these block RAMs has
segment which stores the row data computed twice is set as
two segments, both of which has a label signal which can be
“0”. Then turn to Step 2.
labelled as “1” or “0”. The representation of these signals are
Step 4: Before resuming the convolutions, if status signals
expressed as follows:
of the raw picture RAM and an input block RAM are both
• “1” indicates that a segment has stored the computed pixel
“full”, the input block RAM is used to operate convolutions
data which will participate in the further convolutions.
in priority.
• “0” represents that a segment has not received any com-
For VGG16 model, the convolutional kernel size is 3 × 3
puted pixel data yet or the segment stores the data which
and Tr = 4. Fig. 5 shows the data flow of VGG16 model
was used for computations twice and can be overwritten.
with the proposed inter-layer partial architecture. It is shown
• “empty” indicates that the two segments of the block
that all layers are allocated a corresponding block of attached
RAM are labelled as “00” and pixel data can be written
memories.
to either of the segments.
Fully Connected (FC) layers are usually arranged after con-
• “half” means the two segments are labelled as “10” or
volutional layers and mainly consist of matrix multiplications.
“01”. The corresponding block RAM has stored T2r rows
With the proposed inter-layer partial storage and intra-layer
of pixel data and only the segment labelled as “0” can
ping-pong reuse scheme, the matrix multiplications of FC
store newly computed pixel results.
layers are split into sub matrix multiplications. In Fig. 5, for
• “full” represents that the two segments of the block RAM
example of VGG16, the Pool5 layer is connected to a FC layer,
are labelled as “11”, which has already stored Tr row
which is split from 7 × 7 × 512 × 4096 to 7 × 512 × 4096.
data in a convolutional layer and pixel data can be sent
While the sub matrix multiplications are being processed,
to convolution operations immediately.
the previous operations, e.g., convolution and pooling, can be
The computation flow from layer to layer is described as
operated simultaneously.
follows and Fig. 4 shows an example of the data inter-layer
partial storage and intra-layer ping-pong reuse scheme.
Step 1: Load raw input picture data to fill up the raw picture B. Further On-Chip Memory Reuse
RAM and set the status signal as “full”. Set the signals of all The memory reuse can be further exploited even if we have
other block RAMs as “empty”. already taken advantage of on-chip memories. The memory
Step 2: Once the status signal of an input block RAM Ci reuse scheme can be arranged in two aspects: 1) The full reuse
is “full”, the stored pixel data is sent to computation logics. of RAMs belonging to convolutional layers which are arranged
Then the output T2r row data is written to one segment labelled adjacently before pooling layers; 2) The partial reuse of RAMs
as “0” of the output block RAM Ci+1 . In the mean time the belonging to some layers with the same storage design. These
computations are being processed, the raw picture RAM loads two further memory reuse schemes require particular pixels
the next T2r rows of raw input pictures if the status signal quantized to be the same bits. Therefore, in the ENQ method,
is not “full”. Weight Buffer (WB) receives weights for next pixel data of particular layers should be consistently extended
convolutions from external memories meanwhile. to the maximum bits of the pixel data of these layers.
Step 3: When the convolutions are completed, the segment The data flow of RAM partial reuse scheme is roughly
(i)
label of output block RAM Ci+1 turns from “0” to “1”. presented in Fig. 6. rs denotes a segment of the i -th RAM
1
The signal of block RAM Ci+1 is set as “full” or “half” storing 2 Tr rows of pixel data. The subscript s stands for the
depending on whether the other segment is labelled as “1” segment label which is either “0” or “1”. The heuristic idea is
or “0”. Meanwhile, in input block RAM Ci , the label of the that the computed 12 Tr row data of layer i can be stored in a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

implementation, the number of feature map pixels stored on


chip are 6.4M. With the memory-efficient computation flow,
the required feature map pixels stored on chip are reduced to
463.5k. Our design reduces memory requirement by 14 times
and the intermediate pixels are no longer required to be sent
to off-chip DRAM.
Our proposed inter-layer partial storage and intra-layer ping-
pong reuse scheme are especially hardware friendly to other
CNNs with residual blocks [1]. The residual block function
can be roughly expressed as: H (x) = F(x) + x. F(x) is the
residual block and consists of several convolutional layers.
The inputs and outputs of residual blocks are added together,
which requires that the input feature maps must be preserved
Fig. 6. The data flow of further on-chip memory reuse. in DRAM and read out later in the regular layer-wise process.
However, with the proposed method, we store partial row data
of all layers in on-chip RAMs and the row data can be directly
fetched and added to the outputs of residual blocks.

V. Q UANTIZATION M ETHOD
In CNN implementation, quantization is necessary to reduce
the computation and storage bits while preserving the recog-
nition accuracy. Previous works focus on the quantization of
kernel weights and the quantization of pixel data is rarely
discussed. The equal distance quantization (ENQ) method is
proposed in this section.
It was shown in [26] that the distributions of numerical
values of both pixels and weights in most layers are roughly
Gaussian distribution. For VGG16 model, the distribution of
pixel values in each convolutional layer is Gaussian-like as
well [27].
Fig. 7. Further reuse of on-chip memories for architecture of VGG16 data
flow. A. Quantizaition Flow
The quantization method is to a map pixel values to a set of
segment of RAM j instead of RAM i for layer i . Therefore,
quantization points (QPs) and each QP corresponds to a fixed-
only if the block RAMs have the same storage design, they
point value. {P0 , P1 , · · · , PN−1 } denote a set of QPs with N
can be reused. Once two segments of 1 or 2 block RAMs are
elements and Vi denotes the corresponding fixed-point value
labelled as “1” and both the segments store the row data of the
associated with Pi (i = 0, 1, · · · , N − 1). A pixel value s is
same layer, the computation is triggered and the output 12 Tr
quantized to a QP:
rows of pixel data are written to a segment which is labelled as
“0”. When the computation is completed, one of the segments s → Pk , k = 0, 1, · · · , N − 1. (9)
whose stored pixel data is computed twice is labelled as “0”
and can be overwritten in further computations. It requires K = log2 N bits to store a quantized pixel
For VGG16 model, RAMs for corresponding layers that value.
can be reused in the two different memory reuse schemes are From [27], it can be concluded that most pixel values are
marked in Fig. 7. relatively small in a layer. It is reasonable to assign more
The fully reused block RAMs are linked by solid arrows. QPs to represent the smaller pixels without incurring obvious
These convolutional layers share the same block RAM, size degradation of accuracy. The proposed quantization scheme is
of which is determined by the largest one. By coincidence, described as follows:
in VGG16 these block RAMs are of the same size except • For uniform quantization, i.e., each pixel is quantized to
for the last one. The partial reuse among block RAMs are q bits (2q QPs in total) and each weight is quantized
linked with dash arrows. On the basis of inter-layer partial to qw bits. All pixel values are non-negative because we
storage and intra-layer ping-pong reuse scheme for VGG16 store pixels after the ReLU operation. Let F represent
implementation, the further on-chip memory reuse method the number of fractional bits in the q-bits uniform quan-
can further reduce about 20% of block memories without any tization. Therefore, Vi = i 2−F , i = 0, 1, · · · , 2q − 1.
overhead. A and A represent the accuracy with floating and q-bit
For VGG16 implementation, in order to avoid the whole uniform quantization, respectively. The minimal value of
intermediate pixel data transfer with regular layer-wise q (denoted as q  ) is calculated so that A − A  δ,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 7

TABLE II
ENQ R ESULT OF VGG16 M ODEL

TABLE III VI. E FFICIENT H ARDWARE A RCHITECTURE FOR DCNN


U NIFORM Q UANTIZATION OF VGG16
For a DCNN, each convolutional layer contains multiple
feature maps and convolutions occupy more than 90% of the
overall computation time. In order to improve the performance,
the degree of computing parallelism should be increased as
much as possible. However, it takes careful consideration to
where δ is a small positive number determined by the decide how to process in parallel. When it comes to parallel
corresponding data set and application. processing, data reuse should be employed to reduce the
• For pixels within layer i , the proposed ENQ employs an number of redundant memory read or write. Both feature maps
E i -bit nonuniform quantization scheme, where E i  qm and parameters can be reused but in different ways depending
and qm is the uniform quantization bits in the previous on how to perform the computations in parallel. For previous
step. For the proposed E i -bit ENQ scheme, there are hardware architectures of CNN, two common methods are
2 Ei QPs, Pi,0 , Pi,1 , · · · , Pi,2 E i −1 , where Pi,k corresponds applied to achieve parallel data processing. One method is
to the fixed-point value Vi,k = k2qm −Ei 2−F . Here, data processing in the direction of one dimension which is
k = 0, 1, · · · , 2 Ei − 1. Supposing the magnitude of an along length or width on a feature map and parameters are
pixel is x, the ENQ scheme quantizes it to 2qmx−E i if reused in this way. The other is data processing along channel
x  2qm − 1. Otherwise, x is quantized to 2 Ei − 1. Note in a layer, where feature maps are reused.
that each pixel value within layer i is stored using E i bits. In this paper, both of these two methods are intelligently
When a pixel is required to participate in the convolutions combined together to accelerate the DCNN computations and
in the next layer, it should be converted to qm -bit based improve the throughput. Moreover, due to the intrinsic pipeline
on the relationship between QPs and fixed-point values. property of memory-efficient computation flow introduced in
In this work, an exhaustive search is employed to find Section IV, we introduce a third parallelism method for the
the minimal E i for each layer i such that the resulting first time, the inter-layer parallel processing.
accuracy is close to A .
A. Parallel Processing Along Row & Column
With the proposed FCU architecture, the convolution pat-
B. Numerical Results terns are quite different from the regular ones. Pixel data
in each row or column is reconstructed as connection and
We first perform the uniform quantization of VGG16 model
interdependence, which are explained as follows:
and the weight parameters are all quantized to 8 bits in the
uniform quantization step. As shown in Table III, the accuracy • Connection means that k pixels are packaged and sent
of the full precision is 88.5%. The same bit-width quantization to a k parallel FCU at a time and k output pixels are
is utilized for all layers without fine-tuning and 12 bits are calculated. In the regular k-taps convolution form, k
enough for the uniform quantization of all layers. pixels are multiplied with corresponding coefficients and
The numbers of quantization bits generated by ENQ scheme added together to obtain one output pixel. Here, the size
are shown in Table II, which includes several combinations of of the convolutional kernel is k × k.
E i and the corresponding accuracy. We adopt the last indexed • Interdependence means that once the former package
quantization scheme in Table II with top5 accuracy of 86.25% of k pixels in a row on a feature map are consumed,
and qm is 12 bits. the adjacent k pixels must follow into the same FCU
In our experimental case of VGG16 model, on-chip mem- immediately and the computing process repeats until all
ories are further saved by 54.3% when the inter-layer partial pixels of a row are processed.
storage and intra-layer ping-pong reuse scheme is employed. If By organizing k × (k − 1) FCUs into a computing array,
the further on-chip memory reuse method is utilized, on-chip we construct a processing unit (PU) as a basic structure to
memories can be further saved by 47.0%. Both cases remain perform the convolutions. This architecture is illustrated in
other conditions unchanged except that ENQ is employed Fig. 8, where the data reuse is achieved in the following two
compared to uniform quantization method. ways:
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 10. The multiple-to-multiple pattern of parallel processing along channel.

Fig. 8. Processing Unit (PU) composed of FCU arrays.

Fig. 11. The architecture of a Complex Processor (CP).

Fig. 9. The computing process of 3 × 3 convolutions realized by a 3-parallel


PU. To achieve the goal of multiple-to-multiple pattern process-
ing, an m-to-m processor which consists of all the computation
logics and storage resources is proposed. The m-to-m proces-
•Kernel weight, Wi , is shared among k−1 FCUs in the i -th
sor contains Complex Processor (CP), Pixel RAM (PRAM),
row of the computing array of PU for i = 0, 1, · · · , k −1.
Attached RAM (ARAM), Weight Buffer (WB), and DRAM.
• 2k − 2 rows of pixel data as shown in Fig. 8 are partially
CPs are cores of the m-to-m processor. All computation
shared among certain FCUs. The results of all k FCUs
logics are integrated in the CPs and data communication
in each array column are summed together to obtain the
are the most frequent between memories and CPs. A CP is
intermediate data of a row on an output feature map.
composed of Bit-width Converter (BC), Processing Unit (PU),
In VGG16 model, a PU aims at processing pixel data along Adder Compressor, ReLU module, and Max Pooling unit. The
rows on a feature map, which contains 3 × 2 = 6 3-parallel detailed architecture of CP is shown in Fig. 11 and described
FCUs. Each input and output of an FCU is a package of 3 as follows:
pixels. Thus the 3-parallel PU can process 12 pixels and output
• Bit-width Converters (BCs) act to convert the bit
6 pixels in parallel. width of pixel data between computations and storage
The computing process of a 3-parallel PU is shown in Fig. 9.
as explained in the ENQ method.
With each proposed PU, 12 pixels (4 × 3) from four adjacent
• Processing Units (PUs) are FCU arrays and each PU
rows on an input feature map are used in parallel. Therefore, operates the 2D convolution of an input feature map.
four rows on an input feature map can be processed and two
T PUs are employed in a CP and partial rows of T
output rows are obtained in parallel by the mean time.
input feature maps are processed in parallel. In an m-to-
m processor, there arranged PT PUs and for the efficient
B. Parallel Processing Along Channel usage of hardware resources, T equals P.
The parallel processing along channel is a multiple-to- • Adder Compressor is employed to implement arithmetic
multiple pattern of feature maps. To improve the throughput and digital signal processing (DSP) circuits for low
of the design, the degree of computing parallelism should be power and high performance applications [28]. In order
exploited as much as possible. The multiple-to-multiple pattern to reduce the critical path, we employ 4-2 compressor
employed in our processor is described in Fig. 10. instead of adder trees. The 4-2 Compressor has 5 inputs
The diagram shows that multiple input feature maps are A, B, C, D and Cin to generate 3 outputs Sum, Carry
processed in parallel and multiple output feature maps are and Cout as shown in Fig. 12(a). The input Cin is
obtained. the output from a previous lower compressor and the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 9

layers. PRAMs exchange pixel data with CPs as described


above. Only the PRAMin need to receive input raw picture
data from off-chip DRAM and other PRAMs have no data
access with external DRAM, so they are highly efficient
due to low latency.
• Attached RAMs (ARAMs) perform a similar function
as PRAMs but they buffer intermediate results and send
pixel data to max pooling module. They do not need data
access with DRAMs either.
• Weight Buffers (WBs) are the channels which send
Fig. 12. (a) 4-2 adder compressor. (b) Enhanced architecture with the kernel weights and biases from off-chip DRAM to CPs.
XOR-XNOR and MUX modules of a 4-2 compressor. WBs are used because the weights are too many for
on-chip memory to store all at once. While the current
step of convolution is being performed, the kernel weights
required for the following convolutions are sent to weight
buffers meanwhile.

C. Inter-Layer Parallel Processing


Since the computation dataflow presented with our archi-
tecture is quite different from the traditional layer-wise one,
the inter-layer parallel processing can be conducted. In the
traditional feedforward layer-wise dataflow, the computation
of a convolutional layer can not start until its previous convolu-
tional layer is completedly processed. Then the previous layer
can be dropped out. However, the computation is restricted
between two adjacent convolutional layers and the process of
Fig. 13. Overall architecture of the m-to-m processor. a convolutional layer must wait until all the previous layers
are processed.
Cout output is for the compressor in the next stage. With our architecture, the convolutions are not constrained
The regular approach to implement 4-2 compressor is between two layers. They are split into segments and extend
with 2 full adders connected serially and the enhanced from the first layer to the last layer. So computations of
version in [28] which are employed in our design. This different layers can be under taken at the same time.
enhanced compressor utilizes the XOR-XNOR module For instance, we realize the inter-layer parallel processing
and the transmission gate version of the MUX module as of data loading of input layer, computations of convolutional
shown in Fig. 12(b). If the sums of Adder Compressor layers and fully connected layers, and max pooling of pool-
are the final results of feature maps, then they are sent to ing layers. The detailed parallel process is explained in the
ReLU modules. Otherwise, they are sent to ARAMs as following steps:
intermediate results. • Once the Tr rows of the input raw picture layer are
• ReLU modules receive biases from WBs and add them calculated, then T2r new rows of pixel data begin to
to the output sums of Adder Compressors. Then they be loaded from the off-chip DRAM and overwrite the
perform as the max selectors between zero and corre- segment labelled as ’0’. Therefore the loading time of
sponding sums and send the final results to PRAMs. the input raw picture layer is saved.
• Max Pooling units perform the 2 × 2 pooling function • When it begins to calculate the pooling layer, the convo-
with pixel data from ARAMs and send the results to lution operations of its former layers including the input
corresponding PRAMs. raw picture layer can be trigged simultaneously. Under
The overall architecture of the m-to-m processor is shown usual circumstances convolution time is much longer than
in Fig. 13. Each component is described as follows: that of pooling and the pooling time can be saved.
• Complex Processors (CPs) are the main computation • When the pooling operation of the last pooling layer is
units employed to realize the multiple-to-multiple pattern completed, only one row is outputted. For instance, in
of parallel processing. Suppose P CPs are employed VGG16 model it is 1 ×7 ×512. These partial final results
and each CP outputs the partial rows of an output fea- are fed to the fully connected layer. The operation of fully
ture map. Apart from convolutions, bit-width conversion, connected layer is essentially matrix multiplications. The
ReLU, and max pooling are also performed by a CP. whole matrix multiplications are divided into several par-
CPs exchange pixel and weight data with pixel RAMs, tial multiplications and additions (MACs). Each part can
Attached RAMs, and Weight Buffers. be calculated simultaneously with other layer operations
• Pixel RAMs (PRAMs) are the combinations of block such as convolutions and data loading. Due to the long
RAMs which store the partial rows of pixel data of all interval time between two partial rows of the last pooling
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

layer, the computation time of fully connected layer can to transmit kernel weights from external DRAM is calculated
be saved. in the following equation:
We combine the above three parallel processes together
k2 P2 Q f
to obtain the max degree of parallelism and improve the BW =   . (12)
Li
throughput of our design considerably. mi n{ kin }
The critical path is restricted by the operations of convo-
lution in our architecture. Due to the feedforward hardware For pooling layer, the number of computation clocks are
design of convolutional layer, multiple stages of pipelines can much less than that of convolutional layers. In the 2 × 2 max
be inserted to reduce the critical path. In our case, a three-stage pooling pattern, each input feature map is cut down to a quarter
pipeline is employed in the design. of its original size. The computation clocks of T2r rows in
pooling layer j are calculated as follows:
VII. E FFICIENCY A NALYSIS  j   j 
pooing j L in n in
A. Computing Efficiency Analysis Ncycles = × , (13)
k P
The massive computations are intensive in convolutional
where P denotes the degree of parallelism, k is the kernel
layers and the performance of the proposed system is evaluated j j
size, L in represents the length of input feature map, and n in
in this section.
is the number of feature maps of the input layer. We divide
For CONV layer, the required number of computation j
L in by k because every k pixels in a row are packaged and
cycles of layer i with the proposed architecture can be cal-
stored, so k pixels are read out every cycle.
culated by the following formula:
     Moreover, the computation cycles of a whole layer are:
  
C O N Vi L iin n iin ni j
Ncycles = × × out , i = 1, 2, . . . , pooing j pooing j 2Win
k P P Ncycles_t ot al = Ncycles × , j = 3, 6, . . . , (14)
Tr
(10)
j
where Tr denotes the row-tiling factor and Win represents the
where P denotes the degree of parallelism, k is the kernel
width of input feature map, respectively.
size, L iin represents the length of input feature map, and n iin
Comparing Eq. (10) with Eq. (13), their first two terms look
and n iout are the numbers of feature maps (channels) of input
similar. However, in VGG16 model, the product of  the first two
and output layers repectively. n iout
The above equation handles the computations of Tr rows in terms remains equal and the last term of Eq. (10) P >> 1,
pooing
input layer to T2r rows in output layer. The total computation C O N Vi
so Ncycles_t ot al >> Ncycles
j
and the pooling time can be
cycles of a whole input layer to an output layer is shown in saved if inter-layer parallel processing is applied.
the following equation: For fully connected layer, the number of computation
  cycles is estimated in the following equations:
2Wini
C O N Vi C O N Vi
Ncycles_t ot al = Ncycles × , (11) L l nlin nlout
Tr FC
Ncycles = , (15)
Pl
i i represent
where Tr denotes the row-tiling factor, Win and Wout where L l denotes the length of input feature map, nlin and nlout
the width of input feature map and output feature map, represent the channels of input layer and output layer, and P l
respectively. is the degree of parallelism of fully connected layer.
Since pixels are stored in PRAMs, no pixel needs to be sent Similarly, the total computation cycles of a whole FC layer
to off-chip DRAM, so there is no loading and unloading time is shown in Eq. (16):
of feature maps.
For each feature map in an input convolutional layer, it will FC W L l nlin nlout
be reused P times. It is emphasized that the same reused input
Ncycles_t ot al = , (16)
Pl
feature map for different output feature maps corresponds to where W denotes the width of input layer.
different kernels. In our architecture, each PU is fed with k ×k The number of interval clocks between two partial outputs
weights and we employ P CPs, each of which contains P PUs, C O N Vi C O N Vi
of FC layers is i Ncycles (Note that Ncycles can be
so the total number of weights fed to overall architecture is FC
repeated several times) and in general cases, Ncycles is smaller
k 2 P 2 every computi ng phase (computi ng phase is defined C O N Vi
as the computing process of P feature maps). The kernel than i Ncycles . Therefore, once the inter-layer parallel
weights do not need to be updated until a computi ng phase processing is utilized, the computation time of fully connected
is completed, so the minimum clock cycles during which layer can be saved as well.
the same
 batch of kernel weights can stay unchanged are
L iin B. Storage and Energy Analysis
mi n{ k }. The clock frequency is set to f and each weight
For embedded FPGA platforms, due to limited internal
is quantized to Q bits. The
 minimum
 refresh frequency of memory resources, it is alomst impossible to place a whole
L iin
kernel weights is f/mi n{ k } and the minimum bandwidth large DCNN model on chip. If intermediate pixel data transfer
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 11

TABLE IV
C OMPARISON W ITH O THER W ORKS

is totally avoided, with regular layer-wise implementation, the TABLE V


required memory storage is determined by the largest storage I MPLEMENTATION R ESULTS OF VGG16 ON T WO FPGA P LATFORMS
requirement of adjacent two layers. In VGG16 model, the
largest storage requirement of a layer is the first convolutional
layer and its next convolutional layer. Both of them contain
3.2M (224×224×64) feature map pixels and total 6.4M pixels
need to be stored on chip.
With the proposed memory storage design, the data storage
scheme is arranged in a novel way which divides a whole
layer computation into a series of partial row computations. huge amount of weight parameters are stored in the DRAM.
In VGG16 model, the total number of feature map pixels Since feature map pixels are not sent to DRAM, the latency of
stored in on-chip memory is 463.5k. Our design reduces data transmission between external and internal memories are
memory requirement by 14 times and pixel data is no longer avoided. The energy is reduced because intermediate feature
required to be sent to off-chip DRAM. Because the write map data is no longer needed to be stored in the DRAM.
operation delay of DRAM is avoided, the efficiency of memory
bandwidth is highly improved.
The energy intensive consumptions are in two aspects: VIII. I MPLEMENTATION R ESULTS AND C OMPARISON
1) the multiplications in arithmetic computations; 2) the data A. Implementation Analysis
exchanges between off-chip DRAM and internal memories. The very deep CNN model, VGG16, is implemented as an
In arithmetic aspect, the energy cost of a multiplication experimental case in our design. We only show the average
operation is 31 times that of an addition operation in a 45nm performances because the layer-wise implementation is not
CMOS process [31]. In VGG16 model, the convolutional applied. The performance of fully connected layers is not
kernel size is universal 3×3. The regular convolution between constrained by the memory bandwidth, which is at least not
a 3 × 3 receptive field and a 3 × 3 kernel requires 9 multi- the main factor. But the degree of computing parallelism is
plications and 8 additions to get an output pixel. With the still limited by the bandwidth.
proposed 3-parallel FCUs, it needs 18 (6 × 3) multiplications The implementation results are shown in Table. V. It is
and 36 (10 × 3 + 3 × 2) additions to calculate 3 output the same VGG16 model but of different computation degrees
pixels. In other words, to obtain one output, it only requires on platforms of Xilinx Zynq ZC706 and Virtex VC707. We
6 multiplications and 12 additions. It is estimated that 33% of achieve average performances of 316.23 GOP/s and 1250.21
arithmetic computing energy can be saved approximately. GOP/s and the frame rates are 8.78 and 33.80 FPS, respec-
In memory aspect, DRAM access consumes two orders tively.
of magnitude more energy than SRAM and has much more
latency. The traditional memory architecture needs frequent
access to external memory and can not meet the low power B. Comparison Analysis
requirement of embedded system. The performance of high In Table. IV, we compare our design to other significant
throughput is restricted due to high latency as well. works on FPGA platforms in recent years. As shown in
On two occasions data needs to be transferred from DRAM Table. IV, we put two hardware implementations of the
to on-chip RAMs in our architecture. One is during the same CNN model but with different computing degrees on
loading time when the pixel data of input pictures need FPGA platforms of Zynq XC7Z045 and Virtex7 VX485t,
to be transferred. The other is the refresh of WBs during respectively. We achieve the highest performance (GOP/s) and
computi ng phases. The massive raw input pictures and the resource utilization efficiency (GOP/s/slice).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Compared to regular CNN architectures, the advantages of [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
the proposed architectures are shown as follows: hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
1) The proposed FCUs are computationally energy efficient pp. 580–587.
because the multiplications are reduced. Based on the [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
synthesis results, it is shown that the power of a 16-bit pp. 2278–2324, Nov. 1998.
multiplier in the Design Compiler is approximately [6] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn,
20 times that of a 16-bit adder. The power of an FCU is and D. Yu, “Convolutional neural networks for speech recogni-
tion,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22,
dominated by the multipliers. It is estimated that 33% no. 10, pp. 1533–1545, Oct. 2014. [Online]. Available: http://dx.doi.org/
of the computational energy in regular convolutions with 10.1109/TASLP.2014.2339736
3 × 3 kernel can be saved by applying 3-parallel FCUs. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.
Moreover, due to the reduction of multiplications, the Process. Syst., 2012, pp. 1097–1105.
degree of parallelism in computation can be improved [8] H. Nakahara and T. Sasao, “A deep convolutional neural network based
on embedded systems. on nested residue number system,” in Proc. 25th Int. Conf. Field
2) A novel data storage and reuse scheme are proposed. Program. Logic Appl. (FPL), Sep. 2015, pp. 1–6.
[9] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scalability of
Data transfer is significantly reduced and restricted to GPU-based convolutional neural networks,” in Proc. 18th Euromicro
data reads of input pictures and weights. Since the Conf. Parallel, Distrib. Netw.-Based Process., Feb. 2010, pp. 317–324.
write transfer is avoided, we achieve a high degree [10] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space explo-
ration of FPGA-based deep convolutional neural networks,” in Proc.
of parallelism and an outstanding performance in our 21st Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2016,
design. A further on-chip memory reuse method is pp. 575–580.
also developed, which further reduces on-chip memory [11] M. Sankaradas et al., “A massively parallel coprocessor for convolutional
neural networks,” in Proc. 20th IEEE Int. Conf. Appl.-Specific Syst.,
requirement without any overhead. Archit. Process. (ASAP), Jul. 2009, pp. 53–60.
3) The ENQ method is employed in quantization process [12] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:
and it makes considerable reduction of the stored bits Towards uniformed representation and acceleration for deep convolu-
tional neural networks,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided
of pixel data, which further saves on-chip storage and Design (ICCAD), Nov. 2016, pp. 1–8.
energy. [13] T. Chen et al., “DianNao: A small-footprint high-throughput acceler-
ator for ubiquitous machine-learning,” ACM SIGPLAN Notice Archit.
News, vol. 49, no. 1, pp. 269–284, Mar. 2014. [Online]. Available:
IX. C ONCLUSION http://doi.acm.org/10.1145/2644865.2541967
[14] T. Luo et al., “DaDianNao: A machine-learning supercomputer,” in
In this paper, we focus on the efficient hardware designs for Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
Dec. 2014, pp. 609–622.
CNN implementations. Convolutions dominate the computa-
[15] D. Liu et al., “PuDianNao: A polyvalent machine learning accelerator,”
tion complexity. While the latency of data transfer between ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381,
external and internal memories is the main bottleneck of per- Mar. 2015.
formance improvement. The fast FIR algorithm (FFA) is intro- [16] J. Qiu et al., “Going deeper with embedded FPGA platform for
convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-
duced and Fast Convolution Units (FCUs) are developed based Programm. Gate Arrays (FPGA), New York, NY, USA, 2016, pp. 26–35.
on FFAs. By employing the parallel FCUs in the convolutions [17] M. Courbariaux, Y. Bengio, and J.-P. David. (2015). “BinaryConnect:
of CNN models, the computation complexity and energy are Training deep neural networks with binary weights during propagations.”
[Online]. Available: https://arxiv.org/abs/1511.00363
cut down significantly. The data storage and reuse scheme [18] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An ultra-
approach are especially proposed for data arrangement, which low power convolutional neural network accelerator based on binary
avoid the intermediate pixel data transfer between on-chip and weights,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI),
Jul. 2016, pp. 236–241.
off-chip memories. Hence the intermediate data transfer is [19] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional
avoided and meanwhile the throughput is largely improved. networks for large-scale image recognition.” [Online]. Available:
The ENQ method is also introduced to further reduce power https://arxiv.org/abs/1409.1556
[20] D. A. Parker and K. K. Parhi, “Area-efficient parallel FIR digital
and memory requirement. The VGG16 model is implemented filter implementations,” in Proc. Int. Conf. Appl. Specific Syst., Archit.
on both Xilinx Zynq ZC706 and Virtex VC707 platforms and Process. (ASAP), Aug. 1996, pp. 93–111.
we achieve a frame rate of 8.76 FPS and 33.80 FPS with [21] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
average performances of 316.23 GOP/s and 1250.21 GOP/s,
[22] C. Cheng and K. K. Parhi, “Hardware efficient fast parallel FIR filter
respectively. structures based on iterated short convolution,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 51, no. 8, pp. 1492–1500, Aug. 2004.
[23] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
R EFERENCES works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Aug. 2016,
pp. 4013–4021.
[1] K. He, X. Zhang, S. Ren, and J. Sun. (2015). “Deep residual [24] Y. H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 Eyeriss: An energy-
learning for image recognition.” [Online]. Available: https://arxiv.org/ efficient reconfigurable accelerator for deep convolutional neural net-
abs/1512.03385 works,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Jan. 2016,
[2] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial pp. 262–263.
expression recognition using deep neural networks,” in Proc. IEEE [25] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1–10. accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
[3] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and ture (MICRO), Oct. 2016, pp. 1–12.
Y. LeCun. (2013). “OverFeat: Integrated recognition, localization [26] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. (2015). “Fixed
and detection using convolutional networks.” [Online]. Available: point quantization of deep convolutional networks.” [Online]. Available:
https://arxiv.org/abs/1312.6229 https://arxiv.org/abs/1511.06393
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: EFFICIENT HARDWARE ARCHITECTURES FOR DCNN 13

[27] F. Sun, J. Lin, and Z. Wang, “Intra-layer nonuniform quantization of Jun Lin received the B.S. degree in physics and the
convolutional neural network,” in Proc. 8th Int. Conf. Wireless Commun. M.S. degree in microelectronics from Nanjing Uni-
Signal Process. (WCSP), Oct. 2016, pp. 1–5. versity, Nanjing, China, in 2007 and 2010, respec-
[28] J. Tonfat and R. Reis, “Low power 3–2 and 4–2 adder compressors tively, and the Ph.D. degree in electrical engineering
implemented using ASTRAN,” in Proc. IEEE 3rd Latin Amer. Symp. from Lehigh University, Bethlehem, in 2015.
Circuits Syst. (LASCAS), Feb./Mar. 2012, pp. 1–4. From 2010 to 2011, he was an ASIC Design
[29] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, Engineer with AMD. During summer 2013, he was
“A 240 G-ops/s mobile coprocessor for deep neural networks,” in an Intern with Qualcomm Research, Bridgewater,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, NJ, USA. In 2015, he joined the School of Electronic
pp. 682–687. Science and Engineering, Nanjing University, where
[30] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing he is currently an Associate Research Professor.
FPGA-based accelerator design for deep convolutional neural networks,” His current research interests include low-power high-speed VLSI design,
in Proc. ACM/SIGDA Int. Symp. Field-Programm. Gate Arrays, 2015, specifically VLSI design for digital signal processing, cryptography, and deep
pp. 161–170. learning.
[31] S. Han et al., “EIE: Efficient inference engine on compressed deep Dr. Lin is a member of the Design and Implementation of Signal Processing
neural network,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Systems Technical Committee of the IEEE Signal Processing Society. He was
Archit. (ISCA), Jun. 2016, pp. 243–254. a co-recipient of the Merit Student Paper Award at the IEEE Asia Pacific
Conference on Circuits and Systems in 2008. He was a recipient of the 2014
IEEE Circuits and Systems Society Student Travel Award.

Zhongfeng Wang (F’16) received the B.E. and


M.S. degrees from Tsinghua University, Beijing,
China, and the Ph.D. degree from the Department
of Electrical and Computer Engineering, University
of Minnesota, Minneapolis, MN, USA, in 2000.
He was with National Semiconductor Corporation,
Santa Clara, CA, USA. He was an Assistant Pro-
fessor with the School of Electrical Engineering and
Computer Science, Oregon State University, Corval-
lis, OR, USA. He served as a leading VLSI Architect
with Broadcom Corporation for nearly nine years.
In 2016, he joined Nanjing University as a Distinguished Professor through
the State’s 1000-Talent Plan.
He is a World-Recognized Expert on VLSI for signal processing systems.
Since 2007, in the current record, he has had five papers ranked among top
20 most downloaded manuscripts in the IEEE T RANSACTIONS ON V ERY
L ARGE S CALE I NTEGRATION S YSTEMS . During his tenure at Broadcom, he
has contributed significantly on 10 Gb/s and beyond high-speed networking
products. Additionally, he has made critical contributions in designing FEC
coding schemes for 100 and 400 Gb/s Ethernet standards. So far, his technical
proposals have been adopted by many international networking standards. He
has authored over 150 technical papers, edited one book (VLSI) and filed tens
of U.S. patent applications and disclosures. He was a recipient of the IEEE
Circuits and Systems Society VLSI Transactions Best Paper Award in 2007.
Jichen Wang received the B.S. degree in micro- His current research interests are in the area of digital communications,
electronics from Nanjing University, Nanjing, China, machine learning, and efficient VLSI implementation. He has served as a
in 2016, where he is currently pursuing the M.S. Technical Program Committee Member (or Co-Chair), the Session (or Track)
degree in integrated circuit engineering. Chair, and a Review Committee Member for tens of international conferences.
His research interests include VLSI design and In 2013, he served in the Best Paper Award Selection Committee for the
efficient hardware architectures for machine learn- IEEE Circuits and System Society. Since 2004, he has been serving as an
ing, especially deep learning related applications. Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS
I (TCAS-I), TCAS-II, and IEEE T RANSACTIONS ON V ERY L ARGE S CALE
I NTEGRATION S YSTEMS . He is currently a Guest Editor for a special issue
of the IEEE J OURNAL ON E MERGING AND S ELECTED T OPICS IN C IRCUITS
AND S YSTEMS .

You might also like