0% found this document useful (0 votes)
37 views6 pages

REgnet

Uploaded by

Raven Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views6 pages

REgnet

Uploaded by

Raven Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1

RegNet: Self-Regulated Network for Image


Classification
Jing Xu, Yu Pan, Xinglin Pan, Steven Hoi, Fellow, IEEE, Zhang Yi, Fellow, IEEE, and Zenglin Xu∗ .

Abstract—The ResNet and its variants have achieved remark- communication is somehow ignored and some reusable in-
able successes in various computer vision tasks. Despite its formation learned from previous blocks tends to be forgotten
success in making gradient flow through building blocks, the in later blocks. To illustrate this point, we visualize the
simple shortcut connection mechanism limits the ability of re-
exploring new potentially complementary features due to the output(residual) feature maps learned by consecutive blocks in
arXiv:2101.00590v1 [eess.IV] 3 Jan 2021

additive function. To address this issue, in this paper, we propose ResNet in Fig. 1(a). It can be see that due to the summation
to introduce a regulator module as a memory mechanism to operation among blocks, the adjacent outputs 𝑂 𝑡 , 𝑂 𝑡+1 and
extract complementary features, which are further fed to the 𝑂 𝑡+2 look very similar to each other, which indicates that less
ResNet. In particular, the regulator module is composed of new information has been learned through consecutive blocks.
convolutional RNNs (e.g., Convolutional LSTMs or Convolutional
GRUs), which are shown to be good at extracting spatio-temporal A potential solution to address the above problems is to cap-
information. We named the new regulated networks as RegNet. ture the spatio-temporal dependency between building blocks
The regulator module can be easily implemented and appended while constraining the speed of parameter increasing. To this
to any ResNet architectures. We also apply the regulator mod- end, we introduce a new regulator mechanism in parallel to
ule for improving the Squeeze-and-Excitation ResNet to show the shortcuts in ResNets for controlling the necessary memory
the generalization ability of our method. Experimental results
on three image classification datasets have demonstrated the information passing to the next building block. In detail, we
promising performance of the proposed architecture compared adopt the Convolutional RNNs (“ConvRNNs") [12] as the
with the standard ResNet, SE-ResNet, and other state-of-the-art regulator to encode the spatio-temporal memory. We name
architectures. the new architecture as RNN-Regulated Residual Networks,
Index Terms—Residue Networks, Convolutional Recurrent or “RegNet" for short. As shown in Fig. 1(a), at the 𝑖 𝑡 ℎ
Neural Networks, Convolutional Neural Networks building block, a recurrent unit in the convolutional RNN
takes the feature from the current building block as the input
(denoted by 𝐼 𝑖 ), and then encodes both the input and the serial
I. I NTRODUCTION information to generate the hidden state (denoted by 𝐻 𝑖 ); the
hidden state will be concatenated with the input for reuse in the
Convolutional neural networks (CNNs) have achieved abun-
next convolution operation (leading to the output feature 𝑂 𝑖 ),
dant breakthroughs in a number of computer vision tasks [1].
and will also be transported to the next recurrent unit. To better
Since the champion achieved by AlexNet [2] at the ImageNet
understand the role of the regulator, we visualize the feature
competition in 2012, various new architectures have been
maps, as shown in Fig. 1(a). We can see that the 𝐻 𝑖 generated
proposed, including VGGNet [3], GoogLeNet [4], ResNet [5],
by ConvRNN can complement with the input features 𝐼 𝑖 . After
DenseNet [6], and recent NASNet [7].
conducting convolution on the concatenated features of 𝐻 𝑖 and
Among these deep architectures, ResNet and its vari-
𝐼 𝑖 , the proposed model gets more meaningful features with
ants [8]–[11] have obtained significant attention with out-
rich edge information 𝑂 𝑖 than ResNet does. For quantitatively
standing performances in both low-level and high-level vision
evaluating the information contained in the feature maps, we
tasks. The remarkable success of ResNets is mainly due to the
test their classification ability on test data (by adding average
shortcut connection mechanism, which makes the training of
pooling layer and the last fully connected layer to the 𝑂 𝑖 of
a deeper network possible, where gradients can directly flow
the last three blocks). As shown in Fig. 1(b), we can find that
through building blocks and the gradient vanishing problem
the new architecture can get higher prediction accuracy, which
can be avoided in some sense. However, the shortcut con-
indicates the effectiveness of the regulator from ConvRNNs.
nection mechanism makes each block focus on learning its
Thanks to the kind of parallel structure of the regulator
respective residual output, where the inner block information
module, the RNN-based regulator is easy to implement and
can be applicable to other ResNet-based structures, such as
Jing Xu and Zenglin Xu are with the School of Science and Technology,
Harbin Institute of Technology, Shenzhen, Shenzhen 510085, Guangdong, the SE-ResNet [11], Wide ResNet [8], Inception-ResNet [9],
China. ResNetXt [10], Dual Path Network(DPN) [13], and so on.
Yu Pan and Xinglin Pan are with the Department of SMILE Lab, School Without loss of generality, as another instance to demonstrate
of Computer Science and Engineering, University of Electronic Science and
Technology of China, Chengdu 610031, China. the effectiveness of the proposed regulator, we also apply the
Steven Hoi is with the School of Information Systems (SIS) Singapore ConvRNN module for improving the Squeeze-and-Excitation
Management University, Singapore ResNet (shorted as “SE-RegNet").
Zhang yi is with the Machine Intelligence Laboratory, College of Computer
Science, Sichuan University, Chengdu 610065, China For evaluation, we apply our model to the task of image
Zenglin Xu is the corresponding author (e-mail:zenglin@gmail.com) classification on three highly competitive benchmark datasets,
2

ResNet RegNet
Ht 1

t th t th It Building
Building ConvRNN
Block 91.6 92.7
Block
90 ResNet
Input O t
H t O t
RegNet
(t + 1)th (t +1) th 80
I t+1 Building
Building

test accuracy(%)
ConvRNN 71.7
Block Block
70 67.9
O t+1 H t+1 O t+1

60.2
(t + 2)th 60
Building (t + 2)th I t+2 Building 55.6
Block ConvRNN Block

O t+2 O t+2
50
Oi Oi Hi Ii H t+2

40
th th 7 8 9
O i : The output of i building block I i H i : The inputouput of ConvRNN at i building block the output of i th block
(a) (b)

Fig. 1. (a):Visualization of feature maps in the ResNet [5] and RegNet. We visualize the outputs 𝑂 𝑖 feature maps of the 𝑖 𝑡 ℎ building blocks, 𝑖 ∈ {𝑡 , 𝑡 +1, 𝑡 +2}.
In RegNets, 𝐼 𝑖 denotes the input feature maps. 𝐻 𝑖 denotes the hidden states generated by the ConvRNN at step 𝑖. By applying convolution operations over
the concatenation 𝐼 𝑖 with 𝐻 𝑖 , we can get the regulated outputs( denoted by 𝑂 𝑖 ) of the 𝑖 𝑡 ℎ building block. (b): The prediction on test data based on the
output feature maps of consecutive building blocks. During the test time, we add an average pooling layer and the last fully connected layer to the outputs of
the last three building blocks(𝑖 ∈ {7, 8, 9}) in ResNet-20 and RegNet-20 to get the classification results. It can be seen that the output of each block aided
with the memory information results in higher classification accuracy.

including CIFAR-10, CIFAR-100, and ImageNet. In com- to reuse all of the feature maps of previous layers. Obviously,
parison with the ResNet and SE-ResNet, our experimental not all feature maps need to be reused in the future layers,
results have demonstrated that the proposed architecture can and consequently the densely connected network also leads
significantly improve the classification accuracy on all the to some redundancy with extra computational costs. Recently,
datasets. We further show that the regulator can reduce the Dual Path Network [13] and Mixed link Network [23] are
required depth of ResNets while reaching the same level of the trade-offs between ResNets and DenseNets. In addition,
accuracy. some module-based architectures are proposed to improve the
performance of the original ResNet. SENet [11] proposes
II. R ELATED W ORK a lightweight module to get the channel-wise attention of
intermediate feature maps. CBAM [24] and BAM [25] design
Deep neural networks have been achieved empirical break-
modules to infer attention maps along both channel and
throughs in machine learning. However, training networks with
spatial dimensions. Despite their success, those modules try to
sufficient depths is a very tricky problem. Shortcut connection
regulate the intermediate feature maps based on the attention
has been proposed to address the difficulty in optimization
information learned by the intermediate feature themselves, so
to some extent [5], [14]. Via the shortcut, information can
the full utilization of historical spatio-temporal information of
flow across layers without attenuation. A pioneering work is
previous features still remains an open problem.
the Highway Network [14], which implements the shortcut
On the other hand, convolutional RNNs (shorted as Con-
connections by using a gating mechanism. In addition, the
vRNN), such as ConvLSTM [12] and ConvGRU [26], have
ResNet [5] explicitly requests building blocks fitting a residual
been used to capture spatio-temporal information in a num-
mapping, which is assumed to be easier for optimization.
ber of applications, such as rain removal [27], video super-
Due to the powerful capabilities in dealing with vision
resolution [28], video compression [29], video object detection
tasks of ResNets, a number of variants have been proposed,
and segemetation [30], [31]. Most of those works embed Con-
including WRN [8], Inception-ResNet [9], ResNetXt [10], ,
vRNNs into models to capture the dependency information in
WResNet [15], and so on. ResNet and ResNet-based models
a sequence of images. In order to regulate the information flow
have achieved impressive, record-breaking performance in
of ResNet, we propose to leverage ConvRNNs as a separate
many challenging tasks. In object detection, 50- and 101-
module aiming to extracting spatio-temporal information as
layered ResNets are usually used as basic feature extractors in
complementary to the original feature maps of ResNets.
many models: Faster R-CNN [16], RetinaNet [17], Mask R-
CNN [18] and so on. The most recent models aiming at image
III. O UR M ODEL
super-resolution tasks, such as SRResNet [19], EDSR and
MDSR [20], are all based on ResNets, with a little modifica- In the section, we first revisit the background of ResNets
tion. Meanwhile, in [21], the ResNet is introduced to remove and two advanced ConvRNNs: ConvLSTM and ConvGRU.
rain streaks and obtains the state-of-the-art performance. Then we present the proposed RegNet architectures.
Despite the success in many applications, ResNets still
suffer from the depth issue [22]. DenseNet proposed by [6] A. ResNet
concatenates the input features with the output features using The degradation problem which makes the traditional net-
a densely connected path in order to encourage the network work hard to converge, is exposed when the architecture goes
3

deeper. The problem can be mitigated by ResNet [5] to some TABLE I


extent. Building blocks are the basic architecture of ResNet, P ERFORMANCE OF R EG N ET-20 WITH C ONV GRU AS REGULATORS ON
CIFAR-10. W E COMPARE THE TEST ERROR RATES BETWEEN
as shown in Fig. 2(b), instead of directly fitting a original TRADITIONAL 3×3 KERNELS AND OUR NEW MODIFICATION .
underlying mapping, shown in Fig. 2(a). The deep residual
network obtained by stacking building blocks has achieved kernel type err. Params FLOPs
excellent performance in image classification, which proves 3×3 7.35 +330K +346M
Ours 7.42 +44K +15M
the competence of the residual mapping.

X X

Conv
Conv X 1t X 1t
Ht 1
3 3 Ht 1
1 1
Conv
BN + RELU
Conv t
BN + RELU
X Xt
ConvRNN 2
ConvRNN 2

BN + RELU BN + RELU 3 3
(a) (b) H t BN + RELU
Ht
X 3t
Fig. 2. 2(a) shows the original underlying mapping while 2(b) shows the T Concat T Concat
1 1
residual mapping in ResNet [5]. BN + RELU
1 1
X 3t
BN + RELU

X 4t
3 3 1 1
BN
BN
X 4t X 5t
B. ConvRNN and its Variants
RELU RELU
RNN and its classical variants LSTM and GRU have
X 1t+1 X 1t+1
achieved great success in the field of sequence processing.
To tackle the spatio-temporal problems, we adopt the ba- (a) (b)
sic ConvRNN and its variants ConvLSTM and ConvGRU,
which are transformed from the vanilla RNNs by replacing Fig. 3. The RegNet module is shown in 3(a). The bottleneck RegNet block
is shown in 3(b). The 𝑇 denotes the number of building blocks as well as the
their fully-connected operators with convolutional operators. total time steps of ConvRNN.
Furthermore, for reducing the computational overhead, we
delicately design the convolutional operation in ConvRNNs.
In our implementation, the ConvRNN can be formulated as block. Based on those, by applying ConvRNNs as regulators,
𝑡 2𝑁 we get RNN-Regulated ResNet building module and bottle-
H = 𝑡𝑎𝑛ℎ( Wℎ𝑁 𝑡
∗ [X , H 𝑡−1
] + bℎ ), (1)
neck RNN-Regulated ResNet building module correspond-
where 𝑋 𝑡 is the input 3D feature map, 𝐻 𝑡−1 is the hidden state ingly.
obtained from the earlier output of ConvRNN and 𝐻 𝑡 is the 1) RNN-Regulated ResNet Module (RegNet module): The
output 3D feature map at this state. Both the number of input illustration of RegNet module is shown in Fig. 3(a). Here, we
𝑋 𝑡 and output 𝐻 𝑡 channels in the ConvRNN are N. choose ConvLSTM for expounding. 𝐻 𝑡−1 denotes the earlier
Additionally, 2𝑁 W 𝑁 ∗ X denotes a convolution operation output from ConvLSTM, and 𝐻 𝑡 is output of the ConvLSTM
between weights W and input X with the input channel 2N and at 𝑡-th module . 𝑋𝑖𝑡 denotes the 𝑖-th feature map at the 𝑡-th
the output channel N. To make the ConvRNN more efficient, module.
inspired by [30], [32], given input X with 2N channels, we The 𝑡-th RegNet(ConvLSTM) module can be expressed as
conduct the convolution operation in 2 steps:
(1) Divide the input X with 2N channels into N groups, X2𝑡 = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (W12
𝑡
∗ X1𝑡 + b12
𝑡
),
and use grouped convolutions [33] with 1 × 1 kernel to [H𝑡 , C 𝑡 ] = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (𝐶𝑜𝑛𝑣𝐿𝑆𝑇 𝑀 (X2𝑡 , [H𝑡−1 , C 𝑡−1 ]))),
process each group separately for fusing input channels. X3𝑡 = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (W23
𝑡
∗ 𝐶𝑜𝑛𝑐𝑎𝑡 [X2𝑡 , H𝑡 ])),
(2) Divide the feature map obtained by (1) into N groups,
X4𝑡 = 𝐵𝑁 (W34
𝑡
∗ X3𝑡 + b34
𝑡
),
and use grouped convolutions with 3 × 3 kernel to
process each group separately for capturing the spatial X1𝑡+1 = 𝑅𝑒𝐿𝑈 (X1𝑡 + X4𝑡 ), (2)
information per input channel.
where W𝑖𝑡 𝑗 denotes the convolutional kernel which mapping
Directly applying the original convolutions with 3×3 kernels feature map X𝑖𝑡 to X𝑡𝑗 and b𝑖𝑡 𝑗 denotes the correlative bias.
suffers from high computational complexity. As detailed in Both W12 𝑡 𝑡
and W34 are 3 × 3 convolutional kernels. The W23𝑡
Table I, the new modification reduces the required computation is 1×1 kernel. BN(·) indicates batch normalization. 𝐶𝑜𝑛𝑐𝑎𝑡 [·]
by 18N/11 times with comparable result. Similarly, all the refers to the concatenate operation.
convolutions in ConvGRU and ConvLSTM are replaced with
Notice that in Eq (2) the input feature X2𝑡 and the previous
the light-weight modification.
output of ConvLSTM H𝑡 are the inputs of ConvLSTM in 𝑡-th
module. According to the inputs, the ConvLSTM automati-
C. RNN-Regulated ResNet cally decides whether the information in memory cell will be
To deal with the CIFAR-10/100 datasets and the Imagenet propagated to the output hidden feature map H𝑡 .
dataset, [5] proposed two kinds of ResNet building blocks: 2) Bottleneck RNN-Regulated ResNet Module (bottleneck
the non-bottleneck building block and the bottleneck building RegNet module): The bottleneck RegNet module based on the
4

TABLE II use 20-layered RegNet and SE-RegNet to prove the wide


A RCHITECTURES FOR CIFAR-10/100 DATASETS . B Y SETTING applicability of our method. The SE-RegNet building module
N ∈ {3, 5, 7}, WE CAN GET THE {20, 32, 56}- LAYERED R EG N ET.
in Fig. 3(a) is used to analysis CIFAR datasets. The structural
name output size (6n+2)-layered RegNet details of SE-RegNet are shown in Table II. The inputs of the
conv_0 32 × 32 3 × 3,
 16 
network are 32×32 images. In each conv_𝑖, 𝑖 ∈ {1, 2, 3} layer,
3 × 3, 16 there are n RegNet building modules stacked sequentially, and
conv_1 32 × 32 ConvRNN1 + ×𝑛
3 × 3, 16
  connected together by a ConvRNN. In summary, there are 3
3 × 3, 32
conv_2 16 × 16 ConvRNN2 +
3 × 3, 32
×𝑛 ConvRNNs in our architecture, and each ConvRNN impacts

3 × 3, 64

on the n RegNet building modules. The reduction ratio r in
conv_3 8×8 ConvRNN3 + ×𝑛
3 × 3, 64 SE block is 8.
1×1 AP, FC, softmax In this experiment, we use SGD with a momentum of 0.9
and a weight decay of 1e-4. We train with a batch size of
TABLE III
64 for 150 epoch. The initial learning rate is 0.1 and divided
C LASSIFICATION ERROR RATES ON THE CIFAR-10/100. B EST RESULTS by 10 at 80 epochs. Data augmentation in [35] is used in
ARE MARKED IN BOLD . training. The results of SE-ResNet on CIFAR are based on our
model C10 C100 implementation, since the results were not reported in [11].
ResNet-20 [5] 8.38 31.72 1) Results on CIFAR: The classification errors on the
RegNet-20(ConvRNN) 7.60 30.03 CIFAR-10/100 test sets are shown in Table III. We can
RegNet-20(ConvGRU) 7.42 29.69 see from the results, with the same layer, both RegNet and
RegNet-20(ConvLSTM) 7.28 29.81
SE-ResNet-20 8.02 31.14 SE-RegNet outperform the original models by a significant
SE-RegNet-20(ConvRNN) 7.55 29.63 margin. Compared with ResNet-20, our RegNet-20 with Con-
SE-RegNet-20(ConvGRU) 7.25 29.08 vLSTM decreases the error rate by 1.51% on CIFAR-10 and
SE-RegNet-20(ConvLSTM) 6.98 29.02
2.04% on CIFAR-100. At the same time, compared with SE-
ResNet-20, our SE-RegNet-20 with ConvLSTM decreases the
error rate by 1.04% on CIFAR-10 and 2.12% on CIFAR-100.
bottleneck ResNet building block is shown in Fig. 3(b). The Using ConvGRU as the regulator can reach the same level of
bottleneck building block introduced in [5] for dealing with accuracy as ConvLSTM. Due to the vanilla ConvRNN lacks
the pictures with large size. Based on that, the 𝑡-th bottleneck gating mechanism, it performs slightly worse but still makes
RegNet module can be expressed as great progress compared with the baseline model.
X2𝑡 = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (W12
𝑡
∗ X1𝑡 + b12
𝑡
), 2) Parameters Analysis: For a fair comparison, we eval-
uate our model’s ability by regarding the number of models
[H𝑡 , C 𝑡 ] = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (𝐶𝑜𝑛𝑣𝐿𝑆𝑇 𝑀 (X2𝑡 , [H𝑡−1 , C 𝑡−1 ]))), parameters as the contrast reference. As shown in Table IV,
X3𝑡 = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (W23
𝑡
∗ X2𝑡 + b23
𝑡
), we list the test accuracy of 20, 32, 56-layered ResNets and
X4𝑡 = 𝑅𝑒𝐿𝑈 (𝐵𝑁 (W34
𝑡
∗ 𝐶𝑜𝑛𝑐𝑎𝑡 [X3𝑡 , H𝑡 ])), their respective RegNet counterparts on CIFAR-10/100. After
X5𝑡 = 𝐵𝑁 (W45
𝑡
∗ X4𝑡 + b45
𝑡
), adding minimal additional parameters, both our RegNet with
ConvGRU and ConvLSTM surpass the ResNet by a large
X1𝑡+1 = 𝑅𝑒𝐿𝑈 (X1𝑡 + X5𝑡 ), (3) margin. Our 20-layered RegNet with extra 0.04M parameters
where W12 𝑡 𝑡
and W45 are the two 1 × 1 kernels, and W23
𝑡
is the even outperforms the 32-layered ResNet on both CIFAR-
3 × 3 bottleneck kernel. The W34 is a 1 × 1 kernel for fusing
𝑡 10/100: our 20-layered RegNet(ConvLSTM) having 0.32M
feature in our model. parameters reaches 7.28% error rate on CIFAR-10 surpass the
32-layered ResNet with 7.54% error rate which having 0.47M
IV. E XPERIMENTS parameters. Fig. 4 demonstrates the parameter efficiency com-
parisons between RegNet and ResNet. We show our RegNet
In this section, we evaluate the effectiveness of the proposed
are more parameter-efficient than simply stacking layers in
convRNN regulator on three benchmark datasets, including
vanilla ResNet. On both CIFAR-10/100, our RegNets(GRU)
CIFAR-10, CIFAR-100, and ImageNet. We run the algorithms
get comparable performance with ResNet-56 with nearly 1/2
on pytorch. The small-scaled models for CIFAR are trained on
parameters.
a single NVIDIA 1080 Ti GPU, and the large-scaled models
3) Positions of Feature Reuse: In this subsection, we per-
for ImageNet are trained on 4 NVIDIA 1080 Ti GPUs.
form ablation experiment to further analyze the effect of the
position of feature reuse. We conduct an experiment to analysis
A. Experiments on CIFAR that with ConvRNN which layer has the maximum promotion
The CIFAR datasets [34] consist of RGB image with 32 × to the final outcome. Some previous studies [36] show that
32 pixels. Each dataset contains 50k training images and 10k the features in an earlier layer are more general while the
testing images. The images in CIFAR-10 and CIFAR-100 are features in later layers exhibit more specific. As shown in
drawn from 10 and 100 classes respectively. We train on the Table II, the conv_1, conv_2, conv_3 layers are separated by
training dataset and evaluate on the test dataset. the down sampling operation, which makes the features in
By applying ConvRNNs to ResNet and SE-ResNet, we conv_1 are more low-level and in conv_3 are more specific for
get the RegNet, and SE-RegNet models separately. Here, we classification. The classification results are shown in Table V.
5

TABLE IV
T EST ERROR RATES ON CIFAR-10/100. W E USE C ONV GRU AND C ONV LSTM AS REGULATORS OF R ES N ET. W E LIST THE INCREASE OF PARAMETER
THE ARCHITECTURES AT THE RIGHT CORNER OF THE ERROR RATES .

C-10 C-100
layer ResNet +ConvGRU +ConvLSTM ResNet +ConvGRU +ConvLSTM
20 8.38 7.42 (+0.04𝑀 ) 7.28 (+0.04𝑀 ) 31.72 29.69 (+0.04𝑀 ) 29.81 (+0.04𝑀 )
32 7.54 6.60 (+0.06𝑀 ) 6.88 (+0.07𝑀 ) 29.86 27.42 (+0.07𝑀 ) 28.11 (+0.07𝑀 )
56 6.78 6.39 (+0.11𝑀 ) 6.45 (+0.12𝑀 ) 28.14 27.02 (+0.11𝑀 ) 27.26 (+0.12𝑀 )

TABLE VI
8.5 S INGLE - CROP VALIDATION ERROR RATES ON I MAGE N ET AND
ResNet 32 ResNet
8.0 RegNet(ConvGRU) RegNet(ConvGRU) COMPLEXITY COMPARISONS . B OTH R ES N ET AND R EG N ET ARE
RegNet(ConvLSTM) 31 RegNet(ConvLSTM)
50- LAYER . R ES N ET∗ MEANS WE REPRODUCE THE RESULT BY OURSELF.
Test Error(%)
Test error(%)

7.5 30

7.0 29 model top-1 err. top-5 err. Params FLOPs


28 ResNet [5] 24.7 7.8
6.5 26.6M 4.14G
27 ResNet∗ 24.81 7.78
6.0 RegNet 23.43 (−1.38) 6.93 (−0.85) 31.3M 5.12G
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
#Parameters(M) # Parameters(M)

(a) (b)

Fig. 4. Comparison of parameter efficiency on CIFAR-10 between RegNet TABLE VII


and ResNet [5]. In both 4(a) and 4(b), the curves of our RegNet is always S INGLE - CROP ERROR RATES ON THE I MAGE N ET VALIDATION SET FOR
below ResNet [5] which show that with the same parameters, our models have STATE - OF - THE - ART MODELS . T HE R ES N ET-50 ∗ MEANS THAT THE
stronger ability of expression. RE - IMPLEMENTION RESULT BY OUR EXPERIMENTS .

model top-1 top-5 Params(M) FLOPs(G)


WRN-18(widen=2.0) [8] 25.58 8.06 45.6 6.70
TABLE V DenseNet-169 [6] 23.80 6.85 28.9 7.7
T EST ERROR RATES ON CIFAR-10/100. W E USE C ONV GRU AND SE-ResNet-50 [11] 23.29 6.62 26.7 4.14
C ONV LSTM AS REGULATORS OF R ES N ET. W E LIST THE INCREASE OF
PARAMETER THE ARCHITECTURES . I N EACH OF OUR R EG N ET (𝑖) MODELS ,
ResNet-50 [5] 24.7 7.8 - -
THERE IS ONLY ONE C ONV RNN APPLIED IN LAYER CONV _𝑖, 𝑖 ∈ {1, 2, 3}.
ResNet-50∗ 24.81 7.78 26.6 4.14
ResNet-101 [5] 23.6 7.1 44.5 7.51
RegNet-50 23.43 6.93 31.3 5.12
C-10 C-100
model err. Params err. Params
ResNet [5] 8.38 0.273M 31.72 0.278M
RegNet (1) (GRU) 7.52 0.279M 30.40 0.285M
RegNet (2) (GRU) 7.48 0.285M 30.34 0.291M B. Experiments on ImageNet
RegNet (3) (GRU) 7.49 0.306M 30.30 0.312M
We evaluate our model on ImageNet 2012 dataset [3] which
RegNet (1) (LSTM) 7.56 0.281M 30.23 0.286M
RegNet (2) (LSTM) 7.49 0.290M 30.28 0.296M consists of 1.28 million training images and 50k validation
RegNet (3) (LSTM) 7.52 0.325M 29.92 0.331M images from 1000 classes. Following the previous papers, We
report top-1 and top-5 classification errors on the validation
dataset. Due to the limited resources of our GPUs and without
of loss of generality, we run the experiments of ResNets and
In each model, only one ConvRNN is applied. We name RegNets only.
the models RegNet (𝑖) , 𝑖 ∈ {1, 2, 3} which denotes that only The bottleneck RegNet building modules are applied to Im-
applying a ConvRNN in layer conv_𝑖 and maintaining the ageNet. We use 4 ConvRNNs in RegNet-50. The ConvRNN𝑖 ,
original ResNet structure in the other layers. For a fair compar- 𝑖 ∈ {1, 2, 3, 4}, controls {3, 4, 6, 3} bottleneck RegNet modules
ison, we evaluate the models ability by regarding the number respectively. In this experiment, we use SGD with a momen-
of models parameters as the contrast reference. We can see tum of 0.9 and a weight decay of 1e-4. We train with batch
from the results, using ConvRNNs in a lower layer(conv_1) is size 128 for 90 epoch. The initial learning rate is 0.06 and
more parameter-efficient than higher layer(conv_3). With less divided by 10 at 50 and 70 epochs. The input of the network
parameter increasing in lower layers, they can bring about is 224×224 images, which randomly cropped from the resized
nearly same improvement in accuracy compared with higher original images or their horizontal flips. Data augmentation in
layers. Compared with ResNet, our RegNet (1) (GRU) decrease [27] is used in training. We evaluate our model by applying a
the test error from 8.38% to 7.52%(-0.86%) on CIFAR-10 with center-crop with 224 × 224.
additional 0.006M parameters and from 31.72% to 30.40%(- We evaluate the efficiency of baseline models ResNet-50
1.32%) on CIFAR-100 with additional 0.007M parameters. and its respectively RegNet counterpart. The comparison is
This significant improvement with minimal additional param- based on the computational overhead. As shown in Table VI
eters further proves the effectiveness of the proposed method. with additional 4.7M parameters, RegNet outperforms the
The concatenate operation in our model can fuse features baseline model by 1.38% on top-1 accuracy and 0.85% on
together to explore new features [13], which is more important top-5 accuracy.
for general features in lower layers. Table VII shows the error rates of some state-of-the-art
6

models on the ImageNet validation set. Compared with the [12] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convo-
baseline ResNet, our RegNet-50 with 31.3M parameters and lutional LSTM network: A machine learning approach for precipitation
nowcasting,” CoRR, vol. abs/1506.04214, 2015.
5.12G FLOPs not only surpasses the ResNet-50 but also [13] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path
outperforms ResNet-101 with 44.6M parameters and 7.9G networks,” CoRR, vol. abs/1707.01629, 2017. [Online]. Available:
FLOPs. Since the proposed regulator module is essentially a http://arxiv.org/abs/1707.01629
[14] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”
beneficial makeup to the short cut mechanism in ResNets, one CoRR, vol. abs/1505.00387, 2015.
can easily apply the regulator module to other ResNet-based [15] F. Shen, R. Gan, and G. Zeng, “Weighted residuals for very deep
models, such as SE-ResNet, WRN-18 [8], ResNetXt [10], networks,” international conference on systems, pp. 936–941, 2016.
[16] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
Dual Path Network (DPN) [13], etc. Due to computation re- real-time object detection with region proposal networks,” CoRR, vol.
source limitation, we leave the implementation of the regulator abs/1506.01497, 2015.
module in these ResNet extensions as our future work. [17] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” CoRR, vol. abs/1708.02002, 2017.
[18] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,”
V. C ONCLUSIONS CoRR, vol. abs/1703.06870, 2017.
[19] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Te-
In this paper, we proposed to employ a regulator module jani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image
with Convolutional RNNs to extract complementary features super-resolution using a generative adversarial network,” CoRR, vol.
abs/1609.04802, 2016.
for improving the representation power of the ResNets. Ex- [20] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual
perimental results on three image-classification datasets have networks for single image super-resolution,” in CVPR, Workshops, July
demonstrated the promising performance of the proposed ar- 2017.
[21] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removing
chitecture in comparison with standard ResNets and Squeeze- rain from single images via a deep detail network,” in 2017 IEEE
and-Excitation ResNets as well as other state-of-the-art archi- Conference on Computer Vision and Pattern Recognition, CVPR 2017,
tectures. Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017,
pp. 1715–1723.
In the future, we intend to further improve the efficiency of [22] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep
the proposed architecture and to apply the regulator module networks with stochastic depth,” CoRR, vol. abs/1603.09382, 2016.
to other ResNet-based architectures [8]–[10] to increase their [23] W. Wang, X. Li, J. Yang, and T. Lu, “Mixed link networks,” CoRR, vol.
abs/1802.01808, 2018.
capacity. Besides, we will further explore RegNets for other [24] S. Woo, J. Park, J. Lee, and I. S. Kweon, “CBAM: convolutional block
challenging tasks, such as object detection [16], [17], image attention module,” CoRR, vol. abs/1807.06521, 2018.
super-resolution [19], [20], and so on. [25] J. Park, S. Woo, J. Lee, and I. S. Kweon, “BAM: bottleneck attention
module,” CoRR, vol. abs/1807.06514, 2018.
[26] N. Ballas, L. Yao, C. Pal, and A. C. Courville, “Delving deeper into
ACKNOWLEDGMENT convolutional networks for learning video representations,” CoRR, vol.
abs/1511.06432, 2015.
This work was partially supported by the National [27] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-
Key Research and Development Program of China (No. excitation context aggregation net for single image deraining,” CoRR,
vol. abs/1807.05698, 2018.
2018AAA0100204). [28] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, and J. Ma, “Multi-
memory convolutional neural network for video super-resolution,” IEEE
R EFERENCES TIP, vol. 28, no. 5, pp. 2530–2544, May 2019.
[29] Y. Xu, L. Gao, K. Tian, S. Zhou, and H. Sun, “Non-local convlstm
[1] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, for video compression artifact reduction,” CoRR, vol. abs/1910.12286,
and time series,” The handbook of brain theory and neural networks, 2019.
vol. 3361, no. 10, p. 1995, 1995. [30] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, “Looking fast
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification and slow: Memory-guided mobile video object detection,” CoRR, vol.
with deep convolutional neural networks,” in Advances in Neural Infor- abs/1903.10172, 2019.
mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and [31] M. Siam, S. Valipour, M. Jägersand, and N. Ray, “Convolutional gated
K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. recurrent networks for video segmentation,” CoRR, vol. abs/1611.05435,
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks 2016.
for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [32] 2018 IEEE Conference on Computer Vision and Pattern Recognition,
[Online]. Available: http://arxiv.org/abs/1409.1556 CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, Computer Society, 2018.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with [33] T. Zhang, G. Qi, B. Xiao, and J. Wang, “Interleaved group convolutions
convolutions,” CoRR, vol. abs/1409.4842, 2014. for deep neural networks,” CoRR, vol. abs/1707.02725, 2017.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features
recognition,” CoRR, vol. abs/1512.03385, 2015. from tiny images,” Master’s thesis, Department of Computer Science,
[6] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolu- University of Toronto, 2009.
tional networks,” CoRR, vol. abs/1608.06993, 2016. [35] G. Lebanon and S. V. N. Vishwanathan, Eds., Proceedings of the
[7] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans- Eighteenth International Conference on Artificial Intelligence and
ferable architectures for scalable image recognition,” CoRR, vol. Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015,
abs/1707.07012, 2017. ser. JMLR Workshop and Conference Proceedings, vol. 38. JMLR.org,
[8] S. Zagoruyko and N. Komodakis, “Wide residual networks,” CoRR, vol. 2015. [Online]. Available: http://jmlr.org/proceedings/papers/v38/
abs/1605.07146, 2016. [36] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
[9] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception- features in deep neural networks?” CoRR, vol. abs/1411.1792, 2014.
resnet and the impact of residual connections on learning,” CoRR, vol.
abs/1602.07261, 2016.
[10] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” CoRR, vol. abs/1611.05431,
2016.
[11] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” CoRR,
vol. abs/1709.01507, 2017.

You might also like