Bvit
Bvit
Bvit
Abstract— Recent works have demonstrated that trans- vision [4], [7], [8], [9], [10], [11]. ViT usually divides the
former can achieve promising performance in computer vision, whole image into many fixed-size patches containing local
by exploiting the relationship among image patches with self- components and then exploits the relationship between them
attention. They only consider the attention in a single feature
layer, but ignore the complementarity of attention in different by self-attention, which can focus modeling capabilities on
layers. In this article, we propose broad attention to improve the information relevant to vision tasks, such as image clas-
the performance by incorporating the attention relationship of sification [1], [6], object detection [2], [12], and semantic
different layers for vision transformer (ViT), which is called BViT. segmentation [13].
The broad attention is implemented by broad connection and Many works have demonstrated that the self-attention mech-
parameter-free attention. Broad connection of each transformer
layer promotes the transmission and integration of information anism plays a critical role in the success of ViT. Even if
for BViT. Without introducing additional trainable parameters, they are far away, the relationship between local components
parameter-free attention jointly focuses on the already available in an image can be exploited by self-attention. Furthermore,
attention information in different layers for extracting useful multihead attention can focus on useful local information
information and building their relationship. Experiments on and relationship in images from different views. Many efforts
image classification tasks demonstrate that BViT delivers supe-
rior accuracy of 75.0%/81.6% top-1 accuracy on ImageNet with have been devoted to designing self-attention to improve the
5M/22M parameters. Moreover, we transfer BViT to downstream capabilities of focusing on useful information and suppressing
object recognition benchmarks to achieve 98.9% and 89.9% on redundant information. Swin Transformer [2] hierarchically
CIFAR10 and CIFAR100, respectively, that exceed ViT with performs self-attention on a shifted window of patches, which
fewer parameters. For the generalization test, the broad attention increases the connection between cross windows and has the
in Swin Transformer, T2T-ViT and LVT also brings an improve-
ment of more than 1%. To sum up, broad attention is promising flexibility of modeling on various scales. Tokens-to-Token
to promote the performance of attention-based models. Code and ViT [14] recursively assembles neighboring tokens into a token
pretrained models are available at https://github.com/DRL/BViT. to extract the structure information of the image by self-
Index Terms— Broad attention, broad connection, image clas- attention incrementally. CvT [15] introduces convolution to
sification, parameter-free attention, vision transformer (ViT). ViT to achieve additional attention modeling of local spatial
context. However, these efforts only consider the attention
performed on feature maps from one layer but ignore that
I. I NTRODUCTION
the combination of attention in different layers is helpful.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
1) partitioning the image into fixed-size patches; 2) applying of comprehensive features. AlexNet [41] is regarded as the first
linear embedding to fixed-size patches; and 3) adding the truly deep CNN structure, which has made a breakthrough
position embedding. With an extra learnable “classification in large-scale image datasets. ResNet [4] added residual
token,” the sequence can be handled by a standard transformer connections between convolutional layers, which effectively
encoder. facilitates information transmission in very deep networks and
Based on the innovative ViT [6], several studies are ded- alleviates the gradient dispersion problem of optimization.
icated to improving the architecture design and enhancing DenseNet [18] further increased the pathway between layers
the performance of ViT. DeiT [1] performed data-efficient by means of dense connections, which significantly strength-
training and introduced a distillation token for more pow- ens the propagation and fusion of features and improves the
erful knowledge distillation. Swin Transformer [2] com- modeling ability of deep networks.
puted hierarchical feature representation via shifted window As one of the connection types for pathway connection,
and attained linear computational complexity relative to the broad connection can compensate for the lack of training
input resolution size. Tokens-to-Token ViT [14] proposed a efficiency and feature diversity in the deep network. Intro-
tokens-to-token process to aggregate surrounding tokens and ducing broad connection, Ding et al. proposed BNAS [19]
enhance local information gradually. Autoformer [26] searched and its extended works [42], [43]. BNAS [19] first presented
high-performed transformer-based models through one-shot and searched broad CNN (BCNN), and its search efficiency
architecture search. The search space consists of embedding is leading in reinforcement learning-based architecture search
dimension, MLP ratio, depth, and so on. In addition, some methods [44], [45] and evolutionary algorithm-based meth-
works combined convolution and transformer architecture to ods [46], [47]. Stacked BNAS [43] developed the connection
exploit the advantages of both them [15], [27], [28]. The paradigm of BCNN, achieving performance improvement. The
above-mentioned works have all achieved remarkable perfor- efficiency of BNAS confirms the advantages of the broad
mance, but most of them neglected the exploitation of attention connection paradigm, efficient training, and comprehensive
in different layers. features.
Broad connection helps extract rich information in different
B. Attention Mechanism layers, while deep representations are more effective than shal-
low ones. Therefore, our method introduces broad connection
The attention mechanism is first demonstrated to be helpful
without discarding deep features, which can obtain a wider
in computer vision. Google [29] performed attention mech-
variety of useful information with fewer sacrifices in model
anism on recurrent neural network for image classification.
efficiency.
Subsequently, the attention mechanism is introduced into nat-
ural language processing [5], [30]. Transformer [5] exhibited
the power of the attention mechanism on natural language pro- III. M ETHODOLOGY
cessing tasks and proposed the strong multihead self-attention As shown in Fig. 2, the proposed model BViT mainly con-
(MHSA), which is widely used [1], [6], [31]. Recently, atten- sists of two components: BViT backbone including multiple
tion mechanism has made a brilliant comeback in the image transformer layers (the gray boxes in Fig. 2) for deep feature
processing tasks, such as image classification [1], [2], [6], [32], and broad attention (the blue box in Fig. 2) for broad feature.
object detection [33], [34], semantic segmentation [35], [36]. BViT backbone obtains deep feature Outdeep by calculating
Due to the robustness of the attention mechanism, there are attention among patch features layer by layer. Then the atten-
many studies carried out on the attention mechanism. tion in all transformer layers is collected by the proposed broad
The innovation of the attention mechanism is ongoing. attention component with broad connection. Via parameter-
Some works presented multilevel attention [37], [38], [39]. free attention, we cannot only take full advantage of features
MLAN [37] jointly leveraged visual attention and in different layers jointly but also pay global attention on
semantic attention to process visual question answering. attention from each layer. Then the broad feature Outbroad is
MAM-RNN [39] includes frame-level attention layer and obtained.
region-level attention layer, which can jointly focus on the In Sections III-A and III-B, we introduce the BViT back-
notable regions in each frame and the frames correlated bone and the broad attention in detail.
with the target caption. Besides, HAN [40] made two
levels of attention mechanisms, i.e., word and sentence, A. BViT Backbone
respectively, for differentially attending to content with
Given a RGB image input I ∈ R H ×W ×C , we split it into non-
different importance during the construction of the document 2
overlapping and fixed-length patches Ip ∈ R N ×(P ×C) . P is
representation. Unlike the above-mentioned works, we are
the size of image patch, and N = H W/P 2 is the number
interested in the information of attention that already exists
of image patches. Our patch size P is set to 16. Then image
in different layers of the model. A broader view of attention
patches can be handled as a sequence, which can be processed
contributes to the acquisition of diverse features.
directly by the standard transformer. To obtain the input
x1 ∈ R N ×D of the first transformer layer, a linear projection
C. Pathway Connection of flattened patches is employed for satisfying the consistent
As the core component of deep learning, pathway connec- dimension D across all layers of the transformer. Similar to
tion facilitates the transmission of gradients and the availability ViT [6], we also introduce trainable position embedding for
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Overall architecture of BViT, which consists of BViT backbone (i.e., gray boxes) and broad attention (i.e., blue boxes). A patch feature input the
backbone of BViT which consists of several transformer layers, obtaining the feature Outdeep . The proposed broad attention mechanism extracts information
from each transformer layer by broad connection, obtaining the diverse feature Outbroad .
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Setup TABLE I
A BLATIONS FOR D IFFERENT C OEFFICIENT FACTORS ON I MAGE N ET
To validate the performance of our proposed model, we use
the ImageNet dataset [22], which contains about 1.3M training
data and 50k validation data with various 1000 object classes.
Furthermore, we transfer pretrained BViT on ImageNet to
downstream datasets, such as CIFAR10/100 [23], which are
small-scale image classification datasets with 50k training data
and 10k test data.
1) Model Variants: To demonstrate the performance of
BViT on the image classification task, we construct two mod-
els with different sizes, including BViT-5M and BViT-22M.
The architecture specifications of BViT are shown as follows.
The number of heads and dimensions of the transformer layer
are 3 and 192 for BViT-5M, while 6 and 384 for BViT-22M.
Besides, several architectural parameters remain consistent
across the two models. For instance, depth and MLP ratio
are set to 12 and 4, respectively. BViT does not increase
the trainable parameters compared to vanilla ViT without
broad attention block. The increase in FLOPs due to broad
attention block is tiny (i.e., 10−5 G) and is therefore negligible.
The coefficient factor γ is simply set to 1 in the following
experiments.
2) Training Setting: Our training setting mostly follows
DeiT [1]. For all model variants, the input image resolution is
224 × 224. And we train models for 300 epochs, employing
Adamw [50] optimizer and using cosine decay learning rate
scheduler. Due to the limitation of computing resources, the
batch size of BViT-5M and BViT-22M are 1280 and 512,
respectively. Learning rate varies with batch size, same as
DeiT [1]. The weight decay of 0.05 is applied. The step of
warmup is set to 5000. Further, We employ a majority of
the augmentation and regularization strategies of DeiT [1] in Fig. 4. Architectural details of the four models in ablation study. (a) Vanilla
training, such as RandAugment [51], Mixup [52], Cutmix [53], ViT, (b) BViT with concatenated values V, (c) BViT without concatenated
values V, and (d) BViT.
Random Erasing [54], Stochastic Depth [18] and Exponential
Moving Average [55], except for Repeated Augmentation [56]
which cannot deliver significant performance boosts. 2) With V Versus Without V : As mentioned above, the
3) Fine-Tuning Setting: Our fine-tuning setting mostly fol- power of BViT stems from its broad attention block which
lows ViT [6]. Using SGD optimizer with a momentum of 0.9, jointly pays attention to the information in different layers.
we pretrain our models at resolution 224 × 224. Then we The helpful attention information in different layers consists of
fine-tune each model with a batch size of 512. The training two components, i.e., concatenated values V and aggregated
steps for CIFAR10/100 are 17 000. attention weights Q K T . In order to discuss their respective
effectiveness, we conduct an ablation study with four different
architectures as shown in Fig. 4. To discuss the different
B. Ablation Study components of the information, the four architectures differ
1) Coefficient Factor: The coefficient factor adjusts the mainly in the integration of information in broad connec-
weight of the broad and deep features as shown in (8). tion. The four architectures are: 1) Vanilla ViT as shown in
To discuss the effectiveness of different coefficient factors, Fig. 4(a), without broad attention, it only utilizes the attention
we conduct the ablation study with different coefficient factors in the last transformer layer, i.e., DeiT [1]; 2) BViTw.V as
on ImageNet. Concretely, we choose 0.2, 0.4, 0.6, 0.8, and 1 as shown in Fig. 4(b), with concatenated values V , it performs
coefficient factor candidates. broad attention using concatenated values V in all transformer
As shown in Table I, all coefficient factor candidates result layers and attention weight ql klT of last transformer layer;
in an improvement of over 2%, which confirms the pow- 3) BViTw/o.V as shown in Fig. 4(c), without concatenated
erful effectiveness of broad attention. The greatest increase values V , it delivers broad attention via vl in last transformer
(i.e., 3.1%) is achieved with the coefficient factor of 0.6. layer and broadly connected attention weights Q K T ; and
However, considering the slight variation in the results due 4) BViT as shown in Fig. 4(d), it focuses on both concatenated
to the randomness of the training, we deem it acceptable to values V and aggregated attention weights Q K T .
choose any coefficient factor candidates. Thus we simply set Ablation experiments on ImageNet [22] are reported in
the coefficient factor to 1 for the rest of the experiments. Table II. As a comparison baseline without broad attention
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
P ERFORMANCE C OMPARISON W ITH S TATE - OF - THE -A RT M ODELS ON I MAGE N ET, I NCLUDING T RANSFORMER -BASED M ODELS , CNN-BASED M ODELS ,
MLP-BASED M ODELS , AND H YBRID M ODELS . W E G ROUP M ODELS BASED ON THE S IZE OF T HEIR PARAMETERS AND M ETHOD T YPE .
T HE P ROPOSED BV I T-5M O UTPERFORMS A LL THE M ANUAL -D ESIGNED V I T M ETHODS W ITHOUT B ROAD ATTENTION AT L ESS 2.6%
W ITH A BOUT 5M PARAMETERS . I N T HESE M ODELS W ITH A BOUT 22M PARAMETERS , THE P ROPOSED BV I T-22M H AS A LSO
O UTPERFORMED THE C LASSICAL V I T M ETHOD V I T-S [6] A BOUT 3%, E VEN E XCEEDS P REVALENT V I T M ETHOD
S WIN -S [2]. BLVT AND BS WIN E VEN D ELIVER S TATE - OF - THE -A RT P ERFORMANCE
A MONG T RANSFORMER -BASED M ODELS
Fig. 5. Representation similarity comparison via CKA. (a) DeiT [1] Fig. 6. Mean attention head distance by six attention heads. (a) DeiT [1]
(i.e., architecture without broad attention) and (b) BViT. We present CKA (i.e., architecture without broad attention) and (b) BViT. The horizontal
similarities between all pairs of transformer layers for our BViT and DeiT. coordinate indicates the heads of attention and the vertical coordinate indicates
Both horizontal and vertical coordinates indicate the number of architectural the mean attention distance. Different lines indicate attention blocks of
layers, and the color indicates the scores of similarity. The lighter the color, different layers. Following ViT [6], we randomly sample 128 images from
the higher the scores of similarity. We randomly sample 1000 images from ImageNet [22] dataset and calculate the mean distance between pixels with
ImageNet [22] dataset to compute CKA similarity scores. The heatmaps attention weights.
illustrate that BViT has smaller similarity scores between shallower and deeper
layers than DeiT without broad attention.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
E VALUATION OF T RANSFER L EARNING ON D OWNSTREAM DATASETS . W E T RANSFER P RETRAINED BV I T-22M ON I MAGE N ET TO CIFAR10/100.
BV I T-22M TAKES 224 × 224 I MAGES D URING T RAINING AND F INE -T UNING , AND THE
ACCURACY OF BV I T E XCEEDS V I T W ITH H IGHER R ESOLUTION
Fig. 7. Comparison of attention maps between DeiT (i.e., architecture without broad attention) and our BViT. The attention maps of BViT focus more on
the object to be classified than DeiT and that is positive for image recognition.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
distance of the third, fourth, ninth, and tenth blocks by sorted R EFERENCES
heads. The results show that our BViT [see Fig. 6(b)] attends
[1] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
more local information at the shallow layers. For example, H. Jégou, “Training data-efficient image transformers & distilla-
in the third block’s first head, the mean attention distance of tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
DeiT is about 40 while BViT is about 20 (blue line). As stated pp. 10347–10357.
[2] Z. Liu et al., “Swin Transformer: Hierarchical vision transformer using
in previous research [16], more attention to local features may shifted windows,” 2021, arXiv:2103.14030.
facilitate the learning of the model on the limited datasets. [3] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” 2021,
Thus our model achieves better classification accuracy on arXiv:2105.08050.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
ImageNet [22]. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
3) Attention Maps: To illustrate the significance of the (CVPR), Jun. 2016, pp. 770–778.
broad attention block, we use Attention Rollout [64] to com- [5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Process. Syst., 2017, pp. 5998–6008.
pute attention maps of transformer layers. Attention Rollout [6] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
averages the attention weights of the model across all heads for image recognition at scale,” in Proc. 9th Int. Conf. Learn. Represent.
and then recursively multiplies the weight matrices of different (ICLR), May 2021, pp. 1–22.
[7] Y. Lu, Y. Chen, D. Zhao, B. Liu, Z. Lai, and J. Chen, “CNN-G: Con-
transformer layers. volutional neural network combined with graph for image segmentation
As shown in Fig. 7, we visualize the attention maps of with theoretical analysis,” IEEE Trans. Cognit. Develop. Syst., vol. 13,
BViT and DeiT [1] (i.e., architecture without broad attention). no. 3, pp. 631–644, Sep. 2021.
[8] D. Mellouli, T. M. Hamdani, J. J. Sanchez-Medina, M. B. Ayed, and
Visualization results show that utilization of the attention in A. M. Alimi, “Morphological convolutional neural network architecture
different layers facilitates the spotting of the critical object. for digit recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30,
The attention maps of BViT pay more attention to the object no. 9, pp. 2876–2885, Sep. 2019.
to be recognized than DeiT. The phenomenon provides an [9] X. Liu, B. Hu, Q. Chen, X. Wu, and J. You, “Stroke sequence-
dependent deep convolutional neural network for online handwritten
intuitive argument for the design of our broad attention mecha- Chinese character recognition,” IEEE Trans. Neural Netw. Learn. Syst.,
nism, which reasonably improves the understanding of images vol. 31, no. 11, pp. 4637–4648, Nov. 2020.
by enhancing the exploitation of features. [10] J. Li et al., “ABCP: Automatic block-wise and channel-wise network
pruning via joint search,” IEEE Trans. Cognit. Develop. Syst., early
In summary, benefiting from the design of broad attention, access, Dec. 20, 2022, doi: 10.1109/TCDS.2022.3230858.
our BViT can: 1) achieve effective features by preventing [11] N. Li, Y. Pan, Y. Chen, Z. Ding, D. Zhao, and Z. Xu, “Heuristic rank
model redundancy; 2) deliver excellent performance on lim- selection with progressively searching tensor ring network,” Complex
Intell. Syst., vol. 8, no. 2, pp. 771–785, Apr. 2022.
ited datasets (e.g., ImageNet) for more local attention; and [12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
3) achieve better image understanding. S. Zagoruyko, “End-to-end object detection with transformers,” in
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
pp. 213–229.
V. C ONCLUSION [13] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
former for semantic segmentation,” 2021, arXiv:2105.05633.
This article proposes the Broad attention-based ViT, called [14] L. Yuan et al., “Tokens-to-Token ViT: Training vision transformers from
BViT. As the key element of BViT, broad attention con- scratch on ImageNet,” 2021, arXiv:2101.11986.
[15] H. Wu et al., “CvT: Introducing convolutions to vision transformers,”
sists of broad connection and parameter-free attention. Broad 2021, arXiv:2103.15808.
connection integrates attention information in different lay- [16] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy,
ers. Then parameter-free attention extracts effective features “Do vision transformers see like convolutional neural networks?” in
Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 12116–12128.
from the above-integrated information and constructs their [17] T. Nguyen, M. Raghu, and S. Kornblith, “Do wide and deep networks
relationships. Furthermore, due to the novel broad attention learn the same things? Uncovering how neural network representations
block being directed at the existing attention, the proposed vary with width and depth,” 2020, arXiv:2010.15327.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep
broad attention is generic to improve the performance of networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis. Cham,
attention-based models. Consequently, BViT achieves leading Switzerland: Springer, 2016, pp. 646–661.
performance on vision tasks benefiting from rich and valuable [19] Z. Ding, Y. Chen, N. Li, D. Zhao, Z. Sun, and C. L. Philip Chen, “BNAS:
Efficient neural architecture search using broad scalable architecture,”
information. On ImageNet, BViT arrives at state-of-the-art IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, pp. 5004–5018,
performance among transformer-based models with about 3% Sep. 2021.
boost to groundbreaking ViT. Then we transfer BViT-22M [20] I. Tolstikhin et al., “MLP-Mixer: An all-MLP architecture for vision,”
2021, arXiv:2105.01601.
to downstream tasks (CIFAR10/100) that prove the robust [21] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollar,
transferability of the model. Moreover, the implementation of “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.
broad attention on T2T-ViT, LVT, and Swin Transformer also Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10428–10436.
improves accuracy by more than 1%, confirming the flexibility [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
and effectiveness of our method. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
As a key component of BViT, the broad attention block [23] A. Krizhevsky et al., “Learning multiple layers of features from tiny
significantly improves the performance of the ViT on image images,” Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009,
2009.
classification. We expect to inspect its employment in natural [24] C. Yang et al., “Lite vision transformer with enhanced self-attention,”
language processing tasks. Furthermore, we will explore the in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
impact of different connection combinations of transformer Jun. 2022, pp. 11998–12008.
[25] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable
layer outputs on performance via neural architecture search effectiveness of data in deep learning era,” in Proc. IEEE Int. Conf.
algorithm. Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[26] M. Chen, H. Peng, J. Fu, and H. Ling, “AutoFormer: Searching trans- [51] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical
formers for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. automated data augmentation with a reduced search space,” in Proc.
Vis. (ICCV), Oct. 2021, pp. 12270–12280. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[27] C. Li et al., “BossNAS: Exploring hybrid CNN-transformers Jun. 2020, pp. 702–703.
with block-wisely self-supervised neural architecture search,” 2021, [52] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond
arXiv:2103.12424. empirical risk minimization,” 2017, arXiv:1710.09412.
[28] J. Liu, H. Li, G. Song, X. Huang, and Y. Liu, “UniNet: Unified [53] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix:
architecture search with convolution, transformer, and MLP,” 2021, Regularization strategy to train strong classifiers with localizable fea-
arXiv:2110.04035. tures,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[29] V. Mnih et al., “Recurrent models of visual attention,” in Proc. Adv. pp. 6023–6032.
Neural Inf. Process. Syst., vol. 27, pp. 1–9, 2014. [54] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing
[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by data augmentation,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7,
jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. 2020, pp. 13001–13008.
Represent. (ICLR), 2015, pp. 1–15. [55] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
[31] X. Zhao, Y. Chen, J. Guo, and D. Zhao, “A spatial–temporal attention tion by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp. 838–855,
model for human trajectory prediction,” IEEE/CAA J. Autom. Sinica, Jul. 1992.
vol. 7, no. 4, pp. 965–974, Jul. 2020. [56] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry,
[32] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visual “Augment your batch: Improving generalization through instance repeti-
attention for vehicle classification,” IEEE Trans. Cogn. Develop. Syst., tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
vol. 9, no. 4, pp. 356–367, Dec. 2017. Jun. 2020, pp. 8129–8138.
[33] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural [57] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1314–1324.
Jun. 2018, pp. 7794–7803. [58] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con-
[34] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks volutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019,
meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int. pp. 6105–6114.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–10. [59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[35] J. Fu et al., “Dual attention network for scene segmentation,” in Proc. the inception architecture for computer vision,” in Proc. IEEE Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
pp. 3146–3154. [60] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
[36] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Fea- A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
ture pyramid transformer,” in Proc. Eur. Conf. Comput. Vis. Cham, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
Switzerland: Springer, 2020, pp. 323–339. pp. 16519–16529.
[37] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networks for [61] H. Touvron et al., “ResMLP: Feedforward networks for image classifi-
visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern cation with data-efficient training,” 2021, arXiv:2105.03404.
Recognit. (CVPR), Jul. 2017, pp. 4709–4717. [62] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer
[38] H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu, “AttnSense: Multi- in transformer,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
level attention mechanism for multimodal human activity recog- pp. 15908–15919.
nition,” in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019, [63] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola,
pp. 3109–3115. “A kernel statistical test of independence,” in Proc. Adv. Neural Inf.
[39] X. Li, B. Zhao, and X. Lu, “MAM-RNN: Multi-level attention model Process. Syst., vol. 20, 2007, pp. 1–8.
based RNN for video captioning,” in Proc. 26th Int. Joint Conf. Artif. [64] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,”
Intell., Aug. 2017, pp. 2208–2214. 2020, arXiv:2005.00928.
[40] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical
attention networks for document classification,” in Proc. Conf. North
Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2016,
pp. 1480–1489.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Nannan Li (Graduate Student Member, IEEE)
Process. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105. received the B.S. degree from the School of Automa-
[42] Z. Ding, Y. Chen, N. Li, and D. Zhao, “BNAS-v2: Memory-efficient tion Engineering, University of Electronic Science
and performance-collapse-prevented broad neural architecture search,” and Technology of China, Chengdu, Sichuan, China,
IEEE Trans. Syst., Man, Cybern. Syst., vol. 52, no. 10, pp. 6259–6272, in 2017. She is currently pursuing the Ph.D. degree
Oct. 2022. in control theory and control engineering with the
[43] Z. Ding, Y. Chen, N. Li, D. Zhao, and C. L. Philip Chen, “Stacked State Key Laboratory of Multimodal Artificial Intel-
BNAS: Rethinking broad convolutional neural network for neural archi- ligence Systems, Institute of Automation, Chinese
tecture search,” 2021, arXiv:2111.07722. Academy of Sciences, Beijing, China.
[44] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural Her research interests include computer vision and
architecture search via parameters sharing,” in Proc. Int. Conf. Mach. neural architecture search.
Learn., 2018, pp. 4095–4104.
[45] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
architectures for scalable image recognition,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710.
[46] Y. Chen, R. Gao, F. Liu, and D. Zhao, “ModuleNet: Knowledge-
inherited neural architecture search,” IEEE Trans. Cybern., vol. 52, Yaran Chen (Member, IEEE) received the Ph.D.
no. 11, pp. 11661–11671, Nov. 2021. degree from the Institute of Automation, Chinese
[47] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated CNN Academy of Sciences, Beijing, China, in 2018.
architecture design based on blocks,” IEEE Trans. Neural Netw. Learn. She is currently an Associate Professor with the
Syst., vol. 31, no. 4, pp. 1242–1254, Apr. 2020. State Key Laboratory of Multimodal Artificial Intel-
[48] J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer normalization,” 2016, ligence Systems, Institute of Automation, Chinese
arXiv:1607.06450. Academy of Sciences, and the College of Artificial
[49] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” Intelligence, University of Chinese Academy of Sci-
2016, arXiv:1606.08415. ences, Beijing. Her research interests include deep
[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” learning, neural architecture search, deep reinforce-
2014, arXiv:1412.6980. ment learning, and autonomous driving.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Weifan Li (Graduate Student Member, IEEE) Dongbin Zhao (Fellow, IEEE) received the Ph.D.
received the B.S. degree in materials science and degree from the Harbin Institute of Technology,
engineering from Chongqing University, Chongqing, Harbin, China, in 2000.
China, in 2015, and the M.S. degree in automation He is currently a Professor with the Institute of
from Fuzhou University, Fuzhou, Fujian, China, Automation, Chinese Academy of Sciences, Beijing,
in 2018. He is currently pursuing the Ph.D. degree in China, and the University of Chinese Academy of
control theory and control engineering with the State Sciences, Beijing. He has published six books and
Key Laboratory of Multimodal Artificial Intelligence more than 100 international journal articles. His
Systems, Institute of Automation, Chinese Academy current research interests are in the area of deep
of Sciences, Beijing, China. reinforcement learning, computational intelligence,
His current research interests include reinforce- autonomous driving, game artificial intelligence, and
ment learning, deep learning, and game AI. robotics.
Dr. Zhao was the Chair of Adaptive Dynamic Programming and
Reinforcement Learning Technical Committee from 2015 to 2016, Beijing
Chapter from 2017 to 2018, and Technical Activities Strategic Planning
Sub-Committee in 2019 of IEEE Computational Intelligence Society (CIS).
He is the Chair of Distinguished Lecture Program. He works as several
guest editors of renowned international journals. He is involved in organizing
many international conferences. He serves as an Associate Editor for IEEE
T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, IEEE
T RANSACTIONS ON C YBERNETICS, and IEEE T RANSACTIONS ON A RTIFI -
CIAL I NTELLIGENCE .
Zixiang Ding (Member, IEEE) received the M.E. Shuai Nie received the Ph.D. degree in pat-
degree from the School of Information and Electrical tern recognition and intelligent systems from the
Engineering, Shandong Jianzhu University, Jinan, National Key Laboratory of Pattern Recognition,
Shandong, China, in 2018. He is currently pursuing Institute of Automation, Chinese Academy of Sci-
the Ph.D. degree in computer applications with the ences, Beijing, China, in 2018.
State Key Laboratory of Multimodal Artificial Intel- He is currently an Associate Professor with the
ligence Systems, Institute of Automation, Chinese National Key Laboratory of Pattern Recognition,
Academy of Sciences, Beijing, China. Institute of Automation, Chinese Academy of Sci-
His research interests include computer vision, ences. His research interests include speech recogni-
neural architecture search, and deep reinforcement tion, speech separation/enhancement, deep learning,
learning. and large language models.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.