Bvit

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
BViT: Broad Attention-Based Vision Transformer

Nannan Li, Graduate Student Member, IEEE, Yaran Chen , Member, IEEE,
Weifan Li, Graduate Student Member, IEEE, Zixiang Ding , Member, IEEE,
Dongbin Zhao , Fellow, IEEE, and Shuai Nie
Abstract— Recent works have demonstrated that trans- vision [4], [7], [8], [9], [10], [11]. ViT usually divides the
former can achieve promising performance in computer vision, whole image into many fixed-size patches containing local
by exploiting the relationship among image patches with self- components and then exploits the relationship between them
attention. They only consider the attention in a single feature
layer, but ignore the complementarity of attention in different by self-attention, which can focus modeling capabilities on
layers. In this article, we propose broad attention to improve the information relevant to vision tasks, such as image clas-
the performance by incorporating the attention relationship of sification [1], [6], object detection [2], [12], and semantic
different layers for vision transformer (ViT), which is called BViT. segmentation [13].
The broad attention is implemented by broad connection and Many works have demonstrated that the self-attention mech-
parameter-free attention. Broad connection of each transformer
layer promotes the transmission and integration of information anism plays a critical role in the success of ViT. Even if
for BViT. Without introducing additional trainable parameters, they are far away, the relationship between local components
parameter-free attention jointly focuses on the already available in an image can be exploited by self-attention. Furthermore,
attention information in different layers for extracting useful multihead attention can focus on useful local information
information and building their relationship. Experiments on and relationship in images from different views. Many efforts
image classification tasks demonstrate that BViT delivers supe-
rior accuracy of 75.0%/81.6% top-1 accuracy on ImageNet with have been devoted to designing self-attention to improve the
5M/22M parameters. Moreover, we transfer BViT to downstream capabilities of focusing on useful information and suppressing
object recognition benchmarks to achieve 98.9% and 89.9% on redundant information. Swin Transformer [2] hierarchically
CIFAR10 and CIFAR100, respectively, that exceed ViT with performs self-attention on a shifted window of patches, which
fewer parameters. For the generalization test, the broad attention increases the connection between cross windows and has the
in Swin Transformer, T2T-ViT and LVT also brings an improve-
ment of more than 1%. To sum up, broad attention is promising flexibility of modeling on various scales. Tokens-to-Token
to promote the performance of attention-based models. Code and ViT [14] recursively assembles neighboring tokens into a token
pretrained models are available at https://github.com/DRL/BViT. to extract the structure information of the image by self-
Index Terms— Broad attention, broad connection, image clas- attention incrementally. CvT [15] introduces convolution to
sification, parameter-free attention, vision transformer (ViT). ViT to achieve additional attention modeling of local spatial
context. However, these efforts only consider the attention
performed on feature maps from one layer but ignore that
I. I NTRODUCTION
the combination of attention in different layers is helpful.
T RANSFORMER [5] has demonstrated impressive model-

ing capabilities and achieved state-of-the-art performance
in natural language processing tasks. Recently, vision trans-
Specifically, shallower layers focus on both local and global
information, while deeper layers tend to focus on global
information [6]. And the study on the difference between
former (ViT) [6] has been proposed and made a breakthrough, ViT and CNN [16] suggests that ViT performs poorly on
which is widely believed to be expected to break the domi- limited datasets (e.g., ImageNet) due to inadequate attention to
nance of convolutional neural network (CNN) in computer local features. Further, the centered kernel alignment (CKA)
similarity scores between shallower and deeper layers in ViT
Manuscript received 13 February 2022; revised 6 November 2022 and
13 January 2023; accepted 25 March 2023. This work was supported in are higher than that in CNN [16], which may exhibit that
part by the National Natural Science Foundation of China (NSFC) under the architecture is redundant. In a word, similar layers can
Grant 62173324 and Grant 62006223 and in part by the International be pruned with minimal impact on performance [17]. The
Partnership Program of the Chinese Academy of Sciences under Grant
104GJHZ2022013GC. (Corresponding author: Yaran Chen.) combination of attention from different layers is promising to
Nannan Li, Yaran Chen, Weifan Li, Zixiang Ding, and Dongbin Zhao alleviate the above problems, not only allowing the deep layers
are with the State Key Laboratory of Multimodal Artificial Intelligence to acquire local information but also making fuller exploitation
Systems, Institute of Automation, Chinese Academy of Sciences, Beijing
100190, China, and also with the School of Artificial Intelligence, Uni- of features.
versity of Chinese Academy of Sciences, Beijing 100049, China (e-mail: It is widely accepted that nonlinear processing and pathway
linannan2017@ia.ac.cn; chenyaran2013@ia.ac.cn; liweifan2018@ia.ac.cn; connections between different layers are the key to the success
dingzixiang2018@ia.ac.cn; dongbin.zhao@ia.ac.cn).
Shuai Nie is with the National Laboratory of Pattern Recognition, Institute of deep neural networks. The multilayer nonlinear operations
of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: provide the hierarchical feature extraction capability for the
shuai.nie@nlpr.ia.ac.cn). model. At the same time, the pathway connections between
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2023.3264730. the different layers facilitate the transmission and integration
Digital Object Identifier 10.1109/TNNLS.2023.3264730 of information, which can be used to combine the attention
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Concordia University Library. Downloaded on October 10,2023 at 03:47:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
in different layers. Following we go back to deep CNN for

consolidation of the above discussion. ResNet [4] stacks more
convolution layers to achieve a very deep network. Further-
more, it designs residual connections that effectively improve
performance and develop the training process. DenseNet [18]
further increases the pathway connection, i.e., dense connec-
tion, which brings more improvement in performance. Broad
neural architecture search (BNAS) [19], a neural architecture
search method, also designs a broad connection, adding path-
way connections among the shallower layers and the last layer
for collecting features with different layers. The revolutionary
success of pathway connections on CNN proves its benefit
to learning effective features. Therefore, it is reasonable to
expect the combination and full exploitation of attention in
different layers to be implemented by increasing the path con-
nections. Concretely, combining attention in different layers
facilitates: 1) the transmission and fusion of attention, which
promotes attention to local information and 2) the extraction
Fig. 1. Parameters and accuracy on Imagenet of BViTs (i.e., BViT,
of practical attention information, which mitigates model BSwin, BLVT, and BT2T-ViT) compared to transformer-based, MLP-based
redundancy. and convolution-based models, such as DeiT [1], Swin [2], gMLP [3], and
In this article, we propose a broad attention mechanism ResNet [4].
that can efficiently extract and utilize the knowledge in each
transformer layer. In particular, we first integrate attention
information in different transformer layers via a broad con- In summary, our contributions are outlined below.
nection. Then we perform parameter-free attention on the 1) We propose a novel broad attention-based ViT. Broad
integrated features mentioned above for extracting helpful attention can extract effective features without any extra
contents and their structural relationship hierarchically. It is to trainable parameters. The proposed BViT attains supe-
be noted that no additional trainable parameters are required rior classification accuracy on ImageNet with about 2%
since that parameter-free attention is executed on information improvement compared to ViT and ResNet. Moreover,
already processed by self-attention in transformer layers. More the pretrained model can achieve comparable perfor-
significantly, broad attention is generic, providing attention- mance on downstream tasks, including CIFAR10 and
based models with the flexibility to introduce broad attention CIFAR100.
blocks for improved performance. 2) The visualization results demonstrate that broad atten-
Based on broad attention, we present Broad attention-based tion makes the model attend more locally, which facil-
ViT, called BViT. BViT consists of two components: 1) BViT itates the performance on the limited datasets. And
backbone, including multiple transformer layers, is to yield broad attention brings smaller feature similarity, which
deep features and 2) broad attention block derives broad prevents model redundancy.
features by increasing the path connection of attention in 3) Broad attention mechanism performs well in gen-
different layers and extracting helpful information hierarchi- eralizability, which can elevate the performance of
cally without extra learnable parameters. The integration of attention-based models flexibly. The results of extensive
the BViT backbone and the broad attention block results in a experiments show that implementations of excellent and
model with superior image understanding capabilities. prevail attention-based models present favorable perfor-
The proposed BViT presents powerful performance on mance improvement.
image classification benchmark tasks. The model result-
ing from our method exceeds transformer-based mod- II. R ELATED W ORK
els ViT/DeiT [1], [6], multilayer perceptron (MLP)-based
models Mixer/gMLP [3], [20] and CNN-based models A. Vision Transformer
ResNet/RegNetY [4], [21] with comparable parameters on Transformer [5] was initially devised for natural language
ImageNet [22] dataset. We also prove the transferability on processing and has achieved leading performance. The key
different transfer benchmark datasets (CIFAR10/100 [23]) point of the transformer is to fetch the global dependencies of
with pretraining on ImageNet. Further, to support that broad the input via a self-attention mechanism. Furthermore, in an
attention can be employed to attention-based models as a endeavor to explore more potential of transformers, recent
generic mechanism, we conduct experiments that apply broad research has been directed to the application of transformers
attention to Swin Transformer [2], T2T-ViT [14] and LVT [24]. in the field of computer vision, such as ViT [6], DeiT [1].
In particular, for both models, there is a positive 1% rise in ViT [6] first showed the strength of pure transformer archi-
the result on ImageNet. The comparison of parameters and tecture on image classification with large-scale image datasets
accuracy between our proposed methods and other typical (i.e., ImageNet-21K [22], JFT-300M [25]). ViT processes the
models is shown in Fig. 1. image input into a sequence paradigm by the following steps:
LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 3
1) partitioning the image into fixed-size patches; 2) applying of comprehensive features. AlexNet [41] is regarded as the first
linear embedding to fixed-size patches; and 3) adding the truly deep CNN structure, which has made a breakthrough
position embedding. With an extra learnable “classification in large-scale image datasets. ResNet [4] added residual
token,” the sequence can be handled by a standard transformer connections between convolutional layers, which effectively
encoder. facilitates information transmission in very deep networks and
Based on the innovative ViT [6], several studies are ded- alleviates the gradient dispersion problem of optimization.
icated to improving the architecture design and enhancing DenseNet [18] further increased the pathway between layers
the performance of ViT. DeiT [1] performed data-efficient by means of dense connections, which significantly strength-
training and introduced a distillation token for more pow- ens the propagation and fusion of features and improves the
erful knowledge distillation. Swin Transformer [2] com- modeling ability of deep networks.
puted hierarchical feature representation via shifted window As one of the connection types for pathway connection,
and attained linear computational complexity relative to the broad connection can compensate for the lack of training
input resolution size. Tokens-to-Token ViT [14] proposed a efficiency and feature diversity in the deep network. Intro-
tokens-to-token process to aggregate surrounding tokens and ducing broad connection, Ding et al. proposed BNAS [19]
enhance local information gradually. Autoformer [26] searched and its extended works [42], [43]. BNAS [19] first presented
high-performed transformer-based models through one-shot and searched broad CNN (BCNN), and its search efficiency
architecture search. The search space consists of embedding is leading in reinforcement learning-based architecture search
dimension, MLP ratio, depth, and so on. In addition, some methods [44], [45] and evolutionary algorithm-based meth-
works combined convolution and transformer architecture to ods [46], [47]. Stacked BNAS [43] developed the connection
exploit the advantages of both them [15], [27], [28]. The paradigm of BCNN, achieving performance improvement. The
above-mentioned works have all achieved remarkable perfor- efficiency of BNAS confirms the advantages of the broad
mance, but most of them neglected the exploitation of attention connection paradigm, efficient training, and comprehensive
in different layers. features.
Broad connection helps extract rich information in different
B. Attention Mechanism layers, while deep representations are more effective than shal-
low ones. Therefore, our method introduces broad connection
The attention mechanism is first demonstrated to be helpful
without discarding deep features, which can obtain a wider
in computer vision. Google [29] performed attention mech-
variety of useful information with fewer sacrifices in model
anism on recurrent neural network for image classification.
efficiency.
Subsequently, the attention mechanism is introduced into nat-
ural language processing [5], [30]. Transformer [5] exhibited
the power of the attention mechanism on natural language pro- III. M ETHODOLOGY
cessing tasks and proposed the strong multihead self-attention As shown in Fig. 2, the proposed model BViT mainly con-
(MHSA), which is widely used [1], [6], [31]. Recently, atten- sists of two components: BViT backbone including multiple
tion mechanism has made a brilliant comeback in the image transformer layers (the gray boxes in Fig. 2) for deep feature
processing tasks, such as image classification [1], [2], [6], [32], and broad attention (the blue box in Fig. 2) for broad feature.
object detection [33], [34], semantic segmentation [35], [36]. BViT backbone obtains deep feature Outdeep by calculating
Due to the robustness of the attention mechanism, there are attention among patch features layer by layer. Then the atten-
many studies carried out on the attention mechanism. tion in all transformer layers is collected by the proposed broad
The innovation of the attention mechanism is ongoing. attention component with broad connection. Via parameter-
Some works presented multilevel attention [37], [38], [39]. free attention, we cannot only take full advantage of features
MLAN [37] jointly leveraged visual attention and in different layers jointly but also pay global attention on
semantic attention to process visual question answering. attention from each layer. Then the broad feature Outbroad is
MAM-RNN [39] includes frame-level attention layer and obtained.
region-level attention layer, which can jointly focus on the In Sections III-A and III-B, we introduce the BViT back-
notable regions in each frame and the frames correlated bone and the broad attention in detail.
with the target caption. Besides, HAN [40] made two
levels of attention mechanisms, i.e., word and sentence, A. BViT Backbone
respectively, for differentially attending to content with
Given a RGB image input I ∈ R H ×W ×C , we split it into non-
different importance during the construction of the document 2
overlapping and fixed-length patches Ip ∈ R N ×(P ×C) . P is
representation. Unlike the above-mentioned works, we are
the size of image patch, and N = H W/P 2 is the number
interested in the information of attention that already exists
of image patches. Our patch size P is set to 16. Then image
in different layers of the model. A broader view of attention
patches can be handled as a sequence, which can be processed
contributes to the acquisition of diverse features.
directly by the standard transformer. To obtain the input
x1 ∈ R N ×D of the first transformer layer, a linear projection
C. Pathway Connection of flattened patches is employed for satisfying the consistent
As the core component of deep learning, pathway connec- dimension D across all layers of the transformer. Similar to
tion facilitates the transmission of gradients and the availability ViT [6], we also introduce trainable position embedding for
Fig. 2. Overall architecture of BViT, which consists of BViT backbone (i.e., gray boxes) and broad attention (i.e., blue boxes). A patch feature input the
backbone of BViT which consists of several transformer layers, obtaining the feature Outdeep . The proposed broad attention mechanism extracts information
from each transformer layer by broad connection, obtaining the diverse feature Outbroad .
retention of location information and classification token for

image representation. With the classification token, the input
is represented as x1 ∈ R(N +1)×D .
After the necessary process of the image input, the data
flow of the BViT architecture is divided into two orientations,
i.e., BViT backbone and broad attention. As shown in Fig. 2,
BViT backbone (gray boxes) outputs the deep feature Outdeep
straightforwardly via several transformer layers. A transformer
layer includes two blocks: MHSA and MLP. Besides, both
MHSA and MLP blocks employ residual connections and
apply LayerNorm (LN) [48] before each block. Next, we detail
the computation processes of MHSA and MLP.
1) Multihead Self-Attention: MHSA attends to information
at different positions using multiheads. Given the input tensor
xi ∈ R(N +1)×D for ith layer, the output of MHSA block is
shown as
zî = xi + MHSA(xi ). (1)
2) Multilayer Perceptron: There are two fully connected
layers in an MLP block. And an activation function
(e,g. GELU [49]) is between them. The output z i in ith layer
is formulated as
Fig. 3. Illustration of broad attention block, including broad connection and
z i = zî + MLP(zî ). (2) parameter-free attention.
The output z i in ith layer is the input xi+1 for {i + 1}th

layer. And the output of deep feature is the output of the last 1) Broad Connection: The broad connection promotes the
transformer layer as transmission and integration of information flow by enhancing
Outdeep = zl . (3) the path connection of attention, detailed as follows. For better
understanding, we first give the equation of MHSA operation
B. Broad Attention with xi ∈ R(N +1)×D as input
As the critical point of BViT architecture, broad attention MHSA(xi ) = Atten(To_qkv(xi ))wo
block promotes the transmission and integration of information = Atten(qi , ki , vi )w o
from different layers using the broad connection, and focuses !
on helpful information hierarchically via parameter-free self- qi kiT
= softmax p vi wo
attention. Fig. 3 introduces its detailed mechanism. dq
qi , qi , . . . , qih , qi ∈ R(N +1)×dq

1 2 j
and qi = but also
P integrates attention weights of the different layers
ki = ki1 , ki2 , . . . , kih , ki ∈ R(N +1)×dk
j (i.e., li=1 qi kiT ). The concatenation of attention information
in different layers leads to the inconsistent dimension between
vi = vi1 , vi2 , . . . , vih , vi ∈ R(N +1)×dv (4)
j
Outdeep and output of Attenpf (Q, K , V ). We introduce a pool-
where To_qkv contains linear projection, chunk and rearrange, ing operator BPool for broad feature Outbroad to deal with
j j j
qi , ki and vi are thepquery, key and value of jth head in ith the above problem. BPool performs a 1-D adaptive average
layer, j ∈ [1, h]. (1/ dq ) is the scaling factor, and w o is the pooling on the inconsistent dimension of Attenpf (Q, K , V ).
weight matrix of the second linear projection in MHSA block. The output feature of broad attention can be expressed as
Then we, respectively, concatenate queries, keys, and values Outbroad = BPool(Attenpf (Q, K , V ), {d p }) (7)
of all MHSA blocks in different transformer layers as below
where d p is the dimension of deep feature Outdeep . Focusing
Q = [q1 , q2 , . . . , ql ], Q ∈ R(N +1)×h×(l×dq )
on attention information in different layers across the model,
K = [k1 , k2 , . . . , kl ], K ∈ R(N +1)×h×(l×dk ) parameter-free attention has a tremendous potential to extract
V = [v1 , v2 , . . . , vl ], V ∈ R(N +1)×h×(l×dv ) (5) the critical information and their structural relationship, which
contributes to the performance of the model on vision tasks
where qi , ki , and vi are query, key, and value of ith layer without additional model training burden.
in (4). Q, K , and V are concatenated queries, keys, and values With combination of deep feature output Outdeep in (3) and
accordingly. l is the number of transformer layers. broad feature output Outbroad in (7), we can derive final output
As can be seen in Fig. 3, the broad connection adequately Out is computed by
integrates the features of each transformer layer. Intuitively,
the connected Q, K , and V express the attention information Out = Outdeep + γ ∗ Outbroad (8)
in different layers across the model, while qi , ki , and vi of
each layer are only concerned with the attention information where γ is the coefficient factor, which can be used to adjust
of a single layer. Since the increase of the path connection the weight of two different types of feature.
of attention in different layers, Q, K , and V contain rich In a nutshell, BViT backbone uses multiple transformer
information and are more conducive to extracting helpful layers to process the image input into the more understand-
information. able feature for the model. Based on this, broad attention
In addition to the acquisition of information in different block enables the model to jointly attend to information from
layers, another advantage of broad connection is the enhanced different representation spaces at different transformer layers.
flow of information and gradients throughout the architecture, By extracting attention information in different layers rather
which facilitates its training. Each layer can access the gradient than a single layer, broad attention can obtain wealthier infor-
of the loss function directly, which helps to train a deeper mation, which allows us to pay attention to vital information
network architecture. and ignore redundant information. In the broad attention block,
2) Parameter-Free Attention: The parameter-free attention the broad connection is responsible for enhancing the path
handles the integrated information via self-attention to jointly connection of attention in different layers to facilitate the
focus on helpful information and extract their relationship. transmission and integration of information. Parameter-free
Without introducing linear projection, we directly pay atten- attention is responsible for attending to vital information from
tion to available Q, K , and V . Thus broad attention does different layers and extracting their relationship. Consequently,
not bring extra learnable parameters, only slightly increasing benefiting from the introduction of broad attention block, BViT
the computational complexity. The implementation details of can concentrate better on the significant information which
parameter-free attention Attenpf are as follows. improves the classification accuracy of the model. Moreover,
We perform self-attention on the concatenated queries Q, the inclusion of broad attention does not bring additional
keys K , and values V , as we do in (4) except for linear learnable parameters, which enables the convenient application
projection. The specific formula is given below on attention-based models.
QK T

Attenpf (Q, K , V ) = softmax √ V IV. E XPERIMENTS
d !
Pl
i=1 q i k T We conduct the following experiments with BViT on
= softmax √ i V ImageNet [22] image classification and downstream tasks
d
Pl ! (i.e., CIFAR10/100 [23]). We first give the datasets introduc-
T
i=1 qi ki tion and experimental setup. Next, we perform ablation studies
= softmax √ [v1 , v2 , . . . , vl ]
d to validate the importance of elements of broad attention block,
(6) including coefficient factor and concatenated value V . Then,
√ we compare the proposed BViT architecture with state-of-
where (1/ d) is the scaling factor, and d is the hidden the-art works and apply broad attention to several remarkable
dimension of transform layer. From (6), we can conclude ViT models, such as Swin Transformer [2], T2T-ViT [14] and
that parameter-free attention not only leverages the concate- LVT [24], to verify the generality of broad attention. Finally,
nated values (i.e., [v1 , v2 , . . . , vl ]) of the different layers, we analyze the visualization results of BViT in detail.
A. Setup TABLE I
A BLATIONS FOR D IFFERENT C OEFFICIENT FACTORS ON I MAGE N ET
To validate the performance of our proposed model, we use
the ImageNet dataset [22], which contains about 1.3M training
data and 50k validation data with various 1000 object classes.
Furthermore, we transfer pretrained BViT on ImageNet to
downstream datasets, such as CIFAR10/100 [23], which are
small-scale image classification datasets with 50k training data
and 10k test data.
1) Model Variants: To demonstrate the performance of
BViT on the image classification task, we construct two mod-
els with different sizes, including BViT-5M and BViT-22M.
The architecture specifications of BViT are shown as follows.
The number of heads and dimensions of the transformer layer
are 3 and 192 for BViT-5M, while 6 and 384 for BViT-22M.
Besides, several architectural parameters remain consistent
across the two models. For instance, depth and MLP ratio
are set to 12 and 4, respectively. BViT does not increase
the trainable parameters compared to vanilla ViT without
broad attention block. The increase in FLOPs due to broad
attention block is tiny (i.e., 10−5 G) and is therefore negligible.
The coefficient factor γ is simply set to 1 in the following
experiments.
2) Training Setting: Our training setting mostly follows
DeiT [1]. For all model variants, the input image resolution is
224 × 224. And we train models for 300 epochs, employing
Adamw [50] optimizer and using cosine decay learning rate
scheduler. Due to the limitation of computing resources, the
batch size of BViT-5M and BViT-22M are 1280 and 512,
respectively. Learning rate varies with batch size, same as
DeiT [1]. The weight decay of 0.05 is applied. The step of
warmup is set to 5000. Further, We employ a majority of
the augmentation and regularization strategies of DeiT [1] in Fig. 4. Architectural details of the four models in ablation study. (a) Vanilla
training, such as RandAugment [51], Mixup [52], Cutmix [53], ViT, (b) BViT with concatenated values V, (c) BViT without concatenated
values V, and (d) BViT.
Random Erasing [54], Stochastic Depth [18] and Exponential
Moving Average [55], except for Repeated Augmentation [56]
which cannot deliver significant performance boosts. 2) With V Versus Without V : As mentioned above, the
3) Fine-Tuning Setting: Our fine-tuning setting mostly fol- power of BViT stems from its broad attention block which
lows ViT [6]. Using SGD optimizer with a momentum of 0.9, jointly pays attention to the information in different layers.
we pretrain our models at resolution 224 × 224. Then we The helpful attention information in different layers consists of
fine-tune each model with a batch size of 512. The training two components, i.e., concatenated values V and aggregated
steps for CIFAR10/100 are 17 000. attention weights Q K T . In order to discuss their respective
effectiveness, we conduct an ablation study with four different
architectures as shown in Fig. 4. To discuss the different
B. Ablation Study components of the information, the four architectures differ
1) Coefficient Factor: The coefficient factor adjusts the mainly in the integration of information in broad connec-
weight of the broad and deep features as shown in (8). tion. The four architectures are: 1) Vanilla ViT as shown in
To discuss the effectiveness of different coefficient factors, Fig. 4(a), without broad attention, it only utilizes the attention
we conduct the ablation study with different coefficient factors in the last transformer layer, i.e., DeiT [1]; 2) BViTw.V as
on ImageNet. Concretely, we choose 0.2, 0.4, 0.6, 0.8, and 1 as shown in Fig. 4(b), with concatenated values V , it performs
coefficient factor candidates. broad attention using concatenated values V in all transformer
As shown in Table I, all coefficient factor candidates result layers and attention weight ql klT of last transformer layer;
in an improvement of over 2%, which confirms the pow- 3) BViTw/o.V as shown in Fig. 4(c), without concatenated
erful effectiveness of broad attention. The greatest increase values V , it delivers broad attention via vl in last transformer
(i.e., 3.1%) is achieved with the coefficient factor of 0.6. layer and broadly connected attention weights Q K T ; and
However, considering the slight variation in the results due 4) BViT as shown in Fig. 4(d), it focuses on both concatenated
to the randomness of the training, we deem it acceptable to values V and aggregated attention weights Q K T .
choose any coefficient factor candidates. Thus we simply set Ablation experiments on ImageNet [22] are reported in
the coefficient factor to 1 for the rest of the experiments. Table II. As a comparison baseline without broad attention
TABLE II BViT achieves innovation in the ViT. The innovation can

A BLATIONS ON C OMPONENTS FOR B ROAD ATTENTION ON I MAGE N ET. contribute to other developed ViT architectures.
A LL T HREE A RCHITECTURES W ITH B ROAD C ONNECTION P ROVIDE
S IGNIFICANT P ERFORMANCE I MPROVEMENT
D. Downstream Tasks on CIFAR10/100
With the purpose of investigating the transferability of
BViT, We evaluate the BViT-22M on downstream datasets
via transfer learning. With BViT-22M being pretrained on
ImageNet [22], Table IV exhibits the comparison results
on CIFAR10/100 that includes our BViT and other brilliant
networks such as ViT [6] and ResMLP [61]. ResMLP is
selected as the comparison model because it is fine-tuned
block, DeiT-Ti [1] has the same architectural settings for the
at the resolution of 224 × 224, while most models choose
transformer layers with our BViT-5M. As shown in Table II,
384 × 384.
it is apparent that all three architectures with broad connec-
From the results in Table IV, it is clear that compared
tion yield significant improvement. The heaviest improvement
to both transformer-based and MLP-based models, BViT
is in BViT which implements both concatenated values V
achieves better classification accuracy on CIFAR10/100 with
and aggregated attention weights Q K T . Besides, aggregated
fewer parameters. In general, our novel BViT maintains sound
attention weights Q K T deliver a slightly greater performance
performance on downstream tasks, thus confirming its strength
improvement than concatenated values V . Nevertheless, all
in the field of computer vision.
three methods with broad connection can deliver an accuracy
improvement of more than 2%. Thus the choice of practical
application can be carried out according to the architectural E. Generalization Study
requirement.
The novel broad attention is generic due to its structural
design, which can be implemented to improve the performance
of attention-based models. In order to verify the generalization
C. Image Classification on ImageNet
of broad attention, we introduce broad attention to three
Table III presents the performance comparison to var- excellent models, such as T2T-ViT [14], LVT [24] and Swin
ious types of architectures on ImageNet [22], including Transformer [2], and derive the BT2T-ViT, BLVT, and BSwin.
transformer-based models, CNN-based models, MLP-based The specific experimental settings are consistent among all
models, and hybrid models. The double line divides the models. We apply broad attention to attention-based models
experimental results into two blocks according to the number and fine-tune the model with pretrained weights. In fine-
of parameters. The single line groups the models based on tuning, we train the model for 30 epochs with a batch size
whether the method type contains CNN. In addition, the of 1024, a constant learning rate of 10−5 and a weight decay
performance of other prevailing models with broad attention of 10−8 . In the following, we will give experimental results.
(i.e., BT2T-ViT, BLVT, and BSwin) is included in Table III. 1) BT2T-ViT: The overview of T2T-ViT consists of token-
In particular, BT2T-ViT, BLVT, and BSwin are T2T-ViT, LVT, to-token module and T2T-ViT backbone. We apply broad
and Swin with broad attention, respectively. attention to T2T-ViT backbone. Since the dimension of each
Among manually designed transformer-based models, our T2T Transformer layer in T2T-ViT backbone is consistent,
BViT-5M achieves outstanding performance, outperforming the specific implementation of broad attention in BT2T-ViT is
VIT [6], DeiT [1] with approximately 2% improvement, and exactly identical to BViT.
even Swin Transformer [2] which leads in various vision tasks. Table V presents the performance comparison between T2T-
We also exceed gMLP [3] by about 2%, which delivers top ViT-7 [14] and BT2T-ViT-7 on ImageNet [22]. Benefiting from
results in the MLP-based models. Further, BLVT and BSwin the exploitation of attention in different layers brought by
deliver state-of-the-art performance among transformer-based broad attention, BT2T-ViT-7 exceeds original T2T-ViT-7 1.7%
models. without extra parameters.
ViT is the newly promoted visual architecture, whose 2) BLVT: LVT [24] consists of two novel self-attention
development is much less time-honored than CNN. The per- blocks: convolutional self-attention (CSA) and recursive atrous
formance of transformer-based models in the visual field is self-attention (RASA). We apply broad attention to RASA.
still slightly inferior to that of CNN. Nevertheless, our BViT Besides, considering the inconsistent dimension of the trans-
outperforms the partial CNN-based model, such as the classic former block: 1) we only connect the last block of each
and industry-known ResNet [4], RegNetY [21]. Specifically, stage instead of all LVT Transformer blocks and 2) we apply
our BViT-22M surpasses ResNet-50 with a 2.5% raise using maximum pooling and reshaping to obtain the consistent
fewer parameters, which considerably boosts the ImageNet dimension of RASA in different stages.
classification task. Table V exhibits the comparison results on ImageNet [22]
To sum up, the robust performance of BViT proves that that includes LVT [62] and BLVT. Experimental results show
broad attention block is effective for capturing key features. that broad attention brings a 1.2% performance improvement
By focusing on the attention information in different layers, compared to the original LVT.
TABLE III
P ERFORMANCE C OMPARISON W ITH S TATE - OF - THE -A RT M ODELS ON I MAGE N ET, I NCLUDING T RANSFORMER -BASED M ODELS , CNN-BASED M ODELS ,
MLP-BASED M ODELS , AND H YBRID M ODELS . W E G ROUP M ODELS BASED ON THE S IZE OF T HEIR PARAMETERS AND M ETHOD T YPE .
T HE P ROPOSED BV I T-5M O UTPERFORMS A LL THE M ANUAL -D ESIGNED V I T M ETHODS W ITHOUT B ROAD ATTENTION AT L ESS 2.6%
W ITH A BOUT 5M PARAMETERS . I N T HESE M ODELS W ITH A BOUT 22M PARAMETERS , THE P ROPOSED BV I T-22M H AS A LSO
O UTPERFORMED THE C LASSICAL V I T M ETHOD V I T-S [6] A BOUT 3%, E VEN E XCEEDS P REVALENT V I T M ETHOD
S WIN -S [2]. BLVT AND BS WIN E VEN D ELIVER S TATE - OF - THE -A RT P ERFORMANCE
A MONG T RANSFORMER -BASED M ODELS
Fig. 5. Representation similarity comparison via CKA. (a) DeiT [1] Fig. 6. Mean attention head distance by six attention heads. (a) DeiT [1]
(i.e., architecture without broad attention) and (b) BViT. We present CKA (i.e., architecture without broad attention) and (b) BViT. The horizontal
similarities between all pairs of transformer layers for our BViT and DeiT. coordinate indicates the heads of attention and the vertical coordinate indicates
Both horizontal and vertical coordinates indicate the number of architectural the mean attention distance. Different lines indicate attention blocks of
layers, and the color indicates the scores of similarity. The lighter the color, different layers. Following ViT [6], we randomly sample 128 images from
the higher the scores of similarity. We randomly sample 1000 images from ImageNet [22] dataset and calculate the mean distance between pixels with
ImageNet [22] dataset to compute CKA similarity scores. The heatmaps attention weights.
illustrate that BViT has smaller similarity scores between shallower and deeper
layers than DeiT without broad attention.
Table V exhibits the comparison results on ImageNet [22]

that includes Swin-T/S [2] and BSwin-T/S. It is clear that
3) BSwin: Swin Transformer [2] is a hierarchical Trans- compared to the original Swin-T/S, BSwin-T/S achieves
former, which models features with different scales in dif- better classification accuracy on ImageNet without extra
ferent stages. We broadly connect the outputs of shifted parameters.
window-based self-attention in the last block of each stage 4) Summary: To sum up, the above experimental results
to extract attention information in different layers. Similar to fully demonstrate that our broad attention is generic and can
BLVT, there are also two differences from BViT, which are flexibly improve the performance of attention-based models.
induced by the different scales of features in the hierarchical In a word, broad attention can be introduced to attention-based
Swin Transformer. models as a generic mechanism. Furthermore, the outstanding
TABLE IV
E VALUATION OF T RANSFER L EARNING ON D OWNSTREAM DATASETS . W E T RANSFER P RETRAINED BV I T-22M ON I MAGE N ET TO CIFAR10/100.
BV I T-22M TAKES 224 × 224 I MAGES D URING T RAINING AND F INE -T UNING , AND THE
ACCURACY OF BV I T E XCEEDS V I T W ITH H IGHER R ESOLUTION
Fig. 7. Comparison of attention maps between DeiT (i.e., architecture without broad attention) and our BViT. The attention maps of BViT focus more on
the object to be classified than DeiT and that is positive for image recognition.
TABLE V 1) CKA Similarity: CKA [16] enables a quantitative mea-

P ERFORMANCE C OMPARISON B ETWEEN E XCELLENT ATTENTION -BASED sure of representation similarity within models. Researchers
M ODELS AND M ODELS W ITH B ROAD ATTENTION ON I MAGE N ET.
B ENEFIT F ROM B ROAD ATTENTION , A LL M ODELS S HOW
often employ it to explore the differences in representation
S IGNIFICANT P ERFORMANCE G AINS W ITHOUT learned by visual models, especially when studying the differ-
E XTRA PARAMETERS ence between ViT and CNN [16]. Moreover, Nguyen et al. [17]
state that similar features may mean redundancy in the model.
Concretely, given the representation X and Y of two layers
as inputs, we can derive Gram matrices K = X X T and
L = Y Y T . Then CKA can be computed as
HSIC(K , L)
CKA(K , L) = √ (9)
HSIC(K , K )HSIC(L , L)
where HSIC is the Hilbert-Schmidt Independence Crite-
rion [63]. The representation similarity comparison between
BViT and DeiT [1] (i.e., architecture without broad attention)
performance of BT2T-ViT, BLVT, and BSwin also proves that
is shown in Fig. 5. It can be seen that the CKA similar-
broad attention is effective for leveraging significant features.
ity scores are smaller between shallower and deeper layers
Paying attention to information from multiple layers, the
in BViT [see Fig. 5(b)] than in DeiT, which means less
broad attention helps to understand image representation for
model redundancy. Therefore, the design of broad attention
classification.
helps extract and utilize features effectively, leading to better
performance.
F. Visualization 2) Mean Attention Distance: Mean Attention Distance is
To further investigate the impact of broad attention on first proposed in ViT [6]. The computation of mean attention
representation, we conduct three visualization experiments, distance requires averaging the distance between the query
including CKA similarity, mean attention distance, and atten- pixel and other pixels, that is weighted by the attention weight.
tion map. CKA similarity is to discuss the influence of broad The research about the difference between ViT and CNN [16]
attention on the similarity of features in different layers. Mean demonstrates experimentally that the model with more local
attention distance can analyze the size of the attention area attention performs better on limited datasets. As shown in
for BViT. The attention map demonstrates the representation Fig. 6, in order to figure out what influences broad attention
of BViT from the output token to the input image. brings to attention distance, we plot the mean attention head
distance of the third, fourth, ninth, and tenth blocks by sorted R EFERENCES
heads. The results show that our BViT [see Fig. 6(b)] attends
[1] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
more local information at the shallow layers. For example, H. Jégou, “Training data-efficient image transformers & distilla-
in the third block’s first head, the mean attention distance of tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
DeiT is about 40 while BViT is about 20 (blue line). As stated pp. 10347–10357.
[2] Z. Liu et al., “Swin Transformer: Hierarchical vision transformer using
in previous research [16], more attention to local features may shifted windows,” 2021, arXiv:2103.14030.
facilitate the learning of the model on the limited datasets. [3] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” 2021,
Thus our model achieves better classification accuracy on arXiv:2105.08050.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
ImageNet [22]. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
3) Attention Maps: To illustrate the significance of the (CVPR), Jun. 2016, pp. 770–778.
broad attention block, we use Attention Rollout [64] to com- [5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Process. Syst., 2017, pp. 5998–6008.
pute attention maps of transformer layers. Attention Rollout [6] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
averages the attention weights of the model across all heads for image recognition at scale,” in Proc. 9th Int. Conf. Learn. Represent.
and then recursively multiplies the weight matrices of different (ICLR), May 2021, pp. 1–22.
[7] Y. Lu, Y. Chen, D. Zhao, B. Liu, Z. Lai, and J. Chen, “CNN-G: Con-
transformer layers. volutional neural network combined with graph for image segmentation
As shown in Fig. 7, we visualize the attention maps of with theoretical analysis,” IEEE Trans. Cognit. Develop. Syst., vol. 13,
BViT and DeiT [1] (i.e., architecture without broad attention). no. 3, pp. 631–644, Sep. 2021.
[8] D. Mellouli, T. M. Hamdani, J. J. Sanchez-Medina, M. B. Ayed, and
Visualization results show that utilization of the attention in A. M. Alimi, “Morphological convolutional neural network architecture
different layers facilitates the spotting of the critical object. for digit recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30,
The attention maps of BViT pay more attention to the object no. 9, pp. 2876–2885, Sep. 2019.
to be recognized than DeiT. The phenomenon provides an [9] X. Liu, B. Hu, Q. Chen, X. Wu, and J. You, “Stroke sequence-
dependent deep convolutional neural network for online handwritten
intuitive argument for the design of our broad attention mecha- Chinese character recognition,” IEEE Trans. Neural Netw. Learn. Syst.,
nism, which reasonably improves the understanding of images vol. 31, no. 11, pp. 4637–4648, Nov. 2020.
by enhancing the exploitation of features. [10] J. Li et al., “ABCP: Automatic block-wise and channel-wise network
pruning via joint search,” IEEE Trans. Cognit. Develop. Syst., early
In summary, benefiting from the design of broad attention, access, Dec. 20, 2022, doi: 10.1109/TCDS.2022.3230858.
our BViT can: 1) achieve effective features by preventing [11] N. Li, Y. Pan, Y. Chen, Z. Ding, D. Zhao, and Z. Xu, “Heuristic rank
model redundancy; 2) deliver excellent performance on lim- selection with progressively searching tensor ring network,” Complex
Intell. Syst., vol. 8, no. 2, pp. 771–785, Apr. 2022.
ited datasets (e.g., ImageNet) for more local attention; and [12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
3) achieve better image understanding. S. Zagoruyko, “End-to-end object detection with transformers,” in
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
pp. 213–229.
V. C ONCLUSION [13] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
former for semantic segmentation,” 2021, arXiv:2105.05633.
This article proposes the Broad attention-based ViT, called [14] L. Yuan et al., “Tokens-to-Token ViT: Training vision transformers from
BViT. As the key element of BViT, broad attention con- scratch on ImageNet,” 2021, arXiv:2101.11986.
[15] H. Wu et al., “CvT: Introducing convolutions to vision transformers,”
sists of broad connection and parameter-free attention. Broad 2021, arXiv:2103.15808.
connection integrates attention information in different lay- [16] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy,
ers. Then parameter-free attention extracts effective features “Do vision transformers see like convolutional neural networks?” in
Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 12116–12128.
from the above-integrated information and constructs their [17] T. Nguyen, M. Raghu, and S. Kornblith, “Do wide and deep networks
relationships. Furthermore, due to the novel broad attention learn the same things? Uncovering how neural network representations
block being directed at the existing attention, the proposed vary with width and depth,” 2020, arXiv:2010.15327.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep
broad attention is generic to improve the performance of networks with stochastic depth,” in Proc. Eur. Conf. Comput. Vis. Cham,
attention-based models. Consequently, BViT achieves leading Switzerland: Springer, 2016, pp. 646–661.
performance on vision tasks benefiting from rich and valuable [19] Z. Ding, Y. Chen, N. Li, D. Zhao, Z. Sun, and C. L. Philip Chen, “BNAS:
Efficient neural architecture search using broad scalable architecture,”
information. On ImageNet, BViT arrives at state-of-the-art IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 9, pp. 5004–5018,
performance among transformer-based models with about 3% Sep. 2021.
boost to groundbreaking ViT. Then we transfer BViT-22M [20] I. Tolstikhin et al., “MLP-Mixer: An all-MLP architecture for vision,”
2021, arXiv:2105.01601.
to downstream tasks (CIFAR10/100) that prove the robust [21] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollar,
transferability of the model. Moreover, the implementation of “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.
broad attention on T2T-ViT, LVT, and Swin Transformer also Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10428–10436.
improves accuracy by more than 1%, confirming the flexibility [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
and effectiveness of our method. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
As a key component of BViT, the broad attention block [23] A. Krizhevsky et al., “Learning multiple layers of features from tiny
significantly improves the performance of the ViT on image images,” Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009,
2009.
classification. We expect to inspect its employment in natural [24] C. Yang et al., “Lite vision transformer with enhanced self-attention,”
language processing tasks. Furthermore, we will explore the in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
impact of different connection combinations of transformer Jun. 2022, pp. 11998–12008.
[25] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable
layer outputs on performance via neural architecture search effectiveness of data in deep learning era,” in Proc. IEEE Int. Conf.
algorithm. Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.
[26] M. Chen, H. Peng, J. Fu, and H. Ling, “AutoFormer: Searching trans- [51] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical
formers for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. automated data augmentation with a reduced search space,” in Proc.
Vis. (ICCV), Oct. 2021, pp. 12270–12280. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
[27] C. Li et al., “BossNAS: Exploring hybrid CNN-transformers Jun. 2020, pp. 702–703.
with block-wisely self-supervised neural architecture search,” 2021, [52] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond
arXiv:2103.12424. empirical risk minimization,” 2017, arXiv:1710.09412.
[28] J. Liu, H. Li, G. Song, X. Huang, and Y. Liu, “UniNet: Unified [53] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix:
architecture search with convolution, transformer, and MLP,” 2021, Regularization strategy to train strong classifiers with localizable fea-
arXiv:2110.04035. tures,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[29] V. Mnih et al., “Recurrent models of visual attention,” in Proc. Adv. pp. 6023–6032.
Neural Inf. Process. Syst., vol. 27, pp. 1–9, 2014. [54] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing
[30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by data augmentation,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7,
jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. 2020, pp. 13001–13008.
Represent. (ICLR), 2015, pp. 1–15. [55] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
[31] X. Zhao, Y. Chen, J. Guo, and D. Zhao, “A spatial–temporal attention tion by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp. 838–855,
model for human trajectory prediction,” IEEE/CAA J. Autom. Sinica, Jul. 1992.
vol. 7, no. 4, pp. 965–974, Jul. 2020. [56] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry,
[32] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visual “Augment your batch: Improving generalization through instance repeti-
attention for vehicle classification,” IEEE Trans. Cogn. Develop. Syst., tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
vol. 9, no. 4, pp. 356–367, Dec. 2017. Jun. 2020, pp. 8129–8138.
[33] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural [57] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1314–1324.
Jun. 2018, pp. 7794–7803. [58] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con-
[34] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks volutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019,
meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int. pp. 6105–6114.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1–10. [59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[35] J. Fu et al., “Dual attention network for scene segmentation,” in Proc. the inception architecture for computer vision,” in Proc. IEEE Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
pp. 3146–3154. [60] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
[36] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Fea- A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
ture pyramid transformer,” in Proc. Eur. Conf. Comput. Vis. Cham, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
Switzerland: Springer, 2020, pp. 323–339. pp. 16519–16529.
[37] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networks for [61] H. Touvron et al., “ResMLP: Feedforward networks for image classifi-
visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern cation with data-efficient training,” 2021, arXiv:2105.03404.
Recognit. (CVPR), Jul. 2017, pp. 4709–4717. [62] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer
[38] H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu, “AttnSense: Multi- in transformer,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
level attention mechanism for multimodal human activity recog- pp. 15908–15919.
nition,” in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019, [63] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola,
pp. 3109–3115. “A kernel statistical test of independence,” in Proc. Adv. Neural Inf.
[39] X. Li, B. Zhao, and X. Lu, “MAM-RNN: Multi-level attention model Process. Syst., vol. 20, 2007, pp. 1–8.
based RNN for video captioning,” in Proc. 26th Int. Joint Conf. Artif. [64] S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,”
Intell., Aug. 2017, pp. 2208–2214. 2020, arXiv:2005.00928.
[40] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical
attention networks for document classification,” in Proc. Conf. North
Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2016,
pp. 1480–1489.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Nannan Li (Graduate Student Member, IEEE)
Process. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105. received the B.S. degree from the School of Automa-
[42] Z. Ding, Y. Chen, N. Li, and D. Zhao, “BNAS-v2: Memory-efficient tion Engineering, University of Electronic Science
and performance-collapse-prevented broad neural architecture search,” and Technology of China, Chengdu, Sichuan, China,
IEEE Trans. Syst., Man, Cybern. Syst., vol. 52, no. 10, pp. 6259–6272, in 2017. She is currently pursuing the Ph.D. degree
Oct. 2022. in control theory and control engineering with the
[43] Z. Ding, Y. Chen, N. Li, D. Zhao, and C. L. Philip Chen, “Stacked State Key Laboratory of Multimodal Artificial Intel-
BNAS: Rethinking broad convolutional neural network for neural archi- ligence Systems, Institute of Automation, Chinese
tecture search,” 2021, arXiv:2111.07722. Academy of Sciences, Beijing, China.
[44] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural Her research interests include computer vision and
architecture search via parameters sharing,” in Proc. Int. Conf. Mach. neural architecture search.
Learn., 2018, pp. 4095–4104.
[45] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
architectures for scalable image recognition,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710.
[46] Y. Chen, R. Gao, F. Liu, and D. Zhao, “ModuleNet: Knowledge-
inherited neural architecture search,” IEEE Trans. Cybern., vol. 52, Yaran Chen (Member, IEEE) received the Ph.D.
no. 11, pp. 11661–11671, Nov. 2021. degree from the Institute of Automation, Chinese
[47] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated CNN Academy of Sciences, Beijing, China, in 2018.
architecture design based on blocks,” IEEE Trans. Neural Netw. Learn. She is currently an Associate Professor with the
Syst., vol. 31, no. 4, pp. 1242–1254, Apr. 2020. State Key Laboratory of Multimodal Artificial Intel-
[48] J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer normalization,” 2016, ligence Systems, Institute of Automation, Chinese
arXiv:1607.06450. Academy of Sciences, and the College of Artificial
[49] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” Intelligence, University of Chinese Academy of Sci-
2016, arXiv:1606.08415. ences, Beijing. Her research interests include deep
[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” learning, neural architecture search, deep reinforce-
2014, arXiv:1412.6980. ment learning, and autonomous driving.
Weifan Li (Graduate Student Member, IEEE) Dongbin Zhao (Fellow, IEEE) received the Ph.D.
received the B.S. degree in materials science and degree from the Harbin Institute of Technology,
engineering from Chongqing University, Chongqing, Harbin, China, in 2000.
China, in 2015, and the M.S. degree in automation He is currently a Professor with the Institute of
from Fuzhou University, Fuzhou, Fujian, China, Automation, Chinese Academy of Sciences, Beijing,
in 2018. He is currently pursuing the Ph.D. degree in China, and the University of Chinese Academy of
control theory and control engineering with the State Sciences, Beijing. He has published six books and
Key Laboratory of Multimodal Artificial Intelligence more than 100 international journal articles. His
Systems, Institute of Automation, Chinese Academy current research interests are in the area of deep
of Sciences, Beijing, China. reinforcement learning, computational intelligence,
His current research interests include reinforce- autonomous driving, game artificial intelligence, and
ment learning, deep learning, and game AI. robotics.
Dr. Zhao was the Chair of Adaptive Dynamic Programming and
Reinforcement Learning Technical Committee from 2015 to 2016, Beijing
Chapter from 2017 to 2018, and Technical Activities Strategic Planning
Sub-Committee in 2019 of IEEE Computational Intelligence Society (CIS).
He is the Chair of Distinguished Lecture Program. He works as several
guest editors of renowned international journals. He is involved in organizing
many international conferences. He serves as an Associate Editor for IEEE
T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS, IEEE
T RANSACTIONS ON C YBERNETICS, and IEEE T RANSACTIONS ON A RTIFI -
CIAL I NTELLIGENCE .
Zixiang Ding (Member, IEEE) received the M.E. Shuai Nie received the Ph.D. degree in pat-
degree from the School of Information and Electrical tern recognition and intelligent systems from the
Engineering, Shandong Jianzhu University, Jinan, National Key Laboratory of Pattern Recognition,
Shandong, China, in 2018. He is currently pursuing Institute of Automation, Chinese Academy of Sci-
the Ph.D. degree in computer applications with the ences, Beijing, China, in 2018.
State Key Laboratory of Multimodal Artificial Intel- He is currently an Associate Professor with the
ligence Systems, Institute of Automation, Chinese National Key Laboratory of Pattern Recognition,
Academy of Sciences, Beijing, China. Institute of Automation, Chinese Academy of Sci-
His research interests include computer vision, ences. His research interests include speech recogni-
neural architecture search, and deep reinforcement tion, speech separation/enhancement, deep learning,
learning. and large language models.

Bvit

Uploaded by

Copyright:

Available Formats

Bvit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bvit

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

BViT: Broad Attention-Based Vision Transformer

T RANSFORMER [5] has demonstrated impressive model-

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

in different layers. Following we go back to deep CNN for

LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 3

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

retention of location information and classification token for

The output z i in ith layer is the input xi+1 for {i + 1}th

LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 5

qi , qi , . . . , qih , qi ∈ R(N +1)×dq

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 7

TABLE II BViT achieves innovation in the ViT. The innovation can

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Table V exhibits the comparison results on ImageNet [22]

LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 9

TABLE V 1) CKA Similarity: CKA [16] enables a quantitative mea-

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

LI et al.: BViT: BROAD ATTENTION-BASED VISION TRANSFORMER 11

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

You might also like