Remotesensing 16 01641
Remotesensing 16 01641
Remotesensing 16 01641
Article
MVT: Multi-Vision Transformer for Event-Based Small
Target Detection
Shilong Jing 1,2 , Hengyi Lv 1, *, Yuchen Zhao 1 , Hailong Liu 1 and Ming Sun 1
1 Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences,
Changchun 130033, China; jingshilong22@mails.ucas.ac.cn (S.J.); zhaoyuchen@ciomp.ac.cn (Y.Z.);
liuhailong@ciomp.ac.cn (H.L.); sunming@ciomp.ac.cn (M.S.);
2 University of Chinese Academy of Sciences, Beijing 100049, China
* Correspondence: lvhengyi@ciomp.ac.cn
Abstract: Object detection in remote sensing plays a crucial role in various ground identification
tasks. However, due to the limited feature information contained within small targets, which are
more susceptible to being buried by complex backgrounds, especially in extreme environments
(e.g., low-light, motion-blur scenes). Meanwhile, event cameras offer a unique paradigm with high
temporal resolution and wide dynamic range for object detection. These advantages enable event
cameras without being limited by the intensity of light, to perform better in challenging conditions
compared to traditional cameras. In this work, we introduce the Multi-Vision Transformer (MVT),
which comprises three efficiently designed components: the downsampling module, the Channel
Spatial Attention (CSA) module, and the Global Spatial Attention (GSA) module. This architecture
simultaneously considers short-term and long-term dependencies in semantic information, resulting
in improved performance for small object detection. Additionally, we propose Cross Deformable
Attention (CDA), which progressively fuses high-level and low-level features instead of considering
all scales at each layer, thereby reducing the computational complexity of multi-scale features.
Nevertheless, due to the scarcity of event camera remote sensing datasets, we provide the Event
Object Detection (EOD) dataset, which is the first dataset that includes various extreme scenarios
specifically introduced for remote sensing using event cameras. Moreover, we conducted experiments
on the EOD dataset and two typical unmanned aerial vehicle remote sensing datasets (VisDrone2019
and UAVDT Dataset). The comprehensive results demonstrate that the proposed MVT-Net achieves
Citation: Jing, S.; Lv, H.; Zhao, Y.; Liu, a promising and competitive performance.
H.; Sun, M. MVT: Multi-Vision
Transformer for Event-Based Small
Keywords: event cameras; multi-scale fusion; remote sensing; small target detection
Target Detection. Remote Sens. 2024,
16, 1641. https://doi.org/10.3390/
rs16091641
Log-Intensity
𝑉𝑉𝑡𝑡𝑡
t
Polarity
ON
t
OFF
Figure 1. The process of DVS generates events. Each pixel serves as an independent detection unit
for changes in brightness. An event is generated when the logarithmic intensity change at a pixel
exceeds a specified threshold Vt h. The continuous generation of events forms an event stream, which
consists of two types of polarity: when the light intensity changes from strong to weak and reaches
the threshold, DVS outputs a negative event (red arrow); when the light intensity changes from weak
to strong and reaches the threshold, DVS outputs a positive event (blue arrow).
Utilizing drones equipped with event cameras for object detection or tracking is an
innovative approach that holds great potential for a wide range of applications including
satellite imaging, transportation, and early warning systems. However, due to the scarcity
of remote sensing datasets based on the event cameras, we present the first event-based
remote sensing dataset named Event-based Object Detection Dataset (EOD Dataset), which
utilizes a DAVIS346 event camera mounted on an unmanned aerial vehicle (UAV) to capture
various scenes. Furthermore, in practical processing, a high flying altitude results in ground
targets occupying only a small portion of the image output, which poses challenges for
object detection. Recently, advanced approaches for enhancing the detection performance
of small targets often apply Feature Pyramid Networks (FPN) to concatenate multi-scale
features. However, these methods have significant limitations as they are unable to differ-
entiate between distinct feature layers. So how can we address this problem? Deformable
DETR [3] provides an answer by introducing Scale-Level Embedding to differentiate the
positional encoding of different features at the same location. Therefore, we draw inspi-
ration from this embedding operation to concatenate multi-scale features, with the aim
of enhancing the detection performance of small targets. Moreover, solely considering
multi-scale features undoubtedly incurs significant computational and memory overhead,
making convergence more challenging. For instance, in the Transformer Encoder of De-
formable DETR, the model needs to extract features for all scales, even though deformable
attention is used to reduce computational complexity, which still remains redundant.
In this work, we propose Cross-Deformable-Attention (CDA) to further enhance
the performance of the model while significantly reducing its computational complexity.
Specifically, by applying CDA between low-level and high-level features, we continuously
propagate the fused information from lower layers to higher layers. In addition to reduc-
ing computational complexity, CDA can also reduce model training time and improve
inference speed. What is more, we propose an efficient feature extraction model called
Multi-Vision Transformer (MVT), which consists of three modules: Downsampling Module,
Channel Spatial Attention Module (CSA), and Global Spatial Attention Module (GSA).
Firstly, the downsampling module employs a simple overlapped convolution for scale
reduction, resulting in better performance compared to non-overlapped convolution and
patch merging operations. Then, we apply CSA for attention querying between spatial
and channel dimensions. Compared to the original SE Block, CSA applies adaptive max
pooling operations to preserve more high-frequency information. Finally, we employ GSA
Remote Sens. 2024, 16, 1641 3 of 21
including Window-Attention and Grid-Attention for local and global search. Compared to
Swin-Attention, which requires more computational resources and complex offset vectors,
Grid-Attention and Window-Attention are similar but only require local grid attention to
extend them to the entire domain, achieving higher performance and fewer parameters.
Additionally, we also provide three model variants (MVT-B, MVT-S, MVT-T) by setting dif-
ferent embedding dimensions and output scales. Employing MVT-B trained for 36 epochs,
we achieve 28.7% mAP@0.5:0.95, outperforming all current state-of-the-art methods on the
EOD dataset. With the application of multiple efficient attention modules that consider
multi-scale features, the detection performance is improved especially for small objects,
achieving 16.6% APS . While due to the scarcity of remote sensing datasets based on event
cameras, we select the VisDrone2019 dataset [4] and UAVDT dataset [5], which are similar
to our own dataset and consist of images captured by drones equipped with cameras. In this
case, we employ MVT-B, which is trained for 36 epochs and achieve 31.7% mAP@0.5:0.95
and 24.3% APS on the VisDrone2019 Dataset, as well as 28.2% mAP@0.5:0.95 and 23.7%
APS on the UAVDT Dataset.
Our contributions can be summarized as follows:
1. The first remote sensing dataset based on event cameras has been proposed, called
the Event Object Detection Dataset (EOD Dataset), which consists of over 5000 event
streams and includes six categories of objects like car, bus, pedestrian, two-wheel,
boat, and ship.
2. We propose a novel multi-scale extraction network named Multi-Vision Transformer
(MVT), which consists of three efficient modules proposed by us. The downsampling
module, the Channel Spatial Attention (CSA) module, and the Global Spatial Atten-
tion (GSA) module. Overall, The MVT incorporates efficient modules, achieving a
substantial reduction in computational complexity with high performance.
3. Considering that extracting information at all scales consumes massive computing
resources, we propose a novel cross-scale attention mechanism that progressively
fuses high-level features with low-level features, enabling the incorporation of low-
level information. The Cross-Deformable-Attention (CDA) reduces the computational
complexity of the Transformer Encoder and entire network by approximately 82%
and 45% while preserving the original performance.
4. As a multi-scale object detection network, MVT achieves state-of-the-art performance
trained from scratch without fine-tuning, which trained for 36 epochs, achieving
28.7% mAP@0.5:0.95 and 16.6% APS on the EOD Dataset, 31.7% mAP@0.5:0.95 and
24.3% APS on the VisDrone2019 Dataset, 28.2% mAP@0.5:0.95 and 23.7% APS on the
UAVDT Dataset.
2. Related Work
2.1. Multi-Scale Feature Learning
Convolutional neural networks extract features of objects through hierarchical ab-
stractions, and an important concept in this process is the receptive field. Higher-level
feature maps have larger receptive fields, which make them strong in representing se-
mantic information, while they have lower spatial resolution and lack detailed spatial
geometric features. On the other hand, lower-level feature maps have smaller receptive
fields, which makes them strong in representing geometric details with higher resolution,
but they exhibit weaker semantic information representation. For remote sensing object
detection, the accuracy of small target recognition greatly affects the performance of the
network. Therefore, multi-scale feature representation is a commonly used approach in
small target detection [6,7].
The concept of the Feature Pyramid Networks (FPN) [8] is initially introduced for
multi-scale object detection. However, the computation-intensive nature of the FPN sig-
nificantly influences the detection speed. For this reason, various improvement methods
have been developed. Centralized Feature Pyramid (CFP) [9] focuses on optimizing the
representation of features within the same level, particularly in the corners of the im-
Remote Sens. 2024, 16, 1641 4 of 21
age. Path Aggregation Network (PANet) [10] extends the FPN with a bottom-up path to
capture deeper-level features using shallow-level features. Additionally, the U-Net, origi-
nally designed for segmentation tasks, has also demonstrated outstanding performance in
object detection [11–13].
In addition, there are methods that specifically utilize low-scale features for small
target detection. Unlike approaches that recover high-resolution representation from low-
resolution ones, the High-Resolution Network (HRNet) [6] maintains high-resolution rep-
resentation during forward propagation. Lite-High-Resolution Network (Lite-HRNet) [14]
can rapidly estimate feature points, thereby reducing the computational complexity of
the model. Feature-Selection High-Resolution network (FSHRNet) [15] adopts HRNet
as the backbone and introduces a Feature Selection Convolution (FSConv) layer to fuse
multi-resolution features, enabling adaptive feature selection based on object characteristics.
The Improved U-Net (IU-Net) [16] enhances the HRNetv2 [17] by incorporating the csAG
module, composed of spatial attention and channel attention, to improve model perfor-
mance. However, solely relying on low-scale features often leads to inferior performance,
and the FPN operation fails to distinguish between different feature levels.
Scale-Level Embedding [3] was proposed for multi-scale fusion, which has the signifi-
cant advantage of encoding different feature levels to enable the model to differentiate the
same position information across different feature levels, and is widely applied in various
types of models.
introduces a window shift strategy to overcome the limitation of input resolution and
utilizes a window sliding mechanism with convolutional operations to enable interaction
between different windows, thus achieving global attention. Despite achieving remarkable
results in various tasks, the Swin Transformer still faces the redundancy of using offset
vectors. Furthermore, Multi-Axis Vision Transformer (MAXVIT) [26] proposes Multi-axis
Self-Attention (MaxSA), which decomposes the conventional self-attention mechanism
into two sparse forms: Window-Attention and Grid-Attention. This approach reduces the
quadratic complexity of traditional computation methods to linear complexity. Importantly,
it discards redundant window offset operations and instead employs a simpler form of
window attention and grid attention to consider both local and global information. Addi-
tionally, Deformable DETR [3] introduces Deformable-Attention, which can be summarized
as each feature pixel does not need to interact with all other feature pixels for computation.
Instead, it only needs to interact with a subset of other pixels obtained through sampling.
This mechanism significantly accelerates model convergence while reducing computational
complexity. The aforementioned studies discuss the capability of Transformer Attention
to model global information for accurate target localization. While these methods have
made improvements in terms of computational resources, they still encounter challenges
regarding the excessive computational complexity caused by remote sensing images. There-
fore, we propose a novel Cross-Deformable-Attention (CDA) structure to achieve a balance
between performance and computational cost.
tures and attention mechanisms to address small target detection in complex backgrounds
of remote sensing.
3. Method
3.1. Overall Architecture
The proposed MVT Network is illustrated in Figure 2, which is composed of four
main components, namely Data Processing, MVT Backbone, Feature Fusion Module, and
Prediction Head.
(1) Data Processing (2) MVT Backbone (3) Feature Fusion Module (4) Prediction Head
Fusion Vectors
Cross-scale Deformable Decoder
Query Selection
900
5 Scales
Prediction Head
...
Processing
(5) Overview of MVT Block
900
900
x Global Spatial Attention Module
y Window-Attention Grid-Attention
Downsample
Module
Module
CSA
H 𝐻𝐻
2 𝑊𝑊
W
2𝐶𝐶 2
t C
Figure 2. Overview of the MVT framework, which contains five main components, including: (1) the
data preprocessing method of converting event streams into dense tensors; (2) the proposed MVT
Backbone used to extract multi-scale features; (3) the designed feature fusion module for encoding
and aggregating features at different scales; (4) the detection head that applies bipartite matching
strategy; (5) Each MVT Block, composed of three designed components.
The original chaotic event sequence cannot be directly used as an input tensor for
deep neural networks. Therefore, we encode the event stream in the form of voxel grid
representation [38], which has a channel regarding the temporal dimension generated by
a time partitioning function, and described in detail in Section 3.2. In this work, we do
not consider the correlation of the temporal order. Thus, the processed event tensor has
a shape of X = R H ×W ×1 . Different scale features of the event tensor are extracted by the
backbone, which utilizes CSA to attend to short-range dependencies and GSA to attend to
long-range dependencies, which is specifically described in Section 3.3. Subsequently, the
multi-scale features with rich semantic information are fed into the Transformer Encoder,
where CDA is applied to fuse tokens at different levels, which is described in detail in
Section 3.4. Finally, regression calculations are performed on the 900 vectors generated by
the Feature Fusion Module to obtain the detection results.
where p ∈ {0, 1} is the event polarity, Vth is the threshold. The event camera will generate
an ordered set of events ε = {ek } Ex ,Ey ,E p ,Et ∈ R4 according to Equation (1). Afterwards,
the polarity of each pixel in the same time window is aggregated by performing bilinear
voting, which requires the standardization of event timestamps as
Et − Et (0)
Et_norm = T (2)
Et ( N ) − Et (0)
with a stride of 4 to achieve fourfold downsampling, while the remaining layers apply a
3 × 3 convolution kernel with a stride of 2 for two-fold downsampling. Furthermore, we
demonstrate that the overlapping convolution outperforms non-overlapping convolutions
and patch merging operations in Section 4.3.
𝑙𝑙
𝑓𝑓
(3) Spatial Attention
1 × 𝐻𝐻 × 𝑊𝑊
𝑓𝑓̂ 𝑙𝑙 2 × 𝐻𝐻 × 𝑊𝑊 1 × 𝐻𝐻 × 𝑊𝑊 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
C
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑓𝑓 𝑙𝑙+1 1 × 𝐻𝐻 × 𝑊𝑊
Figure 3. Architecture of CSA module, which consists of channel attention and spatial attention
module to extract short-term dependent attention.
The input feature map f l −1 is fed into CSA for feature extraction. Firstly, f l −1 under-
goes a 2D convolution (Conv2D) operation with a 1 × 1 kernel, resulting in fbl −1 , which
has the same dimensions as f l −1 . Then, fbl −1 is separately fed into channel attention (LCAttn )
and spatial attention (LSAttn ) modules, producing the intermediate feature map fbl , which is
added to fbl −1 to obtain the output feature map f l +1 . The entire CSA computation process
can be represented by Equation (5).
fbl −1 = Conv2D ( f l −1 )
fbl = LSAttn (LCAttn ( fbl −1 )) (5)
f l +1 bl −1
= f ⊕f bl
The main components of CSA can be divided into Channel Attention and Spatial At-
tention. Within the Channel Attention module, there are three branches: in the first branch,
the input (F) is fed into Channel Max Pooling (P Max C ) and a 1D convolution (Conv1D),
nation function (Concat) is employed to transform the three feature maps with C × H × W
dimensions into a single feature map with 3C × H × W dimensions. Finally, utilizing a 2D
convolution to map the channels back to C × H × W and obtain the feature map (FAttn C ).
The entire Channel Attention computation process can be represented by Equation (6).
C
Fmax = Conv1D (P Max ( F ))
Fbmax = σ ( Fmax ) ⊗ F
C
Favg = Conv1D (P Avg ( F )) (6)
Fbavg = σ ( Favg ) ⊗ F
C
FAttn = Conv2D (Concat[ F, Fbmax , Fbavg ])
Within the Spatial Attention module, there are two branches: in the first branch, the
input (F) is fed into both Spatial Max Pooling (P Max S ) and Spatial Average Pooling (P S )
Avg
to obtain features ( Fbmax ) and ( Fbavg ), which are then concatenated to form a tensor ( F)
b with
2 × H × W dimensions. Next, the feature map ( F) b undergoes a 2D convolution (Conv2D)
S ), which are
followed by a sigmoid function (σ), resulting in spatial attention weights ( FbAttn
multiplied element-wise with the original input (F) to achieve the final feature map (FAttn S )
in the second branch. The entire Spatial Attention computation process can be represented
by Equation (7).
S S
Fb = Concat[P Max ( F ), P Avg ( F )]
S (7)
FbAttn = σ(Conv2D ( Fb))
S S
FAttn = FbAttn ⊗F
The CSA module improves the feature extraction performance for short-range regions
by incorporating attention mechanisms for both channels and spatial dimensions. However,
convolutional attention modules suffer from a loss of features for small objects due to their
limitations in long-range regions. Therefore, we propose GSA, considering global attention
to enhance the detection performance of small targets.
𝐻𝐻 × 𝑊𝑊 × 𝐶𝐶 𝐻𝐻 × 𝑊𝑊 × 𝐶𝐶
𝑓𝑓 𝑙𝑙−1
𝐻𝐻 × 𝑊𝑊 × 𝐶𝐶 𝑓𝑓̂ 𝑙𝑙 𝑓𝑓 𝑙𝑙
(3)Grid-Attention
𝐻𝐻 × 𝑊𝑊 × 𝐶𝐶
Figure 4. Architecture of GSA module, which consists of window attention and grid attention to
extract long-term-dependent attention.
attention feature map, which is then added to the original input ( f l −1 ) through a residual
path, resulting in the hidden feature map ( fbl ). Subsequently, ( fbl ) is separately processed
through Layer Normalization (LN) + Multilayer Perceptron (MLP) and the shortcut path to
obtain the feature map ( f l ). In addition, ( f l ) undergoes LN and G-MSA to obtain the global
attention feature map ( fbl +1 ), which is further processed through LN and MLP to obtain the
global spatial feature map ( f l +1 ). The entire Global Spatial Attention computation process
can be represented by Equation (8).
fbl = W-MSA( LN ( f l −1 )) + f l −1
f l = MLP( LN ( fbl )) + fbl
(8)
fbl +1 = G-MSA( LN ( f l )) + f l
f l +1 = MLP( LN ( fbl +1 )) + fbl +1
...
Multi-head Cross Attention
h heads
Cross-scale Deformable encoder layer
Linear
Transformer Encoder
Linear Aggregate
Linear
Sample offsets Sample offsets
Linear Linear
Linear Linear
Deformable attention
Query feature Low level query feature
... Reference
Input feature map points(x,y)
Image features + positional encoding
Figure 5. Overview of the Cross-scale Deformable Encoder layer. The three high-level features
are used as the basic tokens to fuse low-level features layer by layer using Cross-scale Deformable
Attention, finally building the architecture of the transformer encoder.
The encoder layer contains deformable self-attention and cross-scale attention. Con-
sidering that the feature map size of the high level is much smaller than the low level. Thus
only the middle and final encoder layers are needed for cross-scale attention to the low and
high scale instead of extracting all tokens, as shown in Figure 5. In this module, high-level
features FH ∈ R NH ×dmodel will serve as queries to extract features from the low-level features
FL ∈ R NL ×dmodel , each query feature will be split into M heads, and each head will sample K
points from each of the L feature scales as query Q. Therefore, the total number of queries
sampled for a query feature is Np = 2 × M × L × K, ∆p is sampling offsets, and their
Remote Sens. 2024, 16, 1641 11 of 21
corresponding attention weights are directly predicted from query features using two linear
projections denoted as Wp ∈ Rdmodel × Np and WA ∈ Rdmodel ×dmodel . Formally, we have
M L K
′
Q= ∑ Wm [ ∑ ∑ Wm S x l , ϕ( pl ) + ∆pmlk ] (9)
m =1 l =1 k =1
M K
′
K= ∑ Wm [ ∑ WA · Wm S( x, p + ∆pmk )] (10)
m =1 k =1
where m is the attention head, p are the reference points of the query features, x indexes
′
the different scale feature, Wm ∈ Rdmodel × Nm and Wm ∈ R Nm ×dmodel are of learnable weights
(Nm = dmodel /M by default). With the sampled offsets (∆p = FWp ), bilinear interpolation is
applied in computing the features with the function S( x, p + ∆p) in the sampled locations
( p + ∆p) of the corresponding feature x. As all the high-level features will sample locations
to query the key consisting of low-level features, the original model can quickly learn
which sampled location given the queries is important. Finally, we can obtain the value
(V = KWV ) with a parameter matrices WV ∈ Rdmodel ×dmodel , and the cross-scale deformable
attention can be formulated as
QKT
CDA(Q, K, V) = Cat(FL , So f tmax ( √ )V) (11)
dK
In words, the cat function is to concatenate low-level features and other multi-scale features,
dk is the key dimension of a head. Equation (11) indicates more reliable attention weights
predicted by stacking CDA when updating layer-by-layer features from different scales.
4. Experiments
In this section, we test the proposed method on the EOD, VisDrone [4], and UAVDT [5]
datasets, and the mean average precision (mAP) [39] is the main metric that we consider.
In addition, we performe ablation experiments to verify the effectiveness of each module.
Finally, the experimental results demonstrate the superiority of the proposed method.
4.1. Datasets
4.1.1. EOD Dataset
The EOD dataset consists of 5317 event streams captured in various scenes, where
each event stream is a collection of events within 33 ms. The dataset includes 3722 event
streams for training, 530 event streams for validation, and 1065 event streams for testing,
and contains six categories: car, bus, pedestrian, two-wheel, boat, and ship.
area under the precision–recall (P-R) curve when plotted with the recall (R) on the horizontal
axis and precision (P) on the vertical axis. mAP@0.5 refers to the IOU (Intersection of Union)
is greater than 0.5. mAP@0.5:0.95 refers to the average of IOU values from 0.5 to 0.95 with
an interval of 0.05. The P and R are defined as
TP
Precision =
TP + FP
(12)
TP
Recall =
TP + FN
where TP (True Positive) indicates the number of positive samples correctly classified
as positive by the model, FP (False Positive) represents the number of negative samples
incorrectly classified as positive by the model, and FN (False Negative) represents the
number of positive samples incorrectly classified as negative by the model. By calculating
the area under the P-R curve, the mAP is defined as
Z 1
mAP = P( R)dR (13)
0
where P( R) is a function of P and R. In addition, we also evaluate the model size and
computational complexity through Params and GFLOPs (giga floating point of operations).
Table 1. MVT parameters and variations. Except for the channel numbers at each stage, all model
variants share the same parameter set.
Channels
Stage Size Kernel Stride
MVT-B MVT-S MVT-T
S1 1/4 7 4 96 ✓ 64 32
S2 1/8 3 2 192 ✓ 128 ✓ 64
S3 1/16 3 2 384 ✓ 256 ✓ 128 ✓
S4 1/32 3 2 768 ✓ 512 ✓ 256 ✓
We utilize three variants, MVT-B/S/T, for detection on the EOD dataset. Figure 6
presents the detection results in different scenarios. Since the event camera outputs asyn-
Remote Sens. 2024, 16, 1641 13 of 21
chronous data, and generates corresponding events even in low light and overexposure
without limitation by the intensity of light.
Undetected
Motion blur
Undetected Undetected
Low light
Undetected Wrong
Figure 6. Prediction examples on the EOD dataset. The MVT-B/S/T variants are applied to detect in
normal, motion blur, and low-light scenarios, respectively.
MVT-B outperforms the other variants due to its higher-resolution feature scale in-
formation, resulting in superior performance in detecting small objects. Specifically, in
scenarios with motion blur and low light, MVT-S and MVT-T occasionally fail to detect
small targets located in the top-left corner. However, despite the inherent advantages of
event cameras over traditional cameras in terms of efficiency, they suffer from the loss
of high-frequency information in the images, leading to the degradation of image details.
Consequently, under low-light conditions, MVT-T erroneously misclassifies a car as a boat.
Table 2. Ablation experiment on the EOD dataset. “✓” indicates that the module is used in the MVT
network, while “-” indicates that it is not used, best results in bold, underlined denotes the second
best performance, and the same colors indicate the same benchmarks except for CDA.
channel and spatial information improves mAP@0.5:0.95 by 2.4%, the incorporation of GSA
for extracting global spatial information enhances mAP@0.5:0.95 by 5.1%, the introduc-
tion of CDA reduces model computational complexity by approximately 58% in terms of
GFLOPs while maintaining the original performance. Combining CSA and GSA results in
a 7.4% increase in mAP@0.5:0.95. Finally, by considering CSA, GSA, and CDA, we achieve
28.7% mAP@0.5:0.95, reducing Entire GFLOPs and Encoder GFLOPs by approximately 45%
and 82% compared to the model without CDA. Figure 7 shows the attention visualization
both without and with CDA.
Figure 7. Visualization of attention maps. (a) Visualization of feature maps generated by the model
without CDA. (b) Visualization of feature maps generated by the model with CDA. It can be observed
that the attention applied by CDA is more focused on small targets. (c) Detection results applied CDA.
Table 3. Ablation of the downsampling module. Best results in bold. The usage of Conv. overlapping
outperforms other downsample approaches.
Table 4. Ablation of the global spatial attention module. Best results in bold. The usage of Grid-
Attention outperforms Swin-Attention.
Wrong
Normal
Motion blur
Undetected Undetected
Undetected
Low light
Wrong
Figure 8. Comparison of the detection results before and after using CSA alone, GSA alone, and
both CSA and GSA in the MVT network. (a) Baseline. (b) Baseline + CSA. (c) Baseline + GSA.
(d) Baseline + CSA + GSA.
Overlap
Undetected Overlap
Undetected Undetected Undetected
Wrong
Undetected Undetected Undetected Undetected
Undetected Overlap
Figure 9. Prediction examples on the EOD dataset using different approaches involving Faster
R-CNN, YOLOv7, Deformable DETR, and proposed method.
Table 5. Comparison of detection performance on the EOD dataset. The best result is highlighted
with bold.
Figure 10 presents the results of our method for detecting objects in various scenes within
the VisDrone2019 dataset.
Wrong
Wrong
Wrong
Undetected Undetected
Undetected
Undetected Undetected
Undetected Undetected Undetected
Undetected
Wrong Wrong
Undetected Undetected
Figure 10. Prediction examples on the VisDrone2019 dataset using different approaches involving
YOLOv5, DMNet, and proposed method.
Table 6. Comparison of detection performance on the VisDrone2019 dataset. The best result is
highlighted with bold.
Undetected Undetected
Undetected
Undetected
Figure 11. Prediction examples on the UAVDT dataset using different approaches involving Faster
R-CNN, DMNet, and proposed method.
Table 7. Comparison of detection performance on the UAVDT dataset. The best result is highlighted
with bold.
5. Discussion
The UAVDT dataset only annotates three categories of objects and has simpler scenes
compared to the VisDrone2019 dataset. However, the UAVDT dataset exhibits lower
detection performance due to its challenging scenes (e.g., low lighting, motion blur), as
exemplified in Figure 12. Therefore, applying event cameras to improve visual effects in
extreme environments will greatly improve the accuracy of object detection. Despite event
cameras being capable of capturing moving objects in various challenging scenarios, they
only retain intensity features while losing color information, resulting in the loss of object
details. While traditional cameras are limited by a fixed frame rate, they preserve more
high-frequency information. Therefore, it is a meaningful step to simultaneously consider
the event and traditional cameras for detection, aiming to achieve improved performance
in any challenging scenario.
Remote Sens. 2024, 16, 1641 19 of 21
Figure 12. Extreme scenarios in UAVDT dataset. These scenes captured by traditional cameras pose
challenges for object detection.
6. Conclusions
In this paper, we aim to capture details in challenging remote sensing images (e.g., low
light, motion blur scenarios) to improve the detection performance of small targets. We
propose a method called Multi-Vision Transformer (MVT), which employs Channel Spa-
tial Attention (CSA) to enhance short-range dependencies and extract high-frequency
information features, utilizing Global Spatial Attention (GSA) to strengthen long-range
dependencies and retain more low-frequency information. Specifically, the proposed MVT
backbone generates more accurate object locations with enhanced features by maintaining
multi-scale high-resolution features with rich semantic information. Subsequently, we use
Scale-Level Embedding to extract multiple scales features and apply Cross Deformable
Attention (CDA) to progressively fuse information from different scales, significantly re-
ducing the computational complexity of the network. Furthermore, we introduce a dataset
called EOD, captured by a drone equipped with an event camera. Finally, all experiments
are conducted on the EOD dataset and two widely used UAV remote sensing datasets.
The results demonstrate that our method outperforms widely used methods in terms of
detection performance on the EOD dataset, VisDrone2019 dataset, and UAVDT dataset.
References
1. Brandli, C.; Berner, R.; Yang, M.; Liu, S.C.; Delbruck, T. A 240× 180 130 db 3 µs latency global shutter spatiotemporal vision
sensor. IEEE J.-Solid-State Circuits 2014, 49, 2333–2341. [CrossRef]
2. Delbruck, T. Frame-free dynamic digital vision. In Proceedings of the International Symposium on Secure-Life Electronics,
Advanced Electronics for Quality Life and Society, Tokyo, Japan, 6–7 March 2008 ; Volume 1, pp. 21–26.
3. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
Remote Sens. 2024, 16, 1641 20 of 21
4. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision
meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019.
5. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark:
Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14
September 2018; pp. 370–386.
6. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703.
7. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14;
Springer: Berlin/Heidelberg, Germany, 2016; pp. 483–499.
8. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
9. Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized feature pyramid for object detection. IEEE Trans. Image Process. 2023, 32,
4341–4354. [CrossRef] [PubMed]
10. Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and accurate arbitrary-shaped text detection with
pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of
Korea, 27 October–2 November 2019; pp. 8440–8449.
11. Mboga, N.; Grippa, T.; Georganos, S.; Vanhuysse, S.; Smets, B.; Dewitte, O.; Wolff, E.; Lennert, M. Fully convolutional networks for land
cover classification from historical panchromatic aerial photographs. ISPRS J. Photogramm. Remote Sens. 2020, 167, 385–395. [CrossRef]
12. Abriha, D.; Szabó, S. Strategies in training deep learning models to extract building from multisource images with small training
sample sizes. Int. J. Digit. Earth 2023, 16, 1707–1724. [CrossRef]
13. Solórzano, J.V.; Mas, J.F.; Gallardo-Cruz, J.A.; Gao, Y.; de Oca, A.F.M. Deforestation detection using a spatio-temporal deep
learning approach with synthetic aperture radar and multispectral images. ISPRS J. Photogramm. Remote Sens. 2023, 199, 87–101.
[CrossRef]
14. Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021;
pp. 10440–10450.
15. Xu, H.; Tang, X.; Ai, B.; Yang, F.; Wen, Z.; Yang, X. Feature-selection high-resolution network with hypersphere embedding for
semantic segmentation of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4411915. [CrossRef]
16. Hao, X.; Yin, L.; Li, X.; Zhang, L.; Yang, R. A Multi-Objective Semantic Segmentation Algorithm Based on Improved U-Net
Networks. Remote Sens. 2023, 15, 1838. [CrossRef]
17. Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for
labeling pixels and regions. arXiv 2019, arXiv:1904.04514.
18. Li, R.; Shen, Y. YOLOSR-IST: A deep learning method for small target detection in infrared remote sensing images based on
super-resolution and YOLO. Signal Process. 2023, 208, 108962. [CrossRef]
19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. .
20. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020;
pp. 11534–11542.
21. Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Gao, X. Dim2Clear network for infrared small target detection. IEEE Trans. Geosci.
Remote Sens. 2023, 61, 5001714. [CrossRef]
22. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141.
23. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference
on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
24. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
25. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October
2021; pp. 10012–10022.
26. Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In European Conference
on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022, pp. 459–479.
27. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al.
ultralytics/yolov5: v3. 0. Zenodo 2020 . [CrossRef]
28. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162.
Remote Sens. 2024, 16, 1641 21 of 21
29. Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019;
pp. 821–830.
30. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229.
31. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for
end-to-end object detection. arXiv 2022, arXiv:2203.03605.
32. Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13884–13893.
33. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
34. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; pp. 1440–1448.
35. Iacono, M.; Weber, S.; Glover, A.; Bartolozzi, C. Towards event-driven object detection with off-the-shelf deep learning. In
Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Madrid, Spain, 1–5
October 2018; pp. 1–9.
36. Jiang, Z.; Xia, P.; Huang, K.; Stechele, W.; Chen, G.; Bing, Z.; Knoll, A. Mixed frame-/event-driven fast pedestrian detection. In
Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), IEEE, Montreal, QC, Canada, 20–24 May
2019; pp. 8332–8338.
37. Su, Q.; Chou, Y.; Hu, Y.; Li, J.; Mei, S.; Zhang, Z.; Li, G. Deep directly-trained spiking neural networks for object detection. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6555–6565.
38. Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. Unsupervised event-based learning of optical flow, depth, and egomotion. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019;
pp. 989–997.
39. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014; Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
40. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and
a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [CrossRef]
41. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
18–22 June 2023; pp. 7464–7475.
42. Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 190–191.
43. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end
object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14454–14463.
44. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320.
45. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988.
46. Lin, H.; Zhou, J.; Gan, Y.; Vong, C.M.; Liu, Q. Novel up-scale feature aggregation for object detection in aerial images.
Neurocomputing 2020, 411, 364–374. [CrossRef]
47. Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8514–8523.
48. Ma, Y.; Chai, L.; Jin, L. Scale decoupled pyramid for object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704314.
[CrossRef]
49. Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection.
IEEE Trans. Image Process. 2020, 30, 1556–1569. [CrossRef]
50. Xu, J.; Li, Y.; Wang, S. Adazoom: Adaptive zoom network for multi-scale object detection in large scenes. arXiv 2021, arXiv:2106.10409.
51. Ge, Z.; Qi, L.; Wang, Y.; Sun, Y. Zoom-and-reasoning: Joint foreground zoom and visual-semantic reasoning detection network
for aerial images. IEEE Signal Process. Lett. 2022, 29, 2572–2576. [CrossRef]
52. Zhang, J.; Yang, X.; He, W.; Ren, J.; Zhang, Q.; Zhao, T.; Bai, R.; He, X.; Liu, J. Scale Optimization Using Evolutionary Reinforcement
Learning for Object Detection on Drone Imagery. arXiv 2023, arXiv:2312.15219.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.