Transformers in Small Object Detection - SOTA

1
Transformers in Small Object Detection: A

Benchmark and Survey of State-of-the-Art
Aref Miri Rekavandi, Member, IEEE, Shima Rashidi, Farid Boussaid, Stephen Hoefs, Emre Akbas, and
Mohammed Bennamoun, Senior Member, IEEE
Abstract—Transformers have rapidly gained popularity in computer vision, especially in the field of object recognition and detection.
Upon examining the outcomes of state-of-the-art object detection methods, we noticed that transformers consistently outperformed
well-established CNN-based detectors in almost every video or image dataset. While transformer-based approaches remain at the
forefront of small object detection (SOD) techniques, this paper aims to explore the performance benefits offered by such extensive
networks and identify potential reasons for their SOD superiority. Small objects have been identified as one of the most challenging
arXiv:2309.04902v1 [cs.CV] 10 Sep 2023
object types in detection frameworks due to their low visibility. We aim to investigate potential strategies that could enhance
transformers’ performance in SOD. This survey presents a taxonomy of over 60 research studies on developed transformers for the
task of SOD, spanning the years 2020 to 2023. These studies encompass a variety of detection applications, including small object
detection in generic images, aerial images, medical images, active millimeter images, underwater images, and videos. We also compile
and present a list of 12 large-scale datasets suitable for SOD that were overlooked in previous studies and compare the performance
of the reviewed studies using popular metrics such as mean Average Precision (mAP), Frames Per Second (FPS), number of
parameters, and more. Researchers can keep track of newer studies on our web page, which is available at:
https://github.com/arekavandi/Transformer-SOD.
Index Terms—Object recognition, small object detection, vision transformers, object localization, deep learning, attention, MS COCO
dataset.
✦
1 I NTRODUCTION
MALL Object Detection (SOD) has been recognized as a
S significant challenge for State-Of-The-Art (SOTA) object de-
tection methods [1]. The term “small object” refers to objects
that occupy a small fraction of the input image. For example, in
the widely used MS COCO dataset [2], it defines objects whose
bounding box is 32 × 32 pixels or less, in a typical 480 × 640
image (Figure 1). Other datasets have their own definitions, e.g.
objects that occupy 10% of the image. Small objects are often
missed or detected with incorrectly localized bounding boxes,
and sometimes with incorrect labels. The main reason for the
deficient localization in SOD stems from the limited information
provided in the input image or video frame, compounded by the
subsequent spatial degradation experienced as they pass through
multiple layers in deep networks. Since small objects frequently
appear in various application domains, such as pedestrian detec-
tion [3], medical image analysis [4], face recognition [5], traffic
sign detection [6], traffic light detection [7], ship detection [8],
Synthetic Aperture Radar (SAR)-based object detection [9], it is
Fig. 1: Examples of small size objects from MS COCO dataset [2].
worth examining the performance of modern deep learning SOD
The objects are highlighted with color segments.
techniques. In this paper, we compare transformer-based detectors
with Convolutional Neural Networks (CNNs) based detectors in
Aref Miri Rekavandi and Mohammed Bennamoun are with the De- terms of their small object detection performance. In the case
partment of Computer Science and Software Engineering, The Univer- of outperforming CNNs with a clear margin, we then attempt to
sity of Western Australia (Emails: aref.mirirekavandi@uwa.edu.au, mo- uncover the reasons behind the transformer’s strong performance.
hammed.bennamoun@uwa.edu.au). Shima Rashidi is an independent re-
searcher (Email: shima.rashidi7@gmail.com). Farid Boussaid is with the De- One immediate explanation could be that transformers model the
partment of Electrical, Electronics and Computer Engineering, The University interactions between pairwise locations in the input image. This
of Western Australia (Email: farid.boussaid@uwa.edu.au). Stephen Hoefs is is effectively a way of encoding the context. And, it is well
a discipline leader at Defence Science and Technology Group, Australia established that context is a major source of information to detect
(Email: stephen.hoefs@defence.gov.au). Emre Akbas is with the Department
of Computer Engineering, Middle East Technical University, Turkey. (Email: and recognize small objects both in humans and computational
emre@ceng.metu.edu.tr). models [10]. However, this might not be the only factor to
2
explain transformers’ success. Specifically, we aim to analyze this

success along several dimensions including object representation,
fast attention for high-resolution or multiscale feature maps, fully
transformer-based detection, architecture and block modification,
auxiliary techniques, improved feature representation, and spatio-
temporal information. Furthermore, we point out approaches that
could potentially enhance the performance of transformers for
SOD.
In our previous work, we surveyed numerous strategies em-
ployed in deep learning to enhance the performance of small
object detection in optical images and videos up to the year
2022 [11]. We showed that beyond the adaptation of newer deep
learning structures such as transformers, prevalent approaches
include data augmentation, super-resolution, multi-scale feature
learning, context learning, attention-based learning, region pro-
posal, loss function regularization, leveraging auxiliary tasks, and
spatiotemporal feature aggregation. Additionally, we observed
that transformers are among the leading methods in localizing
small objects across most datasets. However, given that [11]
predominantly evaluated over 160 papers focusing on CNN-based
networks, an in-depth exploration of transformer-centric methods
was not undertaken. Recognizing the growth and exploration pace
in the field, there is a timely window now to delve into the current
transformer models geared towards small object detection.
In this paper, our goal is to comprehensively understand the
factors contributing to the impressive performance of transformers
when applied to small object detection and their distinction with
strategies used for generic object detection. To lay the ground- Fig. 2: Transformer architecture containing encoder (left module)
and decoder (right module) used in sequence to sequence translation
work, we first highlight renowned transformer-based object detec- (figure from [30]).
tors for SOD, juxtaposing their advancements against established
CNN-based methodologies.
Since 2017, the field has seen the publication of numerous detectors: DETR and ViT-FRCNN. In Section 3, we present a
review articles. An extensive discussion and listing of these classification for transformer-based SOD techniques and delve
reviews are presented in our previous survey [11]. Another recent into each category comprehensively. Section 4 showcases the
survey article [12] mostly focuses on the CNN-based techniques, different datasets used for SOD and evaluates them across a
too. The narrative of this current survey stands distinct from its range of applications. In Section 5, we analyze and contrast these
predecessors. Our focus in this paper narrows down specifically to outcomes with earlier results derived from CNN networks. The
transformers — an aspect not explored previously — positioning paper wraps up with conclusions in Section 6.
them as the dominant network architecture for image and video
SOD. This entails a unique taxonomy tailored to this innovative
architecture, consciously sidelining CNN-based methods. Given 2 BACKGROUND
the novelty and intricacy of this topic, our review prioritizes works Object detection and in particular SOD, has long relied on CNN-
primarily brought forth post-2022. Additionally, we shed light based deep learning models. Several single-stage and two-stage
on newer datasets employed for the localization and detection of detectors have emerged over time, such as You Only Look Once
small objects across a broader spectrum of applications. (YOLO) variants [13], [14], [15], [16], [17], [18], [19], Single Shot
The studies examined in this survey primarily presented meth- multi-box Detector (SSD) [20], RetinaNet [21], Spatial Pyramid
ods tailored for small object localization and classification or Pooling Network (SPP-Net) [22], Fast R-CNN [23], Faster R-
indirectly tackled SOD challenges. What drove our analysis were CNN [24], Region-Based Fully Convolutional Networks (R-FCN)
the detection outcomes specified for small objects in these papers. [25], Mask R-CNN [26], Feature Pyramid Networks (FPN) [27],
However, earlier research that noted SOD outcomes but either cascade R-CNN [28], and Libra R-CNN [29]. Various strategies
demonstrated subpar performance or overlooked SOD-specific have been used in conjunction with these techniques to improve
parameters in their development approach were not considered their detection performance for SOD, with multi-scale feature
for inclusion in this review. In this survey, we assume the reader learning being the most commonly used approach.
is already familiar with generic object detection techniques, their
architectures, and relevant performance measures. If the reader The transformer model was first introduced in [30] as a novel
requires foundational insight into these areas, we refer the reader technique for machine translation. This model aimed to advance
to our previous work [11]. beyond traditional recurrent networks and CNNs by introducing a
The structure of this paper is as follows: Section 2 offers an new network architecture solely based on attention mechanisms,
overview of CNN-based object detectors, transformers, and their thereby eliminating the need for recurrence and convolutions.
components, including the encoder and decoder. This section also The Transformer model consists of two main modules: the en-
touches upon two initial iterations of transformer-based object coder and the decoder. Figure 2 provides a visual representation
3
Fig. 3: Top: DETR (figure from [31]). Bottom: ViT-FRCNN (figure from [32]).
TABLE 1: A list of terminologies used in this paper with their meanings.
Full Term Description

Encoder Encoder in transformers consists of multiple layers of self-attention modules and feed-forward neural networks to extract local
and global semantic information from the input data.
Decoder Decoder module is responsible to generate the output (either sequence or independent) based on the concept of self and cross
attention applied to the object queries and encoder’s output.
Token Token refers to the most basic unit of data input into the transformers. It can be image pixels, patches, or video clips.
Multi-Head Attention Multi-Head Attention is a mechanism in transformers that enhances the learning capacity and representational power of
self-attention. It divides the input into multiple subspaces and performs attention computations independently on each subspace,
known as attention heads.
Spatial Attention Spatial attention in transformers refers to a type of attention mechanism that attends to the spatial positions of tokens within a
sequence. It allows the model to focus on the relative positions of tokens and capture spatial relationships.
Channel Attention Channel attention in transformers refers to an attention mechanism that operates across different channels or feature dimensions of
the input. It allows the model to dynamically adjust the importance of different channels, enhancing the representation and modeling
of channel-specific information in tasks
Object Query It refers to a learned vector representation that is used to query and attend to specific objects or entities within a scene.
Positional Embedding It refers to a learned representation that encodes the positional information of tokens in an input sequence, enabling the model to
capture sequential dependencies.
of the processing blocks within each module. The description by x. The output of the Multi-Head Attention block is given by
of terminologies commonly used in Transformers for computer
vision is provided in Table 1 for readers who are not familiar MH Attention(Q, K, V) = Concat(head1 , · · · , headh )WO . (2)
with the topic. Within the context of SOD, the encoder module where WO ∈ Rhdv ×d , dk = dq , and
ingests input tokens, which can refer to image patches or video
clips, and employs various feature embedding approaches, such K⊤
h Qh
headh = Attention(Qh , Kh , Vh ) = Softmax( √ )V⊤
h . (3)
as utilizing pre-trained CNNs to extract suitable representations. dk
The positional encoding block embeds positional information into
Finally, the results obtained from the previous steps are combined
the feature representations of each token. Positional encoding has
with a skip connection and a normalization block. These vectors
demonstrated significant performance improvements in various
are then individually passed through a fully connected layer,
applications. The encoded representations are then passed through
applying an activation function to introduce non-linearity into
a Multi-Head Attention block, which is parameterized with three
the network. The parameters of this block are shared across
main matrices, namely Wq ∈ Rdq ×d , Wk ∈ Rdk ×d , and
all vectors. This process is repeated for a total of N times,
Wv ∈ Rdv ×d to obtain query, key and value vectors, shown by q,
corresponding to the number of layers in the deep network. In
k, v, respectively. In other words,
the decoder module, a similar process is applied using the vectors
qi = Wq xi , ki = Wk xi , vi = Wv xi , i = 1, · · · , T, (1) generated in the encoder, while also consuming the previously
generated predictions/outputs as additional input. Ultimately, the
where T is the total number of tokens and each token is denoted output probabilities for the possible output classes are computed.
4
Fig. 4: Taxonomy of small object detection using transformers and popular object detection methods in each category.
N
Attention is achieved through the dot product operation between X
ŝ = arg min Lmatch (yi , ŷs(i) ), (5)
the key and query matrices, Eq. (3), which computes weights for s∈S
i
the linear combination of the matrix V.
An alternative representation for the transformer is also pro-
where Lmatch (yi , ŷs(i) ) measures the pair-wise matching cost
vided in
between ground truth box yi with size N and the prediction with
T
X X index s(i) where s is a specific order of predicted bounding boxes.
MH Attentioni = WO

h Ahik Wv xk , i = 1, · · · , T, (4)
In this formulation, N is the largest possible number of objects
h k=1
within an image. In the case of fewer objects in predictions and
where WO
h is a submatrix of WO that corresponds to h−th ground-truth, y and ŷ will be padded with ∅ (indicating no object).
head, and Ahik is the attention weight in h−th head which is Consequently, this loss function considers all possible matching
the element in i−th row (corresponds to i−th query) and k−th policies between predictions and ground truth, selecting the one
K⊤ Q
column (corresponds to k−th key) of the matrix: Softmax( √hd h ). that yields the minimum loss value. The optimal pairing can be
k
Dosovitskiy et al. were the first to utilize the architecture of efficiently computed using the Hungarian algorithm, as demon-
transformers in computer vision tasks, including image recogni- strated in [34]. DETR used a CNN backbone to extract compact
tion [33]. The remarkable performance exhibited by transformers feature representations and an encoder-decoder transformer with a
in various vision tasks has paved the way for their application in feed-forward network to produce the final predictions (see Figure
the domain of object detection research. Two pioneering works in 3, Top). In contrast, ViT-FRCNN uses the Vision Transformer
this area are the DEtection TRansformer (DETR) [31] (Figure 3, (ViT) [33] for object detection and demonstrates that pre-trained
Top) and ViT-FRCNN [32] (Figure 3, Bottom). ViT on large-scale datasets enhances the detection performance
DETR aimed to reduce the reliance on CNN-based techniques through rapid fine-tuning. While ViT-FRCNN, like DETR, in-
during post-processing by employing a set-based global loss. This corporates CNN-based networks in its pipeline, specifically in
particular loss function aids in the collapse of near-duplicate the detection head, it diverges from DETR by using the Trans-
predictions through bipartite matching, ensuring each prediction is former (encoder only) to encode visual attributes. Additionally,
uniquely paired with its matching ground truth bounding boxes. As a conventional Region Proposal Network (RPN) [24] is used
an end-to-end model, DETR benefits from global computation and for generating detections (illustrated in Figure 3, Bottom). Both
perfect memory, making it suitable for handling long sequences DETR and ViT-FRCNN have shown subpar results in the detection
generated from videos/images. The bipartite matching loss utilized and classification of small objects. ViT-FRCNN even exhibited
in DETR is defined as follows: worse results when increasing the token size of the input image.
5
Fig. 5: BVR uses different representations. i.e., corner and center points to enhance features for anchor-based detection (left figure). Object
representations are shown for another image (cat) where red dashes show the ground truth (figure from [35]).
The best outcomes were achieved when the token size was set to
16×16, and all intermediate transformer states were concatenated
with the final transformed layer. Additionally, both detectors rely
on CNNs at different stages, in DETR as the backbone for feature
extraction and in ViT-FRCNN for the detection head. To improve
the results of small object detection, it is crucial to retain the image
patches as small as possible to preserve spatial resolution, which
consequently increases the computational costs. To address these
limitations and challenges, further research has been conducted,
which will be discussed in detail in the following sections.
3 T RANSFOMERS F OR S MALL O BJECT D ETEC - Fig. 6: The block diagram for the Deformable attention module. zq
is the content feature of the query, x is the feature map and pq is
TION the refrence point in 2-D grid. In short, the deformable attention
In this section, we discuss transformer-based networks for SOD. module only attends to a small set of key sampling points around the
A taxonomy of small object detectors is shown in Figure 4. We reference point (different in each head). This significantly reduces the
complexity and further improves the convergence (figure from [41]).
show that existing detectors based on novel transformers can be
analyzed through one or a few of the following perspectives:
object representation, fast attention for high-resolution or multi-
corner points (keys) into the anchor-based (query) object detection
scale feature maps, fully transformer-based detection, architecture
methodology. Different object representations are also shown in
and block modification, auxiliary techniques, improved feature
Figure 5. CenterNet++ [40] was proposed as a novel bottom-up
representation, and spatio-temporal information. In the following
approach. Instead of estimating all the object’s parameters at once,
subsections, each of these categories is discussed in detail sepa-
CenterNet++ strategically identifies individual components of the
rately.
object separately, i.e., top-left, bottom-left, and center keypoints.
Then, post-processing methodologies are adopted to cluster points
3.1 Object Representation associated with the same objects. This technique has demonstrated
Various object representation techniques have been adopted in a superior recall rate in SOD compared to top-down approaches
object detection techniques. The object of interest can be rep- that estimate entire objects as a whole.
resented by rectangular boxes [23], points such as center points
[36] and point sets [37], probabilistic objects [38], and keypoints
3.2 Fast Attention for High-Resolution or Multi-Scale
[39]. Each object representation technique has its own strengths
Feature Maps
and weaknesses, with respect to the need for annotation formats
and small object representation. The pursuit of finding the optimal Previous research has shown that maintaining a high resolution
representation technique, while keeping all the strengths of the of feature maps is a necessary step for maintaining high perfor-
existing representations, began with RelationNet++ [35]. This mance in SOD. Transformers, inherently exhibit a notably higher
approach bridges various heterogeneous visual representations complexity compared to CNNs due to their quadratic increase
and combines their strengths via a module called Bridging Vi- in complexity with respect to the number of tokens (e.g., pixel
sual Representations (BVR). BVR operates efficiently without numbers). This complexity emerges from requirement of pairwise
disrupting the overall inference process employed by the main correlation computation across all tokens. Consequently, both
representations, leveraging novel techniques of key sampling and training and inference times exceed expectations, rendering the
shared location embedding. More importantly, BVR relies on an detector inapplicable for small object detection in high-resolution
attention module that designates one representation form as the images and videos. In their work on Deformable DETR, Zhu et
“master representation” (or query), while the other representations al. [41] addressed this issue that had been observed in DETR
are designated as “auxiliary” representations (or keys). The BVR for the first time. They proposed attending to only a small set of
block is shown in Figure 5, where it enhances the feature repre- key sampling points around a reference, significantly reducing the
sentation of the anchor box by seamlessly integrating center and complexity. By adopting this strategy, they effectively preserved
6
Fig. 7: ViDT (c) mixes DETR (with ViT backbone or other fully transformer-based backbones) (a) with YOLOS architecture (b) in a multi-scale
feature learning pipeline to achieve SOTA results (figure from [42]).
spatial resolution through the use of multi-scale deformable atten- 3.3 Fully Transformer-Based Detectors
tion modules. Remarkably, this method eliminated the necessity The advent of transformers and their outstanding performance in
for feature pyramid networks, thereby greatly enhancing the de- many complex tasks in computer vision has gradually motivated
tection and recognition of small objects. The i−th output of a researchers to shift from CNN-based or mixed systems to fully
multi-head attention module in Deformable attention is given by: transformer-based vision systems. This line of work started with
X K the application of a transformer-only architecture to the image
X
MH Attentioni = WO

h Ahik Wv xk (pi + ∆phik ) , (6) recognition task, known as ViT, proposed in [33]. In [42], ViDT
h k=1 extended the YOLOS model [49] (the first fully transformer-based
detector) to develop the first efficient detector suitable for SOD.
where i = 1, · · · , T and pi is the reference point of the query
In ViDT, the ResNet used in DETR for feature extraction is
and ∆phik is the sampling offset (in 2D) in h−th head with
replaced with various ViT variants, such as Swin Transformer
K samplings (K<<T=HW). Figure 6 illustrates the computation
[50], ViTDet [51], and DeiT [52], along with the Reconfigured
process within its multi-head attention module. Deformable DETR
Attention Module (RAM). The RAM is capable of handling
benefits from both its encoder and decoder modules, with the
[PATCH] × [PATCH], [DET] × [PATCH], and [DET] × [DET]
complexity order within the encoder being O(HW C 2 ) where H
attentions. These cross and self-attention modules are necessary
and W are the height and width of input feature map and C is the
because, similar to YOLOS, ViDT appends [DET] and [PATCH]
number of channels. In contrast for the DETR encoder, the order
tokens in the input. ViDT only utilizes a transformer decoder as its
of complexity is O(H 2 W 2 C), displaying a quadratic increase
neck to exploit multi-scale features generated at each stage of its
as H and W increase in size. Deformable attention has played
body step. Figure 7 illustrates the general structure of ViDT and
a prominent role in various other detectors, e.g., in T-TRD [43].
highlights its differences from DETR and YOLOS.
Subsequently, Dynamic DETR was proposed in [44], featuring
Recognizing that the decoder module is the main source of in-
a dynamic encoder and a dynamic decoder that harness feature
efficiency in transformer-based object detection, the Decoder-Free
pyramids from low to high-resolution representations, resulting
Fully Transformer (DFFT) [53] leverages two encoders: Scale-
in efficient coarse-to-fine object detection and faster convergence.
Aggregated Encoder (SAE) and Task-Aligned Encoder (TAE), to
The dynamic encoder can be viewed as a sequentially decom-
maintain high accuracy. SAE aggregates the multi-scale features
posed approximation of full self-attention, dynamically adjusting
(four scales) into a single feature map, while TAE aligns the
attention mechanisms based on scale, spatial importance, and rep-
single feature map for object type and position classification and
resentation. Both Deformable DETR and Dynamic DETR make
regression. Multi-scale feature extraction with strong semantics is
use of deformable convolution for feature extraction. In a distinct
performed using a Detection-Oriented Transformer (DOT) back-
approach, O2 DETR [45] demonstrated that the global reasoning
bone.
offered by a self-attention module is actually not essential for
In Sparse RoI-based deformable DETR (SRDD) [54], the
aerial images, where objects are usually densely packed in the
authors proposed a lightweight transformer with a scoring system
same image area. Hence, replacing attention modules with local
to ultimately remove redundant tokens in the encoder. This is
convolutions coupled with the integration of multi-scale feature
achieved using RoI-based detection in an end-to-end learning
maps, was proven to improve the detection performance in the
scheme.
context of oriented object detection. The authors in [46] pro-
posed the concept of Row-Column Decoupled Attention (RCDA),
decomposing the 2D attention of key features into two simpler 3.4 Architecture and Block Modifications
forms: 1D row-wise and column-wise attentions. In the case of DETR, the first end-to-end object detection method, strugles with
CF-DETR [47], an alternative approach to FPN was proposed extended converge times during training and performs poorly on
whereby C5 features were replaced with encoder features at level small objects. Several research works have addressed these issues
5 (E5), resulting in improved object presentation. This innova- to improve SOD performance. One notable contribution comes
tion was named Transformer Enhanced FPN (TEF) module. In from Sun et al. [55], who, drawing inspiration from FCOS [56]
another study, Xu et al. [48] developed a weighted Bidirectional (a fully convolutional single-stage detector) and Faster RCNN,
Feature Pyramid Network (BiFPN) through the integration of skip proposed two encoder-only DETR variants with feature pyra-
connection operations with the Swin transformer. This approach mids called TSP-FCOS and TSP-RCNN. This was accomplished
effectively preserved information pertinent to small objects. by eliminating cross-attention modules from the decoder. Their
7
Fig. 8: Conformer architecture which leverages both local features provided by CNNs and global features provided by transformers in Feature
Coupling Unit (FCU) (figure from [58]).
findings demonstrated that cross-attention in the decoder and the potentially enhanced the detection accuracy in aerial photographs.
instability of the Hungarian loss were the main reasons for the Questioning the object queries used in previous works, Wang
late convergence in DETR. This insight led them to discard the et al. [46] proposed Anchor DETR, which used anchor points for
decoder and introduce a new bipartite matching technique in these object queries. These anchor points enhance the interpretability of
new variants, i.e., TSP-FCOS and TSP-RCNN. the target query locations. The use of multiple patterns for each
In a combined approach using CNNs and transformers, Peng anchor point, improves the detection of multiple objects in one
et al. [57], [58] proposed a hybrid network structure called region. In contrast, Conditional DETR [65] emphasizes on the
“Conformer”. This structure fuses the local feature representa- conditional spatial queries derived from the decoder content lead-
tion provided by CNNs with the global feature representation ing to spatial attention predictions. A subsequent version, Condi-
provided by transformers at varying resolutions (see Figure 8). tional DETR v2 [66], enhanced the architecture by reformulating
This was achieved through Feature Coupling Units (FCUs), with the object query into the form of a box query. This modification
experimental results demonstrating its effectiveness compared to involves embedding a reference point and transforming boxes
ResNet50, ResNet101, DeiT, and other models. A similar hybrid with respect to the reference point. In subsequent works, DAB-
technique combining CNNs and transformers was proposed in DETR [67] further improved on the idea of query design by using
[59]. Recognizing the importance of local perception and long- dynamically adjustable anchor boxes. These anchor boxes serve
range correlations, Xu et al. [60] added a Local Perception Block as both reference query points and anchor dimensions (see Figure
(LPB) to the Swin Transformer block in the Swin Transformer. 9).
This new backbone, called the Local Perception Swin Transformer In another work [47], the authors observed that while the mean
(LPSW), improved the detection of small-size objects in aerial average precision (mAP) of small objects in DETR is not com-
images significantly. DIAG-TR [61] introduced a Global-Local petitive with state-of-the-art (SOTA) techniques, its performance
Feature Interweaving (GLFI) module in the encoder to adaptively for small intersection-over-union (IoU) thresholds is surprisingly
and hierarchically embed local features into global representa- better than its competitors. This indicates that while DETR pro-
tions. This technique counterbalances for the scale discrepancies vides strong perception abilities, it requires fine-tuning to achieve
of small objects. Furthermore, learnable anchor box coordinates better localization accuracy. As a solution, the Coarse-to-Fine
were added to the content queries in the transformer decoder, Detection Transformer (CF-DETR) has been proposed to perform
providing an inductive bias. In a recent study, Chen et al. [62] pro- this refinement through Adaptive Scale Fusion (ASF) and Local
posed the Hybrid network Transformer (Hyneter), which extends Cross-Attention (LCA) modules in the decoder layer. In [68] the
the range of local information by embedding convolutions into the authors contend that the suboptimal performance of transformer-
transformer blocks. This improvement led to enhanced detection based detectors can be attributed to factors such as using a singular
results on the MS COCO dataset. Similar hybrid approaches have cross-attention module for both categorization and regression,
been adopted in [63]. In another study [64], the authors proposed inadequate initialization for content queries, and the absence
a new backbone called NeXtFormer, which combines CNN and of leveraging prior knowledge in the self-attention module. To
transformer to boost the local details and features of small objects, address these concerns, they proposed Detection Split Transformer
while also providing a global receptive field. (DESTR). This model splits cross-attention into two branches, one
Among various methods, O2 DETR [45] substituted the atten- for classification and one for regression. Moreover, DESTR uses
tion mechanism in transformers with depthwise separable con- a mini-detector to ensure proper content query initialization in the
volution. This change not only decreased memory usage and decoder and enhances the self-attention module. Another research
computational costs associated with multi-scale features but also [48], introduced FEA-Swin, which leverages advanced foreground
8
Fig. 9: DAB-DETR improves Conditional DETR and utilizes dynamic anchor boxes to sequentially provide better reference query points and
anchor sizes (figure from [67]).
enhancement attention in the Swin Transformer framework to positive object queries for each ground-truth object to enhance
integrate context information into the original backbone. This was performance.
motivated by the fact that Swin Transformer does not adequately A Dual-Key Transformer Network (DKTNet) is proposed
handle dense object detection due to missing connections between in [76], where two keys are used—one key along with the Q
adjacent objects. Therefore, foreground enhancement highlights stream and another key along with the V stream. This enhances
the objects for further correlation analysis. TOLO [69] is one the coherence between Q and V, leading to improved learning.
of the recent works aiming to bring inductive bias (using CNN) Additionally, channel attention is computed instead of spatial
to the transformer architecture through a simple neck module. attention, and 1D convolution is used to accelerate the process.
This module combines features from different layers to incorpo-
rate high-resolution and high-semantic properties. Multiple light
transformer heads were designed to detect objects at different 3.5 Auxiliary Techniques
scales. In a different approach, instead of modifying the modules Experimental results have demonstrated that auxiliary techniques
in each architecture, CBNet, proposed by Liang et al. [70], groups or tasks, when combined with the main task, can enhance
multiple identical backbones that are connected through composite performance. In the context of transformers, several techniques
connections. have been adopted, including: (i) Auxiliary Decoding/Encoding
In the Multi-Source Aggregation Transformer (MATR) [71], Loss: This refers to the approach where feed-forward networks
the cross-attention module of the transformer is used to leverage designed for bounding box regression and object classification are
other support images of the same object from different views. connected to separate decoding layers. Hence individual losses
A similar approach is adopted in [72], where the Multi-View at different scales are combined to train the models leading to
Vision Transformer (MVViT) framework combines information better detection results. This technique or its variants have been
from multiple views, including the target view, to improve the used in ViDT [42], MDef-DETR [77], CBNet [70], SRDD [54].
detection performance when objects are not visible in a single (ii) Iterative Box Refinement: In this method, the bounding boxes
view. within each decoding layer are refined based on the predictions
Other works prefer to adhere to the YOLO family architecture. from the previous layers. This feedback mechanism progressively
For instance, SPH-Yolov5 [73] adds a new branch in the shallower improves detection accuracy. This technique has been used in
layers of the Yolov5 network to fuse features for improved small ViDT [42]. (iii) Top-Down Supervision: This approach leverages
object localization. It also incorporates for the first time the Swin human understandable semantics to aid in the intricate task of
Transformer prediction head in the Yolov5 pipeline. detecting small or class-agnostic objects, e.g., aligned image-
In [74], the authors argue that the Hungarian loss’s direct text pairs in MDef-DETR [77], or text-guided object detector in
one-to-one bounding box matching approach might not always TGOD [78]. (iv) Pre-training: This involves training on large-
be advantageous. They demonstrate that employing a one-to- scale datasets followed by specific fine-tuning for the detection
many assignment strategy and utilizing the NMS (Non-Maximum task. This technique has been used in CBNet V2-TTA [79], FP-
Suppression) module leads to better detection results. Echoing DETR [80], T-TRD [43], SPH-Yolov5 [73], MATR [71], and
this perspective, Group DETR [75] implements K groups of extensively in Group DETR v2 [81]. (v) Data Augmentation:
object queries with one-to-one label assignment, leading to K This technique enriches the detection dataset by applying various
9
augmentation techniques, such as rotation, flipping, zooming in various backbones. The Small Object Favoring DETR (SOF-
and out, cropping, translation, adding noise, etc. Data augmenta- DETR) [95], specifically favors the detection of small objects by
tion is a commonly used approach to address various imbalance merging convolutional features from layers 3 and 4 in a normalized
problems [82], e.g., imbalance in object size, within deep learning inductive bias module prior to input into the DETR-Transformer.
datasets. Data augmentation can be seen as an indirect approach NLFFTNet [84] addresses the limitation of only considering local
to minimize the gap between train and test sets [83]. Several interactions in current fusion techniques by introducing a nonlocal
methods used augmentation in their detection task including T- feature-fused transformer convolutional network, capturing long-
TRD [43], SPH-Yolov5 [73], MATR [71], NLFFTNet [84], DeoT distance semantic relationships between different feature layers.
[85], HTDet [86], and Sw-YoloX [63]. (vi) One-to-Many Label DeoT [85] merges an encoder-only transformer with a novel
Assignment: The one-to-one matching in DETR can result in poor feature pyramid fusion module. This fusion is enhanced by the use
discriminative features within the encoder. Hence, one-to-many of channel and spatial attention in the Channel Refinement Module
assignments in other methods, e.g., Faster-RCNN, RetinaNet, and (CRM) and Spatial Refinement Module (SRM), enabling the
FCOS have been used as auxiliary heads in some studies such extraction of richer features. The authors in HTDet [86] proposed
as CO-DETR [87]. (vii) Denoising Training: This technique aims a fine-grained FPN to cumulatively fuse low-level and high-level
to boost the convergence speed of the decoder in DETR, which features for better object detection. Meanwhile, in MDCT [96]
often faces an unstable convergence due to bipartite matching. the author proposed a Multi-kernel Dilated Convolution (MDC)
In denoising training, the decoder is fed with noisy ground-truth module to improve the performance of small object-related feature
labels and boxes into the decoder. The model is then trained extraction using both the ontology and adjacent spatial features
to reconstruct the original ground truth (guided by an auxiliary of small objects. The proposed module leverages depth-wise
loss). Implementations like DINO [88] and DN-DETR [89] have separable convolution to reduce the computational cost. Lastly, in
demonstrated the effectiveness of this technique in enhancing the [97], a feature fusion module paired with a lightweight backbone
decoder’s stability. is engineered to enhance the visual features of small objects
by broadening the receptive field. The hybrid attention module
in RTD-Net [97] empowers the system to detect objects that
3.6 Improved Feature Representation are partially occluded, by incorporating contextual information
Although current object detectors excel in a wide range of surrounding small objects.
applications for regular-size or large objects, certain use-cases
necessitate specialized feature representations for improved SOD. 3.7 Spatio-Temporal Information
For instance, when it comes to detecting oriented objects in In this section, our focus is exclusively on video-based object
aerial imagery, any object rotation can drastically alter the feature detectors that aim to identify small objects. While many of these
representation due to increased background noise or clutter in the studies have been tested on the ImageNet VID dataset 1 [98],
scene (region proposal). To address this, Dai et al. [90] ] proposed this dataset was not originally intended for small object detection.
AO2-DETR, a method designed to be robust to arbitrary object Nonetheless, a few of the works also reported their results for
rotations. This is achieved through three key components: (i) the small objects of ImageNet VID dataset. The topic of tracking and
generation of oriented proposals, (ii) a refinement module of the detecting small objects in videos has also been explored using
oriented proposal which extracts rotational-invariant features, and transformer architectures. Although techniques for image-based
(iii) a rotation-aware set matching loss. These modules help to SOD can be applied to video, they generally do not utilize the valu-
negate the effects of any rotations of the objects. In a related able temporal information, which can be particularly beneficial
approach, DETR++[91], uses multiple Bi-Directional Feature for identifying small objects in cluttered or occluded frames. The
Pyramid layers (BiFPN) that are applied in a bottom-up fashion application of transformers to generic object detection/tracking
to feature maps from C3, C4, and C5. Then, only one scale which started with TrackFormer [99] and TransT [100]. These models
is representative of features at all scales is selected to be fed into used frame-to-frame (setting the previous frame as the reference)
DETR framework for detection. For some specific applications, set prediction and template-to-frame (setting a template frame
such as plant safety monitoring, where objects of interest are as the reference) detection. Liu et al. in [101] were among the
usually related to human workers, leveraging this contextual in- first to use transformers specifically for video-based small object
formation can greatly improve feature representation. PointDet++ detection and tracking. Their core concept is to update template
[92] capitalizes on this by incorporating human pose estimation frames to capture any small changes induced by the presence of
techniques, integrating local and global features to enhance SOD small objects and to provide a global attention-driven relationship
performance. Another crucial element that impacts feature quality between the template frame and the search frame.
is the backbone network and its ability to extract both semantic Transformer-based object detection gained formal recognition
and high-resolution features. GhostNet introduced in [93], offers a with the introduction of TransVOD, an end-to-end object detector,
streamlined and more efficient network that delivers high-quality, as presented in [102] and [103]. This model applies both spatial
multi-scale features to the transformer. Their Ghost module in and temporal transformers to a series of video frames, thereby
this network partially generates the output feature map, with the identifying and linking objects across these frames. TransVOD
remainder being recovered using simple linear operations. This is has spawned several variants, each with unique features, including
a key step to alleviate the complexity of the backbone networks. In capabilities for real-time detection. PTSEFormer [104] adopts a
the context of medical image analysis, MS Transformer [94] used progressive strategy, focusing on both temporal information and
a self-supervised learning approach to perform a random mask on the objects’ spatial transitions between frames. It employs multi-
the input image, which aids in reconstructing richer features, that scale feature extraction to achieve this. Unlike other models, PT-
are less sensitive to the noise. In conjunction with a hierarchical
transformer, this approach outperforms DETR frameworks with 1. https://paperswithcode.com/sota/video-object-detection-on-imagenet-vid
10
Fig. 10: Chronology of SOD datasets with number of citations (based on Google Scholar).
TABLE 2: Commonly used datasets for SOD. NF: Not fixed.

Dataset Application Video Image Shooting Angle (Type) Resolution (pixels) #Object Classes #Instances #Images/Videos Public?
UAV123 [109] UAV Tracking ✓ Aerial Perspective(RGB) – – – 123 (>110K frames) Yes: Click Here
MRS-1800 [60] Remote Sensing ✓ Satellite based(RGB) NF 3 16,318 1800 –
SKU-110K[110] Commodity Detection ✓ Normal NF 110,712 147.4 per image 11,762 Yes: Click Here
3.4M Training
BigDetection[79] Generic ✓ Normal NF 600 36M Yes: Click Here
141K Test
Tang et al. [92] Cemical Plant Monitoring ✓ Normal – 19 – 2400 –
Xu et al. [48] UAV-based Detection ✓ Aerial (RGB) 1920×1080 2 12.5K 2K Yes: Click Here
DeepLesion [111] Lesion detection ✓ (CT) – 8 32.7K 32.1K Yes: Click Here
Udacity Self Driving Car [112] Self-Driving ✓ Normal 1920×1200 3 65K 9,423 Yes: Click Here
AMMW Dataset [113] Security Inspection ✓ Normal (AMMW) 160×400 >30 – >58K –
2 2,901 Training
URPC 2018 Dataset Underwater Detection ✓ Normal – 4 – –
800 Test
UAV dataset [97] UAV-based detection ✓ Aerial Perspective (RGB) – 7 320,624 9,630 –
Drone-vs-bird [114] Drone Detection ✓ Normal NF 2 – 77 Training 1,384 Frames Yes: Click Here
SEFormer directly regresses object queries from adjacent frames UAV123 [109]: This dataset contains 123 videos acquired with
rather than the entire dataset, offering a more localized approach. UAVs and it is one of the largest object-tracking datasets with
Sparse VOD [105] proposed an end-to-end trainable video object more than 110K frames.
detector that incorporates temporal information to propose region MRS-1800 [60]: ]: This dataset consists of a combination of
proposals. In contrast, DAFA [106] highlights the significance images from three other remote sensing datasets: DIOR [115],
of global features within a video as opposed to local temporal NWPU VHR-10 [116], and HRRSD [117]. MRD-1800 was cre-
features. DEFA showed the inefficiency of the First In First ated for the dual purpose of detection and instance segmentation,
Out (FIFO) memory structure and proposed a diversity-aware with 1800 manually annotated images which include 3 types of
memory, which uses object-level memory instead of frame-level objects: airplanes, ships, and storage tanks.
memory for the attention module. VSTAM [107] improves feature SKU-110K [110]: This dataset serves as a rigorous testbed for
quality on an element-by-element basis and then performs sparse commodity detection, featuring images captured from various
aggregation before these enhanced features are used for object supermarkets around the world. The dataset includes a range of
candidate region detection. The model also incorporates external scales, camera angles, lighting conditions, etc.
memory to take advantage of long-term contextual information. BigDetection[79]: This is a large-scale dataset that is crafted by
In the FAQ work [108], a novel video object detector is proposed integrating existing datasets and meticulously eliminating dupli-
that uses query feature aggregation in the decoder module. This is cate boxes while labeling overlooked objects. It has a balanced
different than the methods that focus on either feature aggregation number of objects across all sizes making it a pivotal resource for
in the encoder or the methods that perform post-processing for var- advancing the field object detection. Using this dataset for pre-
ious frames. The research indicates that this technique improves training and subsequently fine-tuning on MS COCO significantly
the detection performance outperforming SOTA methods. enhances performance outcomes.
Tang et al. [92]: Originating from video footage of field activities
4 R ESULTS AND B ENCHMARKS within a chemical plant, this dataset covers various types of work
such as hot work, aerial work, confined space operations, etc. It
In this section, we quantitatively and qualitatively evaluate previ- includes category labels like people, helmets, fire extinguishers,
ous works of small object detection, identifying the most effective gloves, work clothes and other relevant objects.
technique for a specific application. Prior to this comparison, we Xu et al. [48]: This publicly available dataset focuses on UAV
introduce a range of new datasets dedicated to small object detec- (Unmanned Aerial Vehicle)-captured images and contains 2K im-
tion, including both videos and images for diverse applications. ages aimed at detecting both pedestrians and vehicles. The images
were collected using a DJI drone and feature diverse conditions
such as varying light levels and densely parked vehicles.
4.1 Datasets DeepLesion [111]: Comprising CT scans from 4,427 patients, this
In this subsection, in addition to the widely used MS COCO dataset ranks among the largest of its kind. It includes a variety
dataset, we compile and present 12 new SOD datasets. These new of lesion types, such as pulmonary nodules, bone abnormalities,
datasets are primarily tailored for specific applications excluding kidney lesions, and enlarged lymph nodes. The objects of interest
the generic and maritime environments (which have been covered in these images are typically small and accompanied by noise,
in our previous survey [11]). Figure 10 displays the chronological making their identification challenging.
order of these datasets along with their citation count as of June Udacity Self Driving Car [112]: Designed solely for educational
15 2023, according to Google Scholar. use, this dataset features driving scenarios in Mountain View and
11
TABLE 3: Detection performance (%) for small-scale objects on MS COCO image dataset [2]. The top section shows results for CNN-based
techniques, the middle section shows results for mixed architectures, and the bottom section presents from transformer-only networks. DC5:
Dilated C5 stage, MS: Multi-scale network, IBR: Iterative bounding box refinement, TS: Two-stage detection, DCN: Deformable convnets,
TTA: Test time augmentation, BD: Pre-trained on BigDetection dataset, IN: Pre-trained on ImageNet, OB: Pre-trained on Object-365 [118].
While ∗ shows the results for COCO test-dev, the other values are reported for COCO val set.
Model Backbone GFLOPS↓/FPS ↑ #params↓ mAP@[0.5,0.95] ↑ Epochs↓ URL

Faster RCNN-DC5 (NeurIPS2015)[24] ResNet50 320/16 166M 21.4 37 Link
Faster RCNN-FPN (NeurIPS2015)[24] ResNet50 180/26 42M 24.2 37 Link
Faster RCNN-FPN (NeurIPS2015)[24] ResNet101 246/20 60M 25.2 – Link
RepPoints v2-DCN-MS (NeurIPS2020)[119] ResNeXt101 –/– – 34.5∗ 24 Link
FCOS (ICCV2019)[56] ResNet50 177/17 – 26.2 36 Link
CBNet V2-DCN(ATSS[120]) (TIP2022)[70] Res2Net101 –/– 107M 35.7∗ 20 Link
CBNet V2-DCN(Cascade RCNN) (TIP2022)[70] Res2Net101 –/– 146M 37.4∗ 32 Link
DETR (ECCV2020)[31] ResNet50 86/28 41M 20.5 500 Link
DETR-DC5 (ECCV2020)[31] ResNet50 187/12 41M 22.5 500 Link
DETR (ECCV2020)[31] ResNet101 52/20 60M 21.9 – Link
DETR-DC5 (ECCV2020)[31] ResNet101 253/10 60M 23.7 – Link
ViT-FRCNN (arXiv2020)[32] – –/– – 17.8 – –
RelationNet++ (NeurIPS2020)[35] ResNeXt101 –/– – 32.8∗ – Link
RelationNet++-MS (NeurIPS2020)[35] ResNeXt101 –/– – 35.8∗ – Link
Deformable DETR (ICLR2021)[41] ResNet50 173/19 40M 26.4 50 Link
Deformable DETR-IBR (ICLR2021)[41] ResNet50 173/19 40M 26.8 50 Link
Deformable DETR-TS (ICLR2021)[41] ResNet50 173/19 40M 28.8 50 Link
Deformable DETR-TS-IBR-DCN (ICLR2021)[41] ResNeXt101 –/– – 34.4∗ – Link
Dynamic DETR (ICCV2021)[44] ResNet50 –/– – 28.6∗ – –
Dynamic DETR-DCN (ICCV2021)[44] ResNeXt101 –/– – 30.3∗ – –
TSP-FCOS (ICCV2021)[55] ResNet101 255/12 – 27.7 36 Link
TSP-RCNN (ICCV2021)[55] ResNet101 254/9 – 29.9 96 Link
Mask R-CNN (ICCV2021)[57] Conformer-S/16 457.7/– 56.9M 28.7 12 Link
Conditional DETR-DC5 (ICCV2021)[65] ResNet101 262/– 63M 27.2 108 Link
SOF-DETR (2022JVCIR) [95] ResNet50 –/– – 21.7 – Link
DETR++ (arXiv2022)[91] ResNet50 –/– – 22.1 – –
TOLO-MS (NCA2022) [69] – –/57 – 24.1 – –
Anchor DETR-DC5 (AAAI2022) [46] ResNet101 –/– – 25.8 50 Link
DESTR-DC5 (CVPR2022)[68] ResNet101 299/– 88M 28.2 50 –
Conditional DETR v2-DC5 (arXiv2022)[66] ResNet101 228/– 65M 26.3 50 –
Conditional DETR v2 (arXiv2022)[66] Hourglass48 521/– 90M 32.1 50 –
FP-DETR-IN (ICLR2022) [80] – –/– 36M 26.5 50 Link
DAB-DETR-DC5 (arXiv2022)[67] ResNet101 296/– 63M 28.1 50 Link
Ghostformer-MS (Sensors2022)[93] GhostNet –/– – 29.2 100 –
CF-DETR-DCN-TTA (AAAI2022)[47] ResNeXt101 –/– – 35.1∗ – –
CBNet V2-TTA (CVPR2022)[79] Swin Transformer-base –/– – 41.7 – Link
CBNet V2-TTA-BD (CVPR2022)[79] Swin Transformer-base –/– – 42.2 – Link
DETA (arXiv2022)[74] ResNet50 –/13 48M 34.3 24 Link
DINO (arXiv2022)[88] ResNet50 860/10 47M 32.3 12 Link
CO-DINO Deformable DETR-MS-IN (arXiv2022)[87] Swin Transformer-large –/– – 43.7 36 Link
HYNETER (ICASSP2023)[62] Hyneter-Max –/– 247M 29.8∗ – –
DeoT (JRTIP2023) [85] ResNet101 217/14 58M 31.4 34 –
ConformerDet-MS (TPAMI2023) [58] Conformer-B –/– 147M 35.3 36 Link
YOLOS (NeurIPS2021)[49] DeiT-base –/3.9 100M 19.5 150 Link
DETR(ViT) (arXiv2021)[42] Swin Transformer-base –/9.7 100M 18.3 50 Link
Deformable DETR(ViT) (arXiv2021)[42] Swin Transformer-base –/4.8 100M 34.5 50 Link
ViDT (arXiv2022)[42] Swin Transformer-base –/9 100M 30.6 50 Link
DFFT (ECCV2022) [53] DOT-medium 67/– – 25.5 36 Link
CenterNet++-MS (arXiv2022) [40] Swin Transformer-large –/– – 38.7∗ – Link
DETA-OB (arXiv2022)[74] Swin Transformer-large –/4.2 – 46.1∗ 24 Link
Group DETR v2-MS-IN-OB (arXiv2022) [81] ViT-Huge –/– 629M 48.4∗ – –
Best Results NA DETR FP-DETR Group DETR v2 DINO NA
nearby cities captured at a 2Hz image acquisition rate. The cate- dataset are sedans, people, motors, bicycles, trucks, buses, and
gory labels within this dataset include cars, trucks, and pedestrians. tricycles.
AMMW Dataset [113]: Created for security applications, this Drone-vs-bird [114]: This video dataset aims to address the
active millimetre-wave image dataset includes more than 30 dif- security concerns of drones flying over sensitive areas. It offers
ferent types of objects. These include two kinds of lighters (made labeled video sequences to differentiate between birds and drones
of plastic and metal), a simulated firearm, a knife, a blade, a bullet under various illumination, lighting, weather, and background
shell, a phone, a soup, a key, a magnet, a liquid bottle, an absorbent conditions.
material, a match, and so on. A summary of these datasets, including their applications, type,
URPC 2018 Dataset: This underwater image dataset includes four resolutions, number of classes/instances/images/frame, and a link
types of objects: holothurian, echinus, scallop and starfish [121]. to their webpage, is provided in Table 2.
UAV dataset [97]: This image dataset includes more than 9K
images captured via UAVs in different weather and lighting
conditions and various complex backgrounds. The objects in this
12
Input
CBNet-
V2
DETA-
OB
DINO
ViDT
Fig. 11: Examples of detection results on COCO dataset [2] for transformer-based SOTA small object detectors compared with Convolutional
networks.
4.2 Benchmarks in Vision Applications algorithms using the COCO 2017 training and validation sets, they
are not restricted to these subsets.
In this subsection, we introduce various vision-based applications
where the detection performance of small objects is vital. For each In Table 3, we examine and evaluate the performance of all
application, we select one of the most popular datasets and report the techniques under review that have reported their results on
its performance metrics, along with details of the experimental MS COCO (compiled from their papers). The table provides
setup. information on the backbone architecture, GFLOPS/FPS (indi-
cating the computational overhead and execution speed), number
of parameters (indicating the scale of the model), mAP (mean
4.2.1 Generic Applications average precision: a measure of object detection performance),
For generic applications, we evaluate the performance of all small and epochs (indicating the inference time and convergence prop-
object detectors on the challenging MS COCO benchmark [2]. erties). Additionally, a link to each method’s webpage is provided
The choice of this dataset is based on its wide acceptance in the for further information. The methods are categorized into three
object detection field and the accessibility of performance results. groups: CNN-based, mixed, and transformer only methods. The
The MS COCO dataset consists of approximately 160K images top-performing methods for each metric are shown in the table’s
across 80 categories. While the authors are advised to train their last row. It should be noted that this comparison was only feasible
13
DETR
Faster
RCNN
SSD
Fig. 11: Examples of detection results on COCO dataset [2] for transformer-based SOTA small object detectors compared with Convolutional
networks.
for methods that have reported values for each specific metric. In RCNN performs better, it still produces low-confidence bounding
instances where there is a tie, the method with the highest mean boxes and occasionally assigns incorrect labels.
average precision was deemed the best. The default mAP values In contrast, DETR has the tendency to over-estimate the num-
are for the ”COCO 2017 val” set, while those for the ”COCO ber of objects, leading to multiple bounding boxes for individual
test-dev” set are marked with an asterisk. Please be aware that the objects. It is commonly noted t that DETR is prone to generating
reported mAP is only for objects with area< 322 . false positives. Finally, among the methods evaluated, CBNet V2
Upon examining Table 3, it is obvious that most techniques stands out for its superior performance. As observed, it produces
benefit from using a mix of CNN and transformer architectures, high confidence scores for the objects it detects, even though it
essentially adopting hybrid strategies. Notably, Group DETR v2 may occasionally misidentify some objects.
which relies solely on a transformer-based architecture, attains a
mAP of 48.4%. However, achieving such a performance requires 4.2.2 Small Object Detection in Aerial Images
the adoption of additional techniques such as pre-training on two Another interesting use of detecting small objects is in the area
large-scale datasets and multi-scale learning. In terms of conver- of remote sensing. This field is particularly appealing because
gence, DINO outperforms by reaching stable results after just many organizations and research bodies aim to routinely monitor
12 epochs, while also securing a commendable mAP of 32.3%. the Earth’s surface through aerial images to collect both national
Conversely, the original DETR model has the fastest inference and international data for statistics. While these images can be
time and the lowest GFLOPS. FP-DETR stands out for having the acquired using various modalities, this survey focuses only on
lightest network with only 36M parameters. non-SAR images. This is because SAR images have extensively
Drawing from these findings, we conclude that pre-training been researched and deserve their own separate study. Nonethe-
and multi-scale learning emerge as the most effective strategies less, the learning techniques discussed in this survey could also be
for excelling in small object detection. This may be attributed to applicable to SAR images.
the imbalance in downstream tasks and the lack of informative In aerial images, objects often appear small due to their
features in small objects. significant distance from the camera. The bird’s-eye view also
Figure 11 which spans two pages, along with its more detailed adds complexity to the task of object detection, as objects can be
counterpart in Figure 12, illustrates the detection results of various situated anywhere within the image. To assess the performance of
transformers and CNN-based methods. These are compared to transformer-based detectors designed for such applications, we se-
each other using selected images from the COCO dataset and lected the DOTA image dataset [122], which has become a widely
implemented by us using their public models available on their used benchmark in the field of object detection. Figure 13 displays
GitHub pages. The analysis reveals that Faster RCNN and SSD some sample images from the DOTA dataset featuring small
fall short in accurately detecting small objects. Specifically, SSD objects. The dataset includes a predefined Training set, Validation
either misses most objects or generates numerous bounding boxes set, and Testing set. In comparison to generic applications, this
with false labels and poorly located bounding boxes. While Faster particular application has received relatively less attention from
14
Fig. 12: Detection results on a sample image when zoomed in. First row from the left: Input image, SSD, Faster RCNN, DETR. Second row
from the left: ViDT, DETA-OB, DINO, CBNet v2.
TABLE 4: Detection performance (%) for small-scale objects on DOTA image dataset [122]. The top section shows results for CNN-based
techniques, the middle section shows results for mixed architectures. MS: Multi-scale network, FT: Fine-tuned, FPN: Feature pyramid network,
IN: Pre-trained on ImageNet.
Model Backbone FPS ↑ #params↓ mAP ↑ Epochs↓ URL

Rotated Faster RCNN-MS (NeurIPS2015)[24] ResNet101 – 64M 67.71 50 Link
SSD (ECCV2016) [20] – – – 56.1 – Link
RetinaNet-MS (ICCV2017)[21] ResNet101 – 59M 66.53 50 Link
ROI-Transformer-MS-IN (CVPR2019) [123], [124] ResNet50 – – 80.06 12 Link
Yolov5 (2020)[17] – 95 – 64.5 – Link
ReDet-MS-FPN (CVPR2021)[125] ResNet50 – – 80.1 – Link
O2 DETR-MS (arXiv2021)[45] ResNet101 – 63M 70.02 50 –
O2 DETR-MS-FT (arXiv2021)[45] ResNet101 – – 76.23 62 –
O2 DETR-MS-FPN-FT (arXiv2021)[45] ResNet50 – – 79.66 – –
SPH-Yolov5 (RS2022) [73] Swin Transformer-base 51 – 71.6 150 –
AO2-DETR-MS (TCSVT2022)[90] ResNet50 – – 79.22 – Link
MDCT (RS2023)[96] – – – 75.7 – –
ReDet-MS-IN (arXiv2023)[124] ViTDet, ViT-B – – 80.89 12 Link
Best Results NA Yolov5 RetinaNet ReDet-MS-IN ReDet-MS-IN NA
transformer experts. However, as indicated in Table 4 (results are

compiled from papers), ReDet distinguishes itself throuh its multi-
scale learning strategy and pre-training on the ImageNet dataset,
achieving the highest precision value (80.89%) and requiring only
12 training epochs. This mirrors the insights gained from the
COCO dataset analysis, suggesting that optimal performance can
be attained by addressing imbalances in downstream tasks and
including informative features from small objects.
4.2.3 Small Object Detection in Medical Images

In the field of medical imaging, specialists are often tasked with
early detection and identification of anomalies. Missing even
barely visible or small abnormal cells can lead to serious repercus-
Fig. 13: Example of small objects in DOTA image dataset.
sions for patients, including cancer and life-threatening conditions.
These small-sized objects can be found as abnormalities in the
15
Fig. 14: Example of small abnormalities in DeepLesion image dataset [111].
Fig. 15: Examples of low quality images in URPC2018 image dataset.
TABLE 5: Detection performance (%) for DeepLesion CT image TABLE 6: Detection performance (%) for URPC2018 dataset [121].
dataset [111]. The top section shows results for CNN-based tech- The top section shows results for CNN-based techniques, the middle
niques, the middle section shows results for mixed architectures. section shows results for mixed architectures.
Model Accuracy ↑ mAP0.5 ↑ Model #Params↓ mAP@[0.5,0.95] ↑ mAP0.5 ↑

Faster RCNN (NeurIPS2015)[24] 83.3 83.3 Faster RCNN (NeurIPS2015)[24] 33.6M 16.4 –
Cascade RCNN (CVPR2018)[28] 68.9M 16 –
Yolov5 [17] 85.2 88.2 Dynamic RCNN (ECCV2020) [127] 41.5M 13.3 –
DETR (ECCV2020)[31] 86.7 87.8 Yolov3 [17] 61.5M 19.4 –
Swin Transformer 82.9 81.2 RoIMix (ICASSP2020) [121] – – 74.92
MS Transformer (CIN2022)[94] 90.3 89.6 HTDet (RS2023) [86] 7.7M 22.8 –
Best Results MS Transformer MS Transformer Best Results HTDet HTDet RoIMix
retina of diabetic patients, early tumors, vascular plaques, etc. ecological surveillance, equipment maintenance, and monitoring
Despite the critical nature and potential life-threatening impact of wreck fishing. Factors like scattering and light absorption of
of this research area, only a handful of studies have tackled the the water, make the SOD task even more challenging. Example
challenges associated with detecting small objects in this crucial images of such challenging environments are displayed in Figure
application. For those interested in this topic, the DeepLesion CT 15. Transformer-based detection methods should not only be adept
image dataset [111] has been selected as the benchmark due to the at identifying small objects but also need to be robust against the
availability of the results for this particular dataset [126]. Sample poor image quality found in deep waters, as well as variations in
images from this dataset are shown in Figure 14. This dataset color channels due to differing rates of light attenuation for each
is divided into three sets: training (70%), validation (15%), and channel.
test (15%) sets [94]. Table 5 compares the accuracy and mAP Table 6 shows the performance metrics reported in existing
of three transformer-based studies against both two-stage and studies for this dataset (results are compiled from their papers).
one-stage detectors (results are compiled from their papers). The HTDet is the sole transformer-based technique identified for
MS Transformer emerges as the best technique with this dataset, this specific application. It significantly outperforms the SOTA
albeit with limited competition. Its primary innovation lies in self- CNN-based method by a huge margin (3.4% in mAP). However,
supervised learning and the incorporation of a masking mechanism the relatively low mAP scores confirm that object detection in
within a hierarchical transformer model. Overall, with an accuracy underwater images remains a difficult task. It is worth noting that
of 90.3% and an mAP of 89.6%, this dataset appears to be less the training set of the URPC 2018 contains 2901 labeled images,
challenging compared to other medical imaging tasks, especially and the testing set contains 800 unlabeled images [86].
considering that some tumor detection tasks are virtually invisible
to the human eyes. 4.2.5 Small Object Detection in Active Milli-Meter Wave
Images
4.2.4 Small Object Detection in Underwater Images Small objects can easily be concealed or hidden from normal RGB
With the growth of underwater activities, the demand to monitor cameras, for example, within a person’s clothing at an airport.
hazy and low-light environments has increased for purposes like Therefore, active imaging techniques are essential for security
16
Fig. 16: Examples of detection results on AMMW image dataset [113] for SOTA small object detectors (figure from [71]).
purposes. In these scenarios, multiple images are often captured with other SOTA CNN-based techniques. Combining images from
from different angles to enhance the likelihood of detecting different angles largely helps to identify even small objects within
even minuscule objects. Interestingly, much like in the field of this imaging approach. For training and testing, 35426 and 4019
medical imaging, transformers are rarely used for this particular images were used, respectively [71].
application.
In our study, we focused on the detection performance of 4.2.6 Small Object Detection in Videos
existing techniques using the AMMW Dataset [113] as shown The field of object detection in videos gained considerable atten-
in Table 7 (results are compiled from their papers). We have tion recently, as the temporal information in videos can improve
identified that MATR emerged as the sole technique that combines the detection performance. To benchmark the SOTA techniques,
transformer and CNNs for this dataset. Despite being the only the ImageNet VID dataset has been used with results specifically
transformer-based technique, it could significantly improve the focused on the dataset’s small objects. This dataset includes 3862
SOD performance (5.49% ↑ in mAP0.5 with respect to Yolov5 and training videos and 555 validation videos with 30 classes of
4.22 % ↑ in mAP@[0.5,0.95] with respect to TridentNet) with the objects. Table 8 reports the mAP of several recently developed
same backbone (ResNet50). Figure 16 visually compares MATR transformer-based techniques (results are compiled from their
17
TABLE 7: Detection performance (%) for AMWW image dataset
important to acknowledge the trade-offs involved. These include
[113]. The top section shows results for CNN-based techniques, the
middle section shows results for mixed architectures. a large number of parameters (in the order of billions), several
days of training (a few hundred epochs), and pretraining on
Model Backbone mAP0.5 ↑ mAP@[0.5,0.95] ↑ extremely large datasets (which is not feasible without powerful
Faster RCNN (NeurIPS2015)[24] ResNet50 70.7 26.83 computational resources). All of these aspects pose limitations
Cascade RCNN (CVPR2018)[28] ResNet50 74.7 27.8
TridentNet (ICCV2019) [128] ResNet50 77.3 29.2 on the pool of users who can train and test these techniques for
Dynamic RCNN (ECCV2020) [127] ResNet50 76.3 27.6 their downstream tasks. It is now more important than ever to
Yolov5 [17] ResNet50 76.67 28.48
MATR (TCSVT2022) [71] ResNet50 82.16 33.42
recognize the need for lightweight networks with efficient learning
Best Results NA MATR MATR paradigms and architectures. Despite the number of parameters
is now on par with the human brain, the performance in small
TABLE 8: Detection performance (%) for ImageNet VID dataset object detection still lags considerably behind human capabilities,
[98] for small objects. The top section shows results for CNN-based underscoring a significant gap in current research.
techniques, the middle section shows results for mixed architectures.
Furthermore, based on the findings presented in Figures 11
PT: Pre-trained on MS COCO.
and 12, we have identified two primary challenges in small
Model Backbone mAP@[0.5,0.95] ↑ object detection: missing objects or false negatives, and redundant
Faster RCNN (NeurIPS2015)[24]+SELSA[129] ResNet50 8.5 detected boxes. The issue of missing objects is likely attributable
Deformable-DETR-PT [41] ResNet50 10.5
Deformable-DETR[41]+TransVOD-PT[103] ResNet50 11 to the limited information embedded in the tokens. This can be
DAB-DETR[67]+FAQ-PT[108] ResNet50 12 resolved by using high-resolution images or by enhancing feature
Deformable-DETR[41]+FAQ-PT[108] ResNet50 13.2
Best Results NA Deformable-DETR+FAQ pyramids although this comes with the drawback of increased
latency—which could potentially be offset by using more efficient,
lightweight networks. The problem of repeated detections has tra-
papers). While transformers are increasingly being used in video ditionally been managed through post-processing techniques such
object detection, their performance in SOD remains less explored. as Non-Maximum Suppression (NMS). However, in the context
Among the methods that have reported SOD performance on of transformers, this issue should be approached by minimizing
the ImageNet VID dataset, Deformable DETR with FAQ stands object query similarity in the decoder, possibly through the use of
out for achieving the highest performance- although it is notably auxiliary loss functions.
low at 13.2 % for mAP@[0.5,0.95] ). This highlights a significant We also examined studies that employ transformer-based
research gap in the area of video-based SOD. methods specifically dedicated to Small Object Detection (SOD)
across a range of vision-based tasks. These include generic detec-
tion, detection in aerial images, abnormality detection in medical
5 D ISCUSSION images, small hidden object detection in active millimeter-wave
In this survey article, we explored how transformer-based ap- images for security purposes, underwater object detection, and
proaches can address the challenges of SOD. Our taxonomy small object detection in videos. Apart from generic and aerial
divides transformer-based small object detectors into seven main image applications, transformers are underdeveloped in other
categories: object representation, fast attention (useful for high- applications, echoing observations made in Rekavandi et al. [11]
resolution and multi-scale feature maps), architecture and block regarding maritime detection. This is particularly surprising given
modification, spatio-temporal information, improved feature rep- the potentially significant impact transformers could have in life-
resentation, auxiliary techniques, and fully transformer-based de- critical fields like medical imaging.
tectors.
When juxtaposing this taxonomy with the one for CNN-
based techniques [11], we observe that some of these categories 6 C ONCLUSION
overlap, while others are unique to transformer-based techniques. This survey paper reviewed over 60 research papers that focus on
Certain strategies are implicitly embedded into transformers, such the development of transformers for the task of small object de-
as attention and context learning, which are performed via the tection, including both purely transformer-based and hybrid tech-
self and cross-attention modules in the encoder and decoder. On niques that integrate CNNs. These techniques have been examined
the other hand, multi-scale learning, auxiliary tasks, architecture from seven different perspectives: object representation, fast atten-
modification, and data augmentation are commonly used in both tion mechanisms for high-resolution or multi-scale feature maps,
paradigms. However, it is important to note that while CNNs architecture and block modifications, spatio-temporal information,
handle spatio-temporal analysis through 3D-CNN, RNN, or fea- improved feature representation, auxiliary techniques, and fully
ture aggregation over time, transformers achieve this by using transformer-based detection. Each of these categories includes
successive spatial and temporal transformers or by updating object several state-of-the-art (SOTA) techniques, each with its own
queries for successive frames in the decoder. set of advantages. We also compared these transformer-based
We have observed that pre-training and multi-scale learning approaches to CNN-based frameworks, discussing the similarities
stand out as the most commonly adopted strategies, contributing to and differences between the two. Furthermore, for a range of
state-of-the-art performance across various datasets performance vision applications, we introduced well-established datasets that
on different datasets. Data fusion is another approach widely used serve as benchmarks for future research. Additionally, 12 datasets
for SOD. In the context of video-based detection systems, the that have been used in SOD applications are discussed in de-
focus is on effective methods for collecting temporal data and tail, providing convenience for future research efforts. In future
integrating it into the frame-specific detection module. research, the unique challenges associated with the detection of
While transformers have brought about substantial advance- small objects in each application could be explored and addressed.
ments in the localization and classification of small objects, it is Fields like medical imaging and underwater image analysis stand
18
to gain significantly from the use of transformer models. Addition- [20] W. Liu et al., “Ssd: Single shot multibox detector,” in ECCV. Springer,
ally, rather than increasing the complexity of transformers using 2016, pp. 21–37.
[21] T.-Y. Lin et al., “Focal loss for dense object detection,” in ICCV, 2017,
larger models, alternative strategies could be explored to boost pp. 2980–2988.
performance. [22] K. He et al., “Spatial pyramid pooling in deep convolutional networks
for visual recognition,” TPAMI, vol. 37, no. 9, pp. 1904–1916, 2015.
[23] R. Girshick, “Fast r-cnn,” in ICCV, 2015, pp. 1440–1448.
7 ACKNOWLEDGMENT [24] S. Ren et al., “Faster r-cnn: Towards real-time object detection with
region proposal networks,” NeurIPS, vol. 28, 2015.
We thank Likun Cai for providing the detection results for CBNet [25] J. Dai et al., “R-FCN: Object detection via region-based fully convolu-
v2 in test images given in Figure 11 and 12. This research was tional networks,” NeurIPS, vol. 29, 2016.
partially supported by the Australian Research Council (ARC [26] K. He et al., “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969.
DP210101682, DP210102674) and Defence Science and Technol- [27] T.-Y. Lin et al., “Feature pyramid networks for object detection,” in
CVPR, 2017, pp. 2117–2125.
ogy Group (DSTG) for the project “Low Observer Detection of [28] Z. Cai and N. Vasconcelos, “Cascade r-cnn: high quality object detec-
Small Objects in Maritime Scenes”. tion and instance segmentation,” TPAMI, vol. 43, no. 5, pp. 1483–1498,
2021.
[29] J. Pang et al., “Libra R-CNN: Towards balanced learning for object
R EFERENCES detection,” in CVPR, 2019, pp. 821–830.
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
[1] Y. Liu, P. Sun, N. Wergeles, and Y. Shang, “A survey and performance Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
evaluation of deep learning methods for small object detection,” Expert neural information processing systems, vol. 30, 2017.
Systems with Applications, vol. 172, p. 114602, 2021. [31] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, S. Zagoruyko, “End-to-end object detection with transformers,” in
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
context,” in European Conference on Computer Vision. Springer, 2014, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp.
pp. 740–755. 213–229.
[3] J. Wu, C. Zhou, Q. Zhang, M. Yang, and J. Yuan, “Self-mimic learning [32] J. Beal, E. Kim, E. Tzeng, D. H. Park, A. Zhai, and D. Kislyuk, “Toward
for small-scale pedestrian detection,” in Proceedings of the 28th ACM transformer-based object detection,” arXiv preprint arXiv:2012.09958,
International Conference on Multimedia, 2020, pp. 2012–2020. 2020.
[4] S. Rashidi, K. Ehinger, A. Turpin, and L. Kulik, “Optimal visual search [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
based on a model of target detectability in natural images,” Advances in T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
Neural Information Processing Systems, vol. 33, pp. 9288–9299, 2020. “An image is worth 16x16 words: Transformers for image recognition
[5] S. W. Cho, N. R. Baek, M. C. Kim, J. H. Koo, J. H. Kim, and at scale,” arXiv preprint arXiv:2010.11929, 2020.
K. R. Park, “Face detection in nighttime images using visible-light [34] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-end people detection
camera sensors with two-step faster region-based convolutional neural in crowded scenes,” in Proceedings of the IEEE conference on computer
network,” Sensors, vol. 18, no. 9, p. 2995, 2018. vision and pattern recognition, 2016, pp. 2325–2333.
[6] Z. Liu, J. Du, F. Tian, and J. Wen, “Mr-cnn: A multi-scale region-based [35] C. Chi, F. Wei, and H. Hu, “Relationnet++: Bridging visual representa-
convolutional neural network for small traffic sign recognition,” IEEE tions for object detection via transformer decoder,” Advances in Neural
Access, vol. 7, pp. 57 120–57 128, 2019. Information Processing Systems, vol. 33, pp. 13 564–13 574, 2020.
[7] D. Yudin and D. Slavioglo, “Usage of fully convolutional network [36] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
with clustering for traffic light detection,” in 2018 7th Mediterranean preprint arXiv:1904.07850, 2019.
Conference on Embedded Computing (MECO). IEEE, 2018, pp. 1–6. [37] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set
[8] L. A. Varga and A. Zell, “Tackling the background bias in sparse representation for object detection,” in Proceedings of the IEEE/CVF
object detection via cropped windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9657–9666.
International Conference on Computer Vision, 2021, pp. 2768–2777.
[38] J. Wang, C. Xu, W. Yang, and L. Yu, “A normalized gaus-
[9] A. M. Rekavandi, A.-K. Seghouane, and R. J. Evans, “Robust subspace
sian wasserstein distance for tiny object detection,” arXiv preprint
detectors based on α-divergence with application to detection in imag-
arXiv:2110.13389, 2021.
ing,” IEEE Transactions on Image Processing, vol. 30, pp. 5017–5031,
[39] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,”
2021.
in Proceedings of the European conference on computer vision (ECCV),
[10] A. Torralba, “Contextual priming for object detection,” International
2018, pp. 734–750.
Journal of Computer Vision, vol. 53, pp. 169–191, 2003.
[11] A. M. Rekavandi, L. Xu, F. Boussaid, A.-K. Seghouane, S. Hoefs, [40] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet++
and M. Bennamoun, “A guide to image and video based small object for object detection,” arXiv preprint arXiv:2204.08394, 2022.
detection using deep learning: Case study of maritime surveillance,” [41] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr:
arXiv preprint arXiv:2207.12926, 2022. Deformable transformers for end-to-end object detection,” ICLR, 2021.
[12] G. Cheng, X. Yuan, X. Yao, K. Yan, Q. Zeng, X. Xie, and J. Han, [42] H. Song, D. Sun, S. Chun, V. Jampani, D. Han, B. Heo, W. Kim, and
“Towards large-scale small object detection: Survey and benchmarks,” M.-H. Yang, “Vidt: An efficient and effective fully transformer-based
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. object detector,” arXiv preprint arXiv:2110.03921, 2022.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [43] Q. Li, Y. Chen, and Y. Zeng, “Transformer with transfer cnn for remote-
once: Unified, real-time object detection,” in Proceedings of the IEEE sensing-image object detection,” Remote Sensing, vol. 14, no. 4, p. 984,
Conference on Computer Vision and Pattern Recognition, 2016, pp. 2022.
779–788. [44] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic
[14] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in detr: End-to-end object detection with dynamic attention,” in Proceed-
CVPR, 2017, pp. 7263–7271. ings of the IEEE/CVF International Conference on Computer Vision,
[15] ——, “Yolov3: An incremental improvement,” arXiv preprint 2021, pp. 2988–2997.
arXiv:1804.02767, 2018. [45] T. Ma, M. Mao, H. Zheng, P. Gao, X. Wang, S. Han, E. Ding, B. Zhang,
[16] A. Bochkovskiy et al., “Yolov4: Optimal speed and accuracy of object and D. Doermann, “Oriented object detection with transformer,” arXiv
detection,” arXiv preprint arXiv:2004.10934, 2020. preprint arXiv:2106.03146, 2021.
[17] G. Jocher et al., “yolov5,” Code repository https://github. [46] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design
com/ultralytics/yolov5, 2020. for transformer-based object detection,” AAAI, 2022.
[18] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, [47] X. Cao, P. Yuan, B. Feng, and K. Niu, “Cf-detr: Coarse-to-fine trans-
W. Nie et al., “Yolov6: A single-stage object detection framework for formers for end-to-end object detection,” in Proceedings of the AAAI
industrial applications,” arXiv preprint arXiv:2209.02976, 2022. Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 185–193.
[19] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable [48] W. Xu, C. Zhang, Q. Wang, and P. Dai, “Fea-swin: Foreground en-
bag-of-freebies sets new state-of-the-art for real-time object detectors,” hancement attention swin transformer network for accurate uav-based
arXiv preprint arXiv:2207.02696, 2022. dense object detection,” Sensors, vol. 22, no. 18, p. 6993, 2022.
19
[49] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu, and W. Liu, [69] R. Xia, G. Li, Z. Huang, Y. Pang, and M. Qi, “Transformers only look
“You only look at one sequence: Rethinking transformer in vision once with nonlinear combination for real-time object detection,” Neural
through object detection,” Advances in Neural Information Processing Computing and Applications, vol. 34, no. 15, pp. 12 571–12 585, 2022.
Systems, vol. 34, pp. 26 183–26 197, 2021. [70] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and
[50] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, H. Ling, “Cbnet: A composite backbone network architecture for object
“Swin transformer: Hierarchical vision transformer using shifted win- detection,” IEEE Transactions on Image Processing, vol. 31, pp. 6893–
dows,” in Proceedings of the IEEE/CVF international conference on 6906, 2022.
computer vision, 2021, pp. 10 012–10 022. [71] P. Sun, T. Liu, X. Chen, S. Zhang, Y. Zhao, and S. Wei, “Multi-source
[51] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision aggregation transformer for concealed object detection in millimeter-
transformer backbones for object detection,” in Computer Vision–ECCV wave images,” IEEE Transactions on Circuits and Systems for Video
2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Technology, vol. 32, no. 9, pp. 6148–6159, 2022.
Proceedings, Part IX. Springer, 2022, pp. 280–296. [72] B. K. Isaac-Medina, C. G. Willcocks, and T. P. Breckon, “Multi-view
[52] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and vision transformers for object detection,” in 2022 26th International
H. Jégou, “Training data-efficient image transformers & distillation Conference on Pattern Recognition (ICPR). IEEE, 2022, pp. 4678–
through attention,” in International conference on machine learning. 4684.
PMLR, 2021, pp. 10 347–10 357. [73] H. Gong, T. Mu, Q. Li, H. Dai, C. Li, Z. He, W. Wang, F. Han,
[53] P. Chen, M. Zhang, Y. Shen, K. Sheng, Y. Gao, X. Sun, K. Li, and A. Tuniyazi, H. Li et al., “Swin-transformer-enabled yolov5 with
C. Shen, “Efficient decoder-free object detection with transformers,” attention mechanism for small object detection on satellite images,”
in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Remote Sensing, vol. 14, no. 12, p. 2861, 2022.
Israel, October 23–27, 2022, Proceedings, Part X. Springer, 2022, pp. [74] J. Ouyang-Zhang, J. H. Cho, X. Zhou, and P. Krähenbühl, “Nms strikes
70–86. back,” arXiv preprint arXiv:2212.06137, 2022.
[54] Y. Zhu, Q. Xia, and W. Jin, “Srdd: a lightweight end-to-end object [75] Q. Chen, X. Chen, J. Wang, H. Feng, J. Han, E. Ding, G. Zeng, and
detection with transformer,” Connection Science, vol. 34, no. 1, pp. J. Wang, “Group detr: Fast detr training with group-wise one-to-many
2448–2465, 2022. assignment,” arXiv preprint arXiv:2207.13085, vol. 1, no. 2, 2022.
[55] Z. Sun, S. Cao, Y. Yang, and K. M. Kitani, “Rethinking transformer- [76] S. Xu, J. Gu, Y. Hua, and Y. Liu, “Dktnet: Dual-key transformer network
based set prediction for object detection,” in Proceedings of the for small object detection,” Neurocomputing, 2023.
IEEE/CVF international conference on computer vision, 2021, pp.
[77] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, R. M. Anwer, and M.-H.
3611–3620.
Yang, “Class-agnostic object detection with multi-modal transformer,”
[56] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- in 17th European Conference on Computer Vision (ECCV). Springer,
stage object detection,” in Proceedings of the IEEE/CVF international 2022.
conference on computer vision, 2019, pp. 9627–9636.
[78] R. Shen, N. Inoue, and K. Shinoda, “Text-guided object detector
[57] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, and Q. Ye, for multi-modal video question answering,” in Proceedings of the
“Conformer: Local features coupling global representations for visual IEEE/CVF Winter Conference on Applications of Computer Vision,
recognition,” in Proceedings of the IEEE/CVF international conference 2023, pp. 1032–1042.
on computer vision, 2021, pp. 367–376.
[79] L. Cai, Z. Zhang, Y. Zhu, L. Zhang, M. Li, and X. Xue, “Bigdetection:
[58] Z. Peng, Z. Guo, W. Huang, Y. Wang, L. Xie, J. Jiao, Q. Tian, and A large-scale benchmark for improved object detector pre-training,”
Q. Ye, “Conformer: Local features coupling global representations for in Proceedings of the IEEE/CVF Conference on Computer Vision and
recognition and detection,” IEEE Transactions on Pattern Analysis and Pattern Recognition, 2022, pp. 4777–4787.
Machine Intelligence, 2023.
[80] W. Wang, Y. Cao, J. Zhang, and D. Tao, “Fp-detr: Detection transformer
[59] W. Lu, C. Lan, C. Niu, W. Liu, L. Lyu, Q. Shi, and S. Wang, “A cnn-
advanced by fully pre-training,” in International Conference on Learn-
transformer hybrid model based on cswin transformer for uav image
ing Representations, 2022.
object detection,” IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 2023. [81] Q. Chen, J. Wang, C. Han, S. Zhang, Z. Li, X. Chen, J. Chen, X. Wang,
S. Han, G. Zhang et al., “Group detr v2: Strong object detector with
[60] X. Xu, Z. Feng, C. Cao, M. Li, J. Wu, Z. Wu, Y. Shang, and S. Ye,
encoder-decoder pretraining,” arXiv preprint arXiv:2211.03594, 2022.
“An improved swin transformer-based model for remote sensing object
detection and instance segmentation,” Remote Sensing, vol. 13, no. 23, [82] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “Imbalance problems
p. 4779, 2021. in object detection: A review,” IEEE transactions on pattern analysis
and machine intelligence, vol. 43, no. 10, pp. 3388–3415, 2020.
[61] J. Xue, D. He, M. Liu, and Q. Shi, “Dual network structure with
interweaved global-local feature hierarchy for transformer-based object [83] S. Rashidi, R. Tennakoon, A. M. Rekavandi, P. Jessadatavornwong,
detection in remote sensing image,” IEEE Journal of Selected Topics A. Freis, G. Huff, M. Easton, A. Mouritz, R. Hoseinnezhad, and A. Bab-
in Applied Earth Observations and Remote Sensing, vol. 15, pp. 6856– Hadiashar, “It-ruda: Information theory assisted robust unsupervised
6866, 2022. domain adaptation,” arXiv preprint arXiv:2210.12947, 2022.
[62] D. Chen, D. Miao, and X. Zhao, “Hyneter: Hybrid network transformer [84] K. Zeng, Q. Ma, J. Wu, S. Xiang, T. Shen, and L. Zhang, “Nlfftnet:
for object detection,” in ICASSP 2023-2023 IEEE International Con- A non-local feature fusion transformer network for multi-scale object
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, detection,” Neurocomputing, vol. 493, pp. 15–27, 2022.
2023, pp. 1–5. [85] T. Ding, K. Feng, Y. Wei, Y. Han, and T. Li, “Deot: an end-to-end
[63] J. Ding, W. Li, L. Pei, M. Yang, C. Ye, and B. Yuan, “Sw-yolox: An encoder-only transformer object detector,” Journal of Real-Time Image
anchor-free detector based transformer for sea surface object detection,” Processing, vol. 20, no. 1, p. 1, 2023.
Expert Systems with Applications, p. 119560, 2023. [86] G. Chen, Z. Mao, K. Wang, and J. Shen, “Htdet: A hybrid transformer-
[64] H. Yang, Z. Yang, A. Hu, C. Liu, T. J. Cui, and J. Miao, “Unifying based approach for underwater small object detection,” Remote Sensing,
convolution and transformer for efficient concealed object detection in vol. 15, no. 4, p. 1076, 2023.
passive millimeter-wave images,” IEEE Transactions on Circuits and [87] Z. Zong, G. Song, and Y. Liu, “Detrs with collaborative hybrid assign-
Systems for Video Technology, 2023. ments training,” arXiv preprint arXiv:2211.12860, 2022.
[65] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, [88] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y.
“Conditional detr for fast training convergence,” in Proceedings of the Shum, “Dino: Detr with improved denoising anchor boxes for end-to-
IEEE/CVF International Conference on Computer Vision, 2021, pp. end object detection,” arXiv preprint arXiv:2203.03605, 2022.
3651–3660. [89] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr:
[66] X. Chen, F. Wei, G. Zeng, and J. Wang, “Conditional detr v2: Accelerate detr training by introducing query denoising,” in Proceed-
Efficient detection transformer with box queries,” arXiv preprint ings of the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:2207.08914, 2022. Recognition, 2022, pp. 13 619–13 627.
[67] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, [90] L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “Ao2-detr: Arbitrary-
“Dab-detr: Dynamic anchor boxes are better queries for detr,” arXiv oriented object detection transformer,” IEEE Transactions on Circuits
preprint arXiv:2201.12329, 2022. and Systems for Video Technology, 2022.
[68] L. He and S. Todorovic, “Destr: Object detection with split transformer,” [91] C. Zhang, L. Liu, X. Zang, F. Liu, H. Zhang, X. Song, and
in Proceedings of the IEEE/CVF conference on computer vision and J. Chen, “Detr++: Taming your multi-scale detection transformer,”
pattern recognition, 2022, pp. 9377–9386. arXiv preprint arXiv:2206.02977, 2022.
20
[92] Y. Tang, B. Wang, W. He, and F. Qian, “Pointdet++: an object detection [114] A. Coluccia, A. Fascista, A. Schumann, L. Sommer, A. Dimou,
framework based on human local features with transformer encoder,” D. Zarpalas, F. C. Akyon, O. Eryuksel, K. A. Ozfuttu, S. O. Altinuc
Neural Computing and Applications, pp. 1–12, 2022. et al., “Drone-vs-bird detection challenge at ieee avss2021,” in 2021
[93] S. Li, F. Sultonov, J. Tursunboev, J.-H. Park, S. Yun, and J.-M. Kang, 17th IEEE International Conference on Advanced Video and Signal
“Ghostformer: A ghostnet-based two-stage transformer for small object Based Surveillance (AVSS). IEEE, 2021, pp. 1–8.
detection,” Sensors, vol. 22, no. 18, p. 6939, 2022. [115] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
[94] Y. Shou, T. Meng, W. Ai, C. Xie, H. Liu, and Y. Wang, “Object optical remote sensing images: A survey and a new benchmark,” ISPRS
detection in medical images based on hierarchical transformer and mask journal of photogrammetry and remote sensing, vol. 159, pp. 296–307,
mechanism,” Computational Intelligence and Neuroscience, vol. 2022, 2020.
2022. [116] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolu-
[95] S. Dubey, F. Olimov, M. A. Rafique, and M. Jeon, “Improving small tional neural networks for object detection in vhr optical remote sens-
objects detection using transformer,” Journal of Visual Communication ing images,” IEEE Transactions on Geoscience and Remote Sensing,
and Image Representation, vol. 89, p. 103620, 2022. vol. 54, no. 12, pp. 7405–7415, 2016.
[96] J. Chen, H. Hong, B. Song, J. Guo, C. Chen, and J. Xu, “Mdct: [117] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust
Multi-kernel dilated convolution and transformer for one-stage object convolutional neural network for very high-resolution remote sensing
detection of remote sensing images,” Remote Sensing, vol. 15, no. 2, p. object detection,” IEEE Transactions on Geoscience and Remote Sens-
371, 2023. ing, vol. 57, no. 8, pp. 5535–5548, 2019.
[118] Y. Gao, H. Shen, D. Zhong, J. Wang, Z. Liu, T. Bai, X. Long, and
[97] T. Ye, W. Qin, Z. Zhao, X. Gao, X. Deng, and Y. Ouyang, “Real-time
S. Wen, “A solution for densely annotated large scale object detection
object detection network in uav-vision based on cnn and transformer,”
task,” 2019.
IEEE Transactions on Instrumentation and Measurement, vol. 72, pp.
[119] Y. Chen, Z. Zhang, Y. Cao, L. Wang, S. Lin, and H. Hu, “Reppoints v2:
1–13, 2023.
Verification meets regression for object detection,” Advances in Neural
[98] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Information Processing Systems, vol. 33, pp. 5621–5631, 2020.
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
[120] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
scale visual recognition challenge,” International journal of computer
anchor-based and anchor-free detection via adaptive training sample
vision, vol. 115, pp. 211–252, 2015.
selection,” in Proceedings of the IEEE/CVF conference on computer
[99] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track- vision and pattern recognition, 2020, pp. 9759–9768.
former: Multi-object tracking with transformers,” in Proceedings of [121] W.-H. Lin, J.-X. Zhong, S. Liu, T. Li, and G. Li, “Roimix: proposal-
the IEEE/CVF conference on computer vision and pattern recognition, fusion among multiple images for underwater object detection,” in
2022, pp. 8844–8854. ICASSP 2020-2020 IEEE International Conference on Acoustics,
[100] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2588–2592.
tracking,” in Proceedings of the IEEE/CVF conference on computer [122] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu,
vision and pattern recognition, 2021, pp. 8126–8135. M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object
[101] C. Liu, S. Xu, and B. Zhang, “Aerial small object tracking with detection in aerial images,” in Proceedings of the IEEE conference on
transformers,” in 2021 IEEE International Conference on Unmanned computer vision and pattern recognition, 2018, pp. 3974–3983.
Systems (ICUS). IEEE, 2021, pp. 954–959. [123] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi
[102] L. He, Q. Zhou, X. Li, L. Niu, G. Cheng, X. Li, W. Liu, Y. Tong, transformer for oriented object detection in aerial images,” in Pro-
L. Ma, and L. Zhang, “End-to-end video object detection with spatial- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
temporal transformers,” in Proceedings of the 29th ACM International Recognition, 2019, pp. 2849–2858.
Conference on Multimedia, 2021, pp. 1507–1516. [124] L. Wang and A. Tien, “Aerial image object detection with vision
[103] Q. Zhou, X. Li, L. He, Y. Yang, G. Cheng, Y. Tong, L. Ma, and transformer detector (vitdet),” arXiv preprint arXiv:2301.12058, 2023.
D. Tao, “Transvod: end-to-end video object detection with spatial- [125] J. Han, J. Ding, N. Xue, and G.-S. Xia, “Redet: A rotation-equivariant
temporal transformers,” IEEE Transactions on Pattern Analysis and detector for aerial object detection,” in Proceedings of the IEEE/CVF
Machine Intelligence, 2022. Conference on Computer Vision and Pattern Recognition, 2021, pp.
[104] H. Wang, J. Tang, X. Liu, S. Guan, R. Xie, and L. Song, “Ptseformer: 2786–2795.
Progressive temporal-spatial enhanced transformer towards video object [126] J. Li, G. Zhu, C. Hua, M. Feng, B. Bennamoun, P. Li, X. Lu, J. Song,
detection,” in Computer Vision–ECCV 2022: 17th European Confer- P. Shen, X. Xu et al., “A systematic collection of medical image datasets
ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII. for deep learning,” ACM Computing Surveys, 2021.
Springer, 2022, pp. 732–747. [127] H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen, “Dynamic r-
[105] K. A. Hashmi, D. Stricker, and M. Z. Afzal, “Spatio-temporal learn- cnn: Towards high quality object detection via dynamic training,” in
able proposals for end-to-end video object detection,” arXiv preprint Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
arXiv:2210.02368, 2022. UK, August 23–28, 2020, Proceedings, Part XV 16. Springer, 2020,
[106] S.-D. Roh and K.-S. Chung, “Dafa: Diversity-aware feature aggregation pp. 260–275.
for attention-based video object detection,” IEEE Access, vol. 10, pp. [128] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks
93 453–93 463, 2022. for object detection,” in Proceedings of the IEEE/CVF international
[107] M. Fujitake and A. Sugimoto, “Video sparse transformer with attention- conference on computer vision, 2019, pp. 6054–6063.
guided memory for video object detection,” IEEE Access, vol. 10, pp. [129] H. Wu, Y. Chen, N. Wang, and Z. Zhang, “Sequence level semantics ag-
65 886–65 900, 2022. gregation for video object detection,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, 2019, pp. 9217–9225.
[108] Y. Cui, “Faq: Feature aggregated queries for transformer-based video
object detectors,” arXiv preprint arXiv:2303.08319, 2023.
[109] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for
uav tracking,” in Computer Vision–ECCV 2016: 14th European Confer-
ence, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings,
Part I 14. Springer, 2016, pp. 445–461.
[110] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner,
“Precise detection in densely packed scenes,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2019, pp. 5227–5236.
[111] K. Yan, X. Wang, L. Lu, and R. M. Summers, “Deeplesion: automated
mining of large-scale lesion annotations and universal lesion detection
with deep learning,” Journal of medical imaging, vol. 5, no. 3, pp.
036 501–036 501, 2018.
[112] “Udacity self-driving car driving data, 2017 transformer,”
https://github.com/udacity/selfdriving- car/tree/master/annotations.
[113] T. Liu, Y. Zhao, Y. Wei, Y. Zhao, and S. Wei, “Concealed object
detection for activate millimeter wave image,” IEEE Transactions on
Industrial Electronics, vol. 66, no. 12, pp. 9909–9917, 2019.

Transformers in Small Object Detection - SOTA

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Transformers in Small Object Detection - SOTA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformers in Small Object Detection - SOTA

Uploaded by

Copyright:

Available Formats

1

Transformers in Small Object Detection: A

explain transformers’ success. Specifically, we aim to analyze this

TABLE 1: A list of terminologies used in this paper with their meanings.

Full Term Description

TABLE 2: Commonly used datasets for SOD. NF: Not fixed.

Model Backbone GFLOPS↓/FPS ↑ #params↓ mAP@[0.5,0.95] ↑ Epochs↓ URL

Model Backbone FPS ↑ #params↓ mAP ↑ Epochs↓ URL

transformer experts. However, as indicated in Table 4 (results are

4.2.3 Small Object Detection in Medical Images

Fig. 14: Example of small abnormalities in DeepLesion image dataset [111].

Fig. 15: Examples of low quality images in URPC2018 image dataset.

Model Accuracy ↑ mAP0.5 ↑ Model #Params↓ mAP@[0.5,0.95] ↑ mAP0.5 ↑

You might also like