Unlocking the capabilities of explainable few-shot learning in remote sensing

Lee, Gao Yu; Dam, Tanmoy; Ferdaus, Md. Meftahul; Poenar, Daniel Puiu; Duong, Vu N.

doi:10.1007/s10462-024-10803-5

Unlocking the capabilities of explainable few-shot learning in remote sensing

Open access
Published: 10 June 2024

Volume 57, article number 169, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Unlocking the capabilities of explainable few-shot learning in remote sensing

Download PDF

2660 Accesses
Explore all metrics

Abstract

Recent advancements have significantly improved the efficiency and effectiveness of deep learning methods for image-based remote sensing tasks. However, the requirement for large amounts of labeled data can limit the applicability of deep neural networks to existing remote sensing datasets. To overcome this challenge, few-shot learning has emerged as a valuable approach for enabling learning with limited data. While previous research has evaluated the effectiveness of few-shot learning methods on satellite-based datasets, little attention has been paid to exploring the applications of these methods to datasets obtained from Unmanned Aerial Vehicles (UAVs), which are increasingly used in remote sensing studies. In this review, we provide an up-to-date overview of both existing and newly proposed few-shot classification techniques, along with appropriate datasets that are used for both satellite-based and UAV-based data. We demonstrate few-shot learning can effectively handle the diverse perspectives in remote sensing data. As an example application, we evaluate state-of-the-art approaches on a UAV disaster scene dataset, yielding promising results. Furthermore, we highlight the significance of incorporating explainable AI (XAI) techniques into few-shot models. In remote sensing, where decisions based on model predictions can have significant consequences, such as in natural disaster response or environmental monitoring, the transparency provided by XAI is crucial. Techniques like attention maps and prototype analysis can help clarify the decision-making processes of these complex models, enhancing their reliability. We identify key challenges including developing flexible few-shot methods to handle diverse remote sensing data effectively. This review aims to equip researchers with an improved understanding of few-shot learning’s capabilities and limitations in remote sensing, while pointing out open issues to guide progress in efficient, reliable and interpretable data-efficient techniques.

Self-supervised learning for remote sensing scene classification under the few shot scenario

Article Open access 09 January 2023

MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning

A Survey of Few-Shot Learning for Image Classification of Aerial Objects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The last few decades saw significant advancements in remote sensing imaging technology. Remote sensing technologies nowadays encompass not only the traditional satellite-based platforms, but also include data collected from remote Unmanned Aerial Vehicles (UAVs). Figure 1 illustrates the typical height at which such platforms navigate as well as their estimated coverage area (Xiang et al. 2018) for an urban setting. The modern airborne sensors that are attached to such platforms can cover and map a significant portion of the earth’s surface with better spatial and temporal resolutions, making them essential for earth-based or environmental-based observations like geodesy and disaster relief. Automatic analysis of remote sensing images is usually multi-modal, meaning that optical, radar, or infrared sensors could be used, and such data could be distributed geographically and globally in an increasingly efficient manner. With advances in artificial intelligence, deep learning approaches have found their way into the remote sensing community, which, together with the increased in remote sensing data availability, has enabled more effective scene understanding, object identification, and tracking.

Convolutional Neural Networks (CNNs) have become popular in object recognition, detection, and semantic or instance segmentation of remote sensing images, typically using RGB images as input, which undergo convolution, normalization, and pooling operations. The convolution operation is effective in accounting for the local interactions between features of a pixel. While the remote sensing community has made great strides in multi-spectral satellite-based image classification, tracking, and semantic and instance segmentation, the limited receptive field of CNNs makes it difficult to model long-range dependencies in an image. Vision transformers (ViTs) was proposed to address this issue by leveraging the self-attention mechanism to capture global interactions between different parts of a sequence. ViTs have demonstrated high performance on benchmark datasets, competing with the best CNN-based methods. Consequently, the remote sensing community has rapidly proposed ViT-based methods for classifying high-resolution images. With pre-training weights and transfer learning techniques, CNNs and ViTs can retain their classification performance at a lower computational cost, which is essential for limited computational resources platforms such as UAVs.

However, both CNNs and ViTs required large training data samples for accurate classification, and some of these methods may not be feasible for critical tasks such as UAVs search-and-rescue. It would be beneficial, for instance, if the platforms were able to quickly identify and generalize disaster scene solely from analyzing a small subset of the captured frames. Few-shot classification approaches addressed the above needs, and in such approaches the goal is to enable the network to quickly generalize to unseen test classes in a more diverse manner given a small sets of training images. A framework like this closely resembles how the human brain learns in real life. Like ViTs, few-shot learning has also ignited new researches in remote sensing, and their applications to land cover classification in the RGB domain (Deng et al. 2021; Zhang et al. 2021) and hyperspectral classification (He et al. 2019; Zhong et al. 2021) has been observed. The approaches have also been extended to object detection (Carion et al. 2020) and segmentation (Xu et al. 2021). These emerging works are also as recent as that utilizing ViT. Since a review of ViT approaches for various domains in remote sensing (Aleissaee et al. 2022) have been reported, a review of few-shot-based approaches in remote sensing is noteworthy to keep current interested researchers up-to-pace with the recent progresses in this area.

We have taken note that a related review has already been conducted in Sun et al. (2021). A notable omission in the previous review is the failure to acknowledge the significance of interpretable machine learning models in this field. Integrating interpretable machine learning into remote sensing image classification can further enhance CNNs and ViTs’ performance. By providing insights into the decision-making process of these models, interpretable machine learning can increase their transparency and accountability, which is particularly relevant in applications where high-stakes decisions are made based on their outputs, such as disaster response and environmental monitoring. For instance, saliency maps can be generated to highlight regions of images that are most relevant for the model’s decision, providing visual explanations for its predictions. Furthermore, interpretable machine learning can aid in identifying potential biases and errors in the training data, as well as enhancing the robustness and generalization of the model. In remote sensing, interpretable machine learning can also facilitate the integration of expert knowledge into the model, enabling the inclusion of physical and environmental constraints in the classification process. This can enhance the accuracy and interpretability of the model, allowing for more informed decision-making. In short, the integration of interpretable machine learning in remote sensing image classification can provide a valuable tool for enhancing the transparency, accountability, and accuracy of CNNs and ViTs. By providing insights into the decision-making process of these models, interpretable machine learning can help build trust in their outputs and facilitate their use in critical applications.

The purpose of this additional review in this area is to address some more gaps that were not included in the previous review by Sun et al. (2021). These gaps are as follows:

Their focus in exploring remote sensing datasets was on satellite-based imagery and the few-shot learning techniques associated with such datasets. Nonetheless, with the emergence of UAV-based remote sensing datasets, we have detected a lack of consideration for the proposed works that have been applied and evaluated in such datasets. Furthermore, datasets and learning-based techniques associated with UAVs could also benefit from few-shot learning approaches due to their limited computational resources. This implies that data collected through UAVs would be constrained by a limited amount, thus emphasizing the need for efficient learning methods.
Quantitatively speaking, satellite-based remote sensing datasets offer a considerably wider field of view and greater coverage, allowing for the simultaneous capture of multiple object classes or labels in a single scene, an approach referred to as multi-label classification. On the other hand, the smaller coverage area of UAV-based remote sensing datasets often provides data that is suitable only for single-label image classification. Consequently, proposed methods that address such settings in the context of UAV-based remote sensing can be easily distinguished from those designed for multi-label classification, in contrast to works utilizing satellite-based remote sensing datasets. It is essential, therefore, to take into account the characteristics of the remote sensing dataset when devising and evaluating image classification methods in this field.
As has been emphasized and illustrated by Sun et al. (2021), the utilization of few-shot learning-based techniques for remote sensing has been on the rise since 2012. As the aforementioned work was published in 2021, we can envisage that there will be an even greater proliferation of such approaches for remote sensing. In light of the dynamic nature of this research domain, our review aims to disseminate the most current and up-to-date information available on the topic. Through this approach, we seek to provide an improved understanding of the recent advances in few-shot learning-based methods for remote sensing, allowing for a comprehensive assessment of their potential applications and limitations.

In summary, our main contributions in this review article are as follows:

In this work, we present and holistically summarize the applications of few-shot learning-based approaches in both satellite-based and UAV-based remote sensing images, focusing on image classification alone, but extending the review work conducted by Sun et al. (2021) in terms of the explored remote sensing datasets. Our analysis serves to assist readers and researchers alike, allowing them to bridge gaps between current state-of-the-art image-based classification techniques in remote sensing, which may aid in promoting further progress in the field.
As part of our discussion on the recent progress in the field of remote sensing regarding few-shot classification, we examined how CNNs and transformer-based approaches can be adapted to datasets, expanding the potential of these methods in this domain.
Our work delved into a thorough discussion of the challenges and research directions concerning few-shot learning in remote sensing. We aimed to identify the feasibility and effectiveness of different learning approaches in this field, focusing on their potential applications in UAV-based classification datasets. Through this approach, we sought to shed light on the potential limitations and further research needed to improve the efficacy of few-shot learning-based techniques in the domain of remote sensing, paving the way for more advanced and sophisticated classification methods to be developed in the future.
We also emphasized the significance of integrating XAI to improve transparency and reliability of few-shot learning-based techniques in remote sensing. Our objective was to offer researchers and practitioners a better comprehension of the possible applications and constraints of these techniques. We also aimed to identify novel research directions to devise more effective and interpretable few-shot learning-based methods for image classification in remote sensing.

The remainder of this paper is structured as follows: In Sect. 2, we provide a quick background on few-shot learning and present example networks. Section 3 discusses related review works in remote sensing, and Sects. 4, 5 and 6 provide brief highlights of the type of remote sensing data, common evaluation metrics utilized, and benchmark datasets commonly used, respectively. Section 7 delves into some up-to-date existing works on few-shot classification in the hyperspectral, Very High Resolution (VHR), and Synthetic Aperture Radar (SAR) data domain. In Sect. 8, we outline some implications and limitations of current approaches, and in Sect. 9, we quantitatively evaluate some existing methods on a UAV-based dataset, demonstrating the feasibility of such approaches for UAV applications. Finally, in Sect. 10, we conclude this review paper. An overview of the scope covered in our review of Explainable Few-Shot Learning for Remote Sensing is illustrated in Fig. 2.

2 Backgrounds

Few-shot learning (FSL) is an emerging approach in the field of machine learning that allows models to acquire knowledge and make accurate predictions with limited training examples per class or context in a specific problem domain. In contrast to conventional machine learning techniques that demand vast quantities of training data, FSL aims to achieve comparable levels of performance using substantially fewer training examples. This ability to learn from scarce data makes FSL well-suited for applications where gathering sizable training sets may be prohibitively expensive or otherwise infeasible.

In traditional machine learning, models are trained from scratch on large labeled datasets. In contrast, FSL aims to learn new concepts from just a few examples, leveraging transfer learning from models pre-trained on other tasks. First, a base model is pretrained on a large dataset for a task like image classification. This provides the model with general feature representations that can be transferred. Then for the new few-shot task, the pretrained model is used as a starting point. The support set of few labeled examples for the new classes is used to fine-tune the pretrained model. Typically only the last layer is retrained to adapt the model to the new classes, in order to leverage the pre-learned features. Finally, the adapted model is evaluated on the query set. The query set contains unlabeled examples that the model must make predictions for, based on the patterns learned from the small support set for each new class. This tests how well the model can generalize to new examples of the classes after adapting with only a few shots. To get a clearer view, this whole process is illustrated in Fig. 3.

Approaches in FSL classification can often be categorized based on the number of novel categories needed for generalization, referred to as N, as well as the number of labeled samples or classes available in the support set for each of the N novel classes, referred to as k. Generally, a lower value of k makes it more challenging for the few-shot model to achieve high classification accuracy, as there is less supporting information in the support set to aid the model in making accurate predictions. This scheme is commonly referred to as ‘N-way k-shot learning scheme.’ In instances where k equals 1, such schemes are often referred to as one-shot learning. Additionally, in instances where k equals 0, such schemes are often referred to as zero-shot learning.

Initial exploration of FSL in conjunction with unmanned aerial vehicle (UAV)-based thermal imagery was undertaken by Liu et al. (2018) and Masouleh and Shah-Hosseini (2019). Their pioneering work demonstrated the potential of FSL for UAV-based tasks where limited onboard computational resources impose stringent constraints on model complexity and training data volume. The primary goal of FSL is to construct models that can identify latent patterns within a certain field using limited training examples, then utilize this learned knowledge to effectively categorize and classify new input. This capability closely mirrors human learning, where people can often understand the core of a new concept from just one or two examples. By reducing reliance on extensive training sets, FSL facilitates the development of machine learning systems applicable to data-scarce real-world problems across a broad range of domains.

2.1 Similarity functions for few-shot learning

A similarity function is a critical component of linking the support set and query set in few-shot learning. An example in the context of aerial disaster scene classification using the AIDER dataset (Kyrkou and Theocharides 2020) is illustrated in Fig. 4. The left side of the figure shows how the similarity function evaluation is performed between each pair of images, with the left and middle images representing a fire disaster class and the right image representing a non-disaster class (or normal class). The right side of the figure shows how the similarity function can be used in conjunction with a query image and those from the support set to make a prediction on the correct class (flood) based on the similarity scores.

In few-shot learning, the choice of loss function is critical for enabling effective generalization from limited examples. Some commonly used loss functions include triplet loss, contrastive loss, and cross-entropy loss. The triplet loss helps models learn useful feature representations by minimizing the distance between a reference sample and a positive sample of the same class, while maximizing the distance to a negative sample from a different class. This allows fine-grained discrimination between classes. Contrastive loss is useful for training encoders to capture semantic similarity between augmented views of the same example. This improves robustness to input variations. Cross-entropy loss is commonly used for classifier training in few-shot models, enabling efficient learning from scarce labeled data. However, it can suffer from overfitting due to limited examples. Regularization methods such as label smoothing can help mitigate this. Other advanced losses like meta-learning losses based on model parameters have shown promise for fast adaptation in few-shot tasks. Overall, the choice of loss function plays a key role in addressing critical few-shot learning challenges like overfitting, feature representation learning, and fast generalization. Further research on specialized losses could continue improving few-shot performance.

For the scenario depicted on the left side of Fig. 4, the triplet loss $L_{triplet}$ (Hoffer and Ailon 2015) is an example of a similarity function that could be used. The triplet loss involves comparing an anchor sample class to a positive sample class and a negative sample class. The goal is to minimize the Euclidean distance between the anchor and the positive class based on the similarity function f and maximize the distance between the anchor and the negative class. This can be summarized mathematically in Eq. 1 as

$$\begin{aligned} L_{triplet} = \sum _{i}^{N} \left\Vert f(x_{i}^{a}) - f(x_{i}^{p})\right\Vert _{2}^{2} - \left\Vert f(x_{i}^{a}) - f(x_{i}^{n})\right\Vert _{2}^{2} + \alpha. \end{aligned}$$

(1)

In Eq. 1, the anchor, positive, and negative class samples are denoted as a, p, and n, respectively. The index i refers to the input sample index, N denotes the total number of samples in the dataset, and $\alpha$ is a bias term acting as a threshold. The subscript 2 indicates that the evaluated Euclidean distance is the L2 loss, and the superscript 2 corresponds to squaring each parenthesis. The second term with a negative sign allows the maximization of the distance between the anchor and the negative class sample.

Networks that use the triplet loss for few-shot learning are also referred to as triplet networks. On the other hand, for comparing pairs of images, Siamese networks are commonly used. In such cases, the contrastive loss function $L_{contrastive}$ (Hadsell et al. 2006) can be a better choice for defining similarity or loss, although the triplet loss can also be employed. The contrastive loss can be expressed mathematically as shown in Eq. 2:

$$\begin{aligned} \begin{aligned} L_{contrastive}(W, y, (x_{1},x_{2})^{i} ) = \sum _{i}^{N} \frac{1}{2} (1-y)(D_W)^{2} + \frac{1}{2}(y)({max(0,m-D_W)}^{2}). \end{aligned} \end{aligned}$$

(2)

In Eq. 2, y denotes whether two data points, $x_{1}$ and $x_{2}$, within a given set i, are similar (y = 0) or dissimilar (y = 1). The margin term m is user-defined, while $D_W$ is the similarity metric, which is given by:

$$\begin{aligned} D_W = \left\Vert f_{W}(X_{1}) - f_{W}(X_{2})\right\Vert _{2}. \end{aligned}$$

(3)

Similarly to the previous method, the L2 loss-based Euclidean distance is used, where the first term in Eq. 3 corresponds to similar data points and the second term corresponds to dissimilar ones.

The third type of network for Few-Shot Learning can be realized as a prototypical network, as introduced by Snell et al. (2017). This method utilizes an embedding space in which samples from the same class are clustered together. In Fig. 5, an example is provided to demonstrate this concept. For each cluster, a typical class prototype is computed as the mean of the data points in that group. The calculation of the class prototype can be expressed mathematically as shown in Eq. 4:

$$\begin{aligned} v^{(k)} = \frac{1}{N_{s}}\sum _{i}^{N_s} f_{\phi }(x_{i}^{k}). \end{aligned}$$

(4)

Equation 4 represents the prototypical network, a third type of network for FSL. The class prototype, computed as the mean of the data points belonging to the same group in the embedded space, is denoted by $v^{(k)}$, where k represents the class. The set of support images for class k is represented by $x_{i}^{k}$, and the embedding function by $f_{\phi }$, which is different from the similarity function f described earlier.

The prototypical network and Siamese or triplet networks are different few-shot learning approaches that compare query and support samples in different ways. While Siamese or triplet networks directly compare query and support samples in pairs or triplets, the prototypical network compares the query samples with the mean of their support set. This is achieved by calculating the prototype representation of each class in the embedded metric space, which is the average of the feature vectors of all the support samples for that class. This can be visualized in Fig. 5. However, for one-shot learning, where only a single support sample is available for each class, the three approaches become equivalent, as the prototype representation becomes identical to the support sample representation. Overall, the choice of few-shot learning approach may depend on the dataset’s specific characteristics and the available support samples.

2.2 Importance of explainable AI in remote sensing

Remote sensing and analysis of satellite imagery has progressed rapidly thanks to artificial intelligence and machine learning. Machine learning models can identify objects and patterns in huge amounts of satellite data with incredible accuracy, surpassing human capabilities. However, these complex machine learning models are often considered “black boxes”—they provide highly accurate predictions and detections but it is unclear why they make those predictions.

Explainable AI is an emerging field of study focused on making machine learning models and their predictions more transparent and understandable to humans. Explainable AI techniques are essential for applications like remote sensing where decisions could have serious real-world consequences (Kakogeorgiou and Karantzalos 2021). For example, a machine learning model that detects signs of natural disasters like wildfires in satellite images needs to provide an explanation for its predictions so that human operators can verify the findings before taking action. There are several approaches to making machine learning models used for remote sensing more explainable.

Highlighting important features: Techniques like saliency maps can highlight the most important parts of an image for a machine learning model’s prediction. For example, computer vision models could highlight the features they use to detect objects in satellite images, allowing correction of errors. Similarly, anomaly detection models could point to regions that led them to flag unusual activity, enabling verification of true positives versus false alarms.
Simplifying complex models: Complex machine learning models can be converted into simplified explanations that humans can understand, like logical rules and decision trees. For instance, deep reinforcement learning policies for navigating satellites could be expressed as a simplified set of if-then rules, revealing any flawed assumptions. These simplified explanations make the sophisticated capabilities of machine learning more accessible to domain experts.
Varying inputs to understand responses: Another explainable AI technique is to systematically vary inputs to a machine learning model and observe how its outputs change in response. For example, generative models that create new realistic satellite images could be evaluated by generating images with different attributes to determine their capabilities and limitations. Analyzing how a model’s predictions vary based on changes to its inputs provides insights into how it works and when it may produce unreliable results.

Overall, explainable AI has the potential to build trust in machine learning systems and empower humans to make the best use of AI for applications like remote sensing. Making machine learning models explainable also allows domain experts to provide feedback that can improve the models. For example, experts in remote sensing may notice biases or errors in a machine learning model’s explanations that could lead the model astray. By providing this feedback, the experts can help data scientists refine and retrain the machine learning model to avoid those issues going forward.

In short, explainable AI has significant promise for enabling machine learning and remote sensing to work together effectively. By making machine learning models and predictions transparent, explainable AI allows:

Humans to verify and trust the outputs of machine learning models before taking consequential actions based on them.
Domain experts to provide feedback that improves machine learning models and avoids potential issues.
A better understanding of the strengths, weaknesses and limitations of machine learning that can guide how the technology is developed and applied in remote sensing.

Explainable AI will be key to ensuring machine learning is used responsibly and to its full potential for remote sensing and beyond. Building partnerships between humans and AI can lead to a future with technology that enhances human capabilities rather than replacing them.

2.3 Taxonomy of few-shot learning approaches

Understanding the different approaches in few-shot learning, as illustrated in Fig. 6, can help improve how these models are interpreted. The figure categorizes several main types of few-shot learning techniques. It includes metric learning methods like Prototypical Networks, Siamese Networks, and Triplet Networks, which focus on learning distinct embeddings; optimization-based meta-learning exemplified by Meta-SGD and MAML for quick adaptation; memory-based meta-learning with methods such as Matching Networks and Relation Networks that use support sets; data augmentation through methods like data hallucination and generative/VAE-based transformations; transfer learning through domain adaptation and pre-training plus fine-tuning; and model generalization methods that focus on inherent adaptability through regularization constraints and inductive bias encoding. Referring to this framework enables us to pinpoint the main focus of a few-shot learning method, whether it’s in measuring similarities, optimizing meta-level processes, employing non-parametric memory, expanding data, transferring knowledge, or adapting the architecture. This classification, as depicted in the figure, can direct the creation of transparent and easy-to-understand techniques, particularly for remote sensing applications where explainability is key. By understanding the foundational learning mechanisms shown in the figure and associated with existing methods, we can make reasoned decisions to select, combine, and enhance techniques to achieve not only accurate but also interpretable few-shot remote sensing analysis.

2.4 Taxonomy of explainable few-shot learning approaches

Expanding on the previous discussion, the clarity of few-shot models can be improved by using explainable AI methods designed for this area. These methods of explainable few-shot learning can generally be divided into two main categories, as discussed below.

2.4.1 Explainable feature extraction

These methods aim to highlight influential features or inputs that drive the model’s predictions.

Attention mechanisms: Attention layers accentuate informative features and inputs by assigning context-specific relevance weights (Jetley et al. 2018). They produce activation maps visualizing influential regions (Wang et al. 2022a; Hong et al. 2021). However, they don’t explain overall reasoning process.
Explainable graph neural networks: Techniques like xGNNs (Yuan et al. 2020a; Moura et al. 2022) can identify important nodes and relationships in graph-structured data. Cheng et al. (2022) puts forth attentive graph neural network modules that can provide visual and textual explanations illustrating which features are most crucial for few-shot learning. This provides feature-level transparency. But complete logic remains unclear.
Concept activation visualization: Approaches like Grad-CAM produce saliency maps showing influential regions of input images (Selvaraju et al. 2016). But local feature importance may not fully represent global decision process.
Rotation-invariant feature extraction: The proposed rotation-invariant feature extraction framework in Pintelas et al. (2023) introduces an interpretable approach for extracting features invariant to rotations. This provides intrinsic visual properties rather than extraneous rotation variations.

2.4.2 Explainable decision making

These methods aim to directly elucidate the model’s internal logic and reasoning.

Interpretable models: Decision trees (Rudin 2019) and rule lists (Letham et al. 2015) provide complete transparency into model logic in a simplified human-readable format. However, accuracy is often lower than complex models.
Model-agnostic methods: Techniques like LIME (Ribeiro et al. 2016) and SHAP approximate complex models locally using interpretable representations. But generating explanations can be slow at prediction time.
Fairness constraints: By imposing fairness constraints during training (Agarwal et al. 2018) or transforming data into fair representations (Zemel et al. 2013), biases can be mitigated. However, constraints may overly restrict useful patterns.
Prototype analysis: Analyzing prototypical examples from each class provides intuition into a model’s reasoning (Snell et al. 2017). But limited to simpler instance-based models.

Overall, choosing suitable explainable few-shot learning techniques requires trading off accuracy, transparency, and efficiency based on the application requirements and constraints. A combination of feature and decision explanation methods is often necessary for complete interpretability. The taxonomy provides an initial guide to navigating this complex landscape of approaches in remote sensing contexts. For clarity, the taxonomy is also illustrated in Fig. 7.

3 Type of remote sensing sensor data

Remote sensing data is typically acquired from satellite or unmanned aerial vehicle (UAV) platforms, and the characteristics of the data can vary greatly depending on the specific platform and sensor used. They can be classified according to their spatial, spectral, radiometric, and temporal resolutions, as discussed in Aleissaee et al. (2022) and Sun et al. (2021).

Spatial resolution: The spatial resolution of remote sensing data is often limited by the size and altitude of the sensor platform, as well as the resolution of the sensor itself. For example, satellite-based sensors typically have a lower spatial resolution than UAV-based sensors, due to their higher altitude and larger coverage area.
Spectral resolution: Spectral resolution refers to the range of wavelengths that a remote sensing sensor can detect, as well as the sampling rate at which it collects data across this range. Different sensors have different spectral characteristics, and the spectral resolution of a sensor can have a significant impact on its ability to distinguish different features or objects in the scene.
Radiometric resolution: Radiometric resolution is related to the sensitivity of the sensor and the number of bits utilized for signal representation. A higher radiometric resolution means that the sensor is able to capture a wider range of signal strengths and more accurately represent the scene being imaged.
Temporal resolution: Temporal resolution is a critical characteristic of remote sensing data, as it can enable the tracking of changes in a scene over time. The frequency with which images are collected, as well as the length of time over which they are collected, can impact the ability of remote sensing systems to detect and monitor changes in the scene, such as vegetation growth or land use changes.

Understanding the various characteristics of remote sensing data is important for developing effective machine learning approaches, as different methods may be better suited to different types of data. For example, models that perform well on high-resolution satellite imagery may not perform as well on lower-resolution UAV data, and vice versa. By considering the characteristics of the data and tailoring machine learning approaches to the specific problem at hand, researchers can develop more accurate and effective models for remote sensing applications.

A classification of image data types can also be made based on three categories, namely Very High-Resolution Imagery, Hyperspectral Imagery, and Synthetic Aperture Radar Imagery, as discussed in Aleissaee et al. (2022) and Sun et al. (2021). Figure 8 provides some examples of such imagery.

Very High Resolution (VHR) imagery: Very High Resolution (VHR) imagery is often captured through the use of VHR satellite sensors, which are designed to capture images with an extremely high level of detail. This level of detail can be particularly beneficial for a number of different applications, including object detection and tracking, as well as emergency response operations. As the technology behind optical sensors continues to advance, the spatial resolution obtained from these sensors becomes even finer, allowing for even greater levels of detail to be captured in these images. This, in turn, can lead to even more accurate object detection and tracking, as well as more effective emergency response operations that are better able to respond to events as they unfold in real time.
Hyperspectral imagery: In addition to the optical electromagnetic spectrum that is often represented by the RGB color channels, remote sensing signals and imagery can also be obtained and analyzed in other parts of the spectrum, including the infrared (IR) and ultraviolet (UV) regions. In particular, the IR spectrum can be further categorized into near, mid or far-infrared, and the corresponding images captured in these ranges are known as hyperspectral imagery. This type of imagery goes beyond the three color channels of optical images and contains more spectral information, enabling the unraveling of the composition of the object of interest, both physically and chemically. As such, hyperspectral images are particularly useful for environmental and earth-science-based research, as they can provide detailed information on factors such as vegetation health, mineral composition, and water quality. By analyzing this spectral information, researchers can gain a deeper understanding of the earth’s surface, as well as monitor changes and anomalies that may indicate potential issues.
Synthetic Aperture Radar (SAR) imagery: By utilizing the process of emission and reception of electromagnetic waves on the Earth, radar-based remote sensing can be accomplished. Such remote sensing techniques can acquire high spatial resolution images regardless of weather conditions, and are widely applicable in numerous domains. In particular, Synthetic Aperture Radar (SAR) based images have been utilized in diverse fields, including disaster management, hydrology and forestry, due to their ability to provide high-quality images regardless of atmospheric conditions, time of day or season. SAR-based imagery can thus be a valuable source of information for remote sensing-based research and applications.

4 Benchmark remote sensing datasets for evaluating learning models

In this section, we will provide a brief overview of commonly used benchmark datasets that evaluate algorithms in remote sensing. The datasets are categorized and listed based on the type of remote sensing data and platforms they were collected from. It is important to note that these datasets are frequently used by researchers to evaluate and benchmark their algorithms, and although not included in the survey works by Aleissaee et al. (2022), they are essential for this review.

4.1 Hyperspectral image dataset

4.1.1 Satellite-based data

Most of the datasets described here are more tailored for multi-label image classification, although a few single label-based classification dataset exist.

Pavia (G. de Inteligencia Computacional 2020; Dam 2022): The Pavia University research team created a hyperspectral image dataset with images consisting of 610 $\times$ 610 pixels and 103 spectral bands. Each image in the dataset is a classification map with 9 classes that include mostly urban contexts such as bitumen, brick, and asphalt. The dataset comprises 42,776 labeled images and is specifically designed for multi-label classification.
Indian Pines (G. de Inteligencia Computacional 2020; Dam et al. 2020): The dataset contains hyperspectral images of a particular landscape in Indiana. It is a multi-label classification dataset where each map consists of 145 $\times$ 145 pixels and 224 spectral bands. There are 16 semantic labels available for each map, and the dataset has a total of 10,249 samples.
Salinas Valley (G. de Inteligencia Computacional 2020): The Salinas Valley dataset consists of hyperspectral images collected from California, with multi-label classification maps of pixel size 512 $\times$ 217 and 224 spectral bands, similar to the Indian Pines dataset. There are 16 semantic classes with 54,129 samples. A subset of the Salinas dataset, referred to as Salinas-A, includes only 86 $\times$ 86 image pixels of 6 classes, with a total of 5,348 samples.
Houston (Contest 2013): The Hyperspectral Image Analysis group in collaboration with the NSF Funded Center for Airborne Laser Mapping (NCALM) has acquired images across the University of Houston. This dataset comprises 16 semantic classes of urban objects such as highways, railways, and tennis courts, unlike the Botswana, Indian Pines, and Salinas Valley datasets. The images have 144 spectral bands in the 380 nm to 1050 nm region, and each image has a pixel size of 349 $\times$ 1905. The dataset is designed for evaluating multi-label image classification.
BigEarthNet (Sumbul et al. 2019): The dataset consists of pairs of Sentinel-2 images captured by a multi-spectral sensor, with 590326 pairs collected from 10 European countries. Each image in the pair has a size of 120 $\times$ 120 pixels and covers 13 spectral bands. The dataset is annotated with multiple land-cover classes or labels, making it suitable for multi-label classification evaluation.
EuroSat (Helber et al. 2019): The dataset consists of images obtained from the Sentinel-2 satellite, covering 13 spectral bands with 10 classes and 27,000 labeled samples. It is utilized for evaluating single-label-based land cover and land use classification. Each image has a pixel size of 64 $\times$ 64.
SEN12MS (Schmitt and Wu 2021): The dataset comprises 180,662 images captured from Sentinel-1 and Sentinel-2, with four cover types categorized using different classification schemes. Each image is of size 256 $\times$ 256 and contains different spectral bands. The images are annotated by multiple land-cover labels, but the primary objective is to use these labels to infer the overall context of the scene, such as forest, grasslands, or savanna, making it suitable for single label-based scene classification. It is important to note that Sentinel-1 images are SAR images, making the dataset useful for SAR-based map classification as well.

4.1.2 UAV-based dataset

WHU-Hi (Hu et al. 2020): The WHU-Hi dataset, which stands for Wuhan UAV-borne Hyperspectral Image, consists of UAV-based images of various crop types gathered in farming areas in Hubei province, China. It is divided into three sub-datasets: WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-Honghu, each with different individual image sizes, numbers of labels/classes, and spectral bands, which are explained in Table 1. The dataset is suitable for evaluating multi-label classification algorithms.

4.2 VHR image-based dataset

4.2.1 Satellite-based datasets

UC Merced Landuse (Yang and Newsam 2010): The dataset was designed for single-label land use classification and comprises 2100 RGB images, each of size 256 $\times$ 256 pixels. The dataset consists of 21 classes, predominantly related to urban land use.
ISPRS Potsdam (Gerke 2014): The International Society of Photogrammetry and Remote Sensing (ISPRS) developed a dataset for algorithmic evaluation of multi-label map classification. The dataset comprises 38 patches/images. The pixel size of each patch is 6000 $\times$ 6000.
ISPRS Vaihingen (Gerke 2014): The dataset was created for multi-label map classification and includes 33 patches/images of varying sizes. The pixel size of each patch is 2494 $\times$ 2064.
RESISC45 (Cheng et al. 2017): The Northwestern Polytechnical University (NWPU) created a dataset for single-label image scene classification. The dataset contains 31,500 images categorized into 45 classes, with each class consisting of 700 images. The pixel size of each image is 256 $\times$ 256.
WHU-RS19 (Xia et al. 2010): The dataset is created using satellite images obtained from Google Earth and contains 19 semantic classes, with approximately 50 samples per class. Samples from same class are extracted from different regions with varying resolutions, scales, orientations, and illuminations. Each image in the dataset is 600 × 600 pixels in size. It is intended for the purpose of single label-based image scene classification.
AID (Xia et al. 2017): The Aerial Image Database (AID) is a collection of 10,000 satellite images gathered from Google Earth, each sized 600 $\times$ 600 pixels. The dataset includes 30 classes primarily related to urban environments. As with the RESISC45 and WHU-RS19 datasets, AID is used for single-label image scene classification purposes.

4.2.2 UAV-based dataset

AIDER (Kyrkou and Theocharides 2020): The Aerial Image Database for Emergency Response (AIDER) is a collection of 8540 UAV images categorized into four disaster categories—collapsed buildings, fire, flood, and traffic accidents, along with a non-disaster category labeled as “normal” (Lee et al. 2023). This is one of the first UAV-based datasets that can be used as a benchmark for visual-based humanitarian aid or search-and-rescue operations in the RGB spectrum.
SAMA-VTOL (Bayanlou and Khoshboresh-Masouleh 2021): The SAMA-VTOL aerial image dataset is a new dataset developed from images captured by UAVs. This dataset was created to support a broad spectrum of scientific projects within the field of remote sensing. It is particularly useful for research projects focused on 3D object modeling, urban and rural mapping, and the processing of digital elevation and surface models. The objective is to provide high-resolution, low-cost data that contribute to a better understanding of both urban and rural scenes for various applications.

4.3 SAR image-based dataset

MSTAR (Wang et al. 2015): This dataset consists of 5950 X-band spectral images, each with a size of 128 $\times$ 128 pixels, and categorized into 10 classes. It is designed specifically for military object recognition and classification.
OpenSARShip (Huang et al. 2017): The dataset includes 11,346 chips of ships captured by C-band SENTINEL-1 SAR imagery, belonging to 17 ship types, and collected from 41 images. Each chip is labeled with automatic identification system messages indicating different environmental conditions. The image sizes of the chips range from 30 $\times$ 30 to 120 $\times$ 120 pixels.

It is evident that there are fewer SAR-based benchmark datasets compared to hyperspectral or VHR-based image datasets. According to Fu et al. (2021), collecting SAR-based images with fine annotation is more challenging due to the difficulty of acquisition and the tedious and time-consuming process of interpreting and labeling such images. Furthermore, Rostami et al. (2019) stated that the devices used for generating SAR images are costly, and the data accessibility is strictly regulated due to its classification.

In Table 1, we have summarized the discussion on the available datasets, highlighting the data type, number of images and classes, pixel sizes, spectral bands (if any), platform, and classification method.

Table 1 Summary of some datasets commonly utilized for few-shot learning algorithmic evaluation in the domain of remote sensing

Full size table

5 Evaluation metrics for few-shot remote sensing task

Before delving into the various approaches in a few-shot remote sensing task, we highlight in this section some evaluation metrics that are more suited for few-shot learning task. The data distribution would display some degree of imbalance between the training set and the test set for small-sample size unlike typical learning-based tasks, and hence appropriate metrics addressing such imbalance would need to be invoked. We illustrate in Table 2 the various metrics along with a brief overview. The metrics are the confusion matrix, precision, recall, F1 score, Overall Accuracy (OA), Average Accuracy (AA), Pixel Accuracy (PA), Average Precision (AP), Kappa coefficient $\kappa$, PR curve and Intersection over Union (IoU or Jaccard Index). Equations (5)–(10) mathematically describe some of the metrics as indicated in the respective equations.

$$\begin{aligned}{} & {} Precision = \frac{TP}{TP+FP}, \quad Recall = \frac{TP}{TP+FN}, \end{aligned}$$

(5)

$$\begin{aligned}{} & {} \bar{F1}= \frac{2}{N_{classes}}\sum _{i=1}^{N_{classes}}\frac{Precision*Recall}{Precision + Recall}, \end{aligned}$$

(6)

$$\begin{aligned}{} & {} OA = \frac{TP+TN}{TP+TN+FP+FN}, \end{aligned}$$

(7)

$$\begin{aligned}{} & {} AA = \sum _{i=1}^{N}\frac{TP_{i}+TN_{i}}{TP_{i}+FN_{i}+TN_{i}+FN_{i}}, \end{aligned}$$

(8)

$$\begin{aligned}{} & {} \kappa = \frac{2\times ((TP \times TN)-(FN\times FP))}{(TP+FP) \times (FP+TN) \times (TP+FN) \times (FN+TN)}, \end{aligned}$$

(9)

$$\begin{aligned}{} & {} {IoU} {=} \frac{{TP}}{{TP+FP+FN}} {=} \frac{{{|Target\cap Prediction |}}}{{|Target\cup Prediction |}}. \end{aligned}$$

(10)

Table 2 Evaluation metrics commonly utilized in few-shot learning-based approaches

Full size table

The variables TP, FP, TN, and FN in the previous equations represent true positive, false positive, true negative, and false negative classes, respectively. In Eq. (6), $N_{classes}$ refers to the total number of classes that are taken into consideration.

Depending on the remote sensing tasks at hand, various appropriate metrics are needed to compare the performances of the state-of-the-art algorithms in the same footing. For image classification, whereby one label are outputted per image as a whole, the confusion matrix, precision, recall, F1 score, OA, AA, $\kappa$, and PR curve are suitable metrics. For image segmentation, whereby multiple labels can be assigned to different area of interests (usually denoted by different colours) in an image, the IoU, F1 score, PA, precision, recall, the confusion matrix, and the PR curve can be utilized. For object detection, whereby an object in the image is required to be identified and localized, the IoU and the Average Precision stands out as the best metrics for the task since the use of the bounding box localized the object of interest, although other metrics like precision, recall, PR curve, and F1 scores could be used.

6 Recent few-shot learning techniques in remote sensing

In the domain of remote sensing, the intersection with computer vision has received considerable attention and research interest, as evinced by numerous works such as those undertaken by Aleissaee et al. (2022) and Tuia et al. (2011), which delve into and assess diverse active machine learning frameworks. Moreover, the intricacies of hyperspectral image classification and contemporary developments in machine learning and computer vision methods are explored by Camps-Valls et al. (2013). A comprehensive and exhaustive analysis of deep learning algorithms utilized for processing remote sensing images, while detailing current practices and available resources, is provided by Zhu et al. (2017). Furthermore, Aleissaee et al. (2022) presents an overview of Vision Transformer-based approaches to remote sensing, with a specific focus on very high-resolution, hyperspectral, and radar imaging. In this review, our specific focus lies on recent breakthroughs in the realm of few-shot learning techniques for remote sensing imaging. We seek to provide an in-depth exploration of the implications of such advancements for scene classification and comprehension in both satellite-based and UAV-based data collection platforms. The incorporation of explainable AI can aid in understanding the reasoning behind classification results, providing more transparency and confidence in decision-making processes.

6.1 Few-shot learning in hyperspectral images classification

In the field of remote sensing, few-shot learning has gained significant traction, as highlighted in the introductory section. In this particular section, we shall concentrate on the techniques put forward for both single-label and multi-label remote sensing classification within the context of both satellite and UAV-based platforms. It is worth noting that, unless indicated otherwise, all the evaluation metrics employed in the studies under review in this section encompass OA, AA, and $\kappa$.

The MDL4OW model, presented by Liu et al. (2020), employs a few-shot based deep learning architecture to classify five unknown classes through training on nine known classes. Notably, the proposed model departs from traditional centroid-based methods, instead utilizing extreme value theory from a statistical model, as depicted in Fig. 3. Furthermore, the authors introduced a novel evaluation metric, the mapping error, which is particularly sensitive to imbalanced classification scenarios frequently encountered in hyperspectral remote sensing image datasets. Mathematical expression of the mapping error for C classes is provided in 11, subject to constraints expressed in 12 and 13.

$$\begin{aligned} Error= & {} \frac{\sum _{i=1}^{C} \left\Vert A_{p,i} - A_{gt,i}\right\Vert }{\sum _{i=1}^{C} A_{gt,i}}, \end{aligned}$$

(11)

$$\begin{aligned} A_{i}\ge & {} 0, \end{aligned}$$

(12)

$$\begin{aligned} \sum _{i=1}^{C+1} A_{p,i}= & {} \sum _{i=1}^{C+1} A_{gt,i}. \end{aligned}$$

(13)

The mathematical expression of 11 represents the mapping error, where $A_{p,i}$ signifies the predicted area of the ith class and $A_{gt,i}$ denotes the corresponding ground-truth area. Here, C represents the total number of known classes (as in the case of their work, where it is equal to 9), while $C+1$ refers to the total number of unknown classes (which, in their work, is 5). The Pavia dataset, the Indian Pines dataset, and the Salinas Valley dataset are the datasets that are evaluated in their study. Apart from mapping error, the Openess metric (Geng et al. 2020) is also considered as a benchmark for their evaluation, which evaluates the degree of openness for a given dataset in open-world classification, in addition to OA and micro F1 score.

Equation 14 elucidates the association between Openess and the number of training and testing data, $N_{train}$ and $N_{test}$, respectively.

$$\begin{aligned} \begin{aligned} Openess = 1 - \sqrt{\frac{2 \times N_{train}}{N_{train}+N_{test}}}. \end{aligned} \end{aligned}$$

(14)

Figure 9 presents an illustration of how the MDL4OW methodology effectively identifies unknown classes, as demonstrated through an example image. The top portion of the figure highlights the road (denoted by black) and the house (enclosed by a dark-blue border), both of which cannot be assigned to any known class, as they were not presented a priori. However, a standard deep learning model would still require assigning them a label, as it was not trained to recognize these specific labels. In contrast, the MDL4OW approach (depicted in the bottom portion of the figure) is adept at identifying and marking unknown classes (denoted by black), effectively applying the proposed scheme.

Employing adaptive subspace learning and feature-wise transformation (SSFT) techniques, Bai et al. (2022) aimed to enhance feature diversities and minimize overfitting. In particular, they incorporated a 3-D local channel attention residual network to extract features, and evaluated their algorithm against the SOTA using the Salinas, Pavia, and Indian Pines datasets. To compare with other SOTA, they performed a 5-shot, 10-shot, 15-shot, 20-shot, and 25-shot evaluation approach. In the study conducted by Ding et al. (2021), a pseudo-labelling approach was adopted to augment the feature extraction procedure of their network using limited samples and also reduce overfitting. The soft pseudo label was computed by taking into account the euclidean distance between the unlabelled samples and the other agents with each labelled sample acting as a reference. Two sub-networks, namely the 3D-CNN and the SSRN (based on the ResNet), were proposed to function as the feature extractor. The dataset used for evaluation comprised of Pavia, Indian Pines and Salinas Valley. For comparison with other SOTA approaches, a 1-shot, 3-shot, and 5-shot evaluation approach was employed. Results showed that the proposed model outperformed all existing SOTA approaches in all three evaluation settings.

Using a 3D residual convolutional block attention network (R3CBAM), the authors of Pal et al. (2022) demonstrated how to effectively learn spectral-spatial features in a more salient manner with small training samples. The CBAM is incorporated as an attention network. Meta-learning is employed, where a set of Euclidean distances from the test query set from known class prototypes are leveraged, and unknown class queries are labelled as outliers and are recognized without setting a threshold value beforehand. The evaluation of their approach was performed on the Indian Pines, Pavia, Salians, and Houston datasets. During the training of their network, a query set was generated from six base classes randomly chosen, and the support set was formed from samples using three randomly chosen query classes. A 1-shot and 5-shot OSR performance evaluation were carried out and compared with SOTA methods. In both 1-shot and 5-shot OSR evaluations, the results indicated that the proposed method outperformed the SOTA methods.

Expanding upon previous works, Wang et al. (2021) proposed the Heterogeneous Few-Shot Learning (HFSL) approach for remote sensing classification with few samples per class. The method initially learns from data randomly sampled from the mini-ImageNet dataset to obtain transferable knowledge, followed by separating the data into support and query sets. A spectral-spatial fusion few-shot learning model is proposed that extracts spectral information through 1D mathematical operations and spatial information through a CNN with VGG16 pre-trained weights in the first layer. Their evaluation approach includes the Pavia University and Houston datasets, with a 5-shot performance evaluation against state-of-the-art methods. Building on this, Hu et al. (2022) adds Knowledge Distillation (KD) to the approach, making it simpler to identify important parts of small samples, even with a shallower network. Further knowledge transfer and fine-tuning of the classifier model are performed, with evaluation on the Pavia University and Indian Pines datasets.

Using the Dirichlet-Net for feature extraction, Qu et al. (2019) suggested a few-shot multi-task transfer learning strategy that aims to maintain classification accuracy across several domains. The key concept is to extract fundamental representations that are common to the same type of object features across domains, with the aim of circumventing the requirement for more ground-truth labels from the target domain. The Pavia University dataset was employed to assess their approach, with a 5-sample per class evaluation strategy (i.e., 5-shot evaluation). Results showed that the proposed method was able to accurately classify unseen target domain samples, demonstrating the efficacy of the approach.

The authors of Tong et al. (2020) proposed a new Attention-weight Graph Convolutional Network (AwGCN) for a few-shot method of quantifying and correlating internal features in hyperspectral data. This is followed by a semi-supervised label propagation of the node labels (features) from the support to query set via the GCN using the trained weights of the attention graphs. Unlike other approaches, they did not rely on pre-trained CNN-based weights as feature extractors but instead utilized a graph-based approach. The proposed method was evaluated on the Indian Pines dataset using 1-shot, 3-shot, and 5-shot approaches and on the Pavia University dataset using a 5-shot approach. In a similar vein, Yang et al. (2020) proposed a GraphSAGE-based approach that utilizes spectral and spatial feature information to greatly reduce algorithmic space and time complexity. Their approach was evaluated on the Pavia University, Indian Pines, and Kennedy Space Centre dataset using a 30-samples per class evaluation approach for training and a 15-samples per class evaluation approach for validation. Their results demonstrated improved accuracy compared to other state-of-the-art methods, with a 6.7% increase in accuracy on the Pavia University dataset and a 5.2% increase on the Indian Pines dataset. Furthermore, the Kennedy Space Centre dataset improved accuracy by 7.1%, making their approach a strong contender for further research. This improvement in accuracy indicates that their approach is more effective than other methods and could be a potential solution for future applications.

The proposed method by Huang et al. (2021a), Self-Attention and Mutual-Attention Few-Shot Learning (SMA-FSL), utilizes a 3D convolutional feature embedding network for spectral-spatial extraction, coupled with a self-attention module to extract prototypes from each class in the support set and a mutual-attention module that updates and aligns these category prototypes with the query set. Attention-based learning is emphasized, where crucial features are enhanced while noisy features are reduced. To assess the efficacy of their approach, it is evaluated on the Houston, Botswana (Gerke 2014), Chikusei (Yokoya and Iwasaki 2016) and Kennedy Space Center datasets using 1-shot, 5-shot, and 15-shot evaluation approaches. These datasets are chosen for their diverse range of terrain and vegetation, allowing for a comprehensive evaluation of the model’s performance. The results of the evaluation show that the approach is effective across all datasets, demonstrating its versatility in different environments.

The proposed work by Zhao et al. (2022) presents an incremental learning-based method that constantly updates the classifier by utilizing few-shot samples, allowing recognition of new classes while retaining knowledge of previous classes. The feature extractor module is implemented using a 20-layer ResNet, and the few-shot class incremental learning (FSCIL) is carried out via a constantly updated classifier (CUC), which is further enhanced by incorporating an attention mechanism for measuring the prototype similarity between each training and test sample class. The Pavia University dataset was employed for evaluating the performance of this approach using a 5-shot evaluation strategy. The results obtained showed that the proposed FSCIL with CUC and attention mechanism achieved superior performance compared to the baseline method. Furthermore, it was also observed that the performance improved with an increase in the number of shots.

Most previous works have employed CNN-based architectures for few-shot learning in hyperspectral image classification. However, CNNs can struggle with modeling long-range dependencies in spectral-spatial data when training samples are scarce. This has motivated recent interest in transformer architectures as an alternative.

In a notable contribution, Bai et al. (2022) addressed the challenge of performance degradation observed in hyperspectral image classification methods when only a limited number of labeled samples are available for training. They proposed a unified framework with a Transformer Encoder and Convolutional Blocks to enhance feature extraction without needing extra data. The Transformer Encoder provides global receptive fields to capture long-range dependencies, while the Convolutional Blocks model local relationships. Their method achieved state-of-the-art results on few-shot hyperspectral tasks using public datasets, demonstrating the potential of transformers to advance few-shot learning in this domain.

Huang et al. (2023) also recognized limitations of CNN-based models for few-shot hyperspectral image classification. They highlighted the inherent difficulty of CNNs in effectively capturing long-range spatial-spectral dependencies, especially in scenarios with limited training data. They proposed an improved spatial-spectral transformer (HFC-SST) to overcome this, inspired by transformers’ strong modeling capabilities for long-range relationships. HFC-SST generates local spatial-spectral sequences as input based on correlation analysis between spectral bands and adjacent pixels. A transformer-based network then extracts discriminative spatial-spectral features from this sequence using only a few labeled samples. Experiments on multiple datasets demonstrated that HFC-SST outperforms CNNs and prior few-shot learning methods by effectively modeling local long-range dependencies in limited training data. This further highlights the potential of transformers to advance few-shot hyperspectral classification through robust spatial-spectral feature learning.

The work by Peng et al. (2023) also explores cross-domain few-shot learning for hyperspectral image classification, where labeled samples in the target domain are scarce. They propose a convolutional transformer-based few-shot learning (CTFSL) approach within a meta-learning framework. Most prior cross-domain few-shot methods rely on CNNs to extract statistical features, which only capture local spatial information. To address this, CTFSL incorporates a convolutional transformer network to extract both local and global features. A domain aligner maps the source and target domains to the same space, while a discriminator reduces domain shift and distinguishes feature origins. By combining few-shot learning across domains, transformer-based feature extraction, and domain alignment, their method outperforms state-of-the-art techniques on public hyperspectral datasets. This demonstrates the potential of transformers and cross-domain learning strategies to advance few-shot hyperspectral classification with limited labeled data.

Recently, Ran et al. (2023) proposed a novel deep transformer and few-shot learning (DTFSL) framework for hyperspectral image classification that aims to overcome the limitations of CNNs. The DTFSL incorporates spatial attention and spectral query modules to capture long-range dependencies between non-local spatial samples. This helps reduce uncertainty and better represent underlying spectral-spatial features with limited training data. The network is trained using episode and task-based strategies to learn an adaptive metric space for few-shot classification. Domain adaptation is also integrated to align distributions and reduce variation across domains. Experiments on three public HSI datasets demonstrated that the transformer-based DTFSL approach outperforms state-of-the-art methods by effectively modeling relationships between non-local spatial samples in a few-shot context. This indicates transformers could be a promising alternative to CNNs for few-shot hyperspectral classification.

In another work, Liu et al. (2022) introduces a Vision Transformer (ViT)-based architecture for FSL that employs feedback learning. The Few-Shot Transformer Network (FFTN) developed by Liu et al. (2022) combines spatial and spectral attention of the extracted features learned by the transformer component. By incorporating XAI, the model’s decision-making process can be made more transparent and interpretable, thereby enhancing its trustworthiness and reducing the risk of biases. Additionally, the network incorporates meta-learning with reinforced feedback learning on the source set as the first step to improving the network’s ability to identify misclassified samples through reinforcement. The second step is target-learning with transductive feedback training on the target sample to learn the distribution of unlabeled samples. This two-step process helps the network adapt to the target domain, thus improving its accuracy and reducing the risk of overfitting.

Table 3 provides an overview of some of the existing methods for few-shot approaches in hyperspectral image classification. It lists the dataset and metrics used, as well as the type of feature extractor approach for each method and the year of publication. The addition of XAI techniques could enhance the transparency and interpretability of these methods.

Table 3 Overview of few-shot hyperspectral image classification methods with feature extraction approach, publication year, and whether the method incorporates XAI techniques

Full size table

A 5-shot evaluation method is mostly used to measure how well the proposed methods work on the Chikusei, Salinas Valley, and Pavia University datasets. Table 3 gives an overview of the methods that have been discussed, including the datasets and evaluation metrics, the learning methods that were used, and the year the paper was published. From the works that were talked about, it is clear that the Pavia, Indian Pines, and Salinas Valley datasets are the ones most often used to compare algorithms.

In Table 3, most of the few-shot approaches for hyperspectral image classification have not incorporated XAI for better interpretability. However, the addition of XAI techniques could enhance transparency and provide insights into the decision-making process of the models. One way to incorporate XAI is by using visualization techniques to highlight the features or regions of the image that contribute to the model’s prediction. Another approach is to use saliency maps to identify the most important regions of the input image that influence the model’s decision. Additionally, model-agnostic methods such as LIME or SHAP can provide insights into the decision-making process of the models. Overall, the incorporation of XAI techniques in few-shot approaches for hyperspectral image classification can improve the transparency and interpretability of the models and facilitate their adoption in real-world applications.

6.2 Few-shot learning in VHR image classification

All studies in this section employ the OA metric unless otherwise stated. Li et al. (2017) proposes a novel zero-shot scheme for scene classification (ZSSC) based on the visual similarity of images from the same class. Their work employs the UC Merced dataset for evaluation, where a number of classes were randomly chosen as observed classes while the rest were unseen classes. Additionally, the authors incorporate the RSSCN7 dataset (Zou et al. 2015) and a VHR satellite-based image database consisting of instances from both observed and unknown classes in an unlabeled format. To address the issue, the authors adopt the word2vec (Church 2017) model to represent each class as a semantic vector and use a K-Nearest Neighbor (KNN) graph-based model to implement sparse learning for label refinement. The refinement also helps to denoise any noisy label during the zero-shot classification scheme. Their proposed model achieves significant performance gains compared to existing SOTA zero-shot learning models with linear computational complexity. Furthermore, the proposed model can handle a large number of classes with minimal memory requirements. Some of the newest attempts to use few-shot learning and UAV-based data was carried out by Al-Haddad and Jaber (2023), Khoshboresh-Masouleh and Shah-Hosseini (2023).

The work presented by Hamzaoui et al. (2022) proposed a hierarchical prototypical network (HPN) as a novel approach for few-shot learning, which is evaluated on the RESISC45 dataset. The HPN model is designed to perform analysis of high-level aggregated information in the image, followed by fine-level aggregated information computation and prediction, utilizing prototypes associated with each level of the hierarchy as described in (4). The evaluation protocol involves a 5-way 1-shot and 5-way 5-shot classification approach within the standard meta-learning framework. In the proposed approach, the ResNet-12 model serves as the backbone for the first stage of feature extraction. The extracted features are then passed through a linear layer to obtain the final feature representation. This feature representation is used for the classification task in the second stage.

In a bid to augment the performance of few-shot task-specific contrastive learning (TSC), Zeng and Geng (2022) introduced a self-attention and mutual-attention module (SMAM) that scrutinizes feature correlations with the aim of reducing any background interference. The adoption of a contrastive learning strategy facilitates the pairing of data using original images from diverse perspectives. Ultimately, the aforementioned approach enhances the potentiality to distinguish intra-class and inter-class image features. The NWPU-RESISC45, WHU-RS19, and UC Merced datasets were leveraged for their algorithmic evaluation, which comprised a 5-way 1-shot and 5-way 5-shot classification approach that was scrutinized and compared. Furthermore, Yuan et al. (2020b) introduced a multiple-attention approach that concurrently focuses on the global and local feature scale as part of their Multi-Attention Deep Earth Mover Distance (MAEMD) proposed network. Local attention is geared towards capturing significant and subtle local features while suppressing others, thereby improving representational learning performance and mitigating small inter-class and large intra-class differences. Their approach was evaluated on the UC-Merced, AID, and OPTIMAL-31 (Wang et al. 2018) datasets with a 1-shot, 5-way 5-shot, and 10-shot evaluation approach. As evidence of the success of the local attention strategy, the results showed that the model achieved state-of-the-art performance across all datasets.

Another illustration of an attention-based model is presented by Kim and Chi (2021) with the introduction of the Self-Attention Feature Selection Network (SAFFNet). This model aims to integrate features across multiple scales using a self-attention module, in a similar manner to that of a spatial pyramid network. The Self-Attention Feature Selection (SAFS) module is employed to better match features from the query set with the fused features in the class-specified support set. Experimental analysis was conducted on the UC-Merced, RESISC45, and AID datasets, using a 1-shot and 5-shot classification evaluation approach. The results showed that SAFS was able to improve the performance of the baseline model for all datasets, with the largest improvement seen on the AID dataset.

Incorporating a feature encoder to learn the embedded features of input images as a pre-training step, Huang et al. (2021b) proposed the Task-Adaptive Embedding Network (TAE-Net). To choose the most informative embedded features during the learning task in an adaptive manner, a task-adaptive attention module is employed. By utilizing only limited support samples, the prediction is performed on the query set by the meta-trained network. For their algorithmic evaluation, they employed the NWPU-RESISC45, WHU-RS19, and UC Merced dataset. A 5-way 1-shot and 5-way 5-shot classification approach were implemented for comparison. The results were evaluated based on accuracy, precision, recall, and F1-score metrics. Furthermore, the models were compared in terms of their training time and memory usage.

The study conducted by Wang et al. (2022b) aims to achieve few-shot learning through deep economic networks. The deep economic network incorporates a two-step simplification process to reduce training parameters and computational costs in deep neural networks. The reduction of redundancy in input image, channel, and spatial features in deep layers is achieved. In addition, teacher knowledge is utilized to improve classification with limited samples. The last block in the model includes depth- and point-wise convolutions that effectively learn cross-channel interactions and enhance computational efficiency. The algorithmic evaluation of the model is conducted on three datasets, namely the UC-Merced, RESISC45, and RSD46-WHU. The evaluation is carried out using a 1-shot and 5-shot approach on RESISC45 and the RSD46-WHU, and an additional 10-shot evaluation is implemented on the UC-Merced dataset. The model shows promising performance across all datasets, with the highest accuracy coming from the 10-shot evaluation.

The introduction of the Discriminative Learning of Adaptive Match Network (DLA-MatchNet) for few-shot classification by Li et al. (2020a) incorporated the attention mechanism in the channel and spatial domains to identify the discriminative feature regions in the images through the examination of their inter-channel and inter-spatial relations. In order to address the challenges posed by large intra-class variances and inter-class similarity, the discriminative features of both the support and query sets were concatenated, and the most relevant pairs of samples were adaptively selected by a matcher, which was manifested as a multi-layer perceptron. The UC-Merced, RESISC45, and WHU-RS19 datasets were employed for the state-of-the-art (SOTA) evaluation, utilizing a 5-way 1-shot and 5-shot approach for all datasets. The results confirmed the superior accuracy of the proposed method over the SOTA, proving its utility for remote sensing image retrieval.

Graph-based methods have also been employed in the very high-resolution (VHR) domain for few-shot learning. In this regard, Jiang et al. (2022) proposed a multi-scale graph-based feature fusion (MGFF) approach that involves a feature construction model that converts typical pixel-based features to graph-based features. Subsequently, a feature fusion model combines the graph features across several scales, which enhances the distinguishing ability of the model via integrating the essential semantic feature information and thereby improving few-shot classification capability. The authors conducted the algorithmic evaluation on the RESISC45 and WHU-RS19 datasets using a 5-way 1-shot and 5-way 5-shot classification approach. In addition, Yuan et al. (2022) proposed Graph Embedding Smoothness Network (GES-Net), which implements embedded smoothing to regularize the embedded features. This not only effectively extracts higher-order feature relations but also introduces a task-level relational representation that captures graph relations among the nodes at the level of the whole task, thereby enhancing the node relations and feature discerning capabilities of the network. The work is evaluated on the RESISC45, WHU-RS19, and UC Merced datasets using 5-way 1-shot and 5-shot comparison approaches. Episodic training was adopted, where each episode refers to a task and is comprised of N uniformly sampled categories without replacement and the query and support set. The support set contains K samples from each of the N categories, and the query set contains a single sample from each of the N categories. The samples in the query and support sets are selected from a larger pool of available samples in a random manner, ensuring that each episode is unique.

Few-shot strategies for VHR image classification can benefit greatly from the application of XAI methods like explainable graph neural networks and attention mechanisms. Transparency, accountability, bias detection, and fairness issues can all be improved with the help of xGNNs because they shed light on the model’s decision-making process. To make the model’s decisions more understandable and transparent, attention mechanisms can draw focus to key features and nodes in the graph. In principle, these methods could make graph-based few-shot classification models more reliable and easy to understand.

Table 4 presents a similar format to Table 3, depicting a comprehensive summary of the aforementioned methods in the very high-resolution (VHR) classification domain. It can be observed that the RESISC45, UC-Merced, and WHU-RS19 dataset are among the most frequently employed for algorithmic comparison, as seen in the subset of the existing works highlighted.

Table 4 The overview of some existing methods for few-shot approaches in VHR image classification

Full size table

Furthermore, the majority of the approaches presented in Table 4 incorporate attention mechanisms and graph-based methods for few-shot VHR classification. The MGFF approach presented by Jiang et al. (2022) is an example of a graph-based method, while the DLA-MatchNet presented by Li et al. (2020a) is an example of an approach that utilizes attention mechanisms. Similarly, the GES-Net presented by Yuan et al. (2022) also uses graph-based methods in its approach. These techniques aim to extract more relevant and informative features from the input images, which can enhance the performance of few-shot VHR classification systems.

6.3 Few-shot learning in SAR image classification

SAR (Synthetic Aperture Radar) is a remote sensing technology for capturing high-resolution images of the Earth’s surface regardless of the weather conditions, making it a valuable tool in various applications such as agriculture, forestry, and land use management. However, the availability of SAR-based data is often limited in comparison to hyperspectral or VHR-based data, mainly due to the high cost of SAR sensors and the complexity of SAR data processing. As a result, traditional classification approaches for SAR data are often challenged by insufficient training data and the high intra-class variability, which leads to a pressing need for the development of few-shot learning methods that can effectively tackle these challenges. Therefore, a review of emerging few-shot learning methods in the SAR-based classification domain is highly desirable to advance the state-of-the-art and enable more accurate and efficient classification of SAR data.

The integration of XAI techniques, such as explainable graph neural networks (xGNNs) and attention mechanisms, can significantly enhance the proposed few-shot transfer learning technique for SAR image classification presented by Tai et al. (2022). This novel approach uses a connection-free attention module to selectively transfer shared features between SAR and Electro-Optical (EO) image domains, reducing the dependence on additional SAR samples, which may not be feasible in certain scenarios. By using xGNNs, the authors can provide insights into the decision-making process, increasing transparency and accountability, which is particularly crucial for SAR image classification due to restricted data access and high acquisition costs. The attention mechanism can highlight relevant features and nodes in the graph, improving the model’s interpretability and transparency, and ultimately, its performance and trustworthiness. In addition, the authors implemented a Bayesian convolutional neural network to update only relevant parameters and discard those with high uncertainties. The evaluation was performed on three EO datasets that included ships, planes, and cars, with SAR images obtained from Hammell (2019), Schwegmann et al. (2017), and MSTAR. The classification accuracy (OA) value was used as the performance metric, with the 10-way k-shot approach achieving an OA of approximately 70%, outperforming other approaches. Overall, incorporating XAI techniques can potentially improve the performance and trustworthiness of the proposed few-shot transfer learning technique for SAR image classification.

In the pursuit of effective few-shot classification, Gao et al. (2022) proposed a Hand-crafted Feature Insertion Module (HcFIM), which combines learned features from CNN with hand-crafted features via a weighted-concatenated approach to aggregate more priori knowledge. Their Multi-scale Feature Fusion Module (MsFFM) is used to aggregate information from different layers and scales, which helps distinguish target samples from the same class more easily. The combination of MsFFM and HcFIM forms their proposed Multi-Feature Fusion Network (MFFN). To tackle the challenge of high similarity within inter-classes in SAR images, the authors proposed the Weighted Distance Classifier (WDC), which computes class-specific weights for query samples in a data-driven manner, distributed using the Euclidean distance as a guide. They also incorporated weight generation loss to guide the process of weight generation. The benchmark MSTAR dataset and their proposed Vehicles and Aircraft (VA) dataset were used for evaluation, where a 4-way 5-shot evaluation approach was used for MSTAR and a 4-way 1-shot evaluation approach was used for VA. The Average Accuracy (AA) was used as the evaluation metric throughout. The evaluation results demonstrated that the VA dataset had a higher AA than MSTAR, indicating that it was better suited for fine-grained classification tasks.

In the study by Fu et al. (2021), a novel approach to few-shot classification is introduced through the integration of meta-learning. This method is characterized by the synergistic use of two primary components: a meta-learner and a base-learner. The meta-learner’s primary function is to determine and store the learning rates, along with generalized parameters pertinent to both the feature extractor and classifier. Its objective is to discern an optimal initialization parameter, thereby refining update strategies by meticulously examining the distribution of few-shot tasks. This optimal initialization is instrumental in setting the algorithm on a path that potentially accelerates convergence and improves performance. Following this, the meta-learner plays a pivotal role in directing the base-learner. Here, the base-learner is conceptualized as a classifier model specifically tailored for SAR-based target detection. Its design ensures enhanced convergence efficiency under the guidance of the meta-learner.

Recognizing the challenges posed by more complex tasks, the study further augments its methodology with a hard-task mining technique. This is particularly valuable in emphasizing and addressing tasks that are inherently more challenging. For the acquisition of transferrable knowledge-a crucial aspect of few-shot learning-the 4CONV network is employed during the meta-training phase. The efficacy of this approach, termed as MSAR in the publication, was rigorously tested on two datasets: the MSTAR dataset and the newly proposed NIST-SAR dataset. Evaluations were carried out using both the 5-way 1-shot and 5-way 5-shot paradigms, with the mean accuracy (AA) serving as the benchmark metric. The empirical results were telling; the MSAR method surpassed baseline methodologies in performance for both tasks. Specifically, it achieved an impressive AA of 86.2% for the 5-way 1-shot task and an even more commendable 97.5% for the 5-way 5-shot task.

The paper by Rostami et al. (2019) proposed a novel few-shot cross-domain transfer learning approach to transfer knowledge from the electro-optical (EO) domain to the synthetic aperture radar (SAR) domain. This is accomplished by utilizing an encoder in each domain to extract and embed individual features into a shared embedded space. The encoded parameters are updated continuously by minimizing the discrepancies in the marginal probability distributions between the two embedded domains. Since the distributions are generally unknown in few-shot learning, the authors approximate the optimal transport discrepancy measurement metric using the Sliced Wasserstein Distance (SWD) for more efficient computation. The approach is evaluated on a dataset of SAR images acquired by Schwegmann et al. (2016) for detecting the presence or absence of ships. The classification accuracy (OA) is used as the evaluation metric for this approach. The results show that this approach can achieve an OA of over 90%, indicating that it is a reliable and accurate method for ship detection.

In their study on few-shot ship recognition using the MSTAR dataset, Wang et al. (2022c) proposed a Deep Kernel Learning (DKL) approach that harnesses the non-parametric adaptability of Gaussian Processes (GP). The kernel function used in their approach is mathematically defined in (15) as a Gaussian kernel

$$\begin{aligned} k_{l}(x,\bar{x}) = exp\left( \frac{-\left\Vert x-\bar{x}\right\Vert ^{2}}{2l^{2}}\right) \end{aligned}$$

(15)

which determines the similarity and relationships between pairs of embedded data samples x and $\bar{x}$, with l serving as a hyperparameter characterizing the length scale. The DKL approach integrates such kernel functions with deep neural networks to enable effective few-shot classification. For the K-shot C-way few-shot classification, the authors trained GPs on C categories, where the ith GP is trained on positive samples from class $C_{i}$ and negative samples from the remaining classes $C-1$. The GP with the highest confidence in the correct target classes is then computed using the log-likelihood formula. They evaluated their approach using 1-shot and 5-shot classification, with the classification accuracy (OA) as the metric.

Graph-based learning methods have gained popularity in SAR image classification similar to hyperspectral and VHR image classification. To enhance feature similarity learning among query images and support samples more effectively using graphs, Yang et al. (2020) proposed a relation network based on the embedding network for feature extraction and attention-based Graph Neural Networks (GNN) in the form of a metric network (Sung et al. 2018). The channel attention module in CBAM is incorporated into the GNN. MSTAR is utilized for evaluation, and a 5-way 1-shot comparison is used with classification accuracy (OA) as the metric. In addition, Yang proposed a Mixed-loss Graph Attention Network (MGA-Net) which utilizes a multi-layer GAT combined with a mixed-loss (embedding loss and classification loss) training to increase inter-class separability and speed up convergence. The MSTAR dataset and the OpenSARShip dataset were used for comparison, and a 3-way 1-shot and 3-way 5-shot classification evaluation were utilized for comparison of the results represented by the classification accuracy (OA) and the confusion matrix. The results showed that the MGA-Net achieved a better performance than the baseline models in both datasets, indicating that the multi-layer GAT and mixed-loss training had a positive effect on the classification accuracy.

Recently, Zhao et al. (2022) proposed an instance-aware transformer (IAT) model for few-shot synthetic aperture radar automatic target recognition (SAR-ATR). They recognize that modeling relationships between query and support images is critical for few-shot SAR-ATR. The IAT leverages transformers and attention to aggregate relevant support features for each query image. It constructs attention maps based on similarities between query and support features to exploit information from all instances. Shared cross-transformer modules align query and support features. Instance cosine distance during training pulls same-class instances closer to improve compactness. Experiments on few-shot SAR-ATR datasets show IAT outperforms state-of-the-art methods. Visualizations also demonstrate improved intra-class compactness and inter-class separation. This highlights the potential of transformers and attention for few-shot SAR classification by effectively relating queries to supports and learning discriminative alignments.

CNNs have been dominant for SAR-ATR, but struggle with limited training data. To address this, Wang et al. (2022d) proposed a convolutional transformer (ConvT) architecture tailored for few-shot SAR ATR. They recognize that CNNs are hindered by narrow receptive fields and inability to capture global dependencies in few-shot scenarios. ConvT constructs hierarchical features and models global relationships of local features at each layer for more robust representation. A hybrid loss function based on recognition labels and contrastive image pairs provides sufficient supervision from limited data. Auto augmentation further enhances diversity while reducing overfitting. Without needing additional datasets, ConvT achieves state-of-the-art few-shot SAR ATR performance on MSTAR by effectively combining transformers with CNNs. This demonstrates transformers can overcome CNN limitations for few-shot SAR classification by integrating local and global dependencies within and across layers.

Table 5 The overview of some existing methods for few-shot approaches in SAR image classification

Full size table

Table 5 provides an overview of the various few-shot learning methods that have been proposed for SAR classification. The table summarizes the key aspects of each approach, including the name of the method, the year of publication, the dataset used for evaluation, and the evaluation metric used. It is noteworthy that among the subset of existing works described in this review, the MSTAR dataset is the most commonly used for algorithmic comparisons. The MSTAR dataset has been widely used in SAR classification due to its relatively large size and the diversity of the target types that it contains. Overall, the methods discussed in Table 5 highlight the potential of few-shot learning approaches in the SAR domain, and demonstrate the effectiveness of various techniques such as graph-based learning, deep kernel learning, and meta-learning. These approaches have the potential to enable more efficient and accurate classification of SAR data, which can have important applications in fields such as remote sensing, surveillance, and defense. XAI techniques can be really useful for identifying objects in radar images. Because researchers usually have limited access to radar data and it’s expensive to get new radar images, techniques like explainable graph neural networks and attention mechanisms are helpful.

7 Few-shot based object detection and segmentation in remote sensing

In the remote sensing domain, much of the focus has been on image classification tasks like land cover mapping. However, it is also essential to advance higher-level vision tasks like object detection and semantic segmentation, which extract richer information from imagery. For example, object detection can precisely localize and identify vehicles, buildings, and other entities within a scene. Meanwhile, segmentation can delineate land, vegetation, infrastructure, and water boundaries at the pixel level. Significant progress has been made in developing and evaluating object detection and segmentation techniques for remote sensing data. Various benchmarks and competitions have been organized using large-scale satellite and aerial datasets Li et al. (2020b), Han et al. (2018). State-of-the-art deep learning models like R-CNNs, SSDs, and Mask R-CNNs (Su et al. 2019) have shown strong performance. However, many of these rely on extensive annotated training data which can be costly and time-consuming to collect across diverse geographical areas. Therefore, advancing object detection and segmentation in remote sensing using limited supervision remains an open challenge. Few-shot learning offers a promising approach to enable effective generalization from scarce training examples. While some initial work has explored object detection for aerial images (Zhu et al. 2015; Yao et al. 2019; Mundhenk et al. 2016; Zhang et al. 2019; Lu et al. 2019), a comprehensive survey incorporating the latest advancements is still lacking. Furthermore, few-shot semantic segmentation has received relatively little attention for remote sensing thus far.

7.1 Few-shot object detection in remote sensing

The main challenge in few-shot object detection is to design a model that can generalize well from a small number of examples (Li et al. 2021a; Wolf et al. 2021; Cheng et al. 2021a; Gao et al. 2021). This is typically achieved by leveraging prior knowledge learned from a large number of examples from different classes (known as base classes). The model is then fine-tuned on a few examples from the new classes (known as novel classes) (Jeune and Mokraoui 2023).

There are various methods used in few-shot object detection, including metric learning methods, meta-learning methods, and data augmentation methods (Xiao et al. 2021; Li et al. 2022a; Wang et al. 2022e). Metric learning methods aim to learn a distance function that can measure the similarity between objects. Meta-learning methods aim to learn a model that can quickly adapt to new tasks with a few training examples with the help other domain informations (Zhang et al. 2023). Data augmentation methods aim to generate more training examples by applying transformations to the existing examples (Liu et al. 2023). Furthermore, a more comprehensive analysis of aerial image-based FSOD is available in the summarized Table 6.

Explainability in few-shot object detection refers to the ability to understand and interpret the decisions made by the model. This is important for verifying the correctness of the model’s predictions and for gaining insights into the model’s behavior. Explainability can be achieved by visualizing the attention maps of the model, which show which parts of the image the model is focusing on when making a prediction. Other methods include saliency maps (Petsiuk et al. 2021), which highlight the most important pixels for a prediction, and decision trees, which provide a simple and interpretable representation of the model’s decision process (Hu et al. 2023). Therefore, few-shot object detection methods have shown promising results in detecting novel objects in aerial images with limited annotated samples. The feature attention highlight module and the two-phase training scheme contribute to the model’s effectiveness and adaptability in few-shot scenarios. However, there are still challenges to be addressed, such as the performance discrepancy between aerial and natural images, and the confusion between some classes. Future research should focus on developing more versatile few-shot object detection techniques that can handle small, medium, and large objects effectively, and provide more interpretable and explainable results.

Table 6 Overview of studies addressing challenges in few-shot object detection for aerial remote sensing images

Full size table

7.2 FSOD benchmark datasets for aerial remote sensing images

NWPU VHR contains 10 categories, with three chosen as novel classes. Researchers commonly use a partition that involves base training on images without novel objects and fine-tuning on a set with $k$ annotated boxes (where $k$ is 1, 2, 3, 5, or 10) for each novel class. The test set has about 300 images, each containing at least one novel object.
DIOR. dataset features 20 classes and over 23,000 images. Five categories are designated as novel, and various few-shot learning approaches are applied. Fine-tuning is performed with $k$ annotated boxes (where $k$ can be 3, 5, 10, 20, or 30) for each novel class, and performance is evaluated on a comprehensive validation set.
RSOD. dataset consists of four classes. One class is randomly selected as the novel class, with the remaining three as base classes. During base training, 60% of samples for each base class are used for training, and the rest for testing. Fine-tuning is performed on $k$ annotated boxes in the novel classes, where $k$ can be 1, 2, 3, 5, or 10.
iSAID dataset features 15 classes and employs three distinct base/novel splits designed according to data characteristics. Each split focuses on a different aspect-such as object size or variance in appearance. The third split specifically selects the six least frequent classes as novel. Base training uses all objects from base classes, and fine-tuning utilizes 10, 50, or 100 annotated boxes per class.
DOTA dataset features an increase from 15 to 18 categories and nearly tenfold expansion to 1.79 million instances. It has two base/novel class splits, with three classes designated as novel. During episode construction, the number of shots for novel classes varies as 1, 3, 5, and 10.
DAN dataset is an amalgamation of DOTA and NWPU VHR datasets, comprising 15 categories. It designates three classes as novel, with the remaining as base classes.
xBD dataset is designed for detection of building damage layouts and hence identifying the disaster that causes it via satellite imagery. It comprises of a pre and post-disaster scenery of the affected area for comparisons, and the type of natural disaster that can be detected include hurricane, flood, fire, etc. In the work by Bowman and Yang (2021), data from xBD was divided into three subsets, one for images captured pre-event (for disaster in general), one for images captured post-event (also for disaster in general), and one for images captured specifically after a tornado. Since the aftermath images due to a tornado is only found in the last subset, the latter training dataset is disjoint from the other two training data subset, allowing few-shot training to be attainable.
HRSC2016 + REMEX-FSSD is a combination of dataset proposed by Zhang et al. (2022) for few-shot ship detection using satellite imagery. Such dataset is distinct from the other aforementioned dataset as it opens up the feasibility of few-shot object detection in the maritime domain. Their motivations stem from the observation that there is a large intra-class diversity and inter-class similarity between different ships, as well the scarcity of training images of different type of ships (more so if the ships are new). For the HRSC2016 dataset, 15 classes are selected, for which 5 classes are categorized as novel classes. For the REMEX-FSSD, the destroyer class of ship is categorized as the novel class.

7.3 Few-shot image segmentation in remote sensing

Few-shot image segmentation (FSIS) is a challenging task in computer vision, particularly in the context of aerial images. This task aims to segment objects in images with only a few labeled examples, which is especially important due to the high cost of collecting labeled data in the domain of aerial images. Recent advancements in few-shot image segmentation have been driven by deep learning techniques, which have shown promising results in various computer vision tasks. Metric-based meta-learning models, such as Siamese networks and prototype networks, have been widely used in few-shot segmentation (Yao et al. 2021; Chen et al. 2022). These models learn to compare the similarity between support and query images and use this information to segment novel classes (Cao et al. 2023).

Another common approach in few-shot image segmentation is to use deep learning networks, specifically convolutional neural networks (CNNs) (Zhang et al. 2020). These networks have shown great success in image segmentation tasks and have been adapted for few-shot learning scenarios. Researchers have explored different architectures and training strategies to improve the performance of CNNs in few-shot image segmentation (Zhang et al. 2020). Meta-learning, which involves training a model to learn how to learn, has also been applied to few-shot image segmentation with promising results (Zhang et al. 2020). Meta-learning algorithms aim to extract meta-knowledge from a set of tasks and use this knowledge to quickly adapt to new tasks with only a few labeled examples. In terms of applications, few-shot image segmentation in aerial images has various potential applications. One application is in urban planning, where few-shot image segmentation can be used to identify and segment different types of buildings, roads, and other urban infrastructure (Puthumanaillam and Verma 2023; Lang et al. 2023a, b). Another application is in land-use and land-cover determination, where few-shot image segmentation can be used to classify different types of land cover, such as forests, agricultural land, and water bodies. Few-shot image segmentation can also be used in environmental monitoring and climate modeling to analyze changes in vegetation cover, water resources, and other environmental factors. In the field of wildfire recognition, detection, and segmentation, deep learning models have shown great potential (Ghali and Akhloufi 2023). These models have been successfully applied to aerial and ground images to accurately classify wildfires, detect their presence, and segment the fire regions. Various deep learning architectures have been explored, including CNNs, one-stage detectors (such as YOLO), two-stage detectors (such as Faster R-CNN), and encoder-decoder models (such as U-Net and DeepLab). In the context of UAV images, a framework has been proposed for removing spatiotemporal objects from UAV images before generating the orthomosaic. The framework consists of two main processes: image segmentation and image inpainting. Image segmentation is performed using the Mask R-CNN algorithm, which detects and segments vehicles in the UAV images. The segmented areas are then masked to be removed. Image inpainting is carried out using the large mask inpainting (LaMa) method, a deep learning-based technique that reconstructs damaged or missing parts of an image (Park et al. 2022). Additionally, a more extensive examination of aerial image-based FSIS can be found in the Table 7.

Excitability in few-shot image segmentation, particularly in the context of remote sensing aerial images, have focused on the development of novel models and techniques that enhance the performance of segmentation tasks and provide insights into the decision-making process of the models. One such advancement is the Self-Enhanced Mixed Attention Network (SEMANet) proposed (Song et al. 2023). SEMANet utilizes three-modal (Visible-Depth-Thermal) images for few-shot semantic segmentation tasks. The model consists of a backbone network, a self-enhanced module (SE), and a mixed attention module (MA). The SE module enhances the features of each modality by amplifying the differences between foreground and background features and strengthening weak connections. The MA module fuses the three-modal features to obtain a better feature representation. Another advancement is the combination of a self-supervised background learner and contrastive representation learning to improve the performance of few-shot segmentation models (Cao et al. 2023). The self-supervised background learner learns latent background features by mining the features of non-target classes in the background. The contrastive representation learning component of the model aims to learn general features between categories by using contrastive learning. This approach has shown potential for enhancing the performance and generalization ability of few-shot segmentation models. Still, there are still some problems to solve in the field, such as how to deal with differences in performance caused by intra-class confusions and how to make models that are simple to understand and can be fairly accurate. Future research should focus on the development of flexible few-shot object segmentation approaches that are capable of effectively handling lightweight models. These models should possess a higher level of interpretability for each of its components and demonstrate the ability to generalize across other domains.

Table 7 Overview of studies addressing challenges in few-shot image segmentation for remote sensing aerial images

Full size table

7.3.1 Few-shot image segmentation benchmark datasets for remote sensing aerial images

iSAID is a large-scale dataset for instance segmentation in aerial images. It contains 2,806 high-resolution images with annotations for 655,451 instances across 15 categories.
Vaihingen consists of true orthophoto (TOP) images captured over the town of Vaihingen an der Enz, Germany. The images have a spatial resolution of 9 cm, which is quite high compared to many other aerial image datasets. The dataset also includes corresponding ground truth data, which provides pixel-wise annotations for six classes: impervious surfaces (such as roads and buildings), buildings, low vegetation (such as grass), trees, cars, and clutter/background.
DLRSD contains images where the label data of each image is a segmentation image. This segmentation map is analyzed to extract the multi-label of the image. DLRSD has richer annotation information with 17 categories and corresponding label IDs.
SARShip-4i contains ship images in the SAR domain that can be utilized for few-shot segmentation, as performed by Li et al. (2023). The dataset is comprised of 139 high-resolution SAR images of ship from 14 regions, with the resolution values ranging from 0.3 m to 3 m.

8 Discussions

In this section, we aim to highlight interesting observations, common trends, and potential research gaps based on the in-depth analysis of the existing few-shot classification techniques across the three remote sensing data domains. The insights discussed in this section can serve as a guide for both current and future researchers in this field.

Most of the methods described in the literature use different feature extraction models, with CNN-based models often serving as the backbone, as we’ve already talked about. Convolution-based few-shot learning models are still popular for classification tasks in all three domains. These models are capable of quickly adapting to new classes with few training examples, making them suitable for real-world applications. However, graph-based methods are becoming more popular for classifying SAR images, and they have only recently been used to classify VHR images. Graph-based methods are advantageous because they are able to capture the spatial relationships between objects, which is essential for classifying SAR and VHR images. Recently, vision transformer-based and incremental learning-based methods have emerged as alternatives for hyperspectral image classification. These methods have shown promise in achieving high accuracy with minimal training data, making them attractive for applications where labeled data is limited.
The evaluation of the discussed works in hyperspectral image classification generally employs three commonly utilized metrics: overall accuracy (OA), average accuracy (AA), and kappa coefficient ($\kappa$). These metrics are frequently used to evaluate the classification performance of the proposed algorithms. In contrast, for VHR and SAR-based image classification, the classification accuracy (OA) is often utilized as the primary evaluation metric, although there are a few exceptions. Moreover, in most of the evaluation strategies adopted by the researchers, the proposed algorithms are run multiple times along with the state-of-the-art (SOTA) techniques, and the corresponding mean accuracy and its standard deviation are reported. This approach provides a more reliable and robust estimate of the classification performance, taking into account any potential variations in the results obtained across multiple runs.
In contrast to hyperspectral classification, it has been observed that there are currently few or no vision ViT-based few-shot classification methods proposed for SAR and VHR images. This could be attributed to the challenges associated with acquiring sufficient datasets for implementing effective and accurate ViT-based architectures for SAR images. Similarly, for VHR images, although there are existing models that use ViT-based classification, they are non-few-shot approaches such as the vanilla ViT-based model proposed by Zhang et al. (2021). Consequently, there are considerable opportunities for researchers to explore the potential of few-shot ViT-based approaches for addressing the challenges associated with VHR remote sensing data classification.
The current state of research on few-shot classification approaches in the field of remote sensing does not seem to include much work on UAV or low-altitude aircraft-based images, as far as current knowledge suggests. This may be due to the unique nature of such images, which have been pointed out in previous studies such as Gao et al. (2021). The differences in object sizes and perspectives, as well as the limited computational resources available for UAV-based operations, may be contributing factors to the scarcity of research in this area. In addition, the relatively smaller size of the UAV-based datasets may have posed challenges for few-shot learning methods, which often require a sufficiently large dataset to learn meaningful feature representations. However, with the increasing availability of UAV-based data, there may be opportunities for developing novel few-shot classification methods that can effectively leverage such data.
Furthermore, while few-shot learning has been studied extensively in the context of supervised classification, there is also potential for exploring its application in other remote sensing tasks such as unsupervised or semi-supervised learning, object detection, and semantic segmentation. Few-shot learning can provide an effective means of leveraging limited labeled data in these tasks, which can potentially lead to more accurate and efficient algorithms for remote sensing applications. Overall, while significant progress has been made in the application of few-shot learning to remote sensing data, there are still many research gaps and opportunities for further investigation. The exploration of new few-shot learning approaches, as well as the extension of existing methods to new applications and domains, can lead to more accurate and efficient algorithms for remote sensing tasks.
The utilization of XAI methodologies in conjunction with few-shot learning models for remote sensing applications can considerably enhance the interpretability of such models, thereby increasing their applicability in domains that are sensitive to potential risks. However, despite the significant promise held by XAI for few-shot learning in remote sensing, the current body of research in this field remains relatively nascent and further endeavors are necessary to fully realize its potential benefits.

8.1 Computational considerations in few-shot learning

Few-shot learning, as a niche within the broader domain of machine learning, warrants unique computational requirements. These requirements become particularly pertinent when the applications have real-time constraints. One of the most critical real-time applications lies in disaster monitoring using UAVs. The immediacy of feedback in such scenarios can drastically affect outcomes, emphasizing the significance of processing time.

Deep learning, which forms the foundation for many few-shot learning techniques, inherently demands high computational resources. Techniques such as CNNs are notorious for their computational intensity during both the training and inference phases. This computational cost can sometimes be a bottleneck, especially when rapid responses are essential. However, the evolving landscape of few-shot learning has seen the emergence of strategies aiming to mitigate these computational challenges:

Meta-learning, exemplified by approaches like MAML (Finn et al. 2017), offers an innovative solution. By optimizing model parameters to allow swift adaptation to novel tasks, these methods significantly reduce the computational overhead. This ensures that models can be fine-tuned efficiently, even when faced with new datasets.
Wang et al.’s (2022b) proposition of employing lightweight model architectures coupled with knowledge distillation techniques emerges as another viable strategy. By minimizing redundancies and unnecessary parameters, these models are streamlined to be more computationally efficient without compromising their predictive power.
Graph-based methodologies, such as GraphSAGE (Yang et al. 2020), and further extensions into GNN-based approaches (Yang et al. 2020), provide alternatives to traditional CNNs. These methods, in certain dataset contexts, have demonstrated reduced computational complexity, making them attractive options.

Despite these advancements, it is noteworthy that a significant portion of few-shot learning methodologies has not been explicitly tailored for optimizing processing time. Recognizing this gap, future research could pivot towards crafting architectures specifically designed for real-time UAV applications. Several avenues could be pursued to enhance computational efficiency. These include embracing model compression techniques, such as pruning and quantization (Han et al. 2015), leveraging efficient neural architecture search methods (Pham et al. 2018), and exploring hardware-software co-design strategies (Ham et al. 2021) to fine-tune models for particular computational platforms. In all these endeavors, the overarching goal remains consistent: achieving rapid inference times without sacrificing model accuracy.

9 Numerical experimentation of few-shot classification on UAV-based and satellite-based dataset

Table 8 List of training, validation and test image sets for each class in our subset of the AIDER dataset

Full size table

In order to address point 4 in the discussion section, a few-shot state-of-the-art (SOTA) method were employed to classify disaster scenes using the publicly available AIDER subset dataset. The evaluation involved the use of several few-shot methods such as the Siamese and Triplet Network, ProtoNet (Snell et al. 2017), Relation Network (Sung et al. 2018), Matching Network (Vinyals et al. 2016), SimpleShot (Wang et al. 2019), TAsk-Dependent Adaptive Metric (TADAM) (Oreshkin et al. 2018), MAML (Finn et al. 2017), Meta-Transfer Learning (MTL) (Sun et al. 2019), and Label Hallucination (Jian and Torresani 2022), which were originally proposed and evaluated in non-remote sensing datasets. The aim of the study was to evaluate the effectiveness of these methods in the remote sensing setting. To compare the results obtained from such dataset against that of a satellite-based remote sensing image classification, we compared our findings with some of the methods utilized in the UC-Merced evaluation as performed by Huang et al. (2021b); For the methods not listed there, we performed the simulation using the experimental condition as stipulated by Huang et al. (2021b).

We conducted a 5-way 1-shot and 5-way 5-shot classification evaluation approach. The AIDER subset dataset consists of a total of 6433 images, classified into 5 categories, namely collapsed buildings, fires, floods, traffic accidents, and normal (non-disaster) classes, with 511, 521, 526, 485, and the rest of the images, respectively. The dataset subset is imbalanced, with more images in the normal class than in the other disaster classes, highlighting the potential benefits of few-shot learning approaches, as mentioned in previous sections. Table 8 depicts the train-valid-test split ratio adapted for each class. All images are cropped to a pixel size of 224 $\times$ 224 and pre-processed by dividing each original pixel value by 255. The learning rate for each algorithm is set as 0.001. ResNet12 is chosen as the feature extraction backbone for TADAM, ProtoNet, Matching Network, Relation Network, SimpleShot, MTL, and Label Hallucination. For all of the methods, a common categorical cross-entropy loss is utilized, except for relation network, which utilized a mean-squared error loss. To tackle the problem of class imbalance, the training and validation samples were subjected to under-sampling by utilizing the RandomUnderSampler module, which was provided in the imblearn library package. All the simulations in this dataset were carried out for a total of 200 epochs using the Tensorflow Keras library in Python, and the Google Colab Pro+ platform with Tesla A100, V100, and T4 Graphical Processing Units (GPU) and Tensor Processing Units (TPU) were employed for computation.

For the UC-Merced dataset, apart from the features as mentioned in section 4.2, 10 classes are utilized as the base training set, 5 classes are set aside as the validation set, and the remaining 6 classes are utilized as the novel test set. In line with Huang et al. (2021b), the shapes of all images are cropped to 84 $\times$ 84 for feature extraction using their proposed feature encoder, with the momentum factor set to 0.1, and the learning rate set to 0.001. Due to all classes in UC-Merced having equal samples per class, methods to handle class imbalance are not needed. Once again, the common categorical cross-entropy is utilized as the loss function, except for relation network, which utilized a mean-squared error loss.

Table 9 presents the results of the simulations carried out on the AIDER subset using the few-shot evaluation approach mentioned earlier. Table 10 the corresponding results on the UC-Merced dataset. The mean accuracy and standard deviation for 10 runs are reported for each method. For the Siamese and Triplet network, the results are only reported for the 5-way 1-shot evaluation, as only 1 pair of images is compared per episodic training (for the Triplet network, the anchor image is taken and compared with the positive vs the negative image each at a time, so 1 pair of images are still considered). It was observed that the mean accuracy for the 5-way 5-shot approach is generally higher than that of the 5-way 1-shot approach for all the methods utilized in the two dataset, in agreement with the statement made earlier about the difficulty of few-shot learning with fewer shots. The Siamese network was found to outperform both Triplet and ProtoNet, demonstrating its effectiveness in feature extraction and embedding. Consistent with the trend observed in a previous study, MTL outperformed TADAM and ProtoNet in the AIDER and the UC-Merced subset, while label hallucination yielded the highest performance with a metric value of over 81% in the AIDER subset.

Table 9 The mean classification accuracy using the 5-way 1-shot and 5-way 5-shot learning evaluation on our AIDER subset simulation

Full size table

Table 10 The mean classification accuracy using the 5-way 1-shot and 5-way 5-shot learning evaluation on UC-Merced

Full size table

10 Explainable AI (XAI) in remote sensing

XAI has become an increasingly crucial area of research and development in the field of remote sensing. As deep learning and other complex black-box models gain popularity for analysis of remote sensing data, there is a growing need to provide transparent and understandable explanations for how these models arrive at their predictions and decisions. Within remote sensing, explainability takes on heightened importance because model outputs often directly inform real-world actions with major consequences. For example, models identifying at-risk areas for natural disasters, pollution, or disease outbreaks can drive evacuations, remediation efforts, and public health interventions. If the reasoning behind these model outputs is unclear, stakeholders are less likely to trust and act upon the model’s recommendations.

To address these concerns, XAI techniques in remote sensing aim to shed light inside the black box (Mohan and Peeples 2023). Explanations can highlight which input features and patterns drive particular model outputs (Wang et al. 2019). Visualizations can illustrate a model’s step-by-step logic (Arrieta et al. 2020). Uncertainty estimates can convey when a model is likely to be incorrect or unreliable (Ling and Templeton 2015). Prototypes and case studies have shown promise for increasing trust and adoptability of AI models for remote sensing applications ranging from climate monitoring to precision agriculture (Shaikh et al. 2022). As remote sensors continue producing ever-larger and more complex datasets, the role of XAI will likely continue growing in importance. With thoughtful XAI implementations, developers can enable deep learning models to not only make accurate predictions from remote sensing data, but also provide the transparency and justifications required for stakeholders to confidently use these tools for critical real-world decision making. Recent approaches to XAI in the field of remote sensing are outlined below.

One notable development is the “What I Know” (WIK) method, which verifies the reliability of deep learning models by providing examples of similar instances from the training dataset to explain each new inference (Ishikawa et al. 2023). This technique demonstrates how the model arrived at its predictions.

XAI has also been applied to track the spread of infectious diseases like COVID-19 using remote sensing data (Temenos et al. 2022). By explaining disease prediction models, XAI enables greater trust and transparency. Additionally, XAI techniques have been used for climate adaptation monitoring in smart cities, where satellite imagery helps extract indicators of land use and environmental change (Sirmacek and Vinuesa 2022).

Several specific XAI methods show promise for remote sensing tasks, including Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Gradient-weighted Class Activation Mapping (Grad-CAM) (Ishikawa et al. 2023). These methods highlight influential input features and image regions that led to a model’s outputs. Grad-CAM produces visual heatmaps to indicate critical areas in an input image for each inference made by a convolutional neural network.

However, some challenges remain in fully integrating XAI into remote sensing frameworks. Practical difficulties exist in collecting labeled training data, extracting meaningful features, selecting appropriate models, ensuring generalization, and building reproducible and maintainable systems (Sirmacek and Vinuesa 2022). There are also inherent uncertainties in modeling complex scientific processes like climate change that limit the interpretability of model predictions (Sirmacek and Vinuesa 2022). Furthermore, the types of explanations provided by current XAI methods do not always match human modes of reasoning and explanation (Gevaert 2022). Despite the challenges, XAI methods hold promise for enhancing few-shot learning approaches in remote sensing. Few-shot learning aims to learn new concepts from very few labeled examples, which is important in remote sensing where labeled data is scarce across the diversity of land cover types. However, the complexity of few-shot learning models makes their predictions difficult to interpret.

10.1 XAI in few-shot learning for remote sensing

Most XAI methods for classification tasks are post-hoc, which cannot be incorporated into the model structure during training. Back-propagation (Chattopadhay et al. 2018; Selvaraju et al. 2017; Shrikumar et al. 2017; Wang et al. 2020) and perturbation-based methods (Schulz et al. 2020) are commonly used in XAI for classification tasks. However, few works have been carried out on XAI for few-shot learning tasks. Initial work has explored techniques like attention maps and feature visualization to provide insights into few-shot model predictions for remote sensing tasks (Liu 2022). Recently, a new type of XAI called SCOUTER (Li et al. 2021b) has been proposed, in which the self-attention mechanism are applied to the classifier. This method extracts discriminant attentions for each category in the training phase, allowing the classification results to be explainable. Such techniques can provide valuable insights into the decision-making process of few-shot classification models, increasing transparency and accountability, which is particularly important in remote sensing due to the high cost of acquiring and processing remote sensing data. In another recent work Wang et al. (2022a), a new approach to few-shot learning for image classification has been proposed that uses visual representations from a backbone model and weights generated by an explainable classifier. A minimum number of distinguishable features are incorporated into the weighted representations, and the visualized weights provide an informative hint for the few-shot learning process. Finally, a discriminator compares the representations of each pair of images in the support and query set, and pairs yielding the highest scores determined the classification results. This approach, when applied onto three mainstream datasets, achieved good accuracy and satisfactory explainability.

11 Conclusions and future directions

In this comprehensive review, we provided a comprehensive analysis of recent few-shot learning techniques for remote sensing across various data types and platforms. Compared to previous reviews (Sun et al. 2021), we expanded the scope to include UAV-based datasets. Our quantitative experiments demonstrated the potential of few-shot methods on various remote sensing datasets. We also emphasized the growing importance of XAI to increase model transparency and trustworthiness.

While progress has been made, ample opportunities remain to advance few-shot learning for remote sensing. Future research could explore tailored few-shot approaches for UAV data that account for unique image characteristics and onboard computational constraints. Vision transformer architectures could also be investigated for few-shot classification of very high-resolution remote sensing data. A key challenge is reducing the performance discrepancy between aerial and satellite platforms. Developing flexible techniques that handle diverse data effectively is an open problem that warrants further investigation.

On the XAI front, further work is needed to address issues unique to remote sensing like scarce labeled data, complex earth systems, and integrating domain knowledge into models. Techniques tailored for few-shot learning specifically could benefit from more research into explainable feature extraction and decision making. Explainability methods that provide feature-level and decision-level transparency without sacrificing too much accuracy or efficiency are needed. There is also potential to apply few-shot learning and XAI to new remote sensing problems like object detection, semantic segmentation, and anomaly monitoring.

To end, few-shot learning shows increasing promise for efficient and accurate analysis of remote sensing data at scale. Integrating XAI can further improve model transparency, trust, and adoption by providing human-understandable explanations. While progress has been made, ample challenges and opportunities remain to realize the full potential of few-shot learning and XAI across the diverse and rapidly evolving remote sensing application landscape. Advances in these interconnected fields can pave the way for remote sensing systems that learn quickly from limited data while remaining transparent, accountable, and fair.

References

Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: International Conference on Machine Learning, pp 60–69. PMLR
Aleissaee AA, Kumar A, Anwer RM, Khan S, Cholakkal H, Xia G-S et al (2022) Transformers in remote sensing: a survey. arXiv:2209.01206
Al-Haddad LA, Jaber AA (2023) An intelligent fault diagnosis approach for multirotor uavs based on deep neural network of multi-resolution transform features. Drones 7(2):82
Google Scholar
Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, García S, Gil-López S, Molina D, Benjamins R et al (2020) Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion 58:82–115
Google Scholar
Bai J, Huang S, Xiao Z, Li X, Zhu Y, Regan AC, Jiao L (2022) Few-shot hyperspectral image classification based on adaptive subspaces and feature transformation. IEEE Trans Geosci Remote Sens 60:1–17
Google Scholar
Bai J, Lu J, Xiao Z, Chen Z, Jiao L (2022) Generative adversarial networks based on transformer encoder and convolution block for hyperspectral image classification. Remote Sensing 14(14):3426
Google Scholar
Bayanlou MR, Khoshboresh-Masouleh M (2021) Multi-task learning from fixed-wing uav images for 2d/3d city modeling. arXiv preprint arXiv:2109.00918
Bowman J, Yang L (2021) Few-shot learning for post-disaster structure damage assessment. In: Proceedings of the 4th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, pp. 27–32
Camps-Valls G, Tuia D, Bruzzone L, Benediktsson JA (2013) Advances in hyperspectral image classification: Earth monitoring with statistical learning methods. IEEE Signal Process Mag 31(1):45–54
Google Scholar
Cao Q, Chen Y, Ma C, Yang X (2023) Few-shot rotation-invariant aerial image semantic segmentation. arXiv preprint arXiv:2306.11734
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, pp 213–229. Springer
Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN (2018) Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE
Chen Y, Wei C, Wang D, Ji C, Li B (2022) Semi-supervised contrastive learning for few-shot segmentation of remote sensing images. Remote Sensing 14(17):4254
Google Scholar
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proc IEEE 105(10):1865–1883
Google Scholar
Cheng G, Yan B, Shi P, Li K, Yao X, Guo L, Han J (2021) Prototype-cnn for few-shot object detection in remote sensing images. IEEE Trans Geosci Remote Sens 60:1–10
Google Scholar
Cheng G, Lang C, Wu M, Xie X, Yao X, Han J (2021) Feature enhancement network for object detection in optical remote sensing images. Journal of Remote Sensing 2021
Cheng H, Zhou JT, Tay WP, Wen B (2022) Attentive graph neural networks for few-shot learning. In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 152–157. IEEE
Church KW (2017) Word2vec. Nat Lang Eng 23(1):155–162
Google Scholar
Contest DF (2013) IEEE GRSS Data Fusion Contest Fusion of Hyperspectral and LiDAR Data
Dam T (2022) Developing generative adversarial networks for classification and clustering: Overcoming class imbalance and catastrophic forgetting. PhD thesis, UNSW Sydney
Dam T, Anavatti SG, Abbass HA (2020) Mixture of spectral generative adversarial networks for imbalanced hyperspectral image classification. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Deng P, Xu K, Huang H (2021) When CNNS meet vision transformer: a joint framework for remote sensing scene classification. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Ding C, Li Y, Wen Y, Zheng M, Zhang L, Wei W, Zhang Y (2021) Boosting few-shot hyperspectral image classification using pseudo-label learning. Remote Sensing 13(17):3539
Google Scholar
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR
Fu K, Zhang T, Zhang Y, Wang Z, Sun X (2021) Few-shot sar target classification via metalearning. IEEE Trans Geosci Remote Sens 60:1–14
Google Scholar
G. de Inteligencia Computacional (GIC) (2020) Hyperspectral remote sensing scenes
Gao Y, Hou R, Gao Q, Hou Y (2021) A fast and accurate few-shot detector for objects with fewer pixels in drone image. Electronics 10(7):783
Google Scholar
Gao F, Xu J, Lang R, Wang J, Hussain A, Zhou H (2022) A few-shot learning method for sar images based on weighted distance and feature fusion. Remote Sensing 14(18):4583
Google Scholar
Geng C, Huang S-J, Chen S (2020) Recent advances in open set recognition: A survey. IEEE Trans Pattern Anal Mach Intell 43(10):3614–3631
Google Scholar
Gerke M (2014) Use of the stair vision library within the isprs 2d semantic labeling benchmark (vaihingen)
Gevaert CM (2022) Explainable ai for earth observation: A review including societal and regulatory perspectives. Int J Appl Earth Obs Geoinf 112:102869
Google Scholar
Ghali R, Akhloufi MA (2023) Deep learning approaches for wildland fires remote sensing: Classification, detection, and segmentation. Remote Sensing 15(7):1821
Google Scholar
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 1735–1742. IEEE
Ham TJ, Lee Y, Seo SH, Kim S, Choi H, Jung SJ, Lee JW (2021) Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 692–705. IEEE
Hammell R (2019) Data retrieved from kaggle. Accessed: Feb 1
Hamzaoui M, Chapel L, Pham M-T, Lefèvre S (2022) A hierarchical prototypical network for few-shot remote sensing scene classification. In: International Conference on Pattern Recognition and Artificial Intelligence, pp. 208–220. Springer
Han J, Zhang D, Cheng G, Liu N, Xu D (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process Mag 35(1):84–100
Google Scholar
Han S, Mao H, Dally WJ (2015) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149
He J, Zhao L, Yang H, Zhang M, Li W (2019) Hsi-bert: hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans Geosci Remote Sens 58(1):165–178
Google Scholar
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12(7):2217–2226
Google Scholar
Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International Workshop on Similarity-based Pattern Recognition, pp 84–92. Springer
Hong J, Fang P, Li W, Zhang T, Simon C, Harandi M, Petersson L (2021) Reinforced attention for few-shot learning and beyond. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 913–923
Huang L, Liu B, Li B, Guo W, Yu W, Zhang Z, Yu W (2017) Opensarship: A dataset dedicated to sentinel-1 ship interpretation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(1):195–208
Google Scholar
Huang W, Yuan Z, Yang A, Tang C, Luo X (2021) Tae-net: task-adaptive embedding network for few-shot remote sensing scene classification. Remote Sensing 14(1):111
Google Scholar
Huang Z, Tang H, Li Y, Xie W (2023) Hfc-sst: improved spatial-spectral transformer for hyperspectral few-shot classification. J Appl Remote Sens 17(2):026509–026509
Google Scholar
Huang K, Deng X, Geng J, Jiang W (2021) Self-attention and mutual-attention for few-shot hyperspectral image classification. In: 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, pp. 2230–2233. IEEE
Hu Y, Huang Y, Wei G, Zhu K (2022) Heterogeneous few-shot learning with knowledge distillation for hyperspectral image classification. In: 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 601–604. IEEE
Hu B, Tunison P, RichardWebster B, Hoogs A (2023) Xaitk-saliency: An open source explainable ai toolkit for saliency. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 15760–15766
Hu X, Zhong Y, Luo C, Wang X (2020) Whu-hi: Uav-borne hyperspectral with high spatial resolution (h2) benchmark datasets for hyperspectral image classification. arXiv preprint arXiv:2012.13920
Ishikawa S-N, Todo M, Taki M, Uchiyama Y, Matsunaga K, Lin P, Ogihara T, Yasui M (2023) Example-based explainable ai and its application for remote sensing image classification. Int J Appl Earth Obs Geoinf 118:103215
Google Scholar
Jetley S, Lord NA, Lee N, Torr PH (2018) Learn to pay attention. arXiv:1804.02391
Jeune PL, Mokraoui A (2023) Rethinking intersection over union for small object detection in few-shot regime. arXiv preprint arXiv:2307.09562
Jiang N, Shi H, Geng J (2022) Multi-scale graph-based feature fusion for few-shot remote sensing image scene classification. Remote Sensing 14(21):5550
Google Scholar
Jian Y, Torresani L (2022) Label hallucination for few-shot classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7005–7014
Kakogeorgiou I, Karantzalos K (2021) Evaluating explainable artificial intelligence methods for multi-label deep learning classification tasks in remote sensing. Int J Appl Earth Obs Geoinf 103:102520
Google Scholar
Khoshboresh-Masouleh M, Shah-Hosseini R (2023) Multimodal few-shot target detection based on uncertainty analysis in time-series images. Drones 7(2):66
Google Scholar
Kim J, Chi M (2021) Saffnet: Self-attention-based feature fusion network for remote sensing few-shot scene classification. Remote Sensing 13(13):2532
Google Scholar
Koukouraki E, Vanneschi L, Painho M (2021) Few-shot learning for post-earthquake urban damage detection. Remote Sensing 14(1):40
Google Scholar
Kyrkou C, Theocharides T (2020) Emergencynet: efficient aerial image classification for drone-based emergency monitoring using atrous convolutional feature fusion. IEEE J Selected Top Appl Earth Observ Remote Sens 13:1687–1699
Google Scholar
Lang C, Cheng G, Tu B, Han J (2023) Global rectification and decoupled registration for few-shot segmentation in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing
Lang C, Wang J, Cheng G, Tu B, Han J (2023) Progressive parsing and commonality distillation for few-shot remote sensing segmentation. IEEE Transactions on Geoscience and Remote Sensing
Le Jeune P, Mokraoui A (2022) Improving few-shot object detection through a performance analysis on aerial and natural images. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp. 513–517. IEEE
Lee GY, Dam T, Ferdaus MM, Poenar DP, Duong VN (2023) Watt-effnet: A lightweight and accurate model for classifying aerial disaster images. IEEE Geoscience and Remote Sensing Letters
Letham B, Rudin C, McCormick TH, Madigan D (2015) Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model
Li A, Lu Z, Wang L, Xiang T, Wen J-R (2017) Zero-shot scene classification for high spatial resolution remote sensing images. IEEE Trans Geosci Remote Sens 55(7):4157–4167
Google Scholar
Li L, Han J, Yao X, Cheng G, Guo L (2020) Dla-matchnet for few-shot remote sensing image scene classification. IEEE Trans Geosci Remote Sens 59(9):7844–7853
Google Scholar
Li K, Wan G, Cheng G, Meng L, Han J (2020) Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J Photogramm Remote Sens 159:296–307
Google Scholar
Li J, Tian Y, Xu Y, Hu X, Zhang Z, Wang H, Xiao Y (2022) Mm-rcnn: Toward few-shot object detection in remote sensing images with meta memory. IEEE Trans Geosci Remote Sens 60:1–14
Google Scholar
Li L, Yao X, Cheng G, Han J (2022) Aifs-dataset for few-shot aerial image scene classification. IEEE Trans Geosci Remote Sens 60:1–11
Google Scholar
Li X, Deng J, Fang Y (2021) Few-shot object detection on remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60:1–14
Li R, Li J, Gou S, Lu H, Mao S, Guo Z (2023) Multi-scale similarity guidance few-shot network for ship segmentation in sar images
Ling J, Templeton J (2015) Evaluation of machine learning algorithms for prediction of regions of high reynolds averaged navier stokes uncertainty. Physics of Fluids, 27(8)
Liu J (2022) Few-shot object detection model based on meta-learning for uav. In: Fifth International Conference on Mechatronics and Computer Technology Engineering (MCTE 2022), vol. 12500, pp. 1468–1474. SPIE
Liu X, Yang T, Li J (2018) Real-time ground vehicle detection in aerial infrared imagery based on convolutional neural network. Electronics 7(6):78
Google Scholar
Liu S, Shi Q, Zhang L (2020) Few-shot hyperspectral image classification with unknown classes using multitask deep learning. IEEE Trans Geosci Remote Sens 59(6):5085–5102
Google Scholar
Liu L, Zuo D, Wang Y, Qu H (2022) Feedback-enhanced few-shot transformer learning for small-sized hyperspectral image classification. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Liu N, Xu X, Celik T, Gan Z, Li H-C (2023) Transformation-invariant network for few-shot object detection in remote sensing images. arXiv preprint arXiv:2303.06817
Liu S, Zhang L, Hao S, Lu H, He Y (2021) Polar ray: A single-stage angle-free detector for oriented object detection in aerial images. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3124–3132
Li L, Wang B, Verma M, Nakashima Y, Kawasaki R, Nagahara H (2021) Scouter: Slot attention-based classifier for explainable image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1046–1055
Lu X, Zhang Y, Yuan Y, Feng Y (2019) Gated and axis-concentrated localization network for remote sensing object detection. IEEE Trans Geosci Remote Sens 58(1):179–192
Google Scholar
Lu X, Sun X, Diao W, Mao Y, Li J, Zhang Y, Wang P, Fu K (2023) Few-shot object detection in aerial imagery guided by text-modal knowledge. IEEE Trans Geosci Remote Sens 61:1–19
Google Scholar
Masouleh MK, Shah-Hosseini R (2019) Development and evaluation of a deep learning model for real-time ground vehicle semantic segmentation from UAV-based thermal infrared imagery. ISPRS J Photogramm Remote Sens 155:172–186
Google Scholar
Mohan A, Peeples J (2023) Quantitative analysis of primary attribution explainable artificial intelligence methods for remote sensing image classification. arXiv preprint arXiv:2306.04037
Moura LVd, Mattjie C, Dartora CM, Barros RC, Marques da Silva AM (2022) Explainable machine learning for covid-19 pneumonia classification with texture-based features extraction in chest radiography. Front Digital Health 3:662343
Google Scholar
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 785–800. Springer
Nguyen T, Miller ID, Cohen A, Thakur D, Guru A, Prasad S, Taylor CJ, Chaudhari P, Kumar V (2021) Pennsyn2real: Training object recognition models without human labeling. IEEE Robotics and Automation Letters 6(3):5032–5039
Google Scholar
Oreshkin B, Rodríguez López P, Lacoste A (2018) Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31
Pal D, Bundele V, Sharma R, Banerjee B, Jeppu Y (2022) Few-shot open-set recognition of hyperspectral images with outlier calibration network. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3801–3810
Park J, Cho YK, Kim S (2022) Deep learning-based uav image segmentation and inpainting for generating vehicle-free orthomosaic. Int J Appl Earth Obs Geoinf 115:103111
Google Scholar
Peng Y, Liu Y, Tu B, Zhang Y (2023) Convolutional transformer-based few-shot learning for cross-domain hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16:1335–1349
Google Scholar
Petsiuk V, Jain R, Manjunatha V, Morariu VI, Mehra A, Ordonez V, Saenko K (2021) Black-box explanation of object detectors via saliency maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11443–11452
Pham H, Guan M, Zoph B, Le Q, Dean J (2018) Efficient neural architecture search via parameters sharing. In: International Conference on Machine Learning, pp. 4095–4104. PMLR
Pintelas E, Livieris IE, Pintelas P (2023) Explainable feature extraction and prediction framework for 3d image recognition applied to pneumonia detection. Electronics 12(12):2663
Google Scholar
Puthumanaillam G, Verma U (2023) Texture based prototypical network for few-shot semantic segmentation of forest cover: Generalizing for different geographical regions. Neurocomputing 538:126201
Google Scholar
Qu Y, Baghbaderani RK, Qi H (2019) Few-shot hyperspectral image classification through multitask transfer learning. In: 2019 10th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), pp. 1–5. IEEE
Ran Q, Zhou Y, Hong D, Bi M, Ni L, Li X, Ahmad M (2023) Deep transformer and few-shot learning for hyperspectral image classification. CAAI Transactions on Intelligence Technology
Ribeiro MT, Singh S, Guestrin C (2016) “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144
Rostami M, Kolouri S, Eaton E, Kim K (2019) Sar image classification using few-shot cross-domain transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1(5):206–215
Google Scholar
Schmitt M, Wu Y-L (2021) Remote sensing image classification with the sen12ms dataset. arXiv preprint arXiv:2104.00704
Schulz K, Sixt L, Tombari F, Landgraf T (2020) Restricting the flow: Information bottlenecks for attribution. arXiv preprint arXiv:2001.00396
Schwegmann CP, Kleynhans W, Salmon BP, Mdakane LW, Meyer RG (2016) Very deep learning for ship discrimination in synthetic aperture radar imagery. In: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 104–107. IEEE
Schwegmann C, Kleynhans W, Salmon B, Mdakane L, Meyer R (2017) A sar ship dataset for detection, discrimination and analysis. IEEE Dataport
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626
Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D, Batra D (2016) Grad-cam: Why did you say that? arXiv:1611.07450
Shaikh TA, Rasool T, Lone FR (2022) Towards leveraging the role of machine learning and artificial intelligence in precision agriculture and smart farming. Comput Electron Agric 198:107119
Google Scholar
Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: International Conference on Machine Learning, pp. 3145–3153. PMLR
Sirmacek B, Vinuesa R (2022) Remote sensing and ai for building climate adaptation applications. Results in Engineering 15:100524
Google Scholar
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. Advances in neural information processing systems 30
Song K, Zhang Y, Bao Y, Zhao Y, Yan Y (2023) Self-enhanced mixed attention network for three-modal images few-shot semantic segmentation. Sensors 23(14):6612
Google Scholar
Su B, Zhang H, Wu Z, Zhou Z (2022) Fsrdd: An efficient few-shot detector for rare city road damage detection. IEEE Trans Intell Transp Syst 23(12):24379–24388
Google Scholar
Sumbul G, Charfuelan M, Demir B, Markl V (2019) Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In: IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 5901–5904. IEEE
Sun X, Wang B, Wang Z, Li H, Li H, Fu K (2021) Research progress on few-shot learning for remote sensing image interpretation. IEEE J Selected Top Appl Earth Observ Remote Sens 14:2387–2402
Google Scholar
Sung F, Yang Y, Zhang L, Xiang T, Torr PH, Hospedales TM (2018) Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208
Sun Q, Liu Y, Chua T-S, Schiele B (2019) Meta-transfer learning for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 403–412
Su H, Wei S, Yan M, Wang C, Shi J, Zhang X (2019) Object detection and instance segmentation in remote sensing imagery based on precise mask r-cnn. In: IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 1454–1457. IEEE
Su H, You Y, Meng G (2022) Multi-scale context-aware r-cnn for few-shot object detection in remote sensing images. In: IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, pp. 1908–1911. IEEE
Tai Y, Tan Y, Xiong S, Sun Z, Tian J (2022) Few-shot transfer learning for sar image classification without extra sar samples. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15:2240–2253
Google Scholar
Temenos A, Tzortzis IN, Kaselimi M, Rallis I, Doulamis A, Doulamis N (2022) Novel insights in spatial epidemiology utilizing explainable ai (xai) and remote sensing. Remote Sensing 14(13):3074
Google Scholar
Tong X, Yin J, Han B, Qv H (2020) Few-shot learning with attention-weighted graph convolutional networks for hyperspectral image classification. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1686–1690. IEEE
Tuia D, Volpi M, Copa L, Kanevski M, Munoz-Mari J (2011) A survey of active learning algorithms for supervised remote sensing image classification. IEEE Journal of Selected Topics in Signal Processing 5(3):606–617
Google Scholar
Vinyals O, Blundell C, Lillicrap T, Wierstra D et al (2016) Matching networks for one shot learning. Advances in neural information processing systems 29
Wang Q, Liu S, Chanussot J, Li X (2018) Scene classification with recurrent attention of vhr remote sensing images. IEEE Trans Geosci Remote Sens 57(2):1155–1167
Google Scholar
Wang B, Wang Z, Sun X, Wang H, Fu K (2021) Dmml-net: Deep metametric learning for few-shot geographic object segmentation in remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–18
Google Scholar
Wang Y, Liu M, Yang Y, Li Z, Du Q, Chen Y, Li F, Yang H (2021) Heterogeneous few-shot learning for hyperspectral image classification. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Wang K, Wang X, Cheng Y (2022) Few-shot aerial image classification with deep economic network and teacher knowledge. Int J Remote Sens 43(13):5075–5099
Google Scholar
Wang K, Qiao Q, Zhang G, Xu Y (2022) Few-shot sar target recognition based on deep kernel learning. IEEE Access 10:89534–89544
Google Scholar
Wang C, Huang Y, Liu X, Pei J, Zhang Y, Yang J (2022) Global in local: A convolutional transformer for sar atr fsl. IEEE Geosci Remote Sens Lett 19:1–5
Google Scholar
Wang B, Ma G, Sui H, Zhang Y, Zhang H, Zhou Y (2023) Few-shot object detection in remote sensing imagery via fuse context dependencies and global features. Remote Sensing 15(14):3462
Google Scholar
Wang Y, Chao W-L, Weinberger KQ, Van Der Maaten L (2019) Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623
Wang H, Chen S, Xu F, Jin Y-Q (2015) Application of deep-learning algorithms to mstar data. In: 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 3743–3745. IEEE
Wang Z, Jiang Z, Yuan Y (2022) Queue learning for multi-class few-shot semantic segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1721–1725. IEEE
Wang B, Li L, Verma M, Nakashima Y, Kawasaki R, Nagahara H (2022) Match them up: visually explainable few-shot image classification. Appl Intell. pp 1–22
Wang H, Wang Z, Du M, Yang F, Zhang Z, Ding S, Mardziel P, Hu X (2020) Score-cam: Score-weighted visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25
Wang R, Wang Q, Yu J, Tong J (2022) Multi-scale self-attention-based few-shot object detection for remote sensing images. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–7. IEEE
Wang D, Yang Q, Abdul A, Lim BY (2019) Designing theory-driven user-centric explainable ai. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–15
Wei S, Zeng X, Qu Q, Wang M, Su H, Shi J (2020) Hrsid: A high-resolution sar images dataset for ship detection and instance segmentation. Ieee Access 8:120234–120254
Google Scholar
Wolf S, Meier J, Sommer L, Beyerer J (2021) Double head predictor based few-shot object detection for aerial imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 721–731
Xia G-S, Yang W, Delon J, Gousseau Y, Sun H, Maître H (2010) Structural high-resolution satellite image indexing. In: ISPRS TC VII Symposium-100 Years ISPRS, vol. 38, pp. 298–303
Xia G-S, Hu J, Hu F, Shi B, Bai X, Zhong Y, Zhang L, Lu X (2017) Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans Geosci Remote Sens 55(7):3965–3981
Google Scholar
Xiang T, Xia G, Zhang L (2018) Mini-UAV-based remote sensing: Techniques, applications and prospectives. arXiv:1812.07770
Xiao Z, Qi J, Xue W, Zhong P (2021) Few-shot object detection with self-adaptive attention network for remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:4854–4865
Google Scholar
Xu Z, Zhang W, Zhang T, Yang Z, Li J (2021) Efficient transformer for remote sensing image segmentation. Remote Sens 13(18):3585
Google Scholar
Yang P, Tong L, Qian B, Gao Z, Yu J, Xiao C (2020) Hyperspectral image classification with spectral and spatial graph using inductive representation learning network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:791–800
Google Scholar
Yang M, Bai X, Wang L, Zhou F (2021) Mixed loss graph attention network for few-shot sar target classification. IEEE Trans Geosci Remote Sens 60:1–13
Google Scholar
Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 270–279
Yang R, Xu X, Li X, Wang L, Pu F (2020) Learning relation by graph neural network for sar image few-shot learning. In: IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, pp. 1743–1746. IEEE
Yao H, Qin R, Chen X (2019) Unmanned aerial vehicle for remote sensing applications-a review. Remote Sensing 11(12):1443
Google Scholar
Yao X, Cao Q, Feng X, Cheng G, Han J (2021) Scale-aware detailed matching for few-shot aerial image semantic segmentation. IEEE Trans Geosci Remote Sens 60:1–11
Google Scholar
Yokoya N, Iwasaki A (2016) Airborne hyperspectral data over chikusei. Space Appl. Lab., Univ. Tokyo, Tokyo, Japan, Tech. Rep. SAL-2016-05-27, 5
Yuan Z, Huang W, Li L, Luo X (2020) Few-shot scene classification with multi-attention deepemd network in remote sensing. IEEE Access 9:19891–19901
Google Scholar
Yuan Z, Huang W, Tang C, Yang A, Luo X (2022) Graph-based embedding smoothing network for few-shot scene classification of remote sensing images. Remote Sensing 14(5):1161
Google Scholar
Yuan H, Tang J, Hu X, Ji S (2020) Xgnn: Towards model-level explanations of graph neural networks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 430–438
Zemel R, Wu Y, Swersky K, Pitassi T, Dwork C (2013) Learning fair representations. In: International Conference on Machine Learning, pp 325–333. PMLR
Zeng Q, Geng J (2022) Task-specific contrastive learning for few-shot remote sensing image scene classification. ISPRS J Photogramm Remote Sens 191:143–154
Google Scholar
Zhang Y, Yuan Y, Feng Y, Lu X (2019) Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans Geosci Remote Sens 57(8):5535–5548
Google Scholar
Zhang P, Bai Y, Wang D, Bai B, Li Y (2020) Few-shot classification of aerial scene images via meta-learning. Remote Sensing 13(1):108
Google Scholar
Zhang J, Zhao H, Li J (2021) Trs: transformers for remote sensing scene classification. Remote Sens 13(20):4143
Google Scholar
Zhang H, Zhang X, Meng G, Guo C, Jiang Z (2022) Few-shot multi-class ship detection in remote sensing images using attention feature map and multi-relation detector. Remote Sensing 14(12):2790
Google Scholar
Zhang S, Song F, Liu X, Hao X, Liu Y, Lei T, Jiang P (2023) Text semantic fusion relation graph reasoning for few-shot object detection on remote sensing images. Remote Sensing 15(5):1187
Google Scholar
Zhao X, Lv X, Cai J, Guo J, Zhang Y, Qiu X, Wu Y (2022) Few-shot sar-atr based on instance-aware transformer. Remote Sensing 14(8):1884
Google Scholar
Zhao Y, Ha L, Wang H, Ma X (2022) Few-shot class incremental learning for hyperspectral image classification based on constantly updated classifier. In: IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, pp. 1376–1379. IEEE
Zhong Z, Li Y, Ma L, Li J, Zheng W-S (2021) Spectral-spatial transformer network for hyperspectral image classification: a factorized architecture search framework. IEEE Trans Geosci Remote Sens 60:1–15
Google Scholar
Zhu XX, Tuia D, Mou L, Xia G-S, Zhang L, Xu F, Fraundorfer F (2017) Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4):8–36
Google Scholar
Zhu H, Chen X, Dai W, Fu K, Ye Q, Jiao J (2015) Orientation robust object detection in aerial images using deep convolutional neural network. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 3735–3739. IEEE
Zou Q, Ni L, Zhang T, Wang Q (2015) Deep learning based feature selection for remote sensing scene classification. IEEE Geosci Remote Sens Lett 12(11):2321–2325
Google Scholar

Download references

Acknowledgements

This research/project is supported by the Civil Aviation Authority of Singapore and Nanyang Technological University, Singapore under their collaboration in the Air Traffic Management Research Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Civil Aviation Authority of Singapore.

Author information

Gao Yu Lee, Tanmoy Dam, and Md. Meftahul Ferdaus have contributed equally to this work.

Authors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Ave, 639798, Singapore, Singapore
Gao Yu Lee & Daniel Puiu Poenar
School of Mechanical and Aerospace Engineering, Nanyang Technological University, 65 Nanyang Drive, 637460, Singapore, Singapore
Tanmoy Dam, Md. Meftahul Ferdaus & Vu N. Duong
Department of Computer Science, The University of New Orleans, 2000 Lakeshore Drive, New Orleans, LA, 70148, USA
Md. Meftahul Ferdaus

Authors

Gao Yu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Tanmoy Dam
View author publications
You can also search for this author in PubMed Google Scholar
Md. Meftahul Ferdaus
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Puiu Poenar
View author publications
You can also search for this author in PubMed Google Scholar
Vu N. Duong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Corresponding author(s): Gao Yu Lee. Contributing authors: Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu N. Duong Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus contributed equally to this work. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Gao Yu Lee.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, G.Y., Dam, T., Ferdaus, M.M. et al. Unlocking the capabilities of explainable few-shot learning in remote sensing. Artif Intell Rev 57, 169 (2024). https://doi.org/10.1007/s10462-024-10803-5

Download citation

Accepted: 16 May 2024
Published: 10 June 2024
DOI: https://doi.org/10.1007/s10462-024-10803-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unlocking the capabilities of explainable few-shot learning in remote sensing

Abstract

Similar content being viewed by others

Self-supervised learning for remote sensing scene classification under the few shot scenario

MMEarth: Exploring Multi-modal Pretext Tasks for Geospatial Representation Learning

A Survey of Few-Shot Learning for Image Classification of Aerial Objects

Explore related subjects

1 Introduction

2 Backgrounds

2.1 Similarity functions for few-shot learning

2.2 Importance of explainable AI in remote sensing

2.3 Taxonomy of few-shot learning approaches

2.4 Taxonomy of explainable few-shot learning approaches

2.4.1 Explainable feature extraction

2.4.2 Explainable decision making

3 Type of remote sensing sensor data

4 Benchmark remote sensing datasets for evaluating learning models

4.1 Hyperspectral image dataset

4.1.1 Satellite-based data

4.1.2 UAV-based dataset

4.2 VHR image-based dataset

4.2.1 Satellite-based datasets

4.2.2 UAV-based dataset

4.3 SAR image-based dataset

5 Evaluation metrics for few-shot remote sensing task

6 Recent few-shot learning techniques in remote sensing

6.1 Few-shot learning in hyperspectral images classification

6.2 Few-shot learning in VHR image classification

6.3 Few-shot learning in SAR image classification

7 Few-shot based object detection and segmentation in remote sensing

7.1 Few-shot object detection in remote sensing

7.2 FSOD benchmark datasets for aerial remote sensing images

7.3 Few-shot image segmentation in remote sensing

7.3.1 Few-shot image segmentation benchmark datasets for remote sensing aerial images

8 Discussions

8.1 Computational considerations in few-shot learning

9 Numerical experimentation of few-shot classification on UAV-based and satellite-based dataset

10 Explainable AI (XAI) in remote sensing

10.1 XAI in few-shot learning for remote sensing

11 Conclusions and future directions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation