Abstract
Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in continual learning are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning setting where labeled data are inadequate, and unlabeled data are abundant but with a limited annotation budget. We exploit meta-learning and propose a method, called Meta-Continual Active Learning. This method sequentially queries the most informative examples from a pool of unlabeled data for annotation to enhance task-specific performance and tackles continual learning problems through a meta-objective. Specifically, we employ meta-learning and experience replay to address inter-task confusion and catastrophic forgetting. We further incorporate textual augmentations to avoid memory over-fitting caused by experience replay and sample queries, thereby ensuring generalization. We conduct extensive experiments on benchmark text classification datasets from diverse domains to validate the feasibility and effectiveness of meta-continual active learning. We also analyze the impact of different active learning strategies on various meta continual learning models. The experimental results demonstrate that introducing randomness into sample selection is the best default strategy for maintaining generalization in meta-continual learning framework.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Continual Learning (CL) aims to address the stability-plasticity dilemma in the context of sequential learning. Stability refers to the ability to alleviate the decline in model performance on previously learned tasks, in other words, preventing catastrophic forgetting (McCloskey and Cohen 1989). Notably, inter-task confusion (Huang et al. 2023) is one of the major reasons that causes catastrophic forgetting. It is a phenomenon where the model confuses classes from different tasks. Plasticity refers to the ability to adapt to a new task. However, the majority of existing CL methods presume the availability of labeled data in abundance, deeming it adequate for learning every task. Their performance heavily relies on the large quantity and high quality of labeled data, which is impractical.
In real-world scenarios, labeled data are scarce while unlabeled data are abundant. The cost of annotation, particularly in the field of Natural Language Processing (NLP), tends to be prohibitively high. To narrow this gap, we address the problems in a setting that more closely aligns with real-world scenarios, namely few-shot Continual Active Learning (CAL) (Ayub and Fendley 2022). In this setting, only a small subset of labeled data is provided for each task with a limited annotation budget. In such a way, model should sequentially select the most worthwhile examples from a pool of unlabeled data and requests labels to enhance performance and simultaneously solve continual learning problems. Introducing active learning into the scheme can be challenging as active learning techniques are typically designed to query from a static data distribution. They may not be able to dynamically capture most relevant samples for queries to prevent catastrophic forgetting.
Replay-based methods have been shown to be particularly effective for NLP tasks (Wang et al. 2022). These methods often retain a certain amount of past samples to prevent catastrophic forgetting. Consequently, they are prone to memory over-fitting, especially when the availability of labeled examples is limited. However, active learning can further escalate the problem by biasing the sample selection towards the memory set. Hence, integrating active learning techniques into CL model can be quite challenging.
Given the success of meta-learning on solving low plasticity, certain studies (Riemer et al. 2019; Gupta et al. 2020) extend meta-learning to CL setting. In this work, we exploit the advantages of meta-learning and uses Model-Agnostic Meta-Learning (MAML) (Finn et al. 2017) framework to solve few-shot CAL problems, called Meta-Continual Active Learning. By applying active learning for task-specific tuning and casting meta-objective as experience replay, the learning objective is deliberately formulated to learn an optimal or a suboptimal initial model state that can rapidly adapt to a balanced subset of all encountered task. Thereby, this method enables fast adaptation while preventing catastrophic forgetting in few-shot learning. In addition, we apply consistency regularization via textual augmentations to address memory overfitting problems that inherent in replay-based method and exacerbated by active learning acquisition.
We conduct extensive experiments on benchmark datasets from Zhang et al. (2015), popularized by de Masson d’Autume et al. (2019) in lifelong language learning. This collection of datasets includes five text classification datasets from four diverse domains. We demonstrate the effectiveness of the proposed framework in a 5-shot CAL setup. This paper also examines how various active learning approaches impact the performance of meta continual learning models.
The main contributions of this paper are fourfold:
-
Leveraging the strengths of meta-learning, we introduce an optimization-based method, namely Meta-Continual Active Learning (Meta-CAL). This method reformulates the meta-objective such that it learns an optimal or a suboptimal initial model state that can effectively adapt to all seen tasks. Thereby, it provides a solution to inter-task confusion and catastrophic forgetting even with limited availability of labeled samples.
-
We integrate active learning into the proposed framework to enhance task-specific tuning. This allows the model to dynamically and selectively query the most informative samples from a pool of unlabeled data, thereby improving performance in a resource-constrained scenario.
-
To address inevitable memory overfitting problems caused by experience replay and active learning, we apply consistency regularization to meta examples via data augmentations. This further ensures intra- and inter-task generalization.
-
In the experiments, the proposed method achieves an accuracy of more than 62% while utilizing only 1.6% of the past samples and maintaining annotation budgets as low as 500 samples for each task. The results demonstrate the feasibility and effectiveness of meta-continual active learning. Furthermore, we observe that random sampling facilitates generalization in meta continual learning, thereby addressing the stability-plasticity dilemma.
2 Related work
2.1 Continual learning
Existing approaches can be categorised into three main mainstreams, i.e., regularization-based methods (Kirkpatrick et al. 2017; Li and Hoiem 2018; Lin et al. 2024), replay-based methods (de Masson d’Autume et al. 2019; Ho et al. 2023) and architecture-based methods (Adel et al. 2020; Yoon et al. 2018; Wang et al. 2023).
Regularization-based methods often impose a penalty or regularization term to the loss function. These methods often penalise changes in non-trivial parameters or constrain the variations in gradients learned from previous tasks. However, most methods for NLP show a preference for replay-based approaches (Wang et al. 2022) to avoid unexpected output from tuning the parameters of deep neural networks (Wang et al. 2019).
Replay-based methods, also known as rehearsal-based methods or memory-based methods, involves revisiting a small amount of past samples (i.e., experience replay) or generating pseudo past samples while adapting to new domain. The popular retrieval schemes for experience replay are random (Chaudhry et al. 2019; Riemer et al. 2019; de Masson d’Autume et al. 2019; Holla et al. 2020), K-Means (Wang et al. 2019; Han et al. 2020) and Mean-of-Feature (Qin and Joty 2022; Chen et al. 2023).
Architecture-based methods dynamically change the model architecture to learn a new task. In general, these methods preserve or partially preserve past fine-tuned parameters and introduce task-specific parameters for the new domain. However, it is challenging to manage the scale of the model since the parameter size continuously accumulates with the increase in the number of seen tasks.
2.2 Meta-continual learning
In recent years, meta-learning has emerged as an effective learning framework for CL. In particular, the bi-level optimization of meta-learning enables fast adaptation to training data while ensuring generalization across all observed samples. In meta-continual learning, meta-learning is often combined with memory replay. Meta-MbPA (Wang et al. 2020) uses MAML to augment episodic memory replay via local adaptation. OML-ER (Holla et al. 2020) and ANML-ER (Holla et al. 2020) utilise an online meta-learning model (OML) (Javed and White 2019) and a neuromodulated meta-learning (ANML) (Beaulieu et al. 2020) respectively for effective knowledge transfer through fast adaptation and sparse experience replay. PMR (Ho et al. 2023) also employs MAML for facilitating episodic memory replay but using a prototypical memory sample selection approach. MER (Riemer et al. 2019) regularizes the objective of experience replay by graident alignment between old and new tasks via a modified Reptile (Nichol et al. 2018). C-MAML (Gupta et al. 2020) utilizes OML to regulate CL objectives and La-MAML (Gupta et al. 2020) optimizes OML objective through the modulation of per-parameter learning rates. Meta-CL (Wu et al. 2024) further improves C-MAML and La-MAML by introducing penalty to restrain unnecessary model updates and preserve non-trivial weights for knowledge consolidation. SB-MCL (Lee et al. 2024) enhances the advantages of meta-learning by integrating sequential Bayesian updates to bridge statistical models with meta-learned neural networks. To date, not many models use gradient alignment for lifelong language learning. In addition, most of these models are employed in a supervised setting. In this work, we focus on aligning meta-learning with CL objectives to provide a solution in a realistic and resource-constrained scenario for NLP.
2.3 Continual active learning
Continual active learning aims to sequentially labels informative data to maximise model performance and solve continual learning problems. It defines a continual learning problem where the available labeled data are insufficient and the annotation budgets are limited. CASA (Perkonigg et al. 2021) detects new pseudo-domains and selected data from new pseudo-domains for annotation, while it revisits labeled samples to address catastrophic forgetting. Ayub and Fendley (2022) propose a method to address few-shot CAL (FoCAL). They use a Guaussian mixture model (GMM) for active learning and pseudo-rehearsal for CL, bypassing the need to store real past data. However, neither of these methods addresses continual active learning in NLP. CAL-SD (Das et al. 2023) tackles NLP tasks, and uses model distillation to augment memory replay with a diversity and uncertainty AL strategies. To date, continual active learning remains understudied, especially in NLP.
3 Preliminaries
In this work, we focus on task-free class-incremental learning scenario, where training data stream passes only once without the presence of “task boundaries” (van de Ven et al. 2021). Based on the task-free setting, we formulate the problem of few-shot continual active learning as follows. Assume that training stream consists of T tasks, \(\{\mathcal {T}_1,\mathcal {T}_2,...,\mathcal {T}_t,...,\mathcal {T}_T\}\). Each task \(\mathcal {T}_{t}\) contains \(N_t\)-way K-shot set \(\mathcal {D}^{label}_t = \{(x_{i},y_{i})\}_{i=1}^{N_t \times K}\) and a pool of unlabeled data \(\mathcal {D}^{pool}_t = \{(u_{i})\}_{i=1}^{|\mathcal {D}^{pool}_t|}\).
Label space: Based on the label space \(\mathcal {Y}\) of tasks, typical continual learning scenarios have domain-incremental learning, where \(\mathcal {Y}_t = \mathcal {Y}_t', \forall t \ne t'\) and class-incremental learning, where \(\mathcal {Y}_t \cap \mathcal {Y}_t'= \emptyset , \forall t \ne t'\). Due to the unforeseen nature of sequential learning, the label space \(\mathcal {Y}_t\) of a task may or may not be disjoint from the other tasks. Hence, we allow \(\mathcal {Y}_t \cap \mathcal {Y}_t' \ne \emptyset\) or \(\mathcal {Y}_t \cap \mathcal {Y}_t'= \emptyset , \forall t \ne t'\) to occur.
Annotation constraint: We consider each task \(\mathcal {T}_{t}\) to have an equal annotation budget \(B_{A}\). An acquisition function \(a(\cdot )\) dynamically query informative data points for annotation until acquisition process reaches annotation budget. We denote the selected samples for annotation as \(a(\mathcal {D}^{pool}_t, B_{A})\) and newly annotated sample set as \(\mathcal {D}^{new}_t\).
Memory constraint: Replay-based model revisits a small subset of labeled data \(\mathcal {D}_{1:1-t}\) to regularize model f while learning \(\mathcal {T}_t\). We limit the amount of past training data saved in memory buffer \(\mathcal {M}\), which should not exceed memory budget threshold \(B_{\mathcal {M}}\).
Objectives: Assume we have a learner \(f_{\theta }\) and current task \(\mathcal {T}_t\), the learning objectives are: (a) perform adaptation on \(\mathcal {D}_t = \mathcal {D}^{label}_t \bigcup \mathcal {D}^{new}_t\), i.e., plasticity:
where \(\mathcal {L}_{\mathcal {T}_t}\) is the task loss, \(\theta _{t}\) is the initial state and \(\theta _{t} = \tilde{\theta }_{t-1}\); (b) prevent inter-task confusion and catastrophic forgetting of prior tasks, i.e., stability:
3.1 Active learning strategies
In this work, we consider four popular active learning (AL) methods as follows.
Uncertainty: This method samples data \(\varvec{x}\) with high uncertainty. The uncertainty is measured by model outputs \(\hat{y}\). Least-confidence method (Culotta and McCallum 2005) evaluates uncertainty by the confidence in predication, where lowest posterior probability indicates greater uncertainty \(\alpha _{LC}(\varvec{x}, n_a)= -Pr(\hat{y}|\varvec{x})\). Margin-confidence (Netzer et al. 2011) method considers the confidence margin between two most likely predictions \((\hat{y}_1, \hat{y}_2)\), \(\alpha _{Marg.}(\varvec{x}, n_a)= -|Pr(\hat{y}_1|\varvec{x})-Pr(\hat{y}_2|\varvec{x})|\). Small difference indicates high uncertainty. Entropy-based (Shannon 2001) method uses predictive entropy H as the indicator, where higher entropy shows more uncertainty in the posterior probability \(\alpha _{Entr.}(\varvec{x}, n_a)= H(Pr(\hat{y}|\varvec{x}))\).
Representative: This method selects data \(\varvec{x}\) that are geometrically representative in vector space (Schröder et al. 2022). In this work, input data \(\varvec{x}\) that have the shortest euclidean distance to the centroid of a cluster are representative. KMeans uses the centroid of each cluster, which applies unsupervised learning to partition the data into clusters. The number of clusters is the selection size \(n_a\). In this work, we also introduce Mean vectors method as a baseline for comparison. It averages representation vectors of each training batch as the centroid.
Diversity: This method chooses data \(\varvec{x}\) that are geometrically seen as outliers in vector space (Mosqueira-Rey et al. 2022). In this work, input data \(\varvec{x}\) with the longest Euclidean distance from a centroid are considered diverse. The centroid selections align with those used in the Representative method.
Random: This method randomly samples data \(\varvec{x}\) from unlabeled pool \(\mathcal {D}^{pool}\).
4 Learning to learn for CAL
Model-Agnostic Meta-Learning (MAML) (Finn et al. 2017) is an optimization-based meta-learning approach, often referred to as the “learning-to-learn” algorithm. Learning to learn allows a model to adapt to different data distributions, which can be seen as a form of transfer learning that improves generalization (Andrychowicz et al. 2016). Therefore, we exploit the MAML framework to facilitate knowledge transfer and generalization across tasks. Moreover, we harness its fast adaption ability to address the challenge of resource scarcity.
Learning to fast adapt: We approximately align the meta-objective with the objectives shown in Eqs. 1 and 2 as follows,
where \(U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t})\) is update operation on \(\theta _{t}\) using training set \(\mathcal {D}_t\). Specifically, \(U_{\mathcal {T}_t, \mathcal {D}_t}(\theta _{t})\) describes optimization acting on \(\theta _{t}\) as
hereby, instead of finding tuned \(\tilde{\theta }_t\), model f learns an optimal initialization \(\theta _{t}\) that can effectively adapt to \(\mathcal {D}_{1:t}\) with few labeled examples.
Learning to continually learn: In general, the full datasets \(\mathcal {D}_{1:t-1}\) are not available while learning \(\mathcal {D}_{t}\). Thus, we retrieve a small subset of past samples \(\mathcal {M}\) for experience replay in meta-objective,
In such a way, we leverage the selected examples from the past to constrain the learning behaviour of f. As a result, it effectively tackles the problems of inter-task confusion and catastrophic forgetting.
Learning to generalize: We also exploit data augmentations to enhance model’s ability to generalize. Especially, we apply consistency regularization (Bachman et al. 2014), which is built on the assumption that perturbations of the same input should not affect the output. Inspired by FixMatch (Sohn et al. 2020), we employ two types of data augmentations in meta-training, strong and weak, denoted by \(\mathcal {A}(\cdot )\) and \(\alpha (\cdot )\), respectively. In contrast to FixMatch, we employ textual augmentations under full supervision to ensure generalization with limited data availability. Specifically, weak and/or strong augmentations are applied in the inner loop to enhance intra-task generalization, while strong augmentations are used in the outer loop to improve both intra- and inter-task generalization. More details are shown in §5.1.
5 Model
Model architecture: Following Online-aware Meta-Learning (OML) (Javed and White 2019), the proposed model \(f_{\theta }\) consists of a representation learning network \(h_{\theta _{\textrm{e}}}\) with a learnable parameter set \(\theta _{\textrm{e}}\) and a prediction network \(g_{\theta _{\textrm{clf}}}\) with a learnable parameter set \(\theta _{\textrm{clf}}\). The model f is described as \(f_{\theta }(x) = g_{\theta _{\textrm{clf}}} (h_{\theta _{\textrm{e}}}(x))\). The representation learning network acts as an encoder. The prediction network is a single linear layer followed by a softmax.
5.1 Training
Each episode has m batches of examples instantaneously drawn from data stream. For each task, our model is trained on \(\mathcal {D}^{label}\) first, and then on \(\mathcal {D}^{pool}\). MAML consists of two optimization loops:
5.1.1 Inner-loop optimization
Inner-loop algorithm performs task-specific tuning. We introduce data augmentations as the regularization term to improve intra-task generalization. The inner loop loss for training samples \(\mathcal {D}_{i}^{label}\) at time step i is
where w denotes the relative weight and \(\mathcal {L}_{CE}\) is the cross-entropy loss.
Annotation process
When the received training batches are unlabeled, we apply acquisition function \(a(\mathcal {D}_{i}^{pool}, m \cdot n_{a})\) to select informative data points for annotation. The selection size per batch for annotation \(n_{a} = \lceil \frac{b\cdot B_{A}}{|\mathcal {D}^{pool}|} \rceil\) where b is the batch size and \(B_A\) is the annotation budget.
Then, the inner-loop loss for newly annotated training batches \(\mathcal {D}^{new}_i\) is
We use different inner loop loss for already-labeled data \(\mathcal {D}_{i}^{label}\) and newly-labeled data \(\mathcal {D}^{new}_i\). The reason is that newly labeled data may contains more accurate and up-to-date label information. Then, our model performs SGD on parameter set \(\theta _{\textrm{clf}}\) with learning rate \(\alpha\) as
Memory sample selection.
We dynamically update memory buffer \(\mathcal {M}\) using reservoir sampling to ensure generalization while avoiding overfitting. Reservoir sampling (Riemer et al. 2019) randomly selects a fixed number of training samples without knowing the total number of samples in advance. We use reservoir sampling to select \(n_s\) examples per class on the incoming data stream \(\mathcal {D}^{label}_i \bigcup \mathcal {D}^{new}_i\) with an equal selection probability for all data seen so far. Note that the current label space \(\mathcal {Y}_i\) might overlap with previous label spaces. We automatically update memory samples for \(\mathcal {Y}_i\) to maintain a fixed amount of memory samples per class.
5.1.2 Outer-loop optimization
Outer-loop algorithm optimizes initial parameter set \(\theta\) to a setting that f can effectively adapt to \(\mathcal {D}_{1:t}\) via a few gradient updates. Due to memory constraints, we only have a few amount of samples from the past. The model reads all examples from memory buffer \(\mathcal {M}\). Then, the outer-loop objective is to have \(\tilde{\theta } = \theta _{\textrm{e}} \cup \tilde{\theta }_{\textrm{clf}}\) generalize well on \(\mathcal {D}_{1:t}\) using \(\mathcal {M}\), shown in Eq. 5,
To improve both intra- and inter-task generalization, we employ strong augmentation to memory samples in meta-objective as
To reduce the complexity of the second-order computation in outer loop, we use first-order approximation, namely FOMAML. The outer-loop optimization process is
where \(\beta\) is the outer-loop learning rate. Algorithm 1 outlines the complete training procedure.
5.2 Testing
The model randomly samples m batches of examples drawn from \(\mathcal {M}\) as support set S and performs SGD on these samples to finetune parameter set \(\theta _{\textrm{clf}}\) with learning rate \(\alpha\). The inner-loop loss at test is
where \(S \subseteq \mathcal {M}\). And the optimization process is,
Then, we output the predication using parameter set \(\tilde{\theta } = \theta _{\textrm{e}} \cup \tilde{\theta }_{\textrm{clf}}\) on the a test sample \(x_{\textrm{test}}\) as
6 Experiments
6.1 Datasets
We use the text classification benchmark datasets from Zhang et al. (2015), including AGNews (news classification; 4 classes), Yelp (sentiment analysis; 5 classes), Amazon (sentiment analysis; 5 classes), DBpedia (Wikipedia article classification; 14 classes) and Yahoo (questions and answers categorization; 10 classes). This collection contains 5 tasks from 4 different domains, which covers class- and domain-incremental learning in task sequence. We randomly sample 5 labeled instances per class, 10,000 unlabeled instances, and 7600 test examples from each of the datasets. Following prior studies (Wang et al. 2020; Holla et al. 2020; Ho et al. 2023), we concatenate training sets in 4 different orderings as shown in Table 1.
6.2 Implementation details
Our example encoder is a pretrained \(\hbox {BERT}_{\textrm{BASE}}\) model (Devlin et al. 2019). The parameter size of our model is 109 M. For learning rates, \(\alpha = 1e^{-3}\) and \(\beta = 3e^{-5}\). The training batch size is 16. The number of mini-batches in each episode, m = 5. The label budget \(B_{A}\) = 2000 examples per task and the memory budget \(B_\mathcal {M}\) is 5 per class, i.e., \(n_s = 5\). For textual augmentation, we swap words randomly as the weak augmentation. And, we apply the combination of swapping words randomly, deleting words randomly and substituting words by WordNet’s synonym, as the strong augmentation. We use nplug,Footnote 1 a Python package to implement augmentations. All models are executed on Linux platform with 8 Nvidia Tesla A100 GPU and 40 GB of RAM. All experiments are performed using PyTorchFootnote 2 (Paszke et al. 2019).
6.3 Baselines
Baseline model:
-
1.
MAML-SEQ: Online FOMAML algorithm.
-
2.
OML-ER (Holla et al. 2020): OML with 5% episodic experience replay rateFootnote 3 + reservoir sampling.
-
3.
C-MAMLFootnote 4 (Gupta et al. 2020): OML with Meta & CL objective alignment + reservoir sampling.
-
4.
Meta-CAL (ours): OML with Meta & CL objective alignment + consistency regularization + reservoir sampling.
-
5.
FULL: Supervised C-MAML, trained on full datasets.
Memory sample selection:
-
1.
Prototype (Ho et al. 2023): Select representative samples that are closest to dynamically updated prototypes in representation space.
-
2.
Ring Buffer (Chaudhry et al. 2019): Use a ’First-In, First-Out’ scheme to update buffer.
-
3.
Reservoir Sampling (Riemer et al. 2019): Randomly select data with an equal selection probability.
6.4 Evaluation metrics
Based on the prior work (Wang et al. 2024), we use three comprehensive and widely-used metrics in CL, i.e., accuracy, backward transfer and forward transfer. Let \(R_{k,i}\) be the macro-averaged accuracy evaluated on the test set of the i task after sequentially learning t tasks,
Accuracy:
Overall accuracy is the weighted average accuracy of all seen tasks \(\mathcal {T}_{1:T}\).
Backward transfer (stability evaluation):
BWT measures how the updated parameters affect model performance on all previously seen tasks.
Forward transfer (plasticity evaluation):
FWT quantifies the average impact of all preceding tasks on the current k task.
6.5 Main results
In Table 2, we compare various models with four AL strategies. These strategies include Random (denoted as RAND), Representative via KMeans (denoted as REP), Diversity via KMeans (denoted as DIV), and Uncertainty via Least-confidence (denoted as UNC). The label budget \(B_{A}\) = 2000 examples per task. Each record is the average of three best results from five runs.
The result shows that our method yields comparable results to the FULL baseline, which is trained on more than 10,000 labeled samples per task. This result indicates our method can effectively select 2000 informative samples from 10,000 unlabeled data to maximize model performance. Compared to other baselines with the same MAML framework, our method obtains the highest average accuracy among the four commonly-used AL strategies, demonstrating its robustness to different AL approaches. It also indicates Meta-CAL has a great ability of preventing catastrophic forgetting after sequentially learning five tasks.
We also perform paired t-tests. Our model with default setting is significantly better than C-MAML with p-values < 0.04 for all four AL strategies. We also compare AL strategies for the proposed model, Random is significantly better than other strategies with p-values < 0.03.
As for the memory sample selection schemes, Ring Buffer outperforms Reservoir Sampling by less than 1% in RAND and REP. Whereas, Reservoir Sampling exhibits smaller standard deviations, indicating its robustness to training set orders. Prototype Sampling has the worst performance. We perform further analysis in §7.2.
Tables 2 and 3 demonstrate that Random is the best AL strategy for our model. As shown in Table 3, for both Diversity- and Representative-based AL strategies, KMeans is the most effective approach. Uncertainty-based methods comparatively perform poorly. CL models aim to revisit past samples to enhance generalization. However, replaying and annotating uncertain samples may not sufficiently improve generalization capabilities, which can hinder effective knowledge consolidation and consequently harms model performance.
7 Further analysis
We use the training set orderFootnote 5 Yelp \(\rightarrow\) AGNews \(\rightarrow\) DBpedia \(\rightarrow\) Amazon \(\rightarrow\) Yahoo to perform further analysis. In this section, we denote RAND, REP, DIV and UNC as Random, Representative (KMeans), Diversity (KMeans) and Uncertainty (Least-confidence), respectively.
7.1 Stability & plasticity
We test the performance of different AL strategies in terms of stability and plasticity.
Per task accuracy at different learning stages. The dark color indicates high accuracy. From left to right, the color for each task progressively fades away, indicating forgetting happens while learning more tasks. Note that Yelp and Amazon are from the same domain (sentiment analysis). UNC shows a lighter color compared to other AL methods, indicating a higher degree of forgetting. The red-outlined box shows the accuracy on AGNews (Task 2) after learning DBpedia (Task 3). (Color figure online)
Stability: We employ the BWT metric as an indicator of catastrophic forgetting, evaluating the impact of the updated model parameters on the performance across all previously learned tasks. The negative BWT values indicate forgetting. As shown in Fig. 2, RAND shows the least forgetting, which indicates the best stability on past tasks. However, all methods exhibit a large forgetting after sequentially learning Task 3. Task 3 (DBpedia, 14 classes) has a relatively large label space compared to Task 1 (Yelp, 5 classes) and Task 2 (AGNews, 4 classes). As shown in Fig. 1, the accuracy on DBpedia exceeds the accuracy on other tasks. Since MAML learns high-quality reusable features for fast adaptation (Raghu et al. 2020), it tends to find shared representations that specifically benefit tasks with large label spaces.
Plasticity: A positive FWT score indicates the model’s ability to leverage knowledge from previously learned tasks, facilitating zero-shot learning and efficient adaptation to the new task. As shown in Fig. 2, RAND, REP, and DIV show positive FWT values after training Task 4. This observation indicates that successful forward knowledge transfer from the previously learned tasks occurs. Figure 1 further supports that this transfer mainly occurs within the same domain, demonstrating effective domain-incremental learning capabilities. Specifically, the proposed method leverages prior knowledge acquired from a familiar domain to facilitate efficient adaptation to new tasks within that domain.
Overall, RAND shows the best stability while REP shows the best performance in plasticity.
7.2 Memory insight
The level of generalization can be inferred from data dispersion. In this section, we investigate the effect of memory samples resulting from active learning and memory sample selection strategies.
T-SNE visualization of memory samples at different learning stages using training set order Yelp \(\rightarrow\) AGNews \(\rightarrow\) DBpedia \(\rightarrow\) Amazon \(\rightarrow\) Yahoo. The black-circled data points belong to the last task. Data points with darker colors represent samples from earlier tasks, except (d) after learning Task 4. Task 4 has the same domain as Task 1. Hence, data points with darker colors are belong to the latest task in (d). (Color figure online)
Different learning stages: Fig. 3 presents the T-SNE visualization of memory samples at different learning stages. As show in Fig. 3a, b, when the proposed model learns a small number of tasks or encounters small label spaces, it focuses on ensuring intra-task generalization. However, as the number of seen tasks increases, the model shifts its focus towards ensuring inter-task generalization as shown in Fig. 3d, e. Consequently, we observe a phenomenon wherein the memory samples from the last tasks cluster together.
T-SNE visualization of memory samples with different AL strategies. A equal dispersion of the data indicates a good memory representation. We provide accuracy for a better comparison. Data points with darker colors (purple, violet and pink) represent samples from earlier tasks. The black-circled data points belong to the last task. The clustering of memory samples from the last task suggests model focuses more on inter-task generalization rather than on intra-task generalization. (Color figure online)
AL strategies: As shown in Fig. 4a–d, memory data in RAND show a good data dispersion within the last task (i.e., intra-task generalization) and across multiple tasks (i.e., inter-task generalization), resulting in the best accuracy. In contrast, UNC in Fig. 4d shows subpar performance in inter-task generalization. The result validates UNC’s incapability to achieve generalization and preserve knowledge. Both REP and DIV exhibit lower accuracy compared to RAND, with DIV showing an obvious lower sparsity. Therefore, it is important to find an optimal balance between representativeness and diversity to ensure generalization.
Memory sample selection methods: In Fig. 4a, e and f, the choice of memory sample selection method also affects the model’s performance. Prototype Sampling selects representative memory samples, potentially resulting in relatively less sparsity across multiple tasks. In contrast, Reservoir and Ring Buffer sampling strategies introduce randomness into the memory sample selection process, achieving better inter-task generalization.
Therefore, the randomness introduced by AL and memory sample selection methods can be beneficial in consolidating knowledge by ensuring generalization. Furthermore, it is noteworthy that inter-task confusion does not occur in the learned embedding space. This validates the superiority of our method in addressing confusion between old and new tasks.
7.3 Effect of augmentations
We conduct an ablation study to analyze textual augmentations and their key roles in Meta-CAL. We examine the effect of inner-loop augmentation and outer loop augmentation in Table 4. The meta samples in the outer-loop constrain the model behaviour.
We hypothesize that if the meta-samples exhibit sufficient generalization capabilities, they can facilitate effective knowledge retention from prior tasks, consequently mitigating the issue of catastrophic forgetting. It is noteworthy that the meta-samples acquired through the combined effect of active learning acquisition and memory sample selection. We further enhance their generalization capabilities through data augmentation. The results validate this hypothesis, as we observe a significant gain by using outer-loop augmentation to enhance generalization. It improves accuracy by approximately 10% when data availability and label budget are extremely limited.
While inner-loop augmentation proves effective in extreme cases, it might not be as advantageous as outer-loop augmentation in non-extreme scenarios. Nevertheless, the combination of inner- and outer-loop augmentations still significantly improves accuracy.
7.4 Annotation budgets
As shown in Fig. 5, in contrast to other AL strategies, increasing annotation budgets for UNC degrades model performance. This demonstrates that annotating uncertain samples is not beneficial for knowledge retention. REP achieves the best performance when the label budget is 500, while RAND outperforms other methods in most cases. Consequently, a certain degree of representativeness can aid in knowledge consolidation when annotation budget is extremely limited. In addition, our model achieves more than 62% when the label budget is only 500 samples per task, suggesting its fast adaptation ability.
7.5 Memory budgets
We evaluate memory efficiency of our model. Table 5 compares the performance of three models with the best average accuracy, i.e., our model with Reservoir sampling, our model with Ring Buffer and C-MAML. While all these models use the MAML framework, only the Meta-CAL models employ consistency regularization to enhance generalization. The result shows Meta-CAL models outperform C-MAML in all three cases. Notably, Meta-CAL w/. Reservoir can attain more than 50% accuracy by saving only one sample per seen class. As a result, it also indicates that consistency regularization through data augmentation improves performance.
8 Conclusion
This paper considers a realistic continual learning scenario, namely few-shot continual active learning. To address this resource-constrained continual learning problems, we propose a novel method that employing meta-learning and active learning techniques, called Meta-Continual Active Learning. Specifically, our model dynamically queries worthwhile unlabeled data for annotation and reformulate meta-objective with experience rehearsal and consistency regularization. In such a way, the proposed method is able to prevent catastrophic forgetting and improve generalization capability in a low-resource scenario. We conduct extensive experiments on benchmark text classification datasets in a 5-shot continual active learning setting. The results show the robustness of the proposed method. However, we only evaluate our method in a 5-shot case. In the future work, we can extend our evaluation to a more realistic scenario where the amount of labeled data is {100, 1000, 5000} and extend to other NLP tasks, e.g., language model training, text generation, and knowledge base enrichment. Furthermore, the annotation budget is allocated equally for each task. The annotation budget allocation strategies can be further studied.
Data availability
No datasets were generated or analysed during the current study.
Notes
The default 1% rate shows bad performance.
Being modified for NLP tasks.
The performance on this order is close to the average performance.
References
Adel T, Zhao H, Turner RE (2020) Continual learning with adaptive weights (CLAW). In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020
Andrychowicz M, Denil M, Colmenarejo SG, Hoffman MW, Pfau D, Schaul T, Freitas N (2016) Learning to learn by gradient descent by gradient descent. In: Lee DD, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds.) Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, Dec 5–10, 2016, Barcelona, Spain, pp. 3981–3989
Ayub A, Fendley C (2022) Few-shot continual active learning by a robot. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A (eds.) Advances in neural information processing systems 35: annual conference on neural information processing systems 2022, NeurIPS 2022, New Orleans, LA, USA, Nov 28–Dec 9, 2022
Bachman P, Alsharif O, Precup D (2014) Learning with pseudo-ensembles. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds.) advances in neural information processing systems 27: annual conference on neural information processing systems 2014, Dec 8–13 2014, Montreal, Quebec, Canada, pp. 3365–3373
Beaulieu S, Frati L, Miconi T, Lehman J, Stanley KO, Clune J, Cheney N (2020) In: ECAI 2020 - 24th European conference on artificial intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, Aug 29–Sept 8, 2020 - Including 10th conference on prestigious applications of artificial intelligence (PAIS 2020). Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 992–1001. IOS Press
Chaudhry A, Ranzato M, Rohrbach M, Elhoseiny M (2019) Efficient lifelong learning with A-GEM. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019
Chaudhry A, Rohrbach M, Elhoseiny M, Ajanthan T, Dokania PK, Torr PHS, Ranzato M (2019) Continual learning with tiny episodic memories. ArXiv arXiv:1902.10486
Chen X, Wu H, Shi X (2023) Consistent prototype learning for few-shot continual relation extraction. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023, pp. 7409–7422
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: Proceedings, the twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference, July 9–13, 2005, Pittsburgh, Pennsylvania, USA, pp. 746–751
Das AM, Bhatt G, Bhalerao MM, Gao VR, Yang R, Bilmes J (2023) Continual active learning
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. Proceedings of machine learning research, vol. 70, pp. 1126–1135
Gupta G, Yadav K, Paull L (2020) La-maml: look-ahead meta learning for continual learning. In: Proceedings of the 34th international conference on neural information processing systems. NIPS’20. Curran Associates Inc., Red Hook, NY, USA
Han X, Dai Y, Gao T, Lin Y, Liu Z, Li P, Sun M, Zhou J (2020) Continual relation learning via episodic memory activation and reconsolidation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020, pp. 6429–6440
Ho S, Liu M, Du L, Gao L, Xiang Y (2023) Prototype-guided memory replay for continual learning. In: IEEE transactions on neural networks and learning systems, 1–11
Holla N, Mishra P, Yannakoudakis H, Shutova E (2020) Meta-learning with sparse experience replay for lifelong language learning. ArXiv arXiv:2009.04891
Huang B, Chen Z, Zhou P, Chen J, Wu Z (2023) Resolving task confusion in dynamic expansion architectures for class incremental learning. In: Thirty-seventh AAAI conference on artificial intelligence, AAAI 2023, thirty-fifth conference on innovative applications of artificial intelligence, IAAI 2023, thirteenth symposium on educational advances in artificial intelligence, EAAI 2023, Washington, DC, USA, Feb 7–14, 2023, pp. 908–916
Javed K, White M (2019) Meta-learning representations for continual learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds.) Advances in neural information processing systems 32: D2019, NeurIPS 2019, Dec 8–14, 2019, Vancouver, BC, Canada, pp. 1818–1828
Kirkpatrick J, Pascanu R, Rabinowitz NC, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA 114:3521–3526
Lee S, Jeon H, Son J, Kim G (2024) Learning to continually learn with the Bayesian principle. Proc Mach Learn Res 235:26621–26639
Li Z, Hoiem D (2018) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40:2935–2947
Lin W, Chen J, Huang R, Ding H (2024) An effective dynamic gradient calibration method for continual learning. Proc Mach Learn Res 235:29872–29889
Masson d’Autume C, Ruder S, Kong L, Yogatama D (2019) Episodic memory in lifelong language learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds.) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, Dec 8–14, 2019, Vancouver, BC, Canada, pp. 13122–13131
McCloskey M, Cohen NJ (1989) Catastrophic interference in connectionist networks: the sequential learning problem. Psychol Learn Motiv 24:109–165
Mosqueira-Rey E, Hernández-Pereira E, Alonso-Ríos D, Bobes-Bascarán J, Fernández-Leal Á (2022) Human-in-the-loop machine learning: a state of the art. Artificial Intelligence Review, 1–50
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning 2011
Nichol A, Achiam J, Schulman J (2018) On first-order meta-learning algorithms arXiv:1803.02999
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 32:8024–8035
Perkonigg M, Hofmanninger J, Herold CJ, Prosch H, Langs G (2021) Continual active learning using pseudo-domains for limited labelling resources and changing acquisition characteristics. Machine Learning for Biomedical Imaging
Qin C, Joty SR (2022) Continual few-shot relation learning via embedding space regularization and data augmentation. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022, pp. 2776–2789
Raghu A, Raghu M, Bengio S, Vinyals O (2020) Rapid learning or feature reuse? towards understanding the effectiveness of MAML. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020
Riemer M, Cases I, Ajemian R, Liu M, Rish I, Tu Y, Tesauro G (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019
Schröder C, Niekler A, Potthast M (2022) Revisiting uncertainty-based query strategies for active learning with transformers. In: Findings of the association for computational linguistics: ACL 2022, Dublin, Ireland, May 22–27, 2022, pp. 2194–2203
Shannon CE (2001) A mathematical theory of communication. SIGMOBILE Mob Comput Commun Rev 5(1):3–55
Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, Cubuk ED, Kurakin A, Li C (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, Dec 6–12, 2020, Virtual
Ven GM, Li Z, Tolias AS (2021) Class-incremental learning with generative classifiers. In: IEEE Conference on computer vision and pattern recognition workshops, CVPR Workshops 2021, Virtual, June 19–25, 2021, pp. 3611–3620
Wang Z, Liu Y, Ji T, Wang X, Wu Y, Jiang C, Chao Y, Han Z, Wang L, Shao X, Zeng W (2023) Rehearsal-free continual language learning via efficient parameter isolation. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023, pp. 10933–10946
Wang Z, Mehta SV, Póczos B, Carbonell JG (2020) Efficient meta lifelong-learning with limited memory. In: Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, Nov 16–20, 2020, pp. 535–548
Wang P, Song Y, Liu T, Lin B, Cao Y, Li S, Sui Z (2022) Learning robust representations for continual relation extraction via adversarial class augmentation. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp. 6264–6278
Wang H, Xiong W, Yu M, Guo X, Chang S, Wang WY (2019) Sentence embedding alignment for lifelong relation extraction. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 796–806
Wang L, Zhang X, Su H, Zhu J (2024) A comprehensive survey of continual learning: theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–20
Wu Y, Huang L-K, Wang R, Meng D, Wei Y (2024) Meta continual learning revisited: Implicitly enhancing online hessian approximation via variance reduction. In: The Twelfth international conference on learning representations
Yoon J, Yang E, Lee J, Hwang SJ (2018) Lifelong learning with dynamically expandable networks. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings
Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 649–657
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions.
Author information
Authors and Affiliations
Contributions
S. Ho designed methodology, conducted experiment, performed analysis, drafted and finalised the manuscript. M. Liu designed methodology, performed analysis, proofread the manuscript. S. Gao proofread the manuscript and provided supervision. L. Gao proofread the manuscript and provided supervision.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ho, S., Liu, M., Gao, S. et al. Learning to learn for few-shot continual active learning. Artif Intell Rev 57, 280 (2024). https://doi.org/10.1007/s10462-024-10924-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10462-024-10924-x