0% found this document useful (0 votes)

11 views

Recent Advances of Foundation Language Models-Based Continual Learning - A Survey

Uploaded by

cardinalshan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Recent Advances of Foundation Language Models-Based Continual Learning - A Survey

Uploaded by

cardinalshan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Recent Advances of Foundation Language Models-based Continual Learning: A

Survey

YUTAO YANG, JIE ZHOU∗ , XUANWEN DING, TIANYU HUAI, SHUNYU LIU, QIN CHEN, LIANG
HE, and YUAN XIE, School of Computer Science and Technology, East China Normal University, China
Recently, foundation language models (LMs) have marked significant achievements in the domains of natural language processing
(NLP) and computer vision (CV). Unlike traditional neural network models, foundation LMs obtain a great ability for transfer learning
by acquiring rich commonsense knowledge through pre-training on extensive unsupervised datasets with a vast number of parameters.
arXiv:2405.18653v1 [cs.CL] 28 May 2024

However, they still can not emulate human-like continuous learning due to catastrophic forgetting. Consequently, various continual
learning (CL)-based methodologies have been developed to refine LMs, enabling them to adapt to new tasks without forgetting previous
knowledge. However, a systematic taxonomy of existing approaches and a comparison of their performance are still lacking, which is
the gap that our survey aims to fill. We delve into a comprehensive review, summarization, and classification of the existing literature
on CL-based approaches applied to foundation language models, such as pre-trained language models (PLMs), large language models
(LLMs) and vision-language models (VLMs). We divide these studies into offline CL and online CL, which consist of traditional methods,
parameter-efficient-based methods, instruction tuning-based methods and continual pre-training methods. Offline CL encompasses
domain-incremental learning, task-incremental learning, and class-incremental learning, while online CL is subdivided into hard task
boundary and blurry task boundary settings. Additionally, we outline the typical datasets and metrics employed in CL research and
provide a detailed analysis of the challenges and future work for LMs-based continual learning.

CCS Concepts: • Computing methodologies → Natural language generation; Scene understanding; Cognitive robotics; Cognitive
science; Intelligent agents.

Additional Key Words and Phrases: Continual Learning, Foundation Language Models, Pre-trained Language Models, Large Language
Models, Vision-Language Models, Survey

ACM Reference Format:

Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Liang He, and Yuan Xie. 2018. Recent Advances of
Foundation Language Models-based Continual Learning: A Survey. In Proceedings of Make sure to enter the correct conference title from
your rights confirmation emai (Conference acronym ’XX). ACM, New York, NY, USA, 40 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Recent advancements in foundation language models (LMs) have set new benchmarks in both natural language
processing (NLP) [136, 226, 232] and computer vision (CV) [188]. Foundation LMs encompass three primary categories:
Pre-trained Language Models (PLMs) [136], Large Language Models (LLMs) [226], and Vision-Language Models
(VLMs) [42]. PLMs such as BERT [88], RoBERTa [120], and BART [102] focus on text-based tasks and are crucial for
∗ Corresponding authors.

Authors’ Contact Information: Yutao Yang; Jie Zhou, [email protected]; Xuanwen Ding; Tianyu Huai; Shunyu Liu; Qin Chen; Liang He; Yuan Xie,
School of Computer Science and Technology, East China Normal University, Shanghai, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Manuscript submitted to ACM

Manuscript submitted to ACM 1

2 Yutao Yang et al.

Traditional CL Task 1 … Task i … Task N

Model 1 … Model i … Model N

Neural
Text
Network
cannot predict before learning

Foundation Language Task 1 … Task i … Task N

Models (LMs)–Based CL Instruction: Create a Instruction: Extract Instruction: Translate
story with keywords characters from article sentences into French
Input: Explorers, Input: John Doe joins Input: The quick
Island … the climate summit … … brown fox jumps …
Output: Explorers Output: John Doe Output: Le renard
landed on an island … brun rapide saute …
Adapter 1 Adapter 2 Adapter 3

Foundation
Text
LMs
Zero/few-shot prediction without training

Fig. 1. Comparison between traditional CL and Foundation language models (LMs)-Based CL.

understanding and generating language by leveraging tasks like masked language modeling during pre-training. LLMs,
including models like GPT-4 [1] and LLaMA [173], extend the capabilities of PLMs by increasing the scale of model
architecture and training data, thus enhancing their generality and adaptability across a broader range of tasks. VLMs,
represented by VisualBERT [106], CLIP [154], LLaVA [113] and DALL-E [156], integrate text and image modalities to
enable complicated interactions between visual and textual information. The underlying paradigm of these models
involves pre-training on extensive, often unlabeled datasets to capture rich semantic information, which is subsequently
fine-tuned for specific tasks or domains. This methodology not only boosts performance across various applications but
also significantly enhances the models’ flexibility and task adaptability.
However, these foundation models often demonstrate limitations in dynamic environments with a sequence of
tasks, primarily due to their fixed parameters once training is completed. These models generally lack the capability
to integrate new data or concepts without undergoing a retraining process. A significant challenge associated with
training on a sequence of tasks is “catastrophic forgetting" [92], a phenomenon where a model loses previously acquired
knowledge upon learning new information. This is in stark contrast to human learning processes, which are inherently
continuous and adaptive. Despite the successes of multi-task learning (MTL) and transfer learning (TL) in certain
applications, they have limitations in real-world scenarios. MTL necessitates having all tasks and their data available
upfront, which poses a challenge when launching a new service as the model must be retrained with all the data.
Furthermore, TL is typically done with only two tasks, i.e., the source and the target, rendering it impractical for
real-world online platforms with multiple target tasks. To address these challenges, it is crucial for models to process
and learn the continuously expanding and diversifying datasets. This requires mechanisms that allow models to adapt
to new linguistic phenomena and trends without compromising the accuracy and sensitivity towards historical data.
Consequently, continual learning (CL) [175, 186], also referred to as lifelong learning [145] or incremental learning
[230], is a crucial area in artificial intelligence that seeks to develop systems capable of continuously updating themselves
and acquiring new knowledge, without forgetting previously learned information, similar to human learning [34]. This
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 3

LFPT5 [150], B-CL [86], ELLE [152], AdapterCL [128], RMR_ DSE [103], DEMIX [47] , CLASSIC [84],
PLMs-based DIL (§4.1.1)
CPT [81], C-PT [234], CL-KD [20], PlugLM [27], Pretr [30], AEWC [100], Continual DAP-training [85]

Domain-Incremental COPR [217], LAMOL [171], RVAE_LAMOL [183], Adapt-Retrieve-Revise [224], Lifelong-MoE [25],
Learning (§4.1) LLM-based DIL (§4.1.2)
CKL [73], DACP [201], CPPO [218], EcomGPT-CT [126]

VLMs-based DIL (§4.1.3) S-Prompt [191], Medical AI [206], VQACL [220]

PP [158], CTR [83], MeLL [182], LINC [111], ERDA [149], PCLL [227], CLIF [76], ConTinTin [208],
PLMs-based TIL (§4.2.1) HMI [129], Adaptive Compositional Modules [223], DYNAINST [137], Conure [214], TERACON [91],
Offline Continual ERNIE 2.0 [172], RecyclableTuning [151]
Learning (§4) Task-Incremental
Learning (§4.2) LLMs-based TIL (§4.2.2) Conpet [170], InstructAlign [15], Continual-T0 [163], DynaMind [41], ELM [71], O-LoRA [189], JARe [146]
Foundation LMs-based CL

VLMs-based TIL (§4.2.3) Medical AI [206], CTP [233], ZSCL [228], MoE-Adapters4CL [212], TRIPLET [148]

EPI [193], IDBR [68], PAGeR [177], ENTAILMENT [200], ExtendNER [138], PLE [105], DE&E [197],
PLMs-based CIL (§4.3.1)
SRC [116]
Class-Incremental
Learning (§4.3)
MoE-Adapters4CL [212], VLM-PL [90], Adaptation-CLIP [117], PROOF [231], LGCL [89], ZSCL [228],
VLMs-based CIL (§4.3.2)
CLAP [75], GMM [18]
Hard Task
PLMs-based HTB (§5.1.1) MBPA++ [35], Meta-MBPA++ [194], OML-ER [63], TPEM [47], CID [115], ProgModel [168]
Boundary (§5.1)
Online Continual
Learning (§5) PLMs-based BTB (§5.2.1) MBPA++ [35], Meta-MBPA++ [194], OML-ER [63], TPEM [47], CID [115]
Blurry Task
Boundary (§5.2)
VLMs-based BTB (§5.2.2) CBA [187], MVP [139], DKR [32]

Fig. 2. Taxonomy of foundation language models for continual learning.

paradigm is particularly relevant in the context of foundation language models (LMs), which are challenged by specific
issues such as catastrophic forgetting (CF) and cross-task knowledge transfer (KT). Catastrophic forgetting represents a
significant challenge, where a model tends to lose previously acquired knowledge upon learning new information. To
address this, language models must maintain a robust grasp of past language data while adapting to new linguistic
trends. Furthermore, cross-task knowledge transfer is essential for enhancing the continual learning process. Effective
KT not only accelerates the learning curve for new tasks (forward transfer) but also enhances the model’s performance
on prior tasks via the feedback of new knowledge (backward transfer).
Recent advancements in continual learning methodologies have substantially enhanced the adaptability and knowl-
edge retention capabilities of foundational language models (LMs). These developments are crucial for addressing
complex challenges previously observed in CL. Researchers have formulated innovative strategies to mitigate these
challenges, thereby enabling LMs to maintain high performance across a variety of tasks while continually integrating
new knowledge [30, 99, 134]. Notable successes have been documented in diverse downstream tasks, such as aspect-
based sentiment analysis, where continual learning enables dynamic adaptation to evolving aspects and sentiments [84].
Similarly, in dialogue generation, the novel technologies assist models in refining and expanding their conversational
capabilities through ongoing interactions [164]. In text classification, continual learning facilitates the incorporation of
new categories and adjustments to shifts in text distributions without the need for complete retraining [158]. Moreover,
in the realm of visual question answering, continual learning is essential for updating the models’ capabilities to process
and respond to new types of visual content and queries [148, 220]. The aforementioned works underscore the potential
of continual learning to significantly boost the performance of foundation LMs.
In the domain of continual learning, there has been a significant paradigm shift from traditional methodologies to
those that integrate foundation LMs (See Figure 1). First, foundation LMs demonstrate enhanced generalization and
transfer learning abilities across diverse tasks owing to their broad pre-training on large-scale datasets. The model
has specialized transfer capability to quickly adapt to downstream tasks with only a few samples. Consequently, it is
crucial to mitigate the degradation of both the zero-shot transfer and history task abilities in LMs while facilitating the
acquisition of new skills. Second, due to the substantial number of parameters in foundation LMs, it is crucial to employ
Manuscript submitted to ACM
4 Yutao Yang et al.

parameter-efficient techniques [59], such as prompt tuning [119] and adapters [140], to update parameters without
comprehensive retraining. Third, the foundation LMs possess the capability to follow instructions through instructional
learning [39, 144], enabling more dynamic and context-aware interactions.
This review systematically categorizes these strategies and technologies into two core areas: offline continual learning
and online continual learning (Figure 2). We first give detailed definitions and scenarios to format the setting of offline
and online CL, where offline CL includes domain-incremental, task-incremental and class-incremental CL, and online
CL includes hard task boundary and blurry task boundary. These learning strategies are further subdivided into methods
based on Pre-trained Language Models (PLMs), Large Language Models (LLMs), and Vision-Language Models (VLMs).
Then, we summarize the related papers about traditional methods, continual pre-training methods, parameter-efficient
tuning methods and instruction-based methods. Finally, we static the main datasets from various perspectives and
review the key metrics to evaluate the forgetting and transferring of the models.
The main contributions of this survey paper can be summarized as follows.

• We thoroughly review the existing literature on foundation LMs-based CL approaches, which integrate foundation
LMs with CL to learn new knowledge without retraining the models. It is quite different from traditional CL since
foundation LMs have great abilities of transfer learning, zero-shot and instruction following with huge parameters.
• We give the definitions of different settings and categorize these studies into various classes to better understand
the development of this domain. In addition to the traditional methods like replay, regularization and parameter-
isolation-based algorithms, we also summarize the works about continual pre-training methods, parameter-efficient
tuning methods and instruction tuning-based methods.
• We provide the characters of existing datasets for CL and present the main metrics to evaluate the performance of
preventing forgetting and knowledge transfer.
• We discuss the most challenging problems of foundation LMs-based CL and point out promising future research
directions in this field.

The paper is organized as follows. In Section 2, we review the mainly related surveys about continual learning. Then,
we introduce the base settings and learning modes of continual learning in Section 3, including the definitions and
scenarios of CL. Furthermore, we present the related studies about offline continual learning, which can be divided
into domain-incremental learning, task-incremental learning and class-incremental learning in Section 4. In Section 5,
we focus on online continual learning, including hard task boundary and blurry task boundary settings. The typical
datasets and metrics are provided in Section 6 and 7. Finally, we analyze the challenge and further work in Section 8
and give the conclusion in Section 9.

2 RELATED SURVEYS
2.1 Continual Learning
Early examinations of Continual Learning (CL) have provided broad coverage, as observed in surveys such as Parisi
et al. [145]. Recently, Wang et al. [186] conduct a comprehensive survey that categorizes five key strategies in CL:
regularization-based, replay-based, optimization-based, representation-based, and architecture-based approaches. This
survey reflects an effort to organize and understand the diverse methodologies employed in the field. Notably, there is a
growing focus on class-incremental setting [7, 131, 230] and replay-based approaches [60], reflecting the increasing
granularity of research interests within the CL domain.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 5

2.2 Continual Learning for Computer Vision

In the realm of computer vision, De et al. [34] address the pressing challenge of continual learning, specifically in the task-
incremental setting, where tasks arrive sequentially with clear boundaries. They introduce a novel stability-plasticity
trade-off framework tailored for continual learners and undertake a comprehensive experimental analysis, comparing the
efficacy of 11 methodologies across three benchmarks. Qu et al. [153] present a comprehensive examination of continual
learning, highlighting its vital role in the accumulation of knowledge from sequential data streams. This research
investigates a range of approaches, encompassing regularization, knowledge distillation, memory-based methods, and
more. These approaches are systematically categorized by their characteristics and applications in computer vision.
Moreover, Mai et al. [130] focus on the realm of online continual learning within image classification, addressing
catastrophic forgetting. This study evaluates the efficacy of state-of-the-art methodologies across diverse memory and
data configurations. Masana et al. [131] present a comprehensive performance evaluation of class-incremental methods
applied to image classification. The experiments span a range of scenarios, incorporating large-scale datasets and varied
network architectures. Belouadah et al. [7] pay more attention to class-incremental learning algorithms for visual tasks.
This study defines the essential properties of incremental learning algorithms, offers a unified formalization of the
class-incremental learning problem, and provides an evaluation framework for detailed analysis.

2.3 Continual Learning for NLP

Biesialska et al. [8] address the challenge of continual learning within Natural Language Processing (NLP), wherein
conventional architectures struggle to accommodate new tasks without compromising previously acquired knowledge.
The research presents an extensive review of methods, including rehearsal, regularization, and architectural approaches,
all designed to mitigate the aforementioned challenge. In a similar vein, Ke et al. [82] offer a focused survey on continual
learning within the NLP domain, providing a comprehensive examination of various continual learning settings,
methodologies, and challenges. This work presents an in-depth analysis of state-of-the-art approaches and extends
original CL settings to be more general and up-to-date. Additionally, it emphasizes the significance of knowledge
transfer within NLP and the challenges posed by inter-task class separation.

2.4 Continual Learning for Other Domains

Recent surveys, surveys like [101, 167, 219] investigate the advancements in incremental learning for neural recommen-
dation systems and continual learning (CL) in robotics, respectively. Zhang et al. [219] make a notable contribution to
narrow the gap between academic research and industrial applications through the introduced Incremental Update
Recommendation Systems (IURS). They highlight the imperative for real-time updates with streaming data and focus
on the distinctive challenges posed by IURS in contrast to traditional Batch Update Recommendation Systems (BURS)
and offer a thorough examination of existing literature and evaluation methodologies in this domain. Shaheen et al.
[167] provide a comprehensive overview of contemporary approaches for CL within real-world contexts. Their analysis
focuses on learning algorithms that efficiently handle large sequential datasets within computational and memory
constraints. The survey also explores challenges in applying CL to autonomous systems, comparing methods across
metrics such as computational efficiency, memory utilization, and network complexity. In the field of robotics, it is
essential for agents to adapt and interact with their environment using a continuous stream of observations. Thus,
Lesort et al. [101] explore CL within this domain, defining CL as a paradigm where both data distribution and learning
Manuscript submitted to ACM
6 Yutao Yang et al.

Task-Incremental Learning
Task 1 Task i Task N Task - ID

... ... ... ...

Task 1 Task i Task N

Class-Incremental Learning

Task - ID
Continual Learner ... ...
Task 1 Task i Task N

Domain 1 Domain i Domain N

Domain-Incremental Learning

... ... Task - ID

... ...
Domain 1 Domain i Domain N

Training Testing
Fig. 3. The setting of different offline continual learning tasks, including task-incremental learning, class-incremental learning and
domain-incremental learning. The samples with different classes (domains) are marked with various shapes (colors).

Hard Task Boundary

Task 1 Task 2 Task 3

Blurry Task Boundary

Fig. 4. The setting of different online continual learning tasks, including hard task boundary arriving and blurry task boundary
arriving. The samples with different classes (domains) are marked with various shapes (colors).

objectives evolve dynamically. They emphasize the challenges in evaluating CL algorithms in robotic applications and
introduce a novel framework alongside metrics tailored to effectively present and assess CL methodologies.
This paper centers on the crucial advancements in CL as applied to foundational language models, which have
obtained significant success in the fields of NLP and multimodal. We categorize existing works into offline and online
CL based on PLMs, LLMs, and VLMs.

3 SETTINGS AND LEARNING MODES OF CL

3.1 Basic Formulation
Continual learning is an advanced method in machine learning. Within this framework, the model is sequentially
trained across a diverse array of tasks denoted as 𝑡 within the set 𝑇 = {1, 2, ..., 𝑁 }, where each task 𝑡 is associated with
(𝑡 ) (𝑡 ) |𝑋 | (𝑡 ) (𝑡 )
its individual dataset 𝑋𝑡 = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1𝑡 . Here, 𝑥𝑖 represents an individual training example, and 𝑦𝑖 denotes
the corresponding class label for task 𝑡, while |𝑋𝑡 | indicates the total number of samples in task 𝑡. However, the data
distributions between any two tasks 𝑡 and 𝑡 ′ are distinct (𝑝 (𝑋𝑡 ) ≠ 𝑝 (𝑋𝑡 ′ ) for all 𝑡 ≠ 𝑡 ′ ). This distinction presents a
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 7

fundamental challenge in managing the diversity of data distributions across multiple tasks. This setup necessitates the
model to learn new knowledge while retaining past information.
Continual learning encompasses two principal paradigms: offline and online continual learning. These paradigms
define how data arrives and how the model updates its knowledge over time.
• Offline Continual Learning: This setting involves learning across a series of tasks, with each task fully presented
before handling the next task. For each task 𝑡, the model trains on the entire dataset 𝐷𝑡 through multiple epochs. The
model progresses to task 𝑡 + 1 only upon achieving the desired proficiency on task 𝑡.
• Online Continual Learning: This setting operates within a dynamic framework wherein the model learns knowledge
from a stream of data points or mini-batches presented sequentially. Additionally, the model lacks access to the entire
dataset for a given task. This setting closely mirrors real-world scenarios characterized by continuous data flow,
compelling the model to adapt in real time.

3.2 Typical Scenarios

3.2.1 Offline Continual Learning. Offline CL (Figure 3) comprises three principal scenarios, each distinguished by
distinct characteristics: Domain-Incremental Learning, Task-Incremental Learning, and Class-Incremental Learning.
• Domain-Incremental Learning (DIL): The model aims to process diverse data distributions. Specifically, in DIL, while
the data distributions 𝑝 (𝑋𝑡 ) in task 𝑡 and 𝑝 (𝑋𝑡 ′ ) in task 𝑡 ′ are different, their task types and class labels remain
consistent. The task identities (task IDs) are not required.
• Task-Incremental Learning (TIL): The model is designed to handle a series of tasks, each with unique objectives.
The classes within these tasks may or may not be disjoint. The boundaries of each task are clear, and task IDs are
provided during both the training and testing phases.
• Class-Incremental Learning (CIL): The model is designed to continually learn new class information while retaining
knowledge of previously learned classes. For tasks 𝑡 and 𝑡 ′ , while they might share the same task type (such as
classification), their class sets 𝐶𝑡 and 𝐶𝑡 ′ are distinct. Moreover, the task IDs are only available during training.
In summary, Domain-Incremental Learning concentrates on adapting the model to the shifts in input data distributions
while maintaining consistency in tasks and classes. Task-Incremental Learning necessitates the model’s ability to learn
and retain task-specific knowledge over successive tasks. On the other hand, Class-Incremental Learning highlights
the gradual integration of new classes into the model’s recognition capabilities without compromising knowledge of
previously learned classes.

3.2.2 Online Continual Learning. In online continual learning (See Figure 4), existing research is categorized into two
configurations based on the arrival pattern of tasks: "Hard Task Boundary" and "Blurry Task Boundary":
• Hard Task Boundary: The arrival of tasks follows a strictly structured and sequential process. Data from the preceding
task is completely processed before transitioning to the next task, ensuring no overlap of data between tasks.
• Blurry Task Boundary: The distinction between tasks is less clear, similar to real-world scenarios. Data from different
tasks are intermixed, making it difficult to pinpoint when one task ends and another begins.
In both setups, the main challenge lies in achieving the balance of learning new data while preserving previously
gained knowledge, often termed as catastrophic forgetting. Numerous approaches, such as experience replay [152, 171],
elastic weight consolidation (EWC) [92], and progressive neural networks [83, 86], have emerged to address this issue.
Each method comes with its unique strengths and weaknesses upon the task arrival configuration.
Manuscript submitted to ACM
8 Yutao Yang et al.
Lifelong Language Pretraining with Distribution-Specialized Experts
states with features for task t to build a classifier.
MoE Outputs Lifelong Pretraining
CLASSIC uses three sub-systems to achieve
("%J) ("%&) (") ("@&)
𝒑 𝒑 𝒑 𝒑
its objectives (see Sec. 1):Regularization
Output (1) contrastive ensem- Non-Prompting Methods

s-3 s-2 s-1 s

L2P

s-3 s-2 s-1 s

DyTox

s-3 s-2 s-1 s

Proposed S-Prompts

s-3 s-2 s-1 s Classifiers

ble distillation (CED) for mitigating CF by dis- 𝓜 Methods Extractor Extractor Extractor Extractor
Pretrained & Fixed
Prompts
Expert ⋯⋯ Expert Expert ⋯ Expert Image Tokens
tilling the knowledge of previous tasks to the cur-
& 9 (;<=) 9 (;<=) @& 9 (;)
Previous trained & Now Fixed

s-2
s-3

s-1

s-2
s-3

s-1
s

s
Incremental Tuned / Learnable
𝜽N:"%& 𝜽"
rent task model; (2) contrastive knowledge sharing
Instance Selection

Expert/Gating Freeze Expert/Gating Expansion

(CKS) to encourage Top2−Gating knowledge transfer; and (3)
Top2−Gating
9 (;<=) 9 (;)
Data Stream
Session s-3 Session s-2 Session s-1 Current Session s Session s+1

contrastive supervised learning on the current task Previous Methods Proposed S-Prompts Instance features of
class +/- in Session s-1

model 𝒙 (CSC) to improve 𝒙

("%J) the current
("%&)
𝒙 task model 𝒙
(") ("@&) Feature Space Instance features of
class +/- in Session s
Domain centers of s-1

accuracy.Inputs We call this framework Streamcontrastive con-

of Corpus Distribution
Figure 1: Comparison
Incremental Update (Dependent) Incremental Insert (Independent) Domain centers of s

(t) of the proposed S-Prompts paradigm against non-prompting methods, and prompting
Figure 2: Overview oftinual
our lifelong learning, inspired
pretraining method for thebyMoE contrastive
model (M): 1)learning.
When pretraining on each data methods distribution
(L2P [66], (xDyTox ), [14]). Exiting methods generally learn sequential tasks/sessions/domains dependently,
Figure 1: CLASSIC adopts Adapter-BERT (Houlsby
(a) CLASSIC we expand
et al., 2019) and its adapters (yellow boxes) in a trans-
and gatings;
the number of experts and gatings (from(b)
3) We further Contrastive
regularize the MoE learning
on the
Lifelong-MoE
E (t−1)
output
to E (t) ) for larger model capacity; 2) We freeze the pretrained
uses
level multiple
to avoid the views
catastrophic of the
forgetting. Embedding, dense,
one and
subspace attention
for every
(c) S-Prompts
producing a single feature space where the classes from different domains are less separable (more forgetting or
old experts
worse transferring). In contrast, the proposed S-Prompts learn the tasks independently. This paradigm leads to
domain, making the classes more separable (less/no forgetting and better transferring).
layers (omitted in this figure) are shared across all data distributions. See details of our method in Section 3 and pretraining settings in
former (Vaswani
Fig. 5. etFrameworks
al., 2017) layer (above
in DIL:(CSC)).
Section 5.1. An
CLASSIC existing
We omit the (PLM-based)
interleaving data
densefor
layersrepresentation
[84],
to make this Lifelong-MoE learning
figure simple and clear.to (LLM-based)
group [25],
To alleviateS-Prompts (VLM-based)
catastrophic forgetting, numerous methods have been proposed [191]. for continual learning.
adapter is a 2-layer fully connected network with a skip- similar data together and push dissimilar data far One of the most effective methods is to leverage a buffer of exemplars from old tasks to perform either
a rehearsal or a distillation when tuning the whole network to learn new tasks (see Non-Prompting
We need to decide how to expand and initialize new ex- Implicit Regularization via DistillationMethods frominOld Fig.1).Ex-
connection. It is added twice to each Transformer layer. away, which makes it easier to learn a more accu- Both the rehearsal and the distillation typically decrease the forgetting degree.
perts and gatings. We empirically observed that randomly perts/Gatings We try to find possible ways to implicitly However, it is often more desired that the exemplars of old tasks are not stored for better data security
Only the adapters and layer norm (green boxes) layers
initializing expandedrate classifier.
experts and gatingsItleads
usestovarious
poor per- transformations
regularize parameters, of the including the newly
and privacy. Moreover, storing a large amount of exemplars might run out available memories.
expanded
Therefore, our focus ex-is shifted to the challenging exemplar-free DIL in this paper.
are trainable. The other modules (grey boxes) of BERT
formance, potentially due to mismatched gradient direc- perts, gating dimensions, embeddings, andOne dense/attention
4are frozen.
OFFLINE CONTINUAL tionsLEARNING
existing data to create useful views. Given a mini- of the most promising solutions to the exemplar-free DIL problem is to learn a set of prompts 4

(CSC): CSC loss is computed based onmagnitudes

and the from new experts/gatings and pre- layers. Inspired by (Li & Hoiem, 2017), weover choose to distill
transformers that achieve the state-of-the-arts in a wide range of areas. In this solution, the
trained dense/attention
current task model (details in Sec. 3.4). (CED): CED batchlayers. of Therefore,
N training examples,
inspired by the ifthewe create from
knowledge another old experts and gatings.
domain-specific knowledge is stored by a prompt pool, and hence a rehearsal buffer is no longer
Specifically,
mandatory to mitigate forgetting. For instance, two recent prompting methods [14, 66] aim at 5

“Net2WiderNet” approach view (Chen

for et al., 2015),
each example, a betterthe
waybatch denoting
will the
havemodel2N as ex-
M, we minimize the combination
learning of
task/domain-specific prompts dependently across domains in an incremental update fashion.
loss is computed
4.1 based on all previous tasks
Domain-Incremental is tofrom
Learning1 toeach new expert and gating dimension from perplexity loss LPerp (for the next-token prediction)
initialize
In [14], the prompts
attention learningandwith
are added task by task, and the newly added prompts share the same task-
the the old prompts. This requests a joint tuning on all the prompts and the entire
t − 1 (details in Sec. 3.2). (CKS): CKS pretrained
loss is com- amples.
ones, helping both theWe assume
preservation that
of old i and KL
knowl- j are two views
divergence LKL of of outputs from two models: transformer using a distillation loss to keep a balance between the new prompting and the old ones.
In [66], a pool of prompts is initialized at the right beginning, and a set of the prompts from the
puted based on previous and current tasksedge andand the training
the warming-up
a task- example.
for the subsequent If we use i as the anchor,
pretraining. pool are selected for each instance. The instance-level prompts generally share task-level knowledge
4.1.1 PLMs-based
based self-attention. Details areDIL.
given in Sec. 3.3. expansion (i,
A vanilla j) is
strategy called
would be to aduplicate
positive pair. All other
the num- pairs
among any similar instances, by jointly tuning the selected prompts and the whole classifier, with
L = LPerp + λL(i,KL
k) the pre-trained transformer being fixed. While using different prompting approaches, [14, 66] both
(1)
follow a commonly-respected principle that the prompting should keep sharing across tasks/domains.
ber of experts in order to fully leverage and inherit all the X
for k 6
= i are negative
pretrained knowledge. However, this will lead to an expo-
pairs. TheL contrastive
Perp = − log Ploss
(x i+1 |M (x 0:i , θ 0:t−1
However, the sharing-driven dependent prompt tuning paradigm as well as many of the non-prompting
, θt , θd ))generally(2)
paradigms result in a tug-of-war (or a zero-sum game), in which one side’s gain is equivalent

and review sentence (e.g., "The soundnentially quality for this

increasing model
is size, positive
which is notpair is (Chen
scalable. In our et al., 2020), x ∈X
X
i to the other’s loss, as studied in [56, 33]. In other words, these paradigms generally accumulate
new knowledge in the same feature space, which is very likely to mix up the subspaces of old/new
Traditional Methods. Continual
work, we chooseLearning
to partially expandmethodologies
the number of experts L are
= − frequently
M (x , θ employed
, θ ) log (M (x , θ , θ , θin
)) . the context of Pre-trained
KL i 0:t−1 d knowledge
i 0:t−1resulting
t d in less separable classes (more forgetting or worse transferring) (Fig.1).
great") are concatenated via [SEP]. Theandsentiment
gating dimensions. We study differ expansion choices,
exp((h · hj )/τ )
x ∈X
In this paper, we explore a rule-breaking idea to instead play a win-win game, i.e., learning the
i

polarity is predicted on (PLMs),

top of the [CLS]and will show that by L
token. expanding
i,j = −a log limited
Pnumber of ex- i , (1)regularization-based,prompts independently (3) across domains so that the prompting can achieve the best for each domain.
Language Models encompassing approaches
perts for each data distribution we can achieve 2N such
1
competitive
as replay-based, The excellent generalization capabilityand of theparameter-isolation-
transformers and the strong transfer learning ability
exp((h i · hk )/τ
indicates )
parameters for dense layersofthat
the recently appeared prompting mechanisms allow for the realization of this idea. In particular,
are shared
As indicated earlier, although BERT can achievewithout further introducing extra model size.
performance
k=1 k6 = j θ d we introduce a new learning paradigm that learns the prompts independently domain by domain,
across distributions, θ0:t−1 indicates parameters for old inserts
ex- the learned prompts into a pool. As shown in Fig.1, compared to the other
based algorithms. Li et al. [103]
state-of-the-art performance on a single task, its
That present
means, we a
selectively
where
regularization-centric
expand (and only
the dot product
expand) the
· hj isperts
hi data
framework
and gatings,
regarded as aandsim-
for Lifelong
and incrementally
prompting and
θt for parameters of newly expanded Learning
non-prompting methods, the new (LLL)
paradigm suggeststermed
merely tuning the RMR-
current domain-
experts when necessary to accommodate incoming new
architecture and fine-tuning are unsuitable for CL
distribution ilarity
that is not coveredfunction
by the olderin the
corpora. hidden
We do space
experts and gating dimensions. x is the embedding
and τ is tem-
ofand
For simplicity the 4
consistency, we use ‘prompt’ to represent ‘prompt token’ for context/domain knowledge
DSE, specifically tailored for
(see Sec. 1) and perform very poorly (Sec.
sequential
not 4.4).
increaseWe
operation across multiple
the number of dense layers.
current token and domains.
X represents theUnlike
whole corpusconventional
encoding, of current
and we utilize ‘token’ to denote those
5 strategies requiring
normal tokens on images and classes, throughout the paper.
A concurrent work [12] learns class-level prompts dependently on transformers for the CIL problem.
perature. The final loss for the data batch is calculated
distribution. This auxiliary loss LKL will implicitly
found that the BERT adapter
incremental memory allocation, idea in (Houlsby et al.,
RMR-DSE across all employs
positive pairs.aEq. recallavoid the model parameters from being updated too far from
1 is foroptimization
unsupervised mechanism, driven by 2 regularization, to
3.3. Expert/Gating Regularization pretrained ones. It is multiplied with a scaling factor λ to
2019) is a better fit for CL. contrastive learning. It can
The purpose of our expert/gating expansion is to enlarge the also beitsused
control impactfor su-
to the original pretraining loss value, and
selectively
BERT Adapter. retain important
The idea inparameters
was given modelAdapter- from
pervised
capacity for incoming preceding
new contrastive
data distributions.learning,
At thistasks.
we
where Furthermore,
will study different
any two in- λs. it incorporates a domain drift estimation
BERT (Houlsby et al., 2019), which inserts moment, two 2- experts
pretrained and gatings store the knowledge
stances/views from the same class form
Explicit a positivevia Partial Experts and Gatings
Regularization
algorithm to address
layer fully-connected networks embedding
(adapters) in space
about previous
each
erase these pretrained
shifts.
distributions.
pair, and any
knowledge
Castellucci
Continuous training will stillet al.
and instance
overfit on theofnew
[20]
a classFreezing
and any
propose
To instance
a Knowledge Distillation-based Continual
explicitly preserve pretrained knowledge, an
transformer layer of BERT (Figure 1(CSC)). Dur-
data, which is not desired.
from Inother
this section,
classeswe form
proposeatwo intuitive way is to completely freeze neurons specifically
negative pair.
Learning (CL-KD) approach with a Teacher-Student
approaches to effectively preserve old knowledge.
ing training for the end-task, only the adapters and
framework. When
responsible for the
previous data student
distributions, model
and only allowis trained on a new language,
parameters for the current distribution to be updated. In
normalization layers are updated. All the other 3.2
the teacher model also transfers its knowledge of supported languages Overcoming Forgetting via
our Contrastive
method, the dense/attention layers are
to the student model. always being op- Lee et al. [100] introduce a
timized, since they are trained to fit all data distributions.
BERT parameters are frozen. This is good for CL Ensemable Distillation (CED)
domain-agnostic
as fine-tuning the BERT framework that employs
causes serious forgetting. The CED a two-phase
objective is totraining 4 with CF.
deal process,
We firstinitially using synthetic data to learn general
Adapter-BERT achieves similar accuracy to the introduce task masks that CED relies on to preserve
conversational
fine-tuned BERT (Houlsbypatterns and subsequently using human-computer dialogs in customer support. Moreover, the Adaptive
et al., 2019). the previous task knowledge/models to be distilled
Elastic to the
Weight Consolidation algorithm is new task model
applied to avoid CF.
to adjust the loss function, ensuring the balance between acquiring
3.1 Overview of CLASSIC
new knowledge and retaining 3.2.1 learned
previously Task Masks (TMs)
information.
The architecture of CLASSIC is given in Figure 1,
which works in the DIL setting for ASC. It uses Given the input hidden states h(t) from the feed-
Gururangan et al. [57] develop a parameter-isolation-based method, named DEMIX layers, which consists of a
Adapter-BERT to avoid fine-tuning BERT. CLAS- forward layer of a transformer layer, the adapter
(t)
SIC takes two inputs
collection in training:
of experts. (1) hidden
The dynamic maps themorinto
states addition input kl of
removal viathese
a fully-connected
experts can enhance the model’s ability to adapt to
h(t) from the feed-forward layer of a transformer network, where l is the l-th layer of the adapter. A
(t)
new
layer ofdomains
BERT and (2)while
task id tmaintaining robust
(no task id is needed TM (aperformance
“soft” binary mask)inmpreviously
l is trained for established
each ones. PlugLM [27] is a pre-training
in testing, see Sec. 3.2.3). The outputs are hidden task t at each layer l in the adapter during training
model equipped with a differentiable 6873plug-in memory (DPM) designed for domain-adaptive continual training. The
core concept behind this approach is to separate the knowledge storage from model parameters using an adaptable
key-value memory structure. It enables the explicit retrieval and utilization of knowledge stored within the DPM.

Continual Pre-training Methods. Continual domain-adaptive pre-training (DAP-training) [85] is based on two main
ideas: (1) the general knowledge in the LM and the knowledge gained from prior domains are crucial to mitigating CF and
enhancing cross-task knowledge transfer. This is achieved through soft-masking units based on their importance, and (2)
the model is designed to develop complementary representations of both the current domain and prior domains, thereby
facilitating the integration of knowledge. The key novelty of Continual DAP-training is a soft-masking mechanism that
directly controls the update to the LM. Cossu et al. [30] formalize and explore the dynamics of continual pre-training
(Pretr) scenarios across language and vision domains. In this framework, models undergo continuous pre-training on a
sequential stream of data before subsequent fine-tuning for various downstream tasks.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 9

Parameter-Efficient Tuning Methods. Due to the huge parameters of LMs, parameter-efficient tuning methods like
adaptors [64, 147] and p-tuning [119] are used for domain-incremental CL [84, 86, 128, 234].
The adapter architecture incorporates a skip-connection to minimize the number of parameters. A notable exemplar
of this approach is AdapterCL [128], which employs residual adapters tailored specifically for task-oriented dialogue
systems. This framework, comprising 37 domains, is structured to facilitate continual learning across four important
aspects: intent recognition, state tracking, natural language generation, and end-to-end processing. In a related vein,
Ke et al. [86] introduce B-CL model to tackle critical challenges in CL for Aspect-Based Sentiment Classification.
B-CL integrates continual learning adapters within capsule network architectures. Aiming at mitigating catastrophic
forgetting during fine-tuning, the CLASSIC model [84] presents an innovative solution by deploying adapters to tap
into BERT’s capabilities (Figure 5a). A novel contrastive continual learning strategy is used to facilitate the transfer of
knowledge across tasks and distill insights from previous tasks to subsequent ones. It also effectively eliminates the
necessity for task identifiers during testing. Furthermore, Continual PostTraining (CPT) [81] introduces two continual
learning plug-in modules, termed CL-plugins, embedded within each transformer layer of RoBERTa.
Prompt tuning [119], or P-tuning, introduces trainable continuous prompts into the sequence of input word embed-
dings, while the language model remains frozen. To tackle the challenge of CL under limited labeled data, Qin et al. [150]
propose a Lifelong Few-shot Language Learning framework (LFPT5). In this framework, prompt tuning, replay and
regularization strategies are leveraged. When presented with a new task, the model generates pseudo-labeled samples
representative of prior domains. The training process then incorporates these pseudo-labeled samples alongside new
task-specific data. Additionally, the KL divergence loss is employed to maintain label consistency between the previous
and the current model. Furthermore, Zhu et al. [234] introduced Continual Prompt Tuning (C-PT) as a methodology to
address the challenges of continual learning within dialogue systems. C-PT facilitates knowledge transfer between
tasks through continual prompt initialization, query fusion, memory replay, and a memory-guided technique.

Instruction Tuning-based Methods. Instruction tuning-based methods involve transforming a given task into natural
language instructions. Qin et al. [152] propose ELLE, a novel approach aimed at effectively incorporating continuously
expanding streaming data into pre-trained language models (PLMs). It consists of two fundamental components: (1)
function-preserved model expansion, which enhances knowledge acquisition efficiency by changing the width and depth
of an existing PLM, and (2) pre-trained domain prompts, which significantly enhance the adaptation for downstream
tasks by effectively segregating the diverse knowledge acquired during pre-training phases.

4.1.2 LLMs-based DIL.

Traditional Methods. In many practical scenarios, retraining Language Models (LMs) is challenging due to resource
constraints and data privacy concerns. Zhang et al. [218] introduce Continual Proximal Policy Optimization (CPPO) to
address this issue. CPPO integrates sample-wise weighting into the Proximal Policy Optimization (PPO) algorithm,
effectively balancing policy learning and knowledge retention. Zhang et al. [217] propose Continual Optimal Policy
Regularization (COPR), which calculates the optimal policy distribution without the partition function and uses the
previous optimal policy to regularize the current policy. Sun et al. [171] introduce LAMOL, which generates pseudo-
samples from previous tasks while training on a new task. It effectively mitigates knowledge loss without requiring
additional memory or computational resources. Building on this framework, Wang et al. [183] developed RVAE_LAMOL,
which integrates a residual variational autoencoder (RVAE) to encode input data into a unified semantic space, thereby
enhancing task representation. This model also incorporates an identity task to enhance the model’s discriminative
Manuscript submitted to ACM
10 Yutao Yang et al.

ability for task identification. To enhance training efficacy, the Alternate Lag Training (ALT) is devised to segment the
training process into multiple phases.
To reduce hallucinations in specialized domains such as the Chinese legal domain, Zhang et al. [224] propose a novel
domain adaptation framework, named Adapt-Retrieve-Revise (ARR). It consists of three steps: adapting a 7-billion-
parameter language model for initial responses, retrieving corroborative evidence from an external knowledge base,
and integrating these to refine the final response with GPT-4. Gogoulou et al. [48] study the pros and cons of updating a
language model when new data comes from new languages – the case of continual learning under language shift. They
feed various languages into the model to examine the impact of pre-training sequence and linguistic characteristics on
both forward and backward transfer effects across three distinct model sizes. A new continual learning (CL) problem,
named Continual Knowledge Learning (CKL), is introduced by Jang et al. [73]. To assess CKL approaches, the authors
establish a benchmark and metric measuring knowledge retention, updating, and acquisition.

Continual Pre-training Methods. LLMs have demonstrated remarkable proficiency in tackling open-domain tasks.
However, their application in specific domains faces notable challenges, encompassing the lack of domain-specific
knowledge, limited capacity to utilize such knowledge, and inadequate adaptation to domain-specific data formats. To
address these issues, researchers have explored a novel approach known as continual pre-training, aiming to adapt
LLMs to specific domains [26, 126, 201]. Among these studies, Cheng et al. [26] draw inspiration from human learning
patterns to develop a novel method that transforms raw corpora into reading comprehension texts. Furthermore, they
discover that while training directly on raw data enhances the model’s domain knowledge, it significantly hurts the
question-answering capability of the model.
Domain-adaptive Continual Pre-training (DACP) uses a large domain corpus, leading to high costs. To reduce these,
Xie et al. [201] propose two strategies: Efficient Task-Similar Domain-Adaptive Continual Pre-training (ETS-DACP)
and Efficient Task-Agnostic Domain-Adaptive Continual Pre-training (ETA-DACP). ETS-DACP is tailored to improve
performance on specific tasks, building task-specific foundational LLMs. Conversely, ETA-DACP selects the most
informative samples across the domain. Given the high cost of training LLMs from scratch and limited annotated data
in certain domains, Ma et al. [126] propose a novel model called EcomGPT-CT. It employs a fusion strategy to exploit
semi-structured E-commerce data. Moreover, multiple tasks are designed to assess LLMs’ few-shot in-context learning
ability and zero-shot performance after fine-tuning.

Parameter-Efficient Tuning Methods. Chen et al. [25] present an innovative Lifelong Learning framework, termed
Lifelong-MoE, which leverages a Mixture-of-Experts (MoE) architecture (Figure 5b). This architecture enhances the
model’s capacity by incorporating new experts, where previously trained experts and gating mechanisms are frozen.

4.1.3 VLMs-based DIL. Vision-language models (VLMs) have demonstrated superiority in domain-incremental learning
contexts. Yi et al. [206] integrate VLMS with continual learning methodologies to develop a general-purpose medical
AI. Moreover, their study highlights the significance of data-efficient adaptation algorithms in minimizing the necessity
for extensive labeling when transitioning to new domains or tasks. Furthermore, the prompt text is utilized to master
the pre-trained knowledge embedded within VLMs. Aiming to independently learn prompts across disparate domains
by using pre-trained VLMs, S-Prompt [191] is devised (Figure 5c). This method encompasses techniques for acquiring
image prompts and introduces an innovative methodology for language-image prompt acquisition. Prompt learning
is conducted separately, utilizing a unified cross-entropy loss function during training. During inference, a K-NN
(k-nearest neighbors) technique is employed to discern the domain.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 11

Training Inference
Classiﬁer Classifier Classiﬁer

Fusion Encoder Con$nual Fusion Encoder Fusion Encoder

(a) LAMOL (b) HMI-LAMOL Visual Encoder Textual Encoder Learning Visual Encoder Textual Encoder Visual Encoder Textual Encoder
Time Persist
QA Hippocampus Module QA How many Is the TV Is this a
Answer [EOS] Answer [EOS] Knowledge Representation
🤔: The 1995 Tooheys 1000 driver who has second-to-last skiers? on? laptop ?
Long-term Embedding
in the Tooheys Top 10 was born where?
・・・ Memory Knowledge Input features ① ② ··· ··· ···
GPT-2 GPT-2
Knowledge

Results
Operators

Query
M1 M2 Mi ③
Retrieval key-prompt pair
Score
Update Inference Engine
[GEN] Context Question [ANS] Answer store retrieve [GEN] Context Question [ANS] Answer
Context <Operator, Args>
Coordinator Searcher
Task ! Task ! + # ··· ··· ···
Short-term
LM c decode (PQ)
h' LM
Feedback

Memory ① Modality-Interaction
encode (PQ) Knowledge Results Browser Responder
with Matrix !! : = × + × + ×( ⨀ ) query-key match
Context Question [ANS] Answer [EOS] h Context Question [ANS] Answer [EOS] Knowledge Metabolism
Time Decay LLMs Discriminator
More
Query
③ Query and Match: cos ( )
Operators
Qu
er
y
s
② Task-Interaction: − , Is this a
GPT-2 BERT GPT-2 External su
lt
!! !!"# F Input laptop?
Re
Resource 🗣: The driver who finished second-to-last in the 1995
Memory Manager Tooheys Top 10 is Tony Longhurst. He was born in Sydney, / / :Decoupled Visual / Textual / Fusion Prompts / :Decoupled Expert / General Prompts / / :Visual / Textual / Fusion Keys
[GEN] Context Question [ANS] Answer [CLS] Context Question [GEN] Context Question [ANS] Answer
Australia.

Figure 1: (a) is the overview of the LAMOL framework. The top is the learning QA to solve tasks and the bottom is Figure 1: The system overview of DynaMind. Figure 3: The TRIPLET framework. Left: during training, the pre-trained encoders are frozen, and parameters in classifier,

(a) HMI (b) DynaMind (c) TRIPLET

the learning LM to generate pseudo-samples. (b) is the proposed HMI implemented on LAMOL. We introduce a decoupled prompts, task-specific keys and interaction matrix are learnable. At task t + 1, we train the decoupled prompts
hippocampus module illustrated on the left.
The 1995 Tooheys 1000 driver who (Vaswani et al., 2017). To uphold the close align- (including three aspects, i.e., modality-wise, layer-wise and complementary). We further apply three interaction strategies
Coordinator has second-to-last in the
Tooheys Top 10 was born where? ment with the LLMs’ semantic space, we utilized (within the light blue colored rectangle) to model modality-wise prompt interaction, task-wise prompt interaction, and in-
the embeddings generated by LLMs as the knowl- teraction between input features and prompt keys. Right: during inference, we first calculate multi-modal representations
pal memory indexing (HMI) for improving gener- sentence representations (Xu et al., 2018; Liu et al., with the query function, which are used to match the most similar multi-modal keys. Then decoupled E-Prompts paired with

Fig. 6. Frameworks in TIL: HMI (PLM-based) [129], DynaMind (LLM-based) [41], TRIPLET (VLM-based) [148]. Searcher
List of drivers who participated edge representation. For OpenAI series models, we
in the 1995 Tooheys 1000 race. matched keys, together with decoupled G-Prompts, are appended to the inputs (or features) for answer generation.
ative reply. To remember training samples with a 2019), sentiment analysis (Chen et al., 2015; Xia can directly retrieve the embedding through their
small data usage, we introduce a hippocampus mod- et al., 2017), composition language learning (Li Driver who finished official APIs4 . For other open-source models, we Basically, Eq. (1) would be modified with: extract task-specific knowledge. For example, the visual
ule that encodes training samples into compressed et al., 2020b), relation learning (Han et al., 2020), Browser ⇣ ⌘ prompt P (v) = {G(v) ; {E (v) }} is composed of G-prompt
second-to-last in the Top 10. choose the last hidden state as the representation. ⇥ ⇤
ŷ(v, q) = F FT P (f ) ; VT([P (v) ; v]); TT([P (q) , q]) [0] , (v)
memory engrams using BERT (Devlin et al., 2019) dialogue systems (Lee, 2017; Madotto et al., 2021), Knowledge Retrieval utilizes vector search to (2)
G(v) shared for all tasks and E-prompt Et specialized for
and product quantization (PQ) (Jégou et al., 2011), text classification, and question-answering (QA) Birthplace of the driver who the t-th task . When the t-th task comes, we train the prompt
Coordinator finished second-to-last in the construct an index on DynaMind’s memory, en- where P (v) ,P (q) , P (f ) are the vision, question, and fusion (m) (m)
Pt = {G(m) ; Et } where m = v, q, f .
and stores them to generate conditioned samples (de Masson d'Autume et al., 2019; Wang et al., Tooheys Top 10.
abling the rapid identification and retrieval of prompt, respectively.
In our implementation, we combine all the three afore-
In the domain of visual question answering, Zhang et al. [220] introduce VQACL, a novel framework designed to
during the replay step (Fig. 1b). This method
makes it possible to generate specific training sam-
2020).
Regularization-based methods aim to constrain Figure 2: The pipeline of example "The 1995 Tooheys
1000 driver who has second-to-last in the Tooheys Top
the most relevant knowledge. By leveraging the
Langchain library (Chase, 2022), DynaMind can
Selective Deep Decoupling We then disentangle prompts
in a layer-wise format, and attaching it to selective lay-
ers. Rather than keeping attaching prompts to all the se-
mentioned decoupling designs. That is, we have three sets
of prompts for three modalities, where each set of prompts
ples from previously learned tasks. changes in model parameters important for previ- effortlessly access and shift between distinct vec- contains layer-wise deep-prompts and each layer-wise deep
10 was born where?". lected multi-head attention (MHA) layers [37], in this pa- prompt contains a G-prompt and a set of E-prompts. In
We evaluated HMI on two different CL scenarios ous tasks. Various methods have been proposed to tor indexing schemes. This module evaluates the
effectively integrate data from both visual and linguistic modalities. This integration is achieved through a dual-level task
using the original LAMOL as a baseline. The first
scenario is a sequence of different types of tasks,
estimate the importance of each parameter. For ex-
ample, elastic weight consolidation (EWC) (Kirk-
ries models3 . relevance of knowledge within the given context,
empowering DynaMind to access pertinent knowl-
per, we add prompts to some MHA layers in a replacing
schema, which is more memory-efficient. Given a trans-
former T containing K layers, T([P ; x]) = (LK LK 1 · · ·
summary, all the learnable prompts include:
n
(m)
P (m) = Gk 2 RLG ⇥D
o[n
(m)
o
Et,k 2 RLE ⇥D ,
3.2 Memory Manager L0 )([P ; x]) could be decomposed layer-by-layer: (4)
for which we used five natural language understand- patrick et al., 2017) uses the Fisher information edge throughout the reasoning process efficiently. m = v, q, f,
Memory Manager plays a pivotal role in storing
sequence that enhances the model’s performance on complex multimodal tasks. Central to VQACL is a compositionality
ing (NLU) tasks from DecaNLP (McCann et al.,
P
matrix. Synaptic intelligence (SI) (Zenke et al., h̄k = ↵k · hP
k + (1 ↵ k ) · Pk ,
In DynaMind’s memory, every piece of knowl- (3)
2018). The other scenario is a sequence of different 2017) estimates importance from the contribution and organizing the memories of DynaMind. It [hCLS P x CLS x
k+1 ; hk+1 ; hk+1 ] = Lk ([hk ; h̄k ; hk ]),
P with subscripts t for tasks, k for the k-th MHA layers, LG
edge is represented as a triple <Context, Key, /LE for G / E-Prompt’s length, D for embedding dimension.
domains in the same task, for which we used five to loss changes. Memory-aware synapses (MAS) consists of five interconnected modules: Knowl- P
Value>. The "Context" component houses the con- where [hCLS x
0 ; h̄0 ; h0 ]= [CLS, P0 , x] are the raw inputs, and
text classification datasets and single-pass setting, (Aljundi et al., 2018) computes the sensitivity of edge Representation, Knowledge Retrieval, Long- the output of LK is regarded as model output. Moreover,
textual data relating to the knowledge. The "Key"
test that evaluates the model’s ability to generalize new skill and concept combinations. The framework also incorporates
which is considered as an ideal scenario for CL.
The results indicate that HMI consistently outper-
parameters on the basis of the gradient of model
outputs.
term Memory, Short-term Memory, and Knowl-
edge Metabolism. Memory Manager works hand
signifies the vectorized representation of the "Con-
text", while the "Value" embodies the specific sub-
↵k 2 {0, 1} is a predefined switch that controls whether us-
ing the output prompt feature hPk or the k-th layer-specific
4.2.2 Prompt Interaction

With the proposed decoupled prompts, then we need inter-

in hand with Inference Engine, engaging in fre- prompt Pk as input.
forms LAMOL and improves robustness to training Architecture-based methods dynamically change action strategies to train them all together. We first have
stance of the knowledge. Complementary Decoupling Following the complemen- Query-and-Match Strategy to match between input features
task order and amount of replay samples. We also quent interactions to ensure a continuous acquisi-
a novel representation learning strategy that differentiates between sample-specific (SS) and sample-invariant (SI)
investigated the balance of previously learned tasks
in generated samples and found that HMI enables
the network structure to assign model parameters
for each task. Progressive neural networks (PNN) tion and updating of the knowledge necessary for
inferencing. This section provides an overview of
Long-term Memory acts as a permanent knowl-
edge repository within DynaMind, accumulating
tary design principle [37], each prompt is further split into
two parts: a General Prompt (G-Prompt) to extract task-
and related task-specific prompts. We further introduce
Modality-Interaction Strategy and Task-Interaction Strat-
(Rusu et al., 2016) freeze the current parameters and retaining large amounts of information over invariant knowledge, and an Expert Prompt (E-Prompt) to egy to promote interactions between prompts. The former
the generation of even old task samples, which indi- and add a new column of the network when train- the functionalities within Memory Manager and
time. The repository holds a collection of vari-
features. SS features capture distinctive attributes of individual inputs, enhancing output uniqueness, while SI features,
cates the controllability of sample generation with
HMI. Furthermore, we explored the potential of
ing a new task. Instead of extending the network,
PackNet (Mallya and Lazebnik, 2018) applies net-
elaborates on their interactions with Inference En-
gine. ous knowledge from a variety of sources, such as
previous interactions, external databases, and the 2956
further improvement of HMI with different sample work pruning using dynamic filters to separate the Knowledge Representation is responsible for
encoding knowledge in a format that can be Internet. DynaMind harnesses the power of long-
selection strategies for replay. neurons used for each task.
derived from category prototypes, ensure essential characteristics are retained.
2 Related Work Replay-based methods mitigate catastrophic for-
efficiently processed by DynaMind. Typically,
there are several common methods used for
term memory to enhance its reasoning ability by
integrating past experience and acquired knowl-
getting by retraining for previous tasks when train- edge. Furthermore, users can actively manipulate
knowledge representation, such as Bag of Words
CL, which involves learning from a stream of tasks ing for a new one. MbPA++ (de Masson d'Autume the knowledge engaged in the inference by explic-
(BOW) (Zhang et al., 2010),Word2Vec (Mikolov
without catastrophic forgetting, is a long-standing et al., 2019) introduces an episodic memory that itly updating the long-term memory, thereby grant-
et al., 2013), knowledge graph (Luu et al.,
issue in machine learning. In the NLP, CL has been stores real samples of previous tasks to use for ex- ing them heightened control over DynaMind’s cog-
2014, 2016), or fine-tuning a Transformer encoder
studied for diverse tasks, for example, word and perience replay and local adaptation. Meta-MbPA
4.2 Task-Incremental Learning 931
3
https://huggingface.co/tiiuae/falcon-40b 4
https://platform.openai.com/

4.2.1 PLMs-based TIL.

Traditional Methods. Drawing inspiration from neurobiological mechanisms, Maekawa et al. [129] present an
inventive approach known as Hippocampal Memory Indexing (HMI) to augment the generative replay technique.
HMI leverages hippocampal memory indexing to integrate compressed representations of prior training instances,
facilitating selective guidance for the generation of training samples. This methodological refinement contributes to
heightened specificity, balance, and overall quality of the replayed samples. In tackling the Continual Few-Shot Relation
Learning (CFRL) challenge, Qin et al. [149] propose ERDA as a solution, drawing upon replay- and regularization-based
techniques. The ERDA framework integrates embedding space regularization and data augmentation strategies to
effectively confront the task of acquiring new relational patterns from scarce labeled instances while mitigating the
risk of catastrophic forgetting. Wang et al. [184] introduced a memory-based approach to continual learning, termed
Episodic Memory Replay (EMR). This method leverages working memory by selectively replaying stored samples
during each iteration of learning new tasks, thereby facilitating the integration of new knowledge while preserving
previously acquired information.
Conure [214] is a framework that effectively manages multiple tasks by leveraging the redundancy of parameters in
deep user representation models. Initially, it prunes less critical parameters to make room for new, task-specific ones.
Subsequently, it incorporates these new parameters while retaining key parameters from prior tasks, facilitating positive
transfer learning. To prevent the loss of previously acquired knowledge, it maintains these essential parameters in a fixed
state. Notably, TERACON [91] utilizes task-specific soft masks to isolate parameters, which not only targets parameter
updates during training but also clarifies the relationships between tasks. This method includes a novel knowledge
retention module that utilizes pseudo-labeling to mitigate the well-known problem of catastrophic forgetting.
Ke et al. [83] introduced a model known as CTR, which employs innovative techniques such as CL-plugin and task
masking to tackle the issue of catastrophic forgetting and to facilitate knowledge transfer across tasks. These strategies
are particularly effective when utilized in conjunction with pre-trained language models, such as BERT, enhancing
their adaptability and efficacy. In the specific context of User Intent Classification (UIC) within large-scale industrial
Manuscript submitted to ACM
12 Yutao Yang et al.

applications, Wang et al. [182] introduce a novel methodology, MeLL, which utilizes a BERT-based text encoder to
generate robust, dynamically updated text representations for continual learning. MeLL combines global and local
memory networks to preserve prototype representations across tasks, acting as a meta-learner that rapidly adapts to
new challenges. It employs a Least Recently Used (LRU) policy for efficient global memory management and minimizes
parameter growth.
Recent advancements in conversational AI have concentrated on mitigating the limitations of traditional chatbots,
which are dependent on static knowledge bases and extensive manual data annotation. The Lifelong INteractive learning
in Conversation (LINC) methodology represents a significant stride in this direction, embodying a dynamic learning
framework that mimics human cognitive processes during interactions [110, 111, 132]. Structured around an Interaction
Module, a Task Learner, and a Knowledge Store, LINC enables chatbots to dynamically integrate and utilize knowledge
within conversational contexts (Figure ??). This approach not only facilitates real-time information extraction and
learning from user interactions but also addresses challenges such as managing erroneous inputs, refining conversational
skills, and maintaining user engagement, thereby enhancing the chatbots’ linguistic and interactive capabilities.

Continual Pre-training Methods. Continual pre-training represents a paradigm wherein pre-trained language models
(PLMs) are progressively enhanced through the assimilation of new knowledge from expanding datasets. Sun et al. [172]
introduced a framework called ERNIE 2.0, which incrementally constructs pre-training tasks, enabling the acquisition
of complex lexical, syntactic, and semantic nuances embedded within the training corpora. This model deviates from
traditional fixed-task training, instead employing continual multi-task learning to refine its capabilities continually.
Further advancing the field of continual pre-training, RecyclableTuning [151] introduces a novel concept of recyclable
tuning that features two distinct strategies: initialization-based and distillation-based methods. The former uses fine-
tuned weights of existing PLMs as the basis for further enhancements, capitalizing on the established parametric
relationships. Conversely, the distillation-based approach harnesses outdated weights to maintain knowledge continuity
and efficiency in successor models.

Parameter-Efficient Tuning Methods. Expounding upon the crucial need for more efficacious knowledge integration,
Zhang et al. [223] propose a pioneering strategy that entails the integration of Adaptive Compositional Modules
alongside a replay mechanism. These modules are designed to dynamically adjust to new tasks and are supplemented by
pseudo-experience replay, significantly enhancing knowledge transfer. This framework is distinguished by its adaptive
integration of modules within transformer architectures, skillfully orchestrating the interactions between existing
and new modules to address emerging tasks. Additionally, the implementation of pseudo-experience replay promotes
efficient knowledge transfer across these modules. Concurrently, Jin et al. [76] introduce the Continual Learning of
Few-Shot Learners (CLIF) challenge, wherein a model accumulates knowledge continuously across a series of NLP tasks.
Their investigation delves into the impact of continual learning algorithms on generalization capabilities and advances
a novel approach for generating regularized adapters.

Instruction Tuning-based Methods. Zhao et al. [227] propose the Prompt Conditioned VAE for Lifelong Learning (PCLL)
specifically designed for task-oriented dialogue (ToD) systems. PCLL employs a conditional variational autoencoder
influenced by natural language prompts to generate high-quality pseudo samples, effectively capturing task-specific
distributions. Additionally, a distillation process is integrated to refine past knowledge by reducing noise within pseudo
samples. Razdaibiedina et al. [158] introduce Progressive Prompts (PP), a novel approach to continual learning in
language models. PP addresses catastrophic forgetting without resorting to data replay or an excessive proliferation
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 13

of task-specific parameters. This method involves acquiring a fresh soft prompt for each task, gradually appending it
to previously learned prompts while keeping the base model unchanged. The ConTinTin paradigm [208] develops a
computational framework for sequentially mastering a series of new tasks guided by explicit textual instructions. It
synthesizes projected outcomes for new tasks based on instructions, while facilitating knowledge transfer from prior
tasks (forward-transfer) and maintaining proficiency in previous tasks (backward-transfer).
In the realm of lifelong in-context instruction learning aimed at enhancing the target PLM’s instance- and task-level
generalization performance as it observes more tasks, DYNAINST is devised by Mok et al. [137]. DYNAINST integrates
the principles of parameter regularization and experience replay. The regularization technique employed by DYNAINST
is tailored to foster broad local minima within the target PLM. In order to devise a memory- and computation-efficient
experience replay mechanism, they introduce Dynamic Instruction Replay, comprising Dynamic Instance Selection
(DIS) and Dynamic Task Selection (DTS). DIS and DTS dynamically determine the selection of instances and tasks to be
stored and replayed, respectively.

4.2.2 LLMs-based TIL. Recent attention has focused on the convergence of large language models (LLMs) with continual
learning methodologies, exemplified by significant contributions such as those by Wang et al. [190] and Peng et al.
[146]. Benefiting from vast corpora and advanced hardware infrastructure, LLMs showcase remarkable capabilities in
language comprehension and generation. However, challenges arise in scenarios involving sequential tasks, where
LLMs often exhibit a decline in performance known as catastrophic forgetting.

Traditional Methods. DynaMind, introduced by Du et al. [41], emerges as a pioneering framework that intricately
incorporates memory mechanisms and modular operators, enhancing the precision of LLM outputs. Comprising three
essential components, DynaMind includes a memory module dedicated to storing and updating acquired knowledge, a
modular operator for processing input data, and a continual learning module responsible for dynamically adjusting
LLM parameters in response to new knowledge. Furthermore, Luo et al. [125] conduct a thorough investigation into
catastrophic forgetting (CF) in Large Language Models (LLMs) during continual fine-tuning. Their experiments across
various domains, including domain knowledge, reasoning, and reading comprehension, reveal the prevalence of CF in
LLMs ranging from 1b to 7b scale, with severity increasing with model size. Comparative analysis between decoder-only
BLOOMZ and encoder-decoder mT0 indicates that BLOOMZ exhibits less forgetting. Additionally, LLMs demonstrate
the ability to mitigate language bias during continual fine-tuning. Contrasting ALPACA against LLAMA, the study
highlights ALPACA’s superiority in preserving knowledge and capacity, suggesting that general instruction tuning aids
in mitigating CF during subsequent fine-tuning phases. This research provides valuable insights into CF dynamics in
LLMs, offering strategies for knowledge retention and bias mitigation.
Wang et al. [190] present TRACE, an innovative benchmark meticulously designed to evaluate continual learning
capabilities in LLMs. Comprising eight distinct datasets spanning challenging tasks, including domain-specific challenges,
multilingual capabilities, code generation, and mathematical reasoning, TRACE serves as a comprehensive evaluation
platform. The authors rigorously examine the effectiveness of conventional Continual Learning (CL) methods when
applied to LLMs within the TRACE framework. Peng et al. [146] propose the Joint Adaptive ReParameterization (JARe)
framework, enhanced with Dynamic Task-related Knowledge Retrieval (DTKR), to facilitate adaptive adjustment of
language models tailored to specific downstream tasks. This innovative approach leverages task distribution within the
vector space, aiming to streamline and optimize the continual learning process seamlessly.
Manuscript submitted to ACM
14 Yutao Yang et al.

Parameter-Efficient Tuning Methods. Large language models (LLMs) encounter several substantial challenges that
limit their practical applications. These include high computational requirements, significant memory demands, and a
tendency toward catastrophic forgetting. Such limitations highlight the need for ongoing research into more efficient
and robust approaches to training and deploying these models. Continual Parameter-Efficient Tuning (ConPET) [170]
is designed for the continuous adaptation of LLMs across diverse tasks, leveraging parameter-efficient tuning (PET)
strategies to enhance both efficiency and performance. ConPET encompasses two primary modes: Static ConPET and
Dynamic ConPET. Static ConPET adapts techniques previously utilized in smaller models for LLMs, thus minimizing
tuning costs and reducing the risk of overfitting. Conversely, Dynamic ConPET enhances scalability by employing
distinct PET modules for varying tasks, supplemented by a sophisticated selector mechanism.
Moreover, the ELM strategy [71] involves initially training a compact expert adapter on the LLM for each specific task,
followed by deploying a retrieval method to select the most appropriate expert LLM for each new task. Furthermore,
Wang et al. [189] have proposed orthogonal low-rank adaptation (O-LoRA), a straightforward yet efficacious method
for facilitating continual learning in language models. O-LoRA mitigates catastrophic forgetting during task acquisition
by employing distinct low-rank vector subspaces maintained orthogonally to minimize task interference. This method
highlights the potential of orthogonal subspace techniques in improving the adaptability of language models to new
tasks without compromising previously acquired knowledge.

Instruction Tuning-based Methods. Scialom et al. [163] introduced Continual-T0, an innovative framework aimed
at exploring the capabilities of large language models (LLMs) through continual learning, incorporating rehearsal
techniques. A central aspect of this approach is the employment of instruction tuning, a key strategy designed to enhance
the adaptability and effectiveness of LLMs when encountering novel tasks. Leveraging self-supervised pre-training,
Continual-T0 demonstrates exceptional proficiency in mastering new language generation tasks while maintaining
high performance across a diverse range of 70 previously encountered datasets. Despite the demonstrated proficiency
of LLMs in adhering to instructions, their ability to generalize across underrepresented languages remains suboptimal.
In response, InstructAlign [15] is proposed to address this challenge by aligning newly introduced languages with those
previously learned, which possess abundant linguistic resources, thereby mitigating instances of catastrophic forgetting.
The core novelty of this approach lies in its advancement of language adaptation methodologies for instruction-tuned
LLMs, with particular emphasis on integrating underrepresented languages.

4.2.3 VLMs-based TIL. The long-term sustainability of pre-trained visual-language models (VLMs) is increasingly
under scrutiny due to their dependence on continually expanding datasets. Although these models demonstrate robust
performance across a diverse range of downstream tasks, the incessant growth of real-world data poses substantial
challenges to the sustainability of traditional offline training methodologies.

Traditional Methods. CTP [233] employs topology preservation and momentum contrast to maintain consistent
relationships within sample mini-batches across tasks, thereby preserving the distribution of prior embeddings. CTP also
introduces the P9D dataset, comprising over one million image-text pairs across nine domains, aimed at visual language
continuous pre-training (VLCP). Zheng et al. [228] address the issue of zero-shot transfer degradation in visual language
models by introducing the Zero-Shot Continual Learning (ZSCL) method. This novel approach utilizes a label-free
dataset to facilitate distillation in the feature space, coupled with the application of weight regularization within the
parameter space. Furthermore, they introduce a new benchmark, the Multi-Domain Task Incremental Learning (MTIL),
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 15

designed to evaluate incremental learning strategies across various domains, thereby enhancing the assessment of such
methods. Moreover, ZSCL has also been adapted for use in the CIL setting, further broadening its applicability

Instruction Tuning-based Methods. By decoupling prompts and prompt interaction strategies, TRIPLET [148] effectively
captures complex interactions between modalities. This includes specific designs for visual, textual, and fused prompts,
as well as how to interact between different tasks through these prompts and retain crucial information, thereby reducing
catastrophic forgetting. Decoupled prompts are designed to separate prompts in terms of multi-modality, layer-wise,
and complementary, with each type of prompt containing learnable parameters intended to capture modality-specific
knowledge from pre-trained visual-language models and training data. The prompt interaction strategies consist of
three main components: the Query-and-Match Strategy, the Modality-Interaction Strategy, and the Task-Interaction
Strategy. These components work together to enhance the model’s adaptability to different tasks and its memory for
old tasks.
Moreover, COIN [23] introduces a new continuous instruction tuning benchmark designed to evaluate the per-
formance of Multimodal Large Language Models (MLLMs) in the sequential instruction fine-tuning paradigm. CoIN
includes 10 commonly used data sets covering 8 task categories, ensuring the diversity of instructions and tasks.
In addition, the trained model is evaluated from two aspects: instruction following and general knowledge, which
respectively evaluate the consistency with human intention and the preservation of knowledge for reasoning. CoIN
converts commonly used visual-linguistic datasets into instruction fine-tuning data formats by using different templates
to explore the behavior of MLLMs in continuous instruction fine-tuning. This method takes into account the diversity
between different tasks and attempts to enhance the model’s adaptability to new and old tasks through diversified
instruction templates. To alleviate the catastrophic forgetting problem of MLLMs, CoIN introduces the MoELoRA
method. This method reduces forgetting by using different experts to learn knowledge on different tasks and using gate
functions to regulate the output of these experts.

Parameter-Efficient Methods. MoE-Adapters4CL [212] introduces a parameter-efficient continual learning method to

mitigate long-term forgetting in incremental learning of visual-language models. Their approach involves dynamically
extending the pre-trained CLIP model to accommodate new tasks by integrating a Mixture-of-Experts (MoE) adapter.
Specifically, MoE consists of several LoRA adapter experts and routers, where the router calculates gating weights and
uses the 𝑇𝑜𝑝𝐾 function to select the 𝑘 most relevant experts for learning the current task. To maintain the zero-shot
recognition capabilities of the visual-language model, a Distribution Discriminative Automatic Selector (DDAS) is
further introduced, which can automatically route in-distribution and out-of-distribution inputs to the MoE adapters
and the original CLIP, respectively. Furthermore, the MoE-Adapters4CL framework has also been adapted for use in the
CIL setting.

4.3 Class-Incremental Learning

4.3.1 PLMs-based CIL.

Traditional Methods. The study presented in [138] introduces ExtendNER, a novel framework for continual learning
in Named Entity Recognition (NER) that obviates the need for extensive re-annotation. This framework employs a
knowledge distillation (KD) technique, wherein a pre-existing named entity recognition (NER) model, termed the
"teacher", imparts knowledge to a newly developed model, termed the "student". The student model, designed to identify
new entity types, progressively learns by emulating the teacher model’s responses to a new dataset. This method
Manuscript submitted to ACM
16 Yutao Yang et al.

a) Teacher Knowledge Transfer (TKT)

Add & Norm %# : Teacher PLE
!!"#
Distill
+ Adapter
Dynamic Frozen
%#$% : PLE %#$& : PLE %# : PLE
+ Frozen
… …
Up-Projection Weighting
Image encoder Prediction Text encoder
… Dog
ReLU Replay
EL !$%$ Cat
Down-Projection
×L Pseudo Samples Adapter Classifier
Memory
FeedForward c) Dynamic Weighting Replay Update
b) Pseudo Samples Replay (PSR) (DWR)
Add & Norm
d) Inference Train Prune Update
Class Prototypes
Multi-Head Attention Training Latent Feature Adapted Feature
PLE
Set
$ "" # "! ! sim(⋅) * score
Prefix Prefix

Query PLE
Query Representation !!"# !! (∆!, %) !!’
Embedding Layer (EL)

Figure 2: Overview of the proposed framework. The left side shows the structure of our PLE, which consists of a Figure 2: Illustration of our proposed method. We add a Linear Adapter layer (in green) after the CLIP Image Encoder which
frozen PrLM and the Prefix (Pk , Pv ) and Continual Adapter inserted into each layer. The trainable parameters are generates projected features. The text feature is generated by encoding class labels using the CLIP Text Encoder. During training

(a) IDBR (b) PLE (c) Adaptation-CLIP

Figure 1: Our proposed model architecture. We disentangle the hidden representation into a task generic space and the Image and Text encoders are frozen and we only update the Linear Adapter. To obtain better continual learning performance,
in green. The top right shows the process of learning for a new task Ti with three strategies: Teacher Knowledge
a task specific space via different induction biases. When training on new tasks, different spaces are regularized we propose a parameter retention method to mitigate forgetting in the Adapter (see Parameter Retention Section).
Transfer (TKT), Pseudo Samples Replay (PSR), and Dynamic Weighting Replay (DWR). The lower right shows the
separately. Also, a small portion of previous data is stored and replayed. distance-based classification pipeline at inference.

Fig. 7. Frameworks
For new tasks, we learn the classifiers by utilizing ingeneric
predictor f on the CIL: featureIDBR (PLM-based) [68], PLE (PLM-based) [105], Adaptation-CLIP (VLM-based) [117].
extractor G(.):
nsp Summary. Existing works in few-shot intent de- advantages and defeats the goal of few-shot learn-
et al. 2022) classifies images by multiplying the encoded im-
age with text features extracted using the text encoder in a
frozen CLIP model using the labels of all categories seen
model has encountered. This set of prompts is given to the
text encoder to obtain the text features Bt . Finally, we per-
form a matrix multiplication between Ai and Bt to compute
information from both spaces, and we allow dif- tection mainly focus on the data scarcity and ignore ing (Gao et al., 2021). After learning Ti , the model
Lnsp = Ex2St [M (L(fnsp (G(B(x)), 0) up to the current task as prompts. The prompts are manually the output logits. We use the cross-entropy loss to update the
the ability to learn consecutive tasks, which is es- is evaluated separately on the test set of seen tasks.
ferent spaces to change to different extents to best sential for the online dialogue systems. The works The setup is aligned with the real scenario, where
determined and combined with the class name and the sim- parameters of the Linear Adapter layer:
+L(fnsp (G(B(x̃)), 1)) ilarity score between the input image and different prompts
retain knowledge from previous tasks. in continual intent detection do not consider the the data privacy of different users is protected while from all classes is used for classification. This method does 1X
n
LCE (x, y, t; W ) = yi · log Ai Bt . (3)
where L is the cross entropy loss, M is the memory data scarcity of emerging new intents. The newly the task information is available. not require storing any exemplars or updating any parame- n i=1
buffer and St is the t-th training set. proposed FSCIL-ID setting is also not aligned with ters. It outperforms most state-of-the-art methods with only
Task Generic Space Task generic space is the
allows the student to acquire the ability to recognize new entities while retaining knowledge of previously identified
hidden space containing information generic to dif- Task Specific Space Models also need task spe-
the online dialogue systems. In contrast to those
works, our work aims to recognize continually
3.2 Overall Framework
As shown in Figure 2, the framework of the pro-
zero-shot evaluation. Continual-CLIP performs no adapta-
tion ability on incoming task data and in the next section,
Adaptation via a Self-attention Adapter. We can use a
self-attention module in place of the single Linear Adapter
layer described above. Specifically, we apply three separate
ferent tasks in a task sequence. During switching we propose how CLIP pre-trained models can be updated
cific information to perform well over each task. emerging new intents from multiple domains with posed method consists of one main module and
and further improved for class incremental learning. linear transformations to the output Ii of the CLIP image
from one task to another, the generic information very few labeled examples. three strategies of continual learning: 1) The
ones. Additionally, Liu et al. [116] propose an innovative method for learning distributed representations of sentences.
should roughly remain the same, e.g., syntactic
For example, on sentiment classification words like
“good” or “bad” could be very informative, but they 3 Methodology
lightweight PLE is responsible for extracting se-
mantic features of the input utterances. 2) The
CLIP with Adaptation
encoder, resulting in the query (Q), key (K), and value (V )
matrices:
knowledge should not change too much across the might not generalize well for tasks like topic clas- Starting from the zero-shot evaluation protocol proposed in Q = Wq I i , K = Wk Ii , V = Wv I i , (4)
TKT aims to transfer task-specific knowledge to
learning process of a sequence of tasks. To extract sification. Thus we employ a simple task-identifier 3.1 Problem Formulation Continual-CLIP (Thengane et al. 2022), we aim to augment where Wq , Wk , and Wv are learned weight matrices. Next,
This method initiates with the configuration of sentence encoders using features independent of any specific corpus,
task generic information g from hidden represen-
tations r, we leverage the next sentence prediction
prediction task on the task specific representation s, In the CFID setting, given a sequence of n tasks
the current model. 3) The PSR first selects two key
samples per class and encodes them through the
the original architecture with new modules that enable better
adaptation to more downstream tasks. By learning on the
we compute the self-attention weights using the dot product
p
between the query and key matrices, scaled by a factor M :
which means for any given example we want to dis- {T1 , T2 , ..., Tn }, each task Ti contains its own train- frozen embedding layer (EL) to generate pseudo training set of each task and updating the parameters of these
task (Devlin et al., 2019) 1 to learn the generic in- tinguish which task this example belongs to. This
ing set Dtrain
i , development set Ddevi , and test set
samples and save them into the Memory. 4) The modules, we can achieve a better trade-off between stability QK T
followed by iterative updates through Boolean operations on conceptor matrices. This technique ensures that the
formation extractor G(.). More specifically, we
insert a [SEP] token into each training example
simple auxiliary setup will encourage s to embed
i . Each dataset D contains a series of sam-
Dtest
|D|
ples {(xi , yi )}|i=1 , where yi is the ground-truth
DWR is responsible for balancing the learning of
new tasks and replaying past tasks.
and plasticity.
Adaptation via a Linear Adapter. Assume we are at task t,
↵ = softmax( p ),
M
(5)

Finally, we compute the weighted sum of the value matrix

different information from different tasks. The loss intent class of the input utterance xi . In particular, (i)
the input image is xt , and denote the CLIP image encoder V, using the self-attention weights ↵:
during tokenization to form a sequence pair labeled for task-identifier predictor ftask is: we describe its few-shot nature that there are only 3.3 Prefix-guided Lightweight Encoder (PLE) as Fimage and the text encoder as Ftext . For the image modal-
encoders maintain their performance on existing datasets while adapting effectively to new data.
IsNext, and switch the first sequence and the second
sequence to form a sentence pair labeled NotNext. Ltask = E(x,z)2St [M L(ftask (S(B(x)), z)
K 2 {5, 10} samples for each class in the training
set. There are also a few (e.g., 10) samples for
PLE serves as the main module to alleviate catas-
trophic forgetting caused by over-parameterization.
ity, we input the images into the image encoder to extract the
latent embedding Ii = Fimage (xit ) 2 RM . Subsequently,
Ai = ↵V.
After obtaining the adapted image features Ai , we use
(6)

each class in the development set. This is because As a sub-module of PLE, the Continual Adapter we pass it through a Linear Adapter layer Ai = gW (Ii ) Eq. 3 to calculate the logits by multiplying the image fea-
In order to distinguish IsNext pairs and NotNext where z is the corresponding task id for x. parameterized by weight matrix W 2 RM ⇥M to compute tures with the text features and use the cross-entropy loss
Huang et al. [68] introduced an innovative methodology known as Information Disentanglement-based Regulariza-
pairs, extractor G(.) needs to learn the context de-
pendencies between two segments, which is bene- Text Classification To adapt to the t-th task,
using a larger development set brings significant
335
is a full continual learning lightweight module an adapted image feature with more capacity. For the text
modality, we amalgamate a manually specified, fixed set of
to update parameters. Unlike the previous approach, in this
Self-attention Adapter we need to update three linear projec-
we combine the task generic representation g = prompts (“a photo of [CLS]”) where [CLS] takes the value tion matrices instead of just one. This results in three times
ficial to understand every example and generic to of all class names Ct = {c1 , c2 , ..., ct } that the current the number of learnable parameters.
tion (IDBR) to address the enduring challenges associated with continual text classification. This method effectively
any individual task.
Denote x̃ as the NotNext example corresponding
G(B(x)) and task specific representation s =
S(B(x)) to perform text classification, where we
minimize the cross entropy loss:
to x (IsNext), and l 2 {0, 1} as the label for next
disentangles the hidden spaces of text into task-generic and task-specific representations, employing distinct regular-
sentence prediction. We build a sentence relation Lcls = E(x,y)2St [M L(fcls (g s), y))

1
Here y is the corresponding class label for x, fcls (.)
Note that the word "sentence" here refers to an arbitrary
ization strategies to enhance knowledge retention and facilitate generalization. Furthermore, the integration of two
span of continuous text (Devlin et al., 2019), it could be several
linguistic sentences or part of a linguistic sentence.
is the class predictor. denotes the concatenation
of the two representations.

auxiliary tasks, namely next sentence prediction and task-id prediction, serves to augment the learning process by
reinforcing the separation between generic and specific representational spaces.

Instruction Tuning-based Methods. Introduced by Varshney et al. [177], Prompt Augmented Generative Replay (PAGeR)
is a method that enables continual learning in intent detection without retaining past data. PAGeR leverages pre-trained
language models to generate intent-specific utterances specific to new intents while preserving accuracy on existing
ones. Unlike exemplar replay, which stores specific examples, PAGeR is structured to selectively maintain relevant
contexts for each specified intent, which are then employed as generation prompts. This approach combines utterance
generation and classification into one model, enhancing knowledge transfer and optimization.

Parameter-Efficient Tuning Methods. In the domain of task-oriented dialogue systems, Continual Few-shot Intent
Detection (CFID) focuses on recognizing new intents with few examples. Li et al. [105] propose a Prefix-guided
Lightweight Encoder (PLE) to address this by using a parameter-efficient tuning method that combines a Continual
Adapter module with a frozen Pre-trained Language Model (PLM) and a Prefix-guided Attention mechanism to reduce
forgetting. To further mitigate forgetting, the Pseudo Samples Replay (PSR) strategy reinforces prior knowledge by
replaying crucial samples from previous tasks. The Teacher Knowledge Transfer (TKT) strategy uses distillation to
transfer task-specific knowledge to maintain performance on new tasks. Additionally, the Dynamic Weighting Replay
(DWR) strategy dynamically adjusts weights of previous tasks to balance new knowledge acquisition with the revision
of old tasks, navigating the variability and potential negative impacts of prior tasks
The Efficient Parameter Isolation (EPI) method, introduced in Wang et al. [193], assigns unique subsets of private
parameters to each task alongside a shared pre-trained model. This approach ensures precise parameter retrieval and
has been shown to outperform non-caching methods in continual language learning benchmarks, while remaining
competitive with caching methods. Furthermore, EPI employs random static masking to reduce storage requirements,
increasing its viability in resource-constrained environments.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 17

In complex system environments, efficient and adaptable machine learning architectures are crucial, especially
for classification tasks with sequentially presented data. Wójcik et al. [197] devise a novel architecture, Domain and
Expertise Ensemble (DE&E), comprising a feature extractor, classifier, and gating mechanism. The feature extractor
employs a multi-layer neural network to convert input data into embeddings, while the classifier, a mixture of binary
class-specific experts, leverages a gating mechanism to select the appropriate expert for the current input dynamically.
This Mixture of Experts-based method promotes incremental learning by training experts with class-specific samples
and combines their outputs during testing to derive the final classification.

4.3.2 VLMs-based CIL.

Traditional Methods. Kim et al. [90] introduce VLM-PL, a novel approach for class incremental object detection that
incorporates new object classes into a detection model without forgetting old ones. Utilizing a visual-language model
(VLM), this method enhances the pseudo-labeling process to improve the accuracy and performance of object detection
in continual learning settings. VLM-PL starts with a pre-trained detector to generate initial pseudo-labels, which are
then validated through a visual-language evaluation using a specially designed hint template. Accurate pseudo-labels
are retained and combined with ground truth labels to train the model, ensuring it remains proficient in recognizing
both new and previously learned categories. Cao et al. [18] introduce a framework for a Generative Multi-modal Model
(GMM) that leverages large language models for class-incremental learning. This innovative approach entails the
generation of labels for images by employing an adapted generative model. Following the production of detailed textual
descriptions, a text encoder is utilized to extract salient features from these descriptions. These extracted features are
subsequently aligned with existing labels to ascertain the most fitting label for classification predictions.
PROOF [231] develops a method to enhance model memory retention when adapting to downstream tasks. This
method involves a projection technique that maps pre-trained features into a new feature space designed to preserve
prior knowledge. Additionally, to effectively utilize cross-modal information, PROOF introduces a fusion module that
employs an attention mechanism. This module adjusts both visual and textual features simultaneously, enabling the
capture of semantic information with enhanced expressive power. Recently, Jha et al. [75] present a new approach for
adapting pre-trained vision-language models like CLIP to new tasks without forgetting previous knowledge. It employs
a Variational Inference framework to probabilistically model the distribution of visual-guided text features, enhancing
fine-tuning reliability by accounting for uncertainties in visual-textual interactions. Key to CLAP is the visual-guided
attention (VGA) module, which aligns text and image features to prevent catastrophic forgetting. Additionally, CLAP
includes lightweight, task-specific inference modules that learn unique stochastic factors for each task, allowing
continuous adaptation and knowledge retention.

Parameter-Efficient Tuning Methods. Liu et al. [117] introduce Adaptation-CLIP, which employs three strategies for
CLIP’s continual learning: linear adapter, self-attention adapter, and prompt tuning. The first two strategies add a linear
layer and a self-attention mechanism, respectively, after the image encoder while freezing the remaining architecture.
The third, prompt tuning, integrates trained prompts into the text encoder to enhance task comprehension and splices
these with prior prompts to maintain continuity. To prevent catastrophic forgetting, a parameter retention strategy
preserves significantly altered parameters from 𝑀𝑡 −1 to 𝑀𝑡 , ensuring stability and effective continual learning.

Instruction Tuning-based Methods. Khan et al. [89] introduce two notable advancements: an enhanced prompt pool key
query mechanism and category-level language guidance. The key query mechanism uses CLS features to improve prompt
selection, featuring key replacement with a fixed CLS tag and dynamic mapping to task-level language representations,
Manuscript submitted to ACM
18 Yutao Yang et al.
𝑘𝑘 − 1 iteration step 𝑘𝑘 𝑘𝑘 + 1

training
<latexit sha1_base64="5k6KY2YiUA9mWJi+0pNRdIUZ16Q=">AAAB+HicbZBNS8NAEIYn9avWj0Y9elksgqeSiKDHohePFewHtKFsttt26WYTdidiDf0lXjwo4tWf4s1/47bNQVtfWHh4Z4aZfcNECoOe9+0U1tY3NreK26Wd3b39sntw2DRxqhlvsFjGuh1Sw6VQvIECJW8nmtMolLwVjm9m9dYD10bE6h4nCQ8iOlRiIBhFa/Xcchf5I2aoqVBCDac9t+JVvbnIKvg5VCBXved+dfsxSyOukElqTMf3EgwyqlEwyaelbmp4QtmYDnnHoqIRN0E2P3xKTq3TJ4NY26eQzN3fExmNjJlEoe2MKI7Mcm1m/lfrpDi4CjKhkhS5YotFg1QSjMksBdIXmjOUEwuUaWFvJWxENWVosyrZEPzlL69C87zqW767qNSu8ziKcAwncAY+XEINbqEODWCQwjO8wpvz5Lw4787HorXg5DNH8EfO5w+kOJO2</latexit>
inference
<latexit sha1_base64="R9GYk4CLv12MSN7pxMFVL8fq0fI=">AAAB+XicbVDLSsNAFJ34rPUVdelmsAiuSiKCLotuXFawD2hDmUxv2qGTSZi5KZbQP3HjQhG3/ok7/8Zpm4W2Hhg4nHMvc88JUykMet63s7a+sbm1Xdop7+7tHxy6R8dNk2SaQ4MnMtHtkBmQQkEDBUpopxpYHEpohaO7md8agzYiUY84SSGI2UCJSHCGVuq5bhfhCXOhItCgOEx7bsWrenPQVeIXpEIK1HvuV7ef8CwGhVwyYzq+l2KQM42CS5iWu5mBlPERG0DHUsViMEE+v3xKz63Sp1Gi7VNI5+rvjZzFxkzi0E7GDIdm2ZuJ/3mdDKObwMZKM7SxFh9FmaSY0FkNtC80cJQTSxjXwt5K+ZBpxtGWVbYl+MuRV0nzsupb/nBVqd0WdZTIKTkjF8Qn16RG7kmdNAgnY/JMXsmbkzsvzrvzsRhdc4qdE/IHzucPTROUEw==</latexit>
𝜃𝜃 𝑘𝑘 , 𝜔𝜔𝑘𝑘 𝜃𝜃 𝑘𝑘+1 , 𝜔𝜔𝑘𝑘+1

y memory memory
local adaptation
ŷ
Inner Update
<latexit sha1_base64="Vi95YwknFrFzcB5LyqgiYSoMf0U=">AAAB7nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjBfsBbSib7aZdutmE3YkQQn+EFw+KePX3ePPfuG1z0NYXFh7emWFn3iCRwqDrfjuljc2t7Z3ybmVv/+DwqHp80jFxqhlvs1jGuhdQw6VQvI0CJe8lmtMokLwbTO/m9e4T10bE6hGzhPsRHSsRCkbRWt3BhGKezYbVmlt3FyLr4BVQg0KtYfVrMIpZGnGFTFJj+p6boJ9TjYJJPqsMUsMTyqZ0zPsWFY248fPFujNyYZ0RCWNtn0KycH9P5DQyJosC2xlRnJjV2tz8r9ZPMbzxc6GSFLliy4/CVBKMyfx2MhKaM5SZBcq0sLsSNqGaMrQJVWwI3urJ69C5qnuWH65rzdsijjKcwTlcggcNaMI9tKANDKbwDK/w5iTOi/PufCxbS04xcwp/5Hz+ALHDj8o=</latexit>

<latexit sha1_base64="iul1io9TJUC6IQ5nvyk14Yqm8+M=">AAAB9HicbZBNS8NAEIY39avWr6pHL4tF8FQSEfRY9OKxgv2ANpTNdtIu3U3i7qRYQn+HFw+KePXHePPfuG1z0NYXFh7emWFm3yCRwqDrfjuFtfWNza3idmlnd2//oHx41DRxqjk0eCxj3Q6YASkiaKBACe1EA1OBhFYwup3VW2PQRsTRA04S8BUbRCIUnKG1/C7CE2YKVKwn01654lbduegqeDlUSK56r/zV7cc8VRAhl8yYjucm6GdMo+ASpqVuaiBhfMQG0LEYMQXGz+ZHT+mZdfo0jLV9EdK5+3siY8qYiQpsp2I4NMu1mflfrZNieO1nIkpShIgvFoWppBjTWQK0LzRwlBMLjGthb6V8yDTjaHMq2RC85S+vQvOi6lm+v6zUbvI4iuSEnJJz4pErUiN3pE4ahJNH8kxeyZszdl6cd+dj0Vpw8plj8kfO5w+vRZKu</latexit>

<latexit sha1_base64="kBWpEtbVW3xlz0t9EJbdPDfcZn0=">AAAB6HicbZBNS8NAEIYnftb6VfXoZbEInkoigh6LXjy2YD+gDWWznbRrN5uwuxFC6C/w4kERr/4kb/4bt20O2vrCwsM7M+zMGySCa+O6387a+sbm1nZpp7y7t39wWDk6bus4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcjerd55QaR7LB5Ml6Ed0JHnIGTXWamaDStWtuXORVfAKqEKhxqDy1R/GLI1QGiao1j3PTYyfU2U4Ezgt91ONCWUTOsKeRUkj1H4+X3RKzq0zJGGs7JOGzN3fEzmNtM6iwHZG1Iz1cm1m/lfrpSa88XMuk9SgZIuPwlQQE5PZ1WTIFTIjMguUKW53JWxMFWXGZlO2IXjLJ69C+7LmWW5eVeu3RRwlOIUzuAAPrqEO99CAFjBAeIZXeHMenRfn3flYtK45xcwJ/JHz+QPnv4z9</latexit>

model
<latexit sha1_base64="JkUTSMXnfCk2UXxCF8HACrSliKI=">AAACAHicbZC7SgNBFIZn4y3G26qFhc1gEKzCrghaBm0sI5gLJCGcncwmQ2Znl5mzYli28VVsLBSx9THsfBsnl0ITDwx8/P85nDl/kEhh0PO+ncLK6tr6RnGztLW9s7vn7h80TJxqxusslrFuBWC4FIrXUaDkrURziALJm8HoZuI3H7g2Ilb3OE54N4KBEqFggFbquUcd5I+YyZiBpNCHBKdG3nPLXsWbFl0Gfw5lMq9az/3q9GOWRlwhk2BM2/cS7GagUTDJ81InNTwBNoIBb1tUEHHTzaYH5PTUKn0axto+hXSq/p7IIDJmHAW2MwIcmkVvIv7ntVMMr7qZUEmKXLHZojCVFGM6SYP2heYM5dgCMC3sXykbggaGNrOSDcFfPHkZGucV3/LdRbl6PY+jSI7JCTkjPrkkVXJLaqROGMnJM3klb86T8+K8Ox+z1oIznzkkf8r5/AGBjZb6</latexit>

<latexit sha1_base64="FPEaNKVbgZ8mF1hFf4dNriuk5+Y=">AAAB83icbZBNS8NAEIY39avWr6pHL8EieCqJCHosevFYwX5AE8pmM22X7m7C7kQsoX/DiwdFvPpnvPlv3LY5aOsLCw/vzDCzb5QKbtDzvp3S2vrG5lZ5u7Kzu7d/UD08apsk0wxaLBGJ7kbUgOAKWshRQDfVQGUkoBONb2f1ziNowxP1gJMUQkmHig84o2itIEB4wlwmMYhpv1rz6t5c7ir4BdRIoWa/+hXECcskKGSCGtPzvRTDnGrkTMC0EmQGUsrGdAg9i4pKMGE+v3nqnlkndgeJtk+hO3d/T+RUGjORke2UFEdmuTYz/6v1MhxchzlXaYag2GLRIBMuJu4sADfmGhiKiQXKNLe3umxENWVoY6rYEPzlL69C+6LuW76/rDVuijjK5IScknPikyvSIHekSVqEkZQ8k1fy5mTOi/PufCxaS04xc0z+yPn8Abbakhw=</latexit>

𝑥𝑥 𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓𝜃𝜃𝑘𝑘 𝑦𝑦� 𝑡𝑡𝑡𝑡𝑡𝑡 𝑔𝑔𝜔𝜔𝑘𝑘 𝑦𝑦� 𝑡𝑡𝑡𝑡𝑡𝑡 ℒ 𝑡𝑡𝑡𝑡𝑡𝑡

𝜃𝜃 𝑘𝑘+1 = 𝜃𝜃 𝑘𝑘 − 𝛼𝛼 ⋅ 𝛻𝛻𝜃𝜃𝑘𝑘 ℒ 𝑡𝑡𝑡𝑡𝑡𝑡

Outer Update

𝑥𝑥 𝑏𝑏𝑏𝑏𝑏𝑏 𝑓𝑓𝜃𝜃𝑘𝑘+1 𝜔𝜔 𝑦𝑦� 𝑏𝑏𝑏𝑏𝑏𝑏 ℒ 𝑏𝑏𝑏𝑏𝑏𝑏 𝜔𝜔𝑘𝑘+1 = 𝜔𝜔𝑘𝑘 − 𝛽𝛽 ⋅ 𝛻𝛻𝜔𝜔𝑘𝑘 ℒ 𝑏𝑏𝑏𝑏𝑏𝑏 forward
model
x
<latexit sha1_base64="LlcxfOHeXPIQwEm33jqcOn/n7lo=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7APasWQymTY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJEs60cd1vp7Syura+Ud6sbG3v7O5V9w/aWqaK0BaRXKpugDXlTNCWYYbTbqIojgNOO8H4Ovc7D1RpJsWdmSTUj/FQsIgRbKx03w8kD/Uktlf2NB1Ua27dnQEtE68gNSjQHFS/+qEkaUyFIRxr3fPcxPgZVoYRTqeVfqppgskYD2nPUoFjqv1slnqKTqwSokgqe4RBM/X3RoZjnUezkzE2I73o5eJ/Xi810aWfMZGkhgoyfyhKOTIS5RWgkClKDJ9YgoliNisiI6wwMbaoii3BW/zyMmmf1T3Lb89rjauijjIcwTGcggcX0IAbaEILCCh4hld4cx6dF+fd+ZiPlpxi5xD+wPn8AVNOkwk=</latexit>
<latexit

experience replay retrieve

<latexit sha1_base64="uXUqb0CpKO2A3XOwNrjSmUN8jRQ=">AAAB+HicbZBNSwMxEIaz9avWj6569BIsgqeyK4Iei148VrCt0C4lm07b0Gx2SWaLdekv8eJBEa/+FG/+G9N2D9r6QuDhnRlm8oaJFAY979sprK1vbG4Vt0s7u3v7ZffgsGniVHNo8FjG+iFkBqRQ0ECBEh4SDSwKJbTC0c2s3hqDNiJW9zhJIIjYQIm+4Ayt1XXLHYRHzDSgFjCGadeteFVvLroKfg4Vkqvedb86vZinESjkkhnT9r0Eg4xpFFzCtNRJDSSMj9gA2hYVi8AE2fzwKT21To/2Y22fQjp3f09kLDJmEoW2M2I4NMu1mflfrZ1i/yrIhEpSBMUXi/qppBjTWQq0JzRwlBMLjGthb6V8yDTjaLMq2RD85S+vQvO86lu+u6jUrvM4iuSYnJAz4pNLUiO3pE4ahJOUPJNX8uY8OS/Ou/OxaC04+cwR+SPn8wezgpPA</latexit>
<latexit sha1_base64="FPEaNKVbgZ8mF1hFf4dNriuk5+Y=">AAAB83icbZBNS8NAEIY39avWr6pHL8EieCqJCHosevFYwX5AE8pmM22X7m7C7kQsoX/DiwdFvPpnvPlv3LY5aOsLCw/vzDCzb5QKbtDzvp3S2vrG5lZ5u7Kzu7d/UD08apsk0wxaLBGJ7kbUgOAKWshRQDfVQGUkoBONb2f1ziNowxP1gJMUQkmHig84o2itIEB4wlwmMYhpv1rz6t5c7ir4BdRIoWa/+hXECcskKGSCGtPzvRTDnGrkTMC0EmQGUsrGdAg9i4pKMGE+v3nqnlkndgeJtk+hO3d/T+RUGjORke2UFEdmuTYz/6v1MhxchzlXaYag2GLRIBMuJu4sADfmGhiKiQXKNLe3umxENWVoY6rYEPzlL69C+6LuW76/rDVuijjK5IScknPikyvSIHekSVqEkZQ8k1fy5mTOi/PufCxaS04xc0z+yPn8Abbakhw=</latexit>

x
<latexit sha1_base64="LlcxfOHeXPIQwEm33jqcOn/n7lo=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIoMuiG5cV7APasWQymTY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJEs60cd1vp7Syura+Ud6sbG3v7O5V9w/aWqaK0BaRXKpugDXlTNCWYYbTbqIojgNOO8H4Ovc7D1RpJsWdmSTUj/FQsIgRbKx03w8kD/Uktlf2NB1Ua27dnQEtE68gNSjQHFS/+qEkaUyFIRxr3fPcxPgZVoYRTqeVfqppgskYD2nPUoFjqv1slnqKTqwSokgqe4RBM/X3RoZjnUezkzE2I73o5eJ/Xi810aWfMZGkhgoyfyhKOTIS5RWgkClKDJ9YgoliNisiI6wwMbaoii3BW/zyMmmf1T3Lb89rjauijjIcwTGcggcX0IAbaEILCCh4hld4cx6dF+fd+ZiPlpxi5xD+wPn8AVNOkwk=</latexit>
<latexit
backward
<latexit sha1_base64="W09jrrLx4Rs9m8jTxoFbHt+OKCA=">AAACAXicbVC7SgNBFJ2NrxhfqzaCzWAQrMKuCFoGbSwjmAckS5id3E2GzD6YuSsJS2z8FRsLRWz9Czv/xkmyhSYeGDiccy93zvETKTQ6zrdVWFldW98obpa2tnd29+z9g4aOU8WhzmMZq5bPNEgRQR0FSmglCljoS2j6w5up33wApUUc3eM4AS9k/UgEgjM0Utc+6iCMMINRAkpAxIEqSCQbT7p22ak4M9Bl4uakTHLUuvZXpxfzNIQIuWRat10nQS9jCgWXMCl1Ug0J40PWh7ahEQtBe9kswYSeGqVHg1iZFyGdqb83MhZqPQ59MxkyHOhFbyr+57VTDK68TERJiibd/FCQSooxndZBe0IBRzk2hHElzF8pHzDFOJrSSqYEdzHyMmmcV1zD7y7K1eu8jiI5JifkjLjkklTJLamROuHkkTyTV/JmPVkv1rv1MR8tWPnOIfkD6/MHfKSXiQ==</latexit>

Figure 1: An illustration of our model and how it interacts with the key-value memory module during
(a) MBPA++
training (left) and inference (right). During training, newly seen examples are used to update the Figure 2: Method overview. At the iteration step k, for (b)theCBA
inner loop, the forward process computes the training loss in
base model and stored in the memory. At certain intervals, we sample examples from the memory Eq. (3) and the backward process updates the classification model parameter ✓(!) by Eq. (5). And for the outer loop, the
and perform gradient updates on the base model (experience replay). During inference, we retrieve
Fig. 8. Frameworks in Online Continual Learning: MBPA++ (PLM-based HTB/BTB) [35], CBA (VLM-base BTB) [187].
examples whose keys are similar to a test example under consideration to fine-tune the model (local
forward process computes the loss in Eq. (4) and the backward process updates the CBA parameter ! by Eq. (6)

adaptation). We use the fine-tuned model to make a prediction and then discard it—keeping the base
thereby enhancing accuracy and robustness across different
model for other predictions. tasks. Meanwhile, category-level On the other hand, our ultimate objective is to protect the
language guidance is inputs. As such, the original classifier network f✓ is aug-
original classifier network f from catastrophic distribution mented by cascading the CBA module after its classification ✓
into the model. We show that a 1% experience replay to learning new examples ratio is sufficient.
shift, while accomplishing a stable consolidation of knowl- layer, as shown in Fig. 2. To capture catastrophic distri-
implemented in the vision transformer to better align output features
edge across withTocategory-specific
Such a process bears some similarity to memory consolidation in human learning (McGaugh, 2000).
different tasks. this end, we further keep language
bution representations,
change during continual learning, we parameterize
In local adaptation, we follow Memory-based Parameter Adaptation (MbPA; Sprechmann et al., 2018) tracking the performance of the classifier network to prevent g! as a multi-layer perceptron (MLP) network with a sin-
significantly improving task handling and category differentiation, leading
catastrophic forgetting, to improved
and use examples retrieved from memory to update model parameters used to make a prediction of a
which requires that f returnedmodel performance.
gle hidden layer containing 256 nodes. Each hidden node is ✓ ⇤ (!)
particular test example. by minimizing the rehearsal-based empirical risk Eq. (3), equipped with a ReLU [32] activation function, and the out-
should maximize the performance of previously seen data. put employs the Softmax activation function to guarantee a
Our setup is different from a typical lifelong learning setup. We assume that the model only makes one
Since accessing all of this historical data is unfeasible, it posterior probability output. Albeit simple, this network is
pass over the training examples, similar to Chaudhry et al. (2019). However, we also assume neither
5 ONLINE CONTINUAL LEARNING can be approximated by the empirical risk over the memory known as a universal approximator, capable of fitting al-
our training nor test examples have dataset identifying information (e.g., a dataset identity, a dataset
buffer data, i.e., most any continuous function [18] and thus can fit various
descriptor). We argue that our lifelong language learning setup—where a model is presented with
posterior distribution changes. Additionally, we introduce a
a stream of examples without an explicit identifier about which dataset (distribution) the examples ⇣ ⌘
5.1
come from—is a realistic setup to learn a general linguistic intelligence model.1 Our experiments Hard Task Boundary ! ⇤ , arg min Lbuf Bbuf ; f✓⇤ (!) ,
skip-layer connection within the CBA module, connecting
the outputs of f✓ to the MLP outputs, which aids the model
focus on lifelong language learning on two tasks—text classification and question answering.2 !
X (4)
= arg min
1
L f✓⇤ (!) (x), y . convergence and facilitates the gradient backward propaga-
Our main contributions in this paper are:
5.1.1 PLMs-based HTB. The Hard Task Boundary (HTB) setting has|Bbeen
|
developed to enable continuous knowledge ! buf
x,y2Bbuf
tion, which has been verified in previous works [19, 21].
Note that the proposed CBA module is only used in the
• We introduce a lifelong language learning setup where the model needs to learn from a This formulation attempts to find the optimal parame- training stage. In the test stage, the test sample xtst is pre-
acquisition byfrom
stream of examples learning models
many datasets from a indynamically
(presented sequentially) changing
one pass, and no dataset stream
ters such that ofnetwork
the classifier textual
returned data, without the need for dataset
by optimiz- dicted by f✓ , that is ŷ tst = f✓ (xtst ). This indicates that our
boundary or dataset identity is given to the model. ing Eq. (3), should also have a good performance on the method does not introduce any calculation overhead in the
identifiers.
• We present an For example,
episodic memory modelShen
(§2) that et al. [168]
augments have implemented
an encoder-decoder model with a HTB in slot filling, Michieli et al. [135] have utilized it in memory buffer data which acts as a stable consolidation of test stage. As a result, our method can be inferred at any
memory module. Our memory is a key-value memory that stores previously seen examples knowledge from the learned tasks. time in the learning process of online CL [23].
In fact, Eq. (3) and Eq. (4) formulate a bi-level learning
audio classification,
for sparse and
experience replay and localVander
adaptation. et al. [176] have explored its use in automatic speech recognition.
The learning of CBA. The Eq. (3) and Eq. (4) in the framework. In the inner loop Eq. (3), the augmented classi-
• We leverage progress in unsupervised pretraining to obtain good memory key representations fier network F✓,! is updated to learn new knowledge and re-
and discuss strategies to manage the space complexity of the memory module. bi-level optimization are nested with each other. Con-
hearse old knowledge from B trn . In the outer loop Eq. (4), cretely, the solution ✓⇤ (!) of Eq. (3) depends on the hyper-
Traditional Methods. Continual learning (CL) methodologies, particularly pertinent to online
• We compare our proposed method to baseline and state-of-the-art continual learning methods scenarios,
parameter encompass
! and the optimum solution is obtained at ! . the parameters of g! are updated from B buf to consolidate ⇤
and demonstrate its efficacy on text classification and question answering tasks (§4). the previously learned knowledge, which has been updated However, the required optimum ! ⇤ is solved by Eq. (4)
in the inner loop, against catastrophic posterior change.
a variety of approaches. These include parameter-isolation-based methods [35, 194], replay-based
which relies onmethods
the best ✓ (!). [63]
Indeed, and
there is generally
no closed-form solution to the bi-level optimization frame- We herein detail the proposed CBA module and bi-level
⇤

2 Model optimization algorithm in the following aspects:

work [6] and we approximately update the ✓ and ! using a
regularization-based methods [115, 176]. MBPA++ [35] introduces a framework for lifelong language
Continual Bias Adaptor: The proposed CBA module g
learning,
gradient-optimization-based enabling
method following [36, 38].
We consider a continual (lifelong) learning setup where a model needs to learn from a stream of !

training examples {xt , yt }Tt=1 . We assume that all our training examples in the series come from is designed to be a nonlinear transformation that takes the (1) Update ✓. Given the CBA parameter ! k at iteration
a pre-trained model to learn continually from textual examples without
outputs (logits) of requiring
the original classifier network f labeled
multiple datasets of the same task (e.g., a text classification task, a question answering task), and each as its stepdatasets. It employs
k, the CBA parameter ! is fixed andan
we formulate ✓
k

dataset comes one after the other. Since all examples come from the same task, the same model can
episodic memory system with sparse experience replay and local adaptation techniques to prevent catastrophic
be used to make predictions on all examples. A crucial difference between our continual learning
1
Contrast this with a more common setup where the model learns in a multitask setup (Ruder, 2017; McCann
forgetting. Extending this framework, Meta-MBPA++ [194] integrates three core lifelong learning principles, enhancing
et al., 2018).
2
McCann et al. (2018) show that many language processing tasks (e.g., classification, summarization, natural
performance in text classification and question-answering tasks while using only 1% of the typical memory usage.
language inference, etc.) can be formulated as a question answering problem.

Liu et al. [115] introduce a2 regularization-based strategy, CID, for lifelong intent detection. This method uses cosine
normalization, hierarchical knowledge distillation, and inter-class margin loss to tackle the challenges of data imbalances
in the lifelong intent detection (LID) task, aiming to mitigate the negative impacts associated with these imbalances.

5.2 Blurry Task Boundary

5.2.1 PLMs-based BTB. MBPA++ [35] and Meta-MBPA++ [194] can also exemplify models capable of adapting to
environments with indistinct task boundaries. TPEM [47] employs a tripartite approach within an encoder-decoder
framework, consisting of pruning, expanding, and masking techniques. Pruning helps preserve essential information
from previous tasks, expansion increases the model’s capacity to accommodate new tasks, and masking reduces
interference from previous tasks’ weights, thus enhancing learning efficiency. Originally, online meta-learning (OML)
[74] and a neuromodulatory meta-learning algorithm (ANML) [6] were intended to continuously learn new sequences of
tasks during the testing phase. Holla et al. [63] adapt these algorithms for a conventional continual learning context,
where the focus is on evaluating performance on previously encountered tasks. The enhanced versions, named OML-ER
and ANML-ER, incorporate an episodic memory module designed for experience replay.

5.2.2 VLMs-based BTB.

Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 19

Traditional Methods. Cui et al. [32] propose the Dynamic Knowledge Rectification (DKR) framework, designed
to mitigate the propagation of incorrect information in foundation LMs. The DKR framework operates by initially
leveraging an existing model to identify and exclude obsolete or erroneous knowledge when confronted with new data.
Subsequently, a rectification process is employed to amend these inaccuracies while ensuring the preservation of valid
data associations. This process is especially vital when integrating new data, as it prevents the perpetuation of outdated
or incorrect information. In cases where data is inaccessible through existing models, DKR utilizes paired ground-truth
labels to support the continuous evolution of knowledge bases, thereby enhancing the model’s accuracy.

Parameter-Efficient Tuning Methods. Wang et al. [187] introduce the Continual Bias Adaptor (CBA) (Figure 8b), a
novel method designed to enhance the efficacy of online CL by mitigating catastrophic forgetting. The CBA method
dynamically modifies the classifier’s network to adapt to significant shifts in data distribution observed during train-
ing, thereby preserving knowledge from earlier tasks. Notably, the CBA module can be deactivated during testing
phases, eliminating additional computational burdens or memory demands. This feature highlights the practicality and
applicability of the CBA in real-world scenarios.

Instruction Tuning-based Methods. In the BTB scenario, the Mask and Visual Prompt tuning (MVP) method, as detailed
by Moon et al. [139], addresses challenges like intra- and inter-task forgetting and class imbalance effectively. MVP
features instance-wise logit masking to prevent irrelevant information retention, contrastive visual prompt tuning loss
to ensure consistent prompt selection, gradient similarity-based focal loss to focus on overlooked samples, and adaptive
feature scaling to balance the integration of new knowledge with existing data retention.

Online Learning for CV. Recent advancements in class-incremental online learning predominantly address computer
vision tasks, encapsulating various methodologies such as regularization-based [5, 52, 54, 207], replay-based [3, 14,
107, 169, 222], distillation-based [46, 61, 95], and gradient-based approaches [24, 55, 77, 143, 210]. Among these, Koh
et al. [94] introduced the Class-Incremental Blurry (CLIB) model, which distinguishes itself through its task-free
nature, adaptability to class increments, and prompt response to inference queries, showing superior performance over
traditional continual learning methods. In a related vein, Gunasekara et al. [53] explore the Online Streaming Continual
Learning (OSCL), a hybrid of Stream Learning (SL) and Online Continual Learning (OCL), which integrates aspects of
both domains. Furthermore, issues of shortcut learning and bias in online continual learning have been tackled by Wei
et al. [195] and Chrysalis et al. [29]. Additionally, Semola et al. [166] propose Continual-Learning-as-a-Service (CLaaS),
a service model that leverages continual learning to monitor shifts in data distribution and update models efficiently.
This array of developments highlights the dynamic capabilities of online continual learning frameworks to effectively
address complex challenges across a variety of computer vision tasks.

6 DATASETS
6.1 Offline Datasets for NLP
6.1.1 Datasets for Classification.

Text Classification. The most typical task for continual learning is text classification. The foundational text classi-
fication benchmark encompasses five text classification datasets introduced by [221], including AG News, Amazon
Reviews, Yelp Reviews, DBpedia, and Yahoo Answers [171]. Particularly, the AG News dataset has 4 classes for news
classification; the Amazon and Yelp dataset has 5 classes for sentiment analysis; the DBpedia dataset has 14 classes
Manuscript submitted to ACM
20 Yutao Yang et al.

Table 1. The statistics information of the existing CL datasets. #D/T/C means the number of domains/tasks/classes, respectively.
Datasets #Train #Val #Test #Total CL Settings NLP Problems Language #D/T/C
Offline
Progressive Prompts [158] - - - - TIL Sentiment analysis, topic classification, boolean QA, QA, English 15 tasks
paraphrase detection, word sense disambiguation, natural
language inference
MeLL [182] 1,430,880 173,781 118,240 1,722,901 TIL Intent classification English 1184 tasks
Continual-T0 [164] 800,000 - 33,382 833,382 TIL Text Simplification, Headline Generation with Constraint, English 8 tasks
Haiku Generation, Covid QA, Inquisitive Question Genera-
tion, Empathetic Dialogue Generation, Explanation Gener-
ation, Twitter Stylometry
COPR [217] - - - - TIL QA tasks, summary task, positive file review generation English 3 tasks
task
Adaptive Compositional Mod- 50,725 - 27,944 78,669 TIL Natural language generation, SQL query generation, Sum- English 8 tasks
ules [223] marization and Task-oriented dialogue
CODETASKCL [204] 181,000 9,700 10,000 200,700 TIL Code generation, code translation, code summarization, and Hybrid 4 tasks
code refinement
Lifelong Simple Questions - - - - TIL single-relation questions English 20 tasks
[184]
Lifelong FewRel [184] - - - - TIL few-shot relation detection English 10 tasks
InstrDialog [225] 9,500 950 1,900 12,350 TIL Dialogue state tracking, dialogue generation, intent identi- English 19 tasks
fication
InstrDialog++ [225] 3,800 1,900 3,800 9,500 TIL Dialogue Generation, Intent Identification, Dialogue State English 38 tasks
Tracking, Style Transfer, Sentence Ordering, Word Seman-
tics, Text Categorization, Pos Tagging, Fill in The Blank,
Program Execution, Question Generation, Misc, Coherence
Classification, Question Answering, Summarization, Com-
monsense Classification, Wrong Candidate Generation and
Toxic Language Detection
ConTinTin [208] - - - - TIL Question generation tasks (QG), answer generation tasks English 61 tasks
(AG), classification tasks (CF), incorrect answer generation
tasks (IAG), minimal modification tasks (MM) and verifica-
tion tasks (VF)
Tencent TL [214] - - - - TIL personalized recommendations and profile predictions English 6 tasks
Movielens [214] - - - - TIL personalized recommendations and profile predictions English 3 tasks
NAVER Shopping [91] - - - - TIL search query prediction tasks English 6 tasks
TRACE [190] 40,000 - 16,000 56,000 TIL Domain-specific task, Multi-lingual task, Code completion Hybrid 8 tasks
task, Mathematical reasoning task
ABSC [86] 3,452 150 1,120 4,722 DIL Aspect-based sentiment classification English 19 Domains
DecaNLP [171] 169,824 - 32,116 201,940 DIL Question answering, semantic parsing, sentiment analysis, English 5 domains
semantic role labeling, and goal-oriented dialogue
Foundational text classifica- 115,000 - 7,600 122,600 DIL News classification, sentiment analysis, Wikipedia article English 5 domains
tion [171] classification, and question-and-answer categorization
RVAE_LAMOL [183] 15,870 - 5,668 21,538 DIL Oriented dialogue of the restaurant reservation task, seman- English 3 domains
tic role labeling, sentiment classification
COPR [217] - - - - DIL - English 18 domains
SGD [234] 38,745 5,210 11,349 40,287 DIL Dialogue state tracking English 19 Domains
CPT [81] 3,121,926 - - 3,121,926 DIL Domain-adaptive pre-training task English 4 Domains
CKL[73] - - - 30,372 DIL Domain-adaptive pre-training task English 3 Domains
ELLE [152] - - - - DIL Domain-adaptive pre-training task English 5 Domains
Domain-incremental Paper - - - - DIL Relation extraction and named entity recognition English 4 domains
Stream [78]
Chronologically-ordered - - - - DIL multi-label hashtag prediction and single-label emoji pre- English 4 domains
Tweet Stream [78] diction
AdapterCL [128] 31,426 4,043 4,818 40,287 DIL Intent classification, Dialogue State Tracking (DST), Natural English 37 Domains
Language Generation (NLG), end-to-end (E2E) modeling
DE&E [197] 28,982 - 12,089 41,071 CIL Text classification English 3 tasks
EPI [193] 12,840 3,524 6,917 23,281 CIL Text classification, topic classification English 13 Classes
PAGeR [177] 59,754 7,115 15,304 82,173 CIL Intent classification English 355 Classes
PLE [105] 4,669 4,650 31,642 40,961 CIL Intent classification English 477 Classes
CoNLL-03 [138] 23,326 5,902 5,613 34,841 CIL Named Entity Recognition English 4 Classes
OntoNotes [138] 107,169 16,815 10,464 134,448 CIL Named Entity Recognition English 6 Classes
Online
Foundational text classifica- 115,000 - 7,600 122,600 Hard and Blurry News classification, sentiment analysis, Wikipedia article English 5 tasks
tion [35] classification, and question-and-answer categorization
MBPA++ [35] 881,000 35,000 38,000 954,000 Hard and Blurry News classification, sentiment analysis, Wikipedia article English 9 tasks
classification, questions and answers categorization, ques-
tion answering
Lifelong FewRel [63] - - - - Hard and Blurry few-shot relation detection English 10 tasks
Firehose [66] - - - 110,000,000 Blurry Personalized online language learning English 1 tasks
TemporalWiki [72] - - - - - - English -

for Wikipedia text classification; and the Yahoo dataset has 10 classes for Q&A classification. The text classification
benchmark includes 115,000 training and 7,600 test examples for each task, holding out 500 samples per class from
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 21

the training set for validation. Building upon this, Razdaibiedina et al. [158] developed a novel continual learning
(CL) benchmark. This benchmark not only utilizes the foundational text classification benchmark but also integrates
additional datasets from the GLUE benchmark [181], SuperGLUE benchmark [180], and the IMDB dataset [127]. Specifi-
cally, the GLUE benchmark datasets included are MNLI, QQP, RTE, and SST2, focusing on tasks like natural language
inference, paraphrase detection, and sentiment analysis. Similarly, the SuperGLUE datasets—WiC, CB, COPA, MultiRC,
and BoolQ—encompass tasks ranging from word sense disambiguation to question answering. DE&E [197] uses three
common text classification data sets with different characteristics-News-groups, BBC News, and Consumer Finance
Complaints2. Such datasets can be used to evaluate the models on tasks with different difficulty levels.
The datasets introduced in [193] further are categorized into two groups based on the domain relevance between tasks:
far-domain and near-domain. The far-domain group comprises two text classification tasks, which are foundational
benchmarks [221] divided into topic classification (AG News, Yahoo Answers, DBpedia) and sentiment classification
(Yelp, Amazon Reviews). In contrast, the near-domain group uses the Web of Science (WOS) [96] and 20 Newsgroups
[97], which are restructured according to their high inter-task relevance. The WOS dataset comprises seven parent
classes, each with five closely related sub-classes, while the 20 Newsgroups dataset, containing six news topics, is
reorganized into four tasks to maximize inter-task correlation.

Intent Classification. Some studies focus on intent classification tasks, where the classes are quite different in various
domains or scenarios. In the realm of intent classification and detection, several datasets have been specifically designed
to advance the field by addressing different challenges and providing diverse environments for model training and
evaluation. The dataset, as introduced in PAGeR [177], aims to tackle the lifelong intent detection problem by combining
three public intent classification datasets (CLINC150 [98], HWU64 [118], BANKING77 [19]), one text classification
dataset (Stackoverflow S20 [203]), and two public multidomain dialog intent detection datasets (SGD [157], MWOZ
[12]). Moreover, FewRel [58] is also incorporated to tackle the lifelong relation extraction problem. This integration is
intended to simulate real-world applications by encompassing a broad spectrum of domains and query distributions,
thereby facilitating the development of more robust and versatile intent detection systems.
Conversely, the dataset compiled in PLE [105] consolidates nine well-regarded intent detection datasets, including
CLINC150 [98] and HWU64 [118], among others, arranged in a fixed random sequence to form a standardized benchmark.
This dataset emphasizes the importance of consistency and comparability in performance evaluations across different
intent detection models, providing a platform for assessing and enhancing various methodologies. The dataset described
by MeLL [182] specifically addresses intent detection within two distinct contexts: task-oriented dialogues (TaskDialog-
EUIC) and real-world e-commerce interactions (Hotline-EUIC). TaskDialog-EUIC integrates data from Snips [31], TOP
semantic parsing [56], and Facebook’s Multilingual Task Oriented Dataset [162] into 90 tasks with overlapping label
sets, amounting to over ten thousand samples. Hotline-EUIC is derived from an e-commerce dialogue system [104] and
the hotline audios are transcribed to text by a high-accuracy industrial Automatic Speech Recognition (ASR) system.

Fine-grained Sentiment Analysis. Ke et al. [86] developed a task incremental learning dataset for aspect-based
sentiment classification (ABSC). This dataset aggregates reviews from four distinct sources, thereby enhancing its
diversity and applicability across multiple domains. The sources include the L5Domains dataset by Hu et al. [67],
which features consumer reviews for five different products; the Liu3Domains dataset by Liu [114], comprising reviews
pertaining to three products; the Ding9Domains dataset by Ding et al. [38], which includes reviews of nine varied
products; and the SemEval14 dataset, which is focused on reviews of two specific products—laptops and restaurants.
Manuscript submitted to ACM
22 Yutao Yang et al.

6.1.2 Datasets for Generation. In the rapidly advancing field of machine learning, diverse datasets function as crucial
benchmarks for exploring various dimensions of language and code generation. These datasets address both universal
and task-specific challenges, enabling a comprehensive evaluation of model capabilities. A particularly significant
dataset highlighted in the work by Continual-T0 [164] focuses on English language generation tasks, including text
simplification and empathetic dialogue generation, among others [9, 17]. The design of this dataset maintains uniformity
in size, facilitating effective comparative analyses of performance across distinct tasks by ensuring a consistent volume
of data for training. In a subsequent study, Luo et al. [125] conduct an analysis of catastrophic forgetting on Bloomz
[161] using Continual T0 datasets.
The dataset, introduced in LAMOL [171], integrates elements from both DecaNLP [133] and the foundational text
classification benchmark [221]. This dataset encompasses five distinct NLP tasks originally sourced from DecaNLP:
question answering, semantic parsing, sentiment analysis, semantic role labeling, and goal-oriented dialogue. For the
purposes of this dataset, all tasks, whether derived from DecaNLP or the foundational text classification benchmark,
are restructured into a uniform format, conceptualized under the framework of a question answering task. Moreover,
the dataset devised in RVAE_LAMOL [183], employs three tasks from DecaNLP: the English Wizard of Oz (WOZ) for
goal-oriented dialogue, QA-SRL for semantic role labeling in a SQuAD-style format, and SST, which is a binary version
of the Stanford Sentiment Treebank categorizing sentiments as positive or negative. These tasks are specifically treated
as sequence generation tasks.
The dataset introduced in COPR [217] represents a pioneering effort in applying both Task Incremental Learning
(TIL) and Domain Incremental Learning (DIL) within the context of benchmarks that utilize existing human preferences.
Specifically, the TIL framework in this dataset mandates that the model sequentially acquires knowledge from three
distinct tasks. These include the question-answering task utilizing the HH-RLHF dataset [4], the summarization task
based on the Reddit TL, DR dataset with human feedback [179], and the positive film review generation task using the
IMDB dataset [127]. Meanwhile, the DIL framework requires the model to adapt to three distinct segments from the
SHP dataset, as described by Ethayarajh et al. [44].
The dataset described in Adaptive Compositional Modules [223] explores sequence generation and categorizes tasks
into "similar" and "dissimilar" groups based on their characteristics. Tasks classified as similar, including E2ENLG
[142] and four domains (restaurant, hotel, TV, laptop) from RNNLG [196], demonstrate shared patterns and are tested
across four sequence orders, comprising a total of five tasks. In contrast, dissimilar tasks such as WikiSQL (SQL
query generation) [229], CNN/DailyMail (news article summarization) [165], and MultiWOZ (semantic state sequence
generation) [12] exhibit significant distributional shifts from previously encountered tasks. The CODETASKCL dataset,
explored by Yadav et al. [204], encompasses a diverse array of code-centric tasks, including code generation [70],
summarization [69], translation [123], and refinement [174] across various programming languages. This dataset
significantly enhances the breadth of language processing applications within technical fields.

6.1.3 Datasets for Information Extraction. In the realm of natural language processing (NLP), various datasets are
tailored to specific aspects of the task under continual learning paradigms. The dataset introduced in ExtendNER
[138], exemplifies a continual learning approach to Named Entity Recognition (NER). This dataset amalgamates the
CoNLL-03 English NER [160] and OntoNotes [65], covering a broad spectrum of entity types and sources. This hybrid
dataset is structured to challenge the adaptability and generalization capabilities of NER systems across varied contexts.
Unlike the static nature of text in NER tasks, the Schema-Guided Dialog (SGD) [157] dataset, utilized in C-PT [234],
serves the Dialog State Tracking aspect of IE, which involves maintaining the context of a dialog over time. The SGD
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 23

dataset features 44 services across 19 domains, each treated as a separate task, and is designed to evaluate models on
their ability to manage and extract information across conversational turns. Lastly, the lifelong SimpleQuestions and
lifelong FewRel datasets, devised in [184] is crafted for the task of relation extraction. It merges elements from the
SimpleQuestions [11] and FewRel [58] to form a lifelong learning benchmark that confronts the challenges of relation
detection in a few-shot context.

6.1.4 Datasets for Continual Pre-training. In the realm of continual pre-training for large language models (LMs), the
development and utilization of specialized benchmarks play a pivotal role in evaluating and enhancing the effectiveness
of continual learning systems. The dataset, introduced in CPT [81], primarily focuses on the continual post-training of
LMs across a series of domain-specific, unlabeled datasets. It provides a rigorous test environment by using diverse
corpora such as Yelp Restaurant Reviews [202], AI and ACL Papers [121], and AGNews articles [221]. Its main objective
is to gauge how well an LM can incrementally integrate domain-specific knowledge without forgetting previously
learned information, thereby enhancing its few-shot learning capabilities in these domains. Contrary to the datasets
employed in CPT [81], which evaluate domain-specific adaptability and incremental learning, the CKL benchmark [73]
is meticulously designed to measure the LM’s ability to retain timeless knowledge, update obsolete information, and
acquire new knowledge. It comprises subsets like INVARIANTLAMA, UPDATEDLAMA, and NEWLAMA, which are
crafted to probe specific types of knowledge that an LM may encounter in its learning trajectory.
Whereas the aforementioned two datasets assess more controlled dimensions of knowledge integration and retention,
the dataset introduced in ELLE [152] focuses on the dynamic scenario of accumulating streaming data from diverse
sources in a lifelong learning context. This dataset mirrors the real-world challenge of a language model (LM) that must
continuously adapt to new data inflows from multiple domains, including BOOKCORPUS (WB) [235], NEWS ARTICLES
(NS) [215], AMAZON REVIEWS (REV) [62], BIOMEDICAL PAPERS (BIO) [121] and COMPUTER SCIENCE PAPERS
(CS) [121]. The benchmark evaluates the LM’s capacity to effectively integrate new information from these varied
sources over time, highlighting the essential need for LMs to evolve in response to continual data growth and shifts in
data distribution. Jin et al. [78] construct data streams to represent two prevalent types of domain shifts observed in
practical scenarios. The first, a Domain-incremental Paper Stream, simulates the sequential evolution of research areas
within academic papers, encompassing diverse disciplines such as biomedical and computer science. The second, a
Chronologically-ordered Tweet Stream, models the temporal progression of tweets over time.

6.1.5 Datasets for Hybrid Tasks. An increasing number of datasets are adopting a hybrid task approach that integrates
multiple learning paradigms and task types, aimed at testing and enhancing the adaptability of models. A notable
example is the dataset introduced in AdapterCL [128], which is tailored for task-oriented dialogue systems. This dataset
incorporates four task-oriented datasets: TaskMaster 2019 (TM19) [13], TaskMaster 2020 (TM20) [13], Schema Guided
Dialogue (SGD) [157], and MultiWoZ [12]. These datasets have been pre-processed to form a curriculum encompassing
37 domains, structured under four continual learning settings: INTENT classification, Dialogue State Tracking (DST),
Natural Language Generation (NLG), and end-to-end (E2E) modeling.
Continual Instruction Tuning Benchmark (CITB) [225] extends the concept of continual learning by focusing on
instruction-based NLP tasks. Built on the comprehensive SuperNI [192] dataset, it includes over 1,600 tasks across
diverse NLP categories. CITB differentiates itself by formulating two distinct streams—InstrDialog and InstrDialog++—to
examine how models integrate and retain new dialogue-oriented and varied NLP tasks under continual learning settings.
This benchmark suite not only tests task retention and adaptability but also explores how instruction tuning can be
optimized for a continual learning framework. The ConTinTin [208] is an adaptation of the NATURAL-INSTRUCTIONS
Manuscript submitted to ACM
24 Yutao Yang et al.

dataset, specifically restructured to facilitate a continual learning framework. This adaptation involves decomposing the
original crowdsourcing instructions into smaller, distinct sub-tasks to create a new dataset. Additionally, the new dataset
incorporates a novel experimental design where tasks are selected randomly to create diverse sequences, enabling the
evaluation of a model’s adaptability to novel instructions without prior exposure.
The dataset, used in Conure [214], consists of Tencent TL (TTL) [213] and Movielens (ML). The TTL dataset is
designed to address three item recommendation tasks and three user profiling tasks, whereas the ML dataset exclusively
focuses on three item recommendation tasks. Both datasets have been pre-processed to facilitate a continual learning
framework, simulating environments where models must adapt to evolving data streams. Furthermore, Kim et al. [91]
introduced the proprietary NAVER Shopping dataset, which builds upon the previously mentioned datasets. The NAVER
Shopping dataset features six tasks: two for search query prediction, two for purchased item category prediction, and
two for user profiling, all designed to meet real-world industry requirements.
Finally, the TRACE dataset, introduced by Wang et al. [190], is specifically designed to bridge the existing gap
in the evaluation of large language models (LLMs) within the continual learning framework, encompassing a wide
range of complex and specialized tasks. Distinguished from other datasets, TRACE targets domain-specific tasks that
are multilingual and technical, including code completion and mathematical reasoning. This diversity presents a
unique set of challenges that span both specialized and broad dimensions. Moreover, TRACE rigorously assesses the
models’ ability to sustain performance across tasks that demand different knowledge bases and cognitive skills. This
evaluation highlights the essential need for adaptability in LLMs trained to operate under continual learning conditions,
underscoring their potential in dynamic real-world applications.

6.2 Online Datasets for NLP

6.2.1 Datasets for Classification. The foundational text classification benchmark, as introduced by Zhang et al. [221]
has traditionally been applied in offline continual learning settings. Recent advancements have adapted this benchmark
for online continual learning, notably in studies such as MBPA++ [35] and OML-ER [63]

6.2.2 Datasets for Generation. The dataset, used in MBPA++ [35], comprises three distinct question-answering col-
lections: SQuAD 1.1 [155], TriviaQA [79], and QuAC [28]. SQuAD 1.1 is a reading comprehension dataset based on
Wikipedia articles, designed to assess the ability to derive answers from structured text. TriviaQA consists of question-
answer pairs developed by trivia enthusiasts, accompanied by corroborative evidence sourced from both the web and
Wikipedia, testing the model’s capability to handle diverse information sources. QuAC adopts a dialog-style format in
which a student queries about information in a Wikipedia article and a teacher responds using text directly from the
article, challenging the model’s interactive response generation.

6.2.3 Datasets for Information Extraction. The lifelong relation extraction benchmark, used in OML-ER [63], is struc-
tured by Wang et al. [184] based on FewRel. Unlike the original application by Wang et al., the benchmark in OML-ER
is adapted for online continuous learning scenarios.

6.2.4 Datasets for Other Tasks. Hu et al. [66] compile the Firehose dataset, consisting of 110 million tweets from over
920,000 users between January 2013 and September 2019. This dataset is split into FIREHOSE 10M and FIREHOSE 100M.
TemporalWiki [72] addresses temporal misalignment by serving as a lifelong benchmark that trains and evaluates LMs
using consecutive snapshots of Wikipedia and Wikidata. This methodology assists in assessing an LM’s capacity to
both retain previously acquired knowledge and assimilate new information over time.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 25

Test task
T1 T2 T3 T4 TN-1 TN
R0,1 R0,2 R0,3 R0,4 ... R0,N-1 R0,N

...
T1
R1,1 R1,2 R1,3 R1,4 R1,N-1 R1,N
Training on task Tk individually, test on Tk
T2 T3 T4
R2,1 R2,2 R2,3 R2,4 ... R2,N-1 R2,N
Training Task

Training on task {T1 , ..., Tk }, test on Ti>k

R3,1 R3,2 R3,3 R3,4 ... R3,N-1 R3,N
Training on task {T1 , ..., Tk }, test on Ti<=k
R4,1 R4,2 R4,3 R4,4 ... R4,N-1 R4,N
Training on task {T1 , ..., TN }, test on Ti<=N
...
...
...
...
...
...
...
...
TN-1 TN

RN-1,1 RN-1,2 RN-1,3 RN-1,4 RN-1,N-1 RN-1,N

RN,1 RN,2 RN,3 RN,4 ... RN,N-1 RN,N

Fig. 9. Illustration of calculating metrics.

6.3 Offline CL Datasets for Multi-modal Tasks

The P9D dataset [233] consists of over one million image-text pairs from e-commerce data, organized into nine industry
sector-based training tasks. It includes 1,014,599 training pairs, 2,846 for cross-modal retrieval tests, and 4,615 query
pairs with 46,855 gallery pairs for multi-modal retrieval. Qian et al. [148] introduce two novel benchmarks for continual
learning, namely CL-TDIUC and CL-VQA2.0, which are derived from the TDIUC [80] and VQA2.0 [50], respectively.
These benchmarks are categorized into three scenarios: the Continual Vision Scenario, which deals with new visual
scenes; the Continual Language Scenario, focusing on new questions in existing scenes; and the Continual Vision-
Language Scenario, addressing changes in both questions and visuals. DKR [32] comprises five benchmark datasets:
MS-COCO Caption (MS-COCO) [108], Flickr30K [211], IAPR TC-12 [51], ECommerce-T2I (EC) [205], and RSICD [124].
Furthermore, two experimental scenarios are established. The first scenario involves a sequential processing of the
datasets, specifically MS-COCO, Flickr30K, IAPR TC-12, EC, and RSICD, in that order. The second scenario, which
builds on the approach proposed by Ni et al. [141], partitions the EC dataset into five sub-datasets for the training phase.
The model’s performance is subsequently tested on the Flickr30K, MS-COCO, and EC datasets.

6.4 Offline and Online Datasets for Other Tasks

In this paper, we provide a comprehensive review of recent studies in natural language processing (NLP) and multi-
modal tasks, with a particular focus on the use of PLMs, LLMs and VLMs. Additionally, we introduce a range of offline
continual learning (CL) datasets employed across various applications, including automatic speech recognition [36, 176],
autonomous driving [178], disease classification [37], reinforcement learning [198], graph data [93], and computer
vision [45, 109]. Furthermore, some online CL benchmarks are also proposed for computer version tasks, such as
continual visual learning [16], continual object detection [185].

7 METRICS
In this section, we review the principal metrics commonly used to evaluate continual learning. These metrics can be
categorized into three main types: (1) overall performance, which assesses the algorithm’s effectiveness across all tasks;
(2) memory stability, which measures the extent to which an algorithm retains previously acquired knowledge; and (3)
Manuscript submitted to ACM
26 Yutao Yang et al.

learning plasticity, which evaluates the algorithm’s capacity to acquire new skills or knowledge. Each of these metrics
provides insights into different aspects of the algorithm’s performance in a continual learning context.
To begin, we establish the notation (Figure 9) used throughout the learning and evaluation phases of the model. Once
the model completes a learning task, denoted as 𝑇𝑖 , it evaluates its performance on a test set that encompasses all 𝑁
tasks, where 𝑁 is the total number of tasks in the set 𝑇 . This evaluation is represented by a matrix 𝑅 ∈ R𝑁 ×𝑁 , wherein
each element 𝑅𝑖,𝑗 indicates the model’s test classification accuracy on task 𝑇 𝑗 after training on task 𝑇𝑖 .

7.1 Overall Performance.

The metric termed “Last" [122, 228] evaluates the overall performance of a continual learning (CL) method upon the
completion of all tasks. Specifically, it computes the average score from the last row in the performance matrix 𝑅.
𝑁
1 ∑︁
𝐿𝑎𝑠𝑡 = 𝑅𝑁 ,𝑖 (1)
𝑁 𝑖=1

Also, Zheng et al. [228] devise the “Avg" score metric, which computes the mean accuracy across all datasets and
timestamps.

𝑁 𝑁
1 ∑︁ © 1 ∑︁
𝐴𝑣𝑔 = (2)
ª
𝑅𝑖,𝑗 ®
𝑁 𝑖=1 𝑁 𝑗=1
« ¬
In the seminal works of Rebuffi et al. [159] and Douillard et al. [40], the concept of Average Incremental Accuracy
(AIA) is introduced. This metric is specifically designed to quantify the historical performance across different tasks. It
calculates the average performance for each task by considering the lower triangular portion of the matrix 𝑅, effectively
capturing the evolving competence of the system as new tasks are learned.

𝑁 𝑖
1 ∑︁ © 1 ∑︁
𝐴𝐼𝐴 = (3)
ª
𝑅𝑖,𝑗 ®
𝑁 𝑖=1 𝑖 𝑗=1
« ¬
The metric, termed Transfer, is derived by computing the average of the performance values for tasks that are
represented in the upper-right triangle of matrix 𝑅. This approach uniformly weights each dataset by averaging
their performance across different tasks, thereby assessing the preservation of zero-shot transfer capabilities. Prior to
commencing learning on task 𝑇𝑖 , no fine-tuning is performed on tasks that precede 𝑇𝑖 .
𝑁 𝑖 −1
1 ∑︁ © 1 ∑︁
𝑇𝑟𝑎𝑛𝑠 𝑓 𝑒𝑟 = (4)
ª
𝑅 𝑗,𝑖 ®
𝑁 − 1 𝑖=2 𝑖 − 1 𝑗=1

« ¬
Moreover, Chaudhry et al. [22] devise a metric known as Learning Curve Area (LCA), which quantifies the speed of
learning in a model. Qin et al. [152] propose two metrics designed to evaluate pre-trained language models (PLMs)
based on their performance within learned domains: Average Perplexity (𝐴𝑃) and Average Increased Perplexity (𝐴𝑃 + ).

7.2 Memory Stability.

Memory stability is typically evaluated by backward transfer (BWT) [122] and forgetting measure (FM) [21].
Backward Transfer (BWT) emerges as a pivotal concept extensively documented in the literature, notably by Lopez
et al. [122] and Wu et al. [199]. BWT measures the performance degradation on previously mastered tasks after the
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 27

model is trained on new tasks. This performance degradation phenomenon is often referred to as “forgetting".
𝑁 −1
1 ∑︁
𝐵𝑊𝑇 = 𝑅𝑁 ,𝑖 − 𝑅𝑖,𝑖 (5)
𝑁 − 1 𝑖=1

Additionally, Chaudhry et al. [21] introduce the Forgetting Measure (FM), a metric designed to quantify the extent
of forgetting a model experiences for a specific task. A lower FM indicates better retention of previous tasks. Davari
et al. [33] propose a method named linear probes (LP) to assess representation forgetting. This approach measures
the effectiveness of learned representations via an optimal linear classifier trained on the frozen activations of a base
network. Representation forgetting is quantified by evaluating the change in Language Processing (LP) performance
before and after the introduction of a new task. Kemker et al. [87] introduce three metrics, where Ωbase assesses
retention of initial learning, Ωnew measures recall of new tasks, and Ωall evaluates overall proficiency in maintaining old
knowledge and acquiring new information. Additionally, researchers [95] devise a novel metric, termed the Knowledge
Loss Ratio (KLR), quantifies knowledge degradation using principles from information theory.

7.3 Learning Plasticity.

Evaluating learning plasticity can be effectively accomplished through two key metrics: forward transfer (FWT) [122]
and intransigence measure (IM) [21].
Forward Transfer (FWT) [122] assesses the beneficial effects on the performance of subsequent tasks following a
model’s training on prior tasks.
𝑁
1 ∑︁
𝐹𝑊𝑇 = 𝑅𝑖 −1,𝑖 − 𝑅0,𝑖 (6)
𝑁 − 1 𝑖=2
where 𝑅0,𝑖 denotes the performance metric associated with training on task 𝑖 independently. Higher values of FWT
indicate superior model performance. It is important to note that discussing backward transfer for the initial task is not
applicable, as there are no preceding tasks to influence its performance.
Intransigence measure (IM), as defined by Chaudhry et al. [21], quantifies a model’s inability to learn new tasks. This
measure is calculated by comparing the performance difference of a task when trained jointly with other tasks versus
when trained in a continual learning setting. Moreover, Koh et al. [95] introduce novel metrics, known as Knowledge
Gain Ratio (KGR), which quantifies the capacity to acquire new knowledge through the calculation of knowledge gain.

7.4 Metrics for Continual Pre-training.

7.5 Online CL-Specific Metrics.

Near-future accuracy (NFA) [2] is introduced as a novel evaluation metric for OCL problem. Unlike traditional evaluation
methods that assess models on immediately subsequent samples, NFA evaluates models on samples slightly further into
the future, using a minimal shift 𝑆. Such operation can mitigate label correlation effects, which can adversely impact the
accuracy of model adaptability assessments. The smallest shift 𝑆 is selected to ensure that the test sample aligns closely
Manuscript submitted to ACM
28 Yutao Yang et al.

with the distribution of recently observed training data. Yogatama et al. [209] proposed a novel online codelength,
inspired by prequential encoding [10], to quantify how quickly an existing model can adapt to a new task.

8 CHALLENGES AND FURTHER WORK

Autonomous Continual Learning. Most existing studies in the domain of continual learning assume static datasets
with known distributions in a relatively closed environment. Moreover, these studies mainly focus on simple tasks
(e.g., text classification, sentiment analysis and intent classification) with clear labels. These assumptions do not hold in
real-world applications, where environments continually evolve and introduce novel stimuli. A key challenge is to
develop continual learning models that operate effectively in complex, noisy environments where clear labels are not
always available, and task domains frequently change. Liu et al. [112] recently proposed the SOLA framework to address
these limitations by facilitating autonomous adaptation in AI systems. Despite this progress, significant challenges
remain in enabling these systems to independently adjust to new, dynamic environments without ongoing human
oversight. Future research should focus on developing algorithms capable of autonomously detecting and adapting to
shifts in data distribution, thereby improving the applicability of AI in dynamic real-world scenarios.

Learning Knowledge from Conversation. Traditional AI systems are typically trained on static data sets, which starkly
contrasts with human conversational learning that dynamically updates knowledge through interaction [111]. The
challenge for AI lies in transitioning from static data learning to more dynamic, conversational engagements. The future
direction in this area could involve the development of models that mimic human conversational learning processes,
capable of context adaptation, new concept inference, and dynamic knowledge application within ongoing interactions.

Multi-modal Continual Learning. Continual learning research has predominantly concentrated on natural language
processing tasks such as sentiment analysis and text classification. Recent studies have begun exploring basic multi-
modal tasks, such as text-to-image retrieval, text-image classification, and visual question answering. The integration of
diverse data types—textual, visual, and auditory—poses a substantial challenge. Future studies should expand to more
complex multi-modal datasets and strive to devise methodologies that effectively synthesize these varied modalities,
thereby enhancing the model’s capability to maintain continuous learning across different sensory inputs.

Privacy Protection in Continual Learning. Privacy protection in continual learning systems poses a significant
challenge, particularly as these systems are designed to continuously update and refine their models based on incoming
data streams. Unlike traditional static machine learning models, continual learning systems frequently access and
process sensitive data across different contexts and time periods, raising substantial concerns about data confidentiality
and user privacy. Effective privacy-preserving mechanisms must be integrated into the architecture of these systems to
ensure that they do not inadvertently expose or misuse personal data. Techniques such as differential privacy [43],
federated learning [216], and secure multi-party computation [49] offer promising solutions by allowing models to
learn from decentralized data sources without needing to access the actual data directly. Future research in continual
learning should not only focus on enhancing learning efficiency and adaptability but also prioritize the development of
robust frameworks that safeguard user privacy across all phases of data handling and model updating.

Robust Continual Learning. The existing studies mainly focus on designing a continual learning model to improve the
performance of forgetting and transferring with various metrics while the robustness of continual learning systems is
not well studied. It is critical, especially in applications where safety and reliability are paramount. The main challenges
include evaluating the robustness of these systems against adversarial attacks or when faced with drastically changing
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 29

environments. Future research could focus on developing evaluation metrics for robustness in continual learning and
designing systems that maintain performance reliability over time despite environmental changes.

Large-Scale and High-Quality Datasets and Benchmarks. As discussed in Section 6, most of the datasets are constructed
by merging the existing datasets. This often results in datasets that lack diversity and real-world complexity, which
hampers the development of robust and adaptable continual learning models. The creation of large-scale, high-quality
datasets that accurately reflect real-world complexities represents a critical challenge. Moving forward, the development
of such datasets and benchmarks will be essential not only for assessing the efficacy of continual learning algorithms
but also for pushing the limits of what these algorithms can achieve in practical settings.

9 CONCLUSIONS
This survey provides an in-depth exploration of continual learning (CL) methodologies tailored for foundation language
models (LMs), such as pre-trained language models (PLMs), large language models (LLMs), and vision-language models
(VLMs). By integrating the dynamic adaptability of CL with the robust foundational capabilities of LMs, this field
promises to significantly advance the state of artificial intelligence. We categorize existing research into offline and
online continual learning paradigms, offering a clear distinction between the settings and methodologies used within
these frameworks. Offline CL is discussed in terms of domain-incremental, task-incremental, and class-incremental
learning. Meanwhile, online CL is analyzed with a focus on the delineation between hard and blurry task boundaries,
providing insights into how these approaches handle real-time data streams. Our review of the literature not only
clarifies the current landscape of CL approaches for foundation LMs but also emphasizes the innovative integration
of continual pre-training, parameter-efficient tuning, and instruction tuning methods that are specifically designed
to leverage the vast capabilities of foundation LMs. Furthermore, we highlight the main characteristics of datasets
used in this domain and the metrics that effectively measure both the mitigation of catastrophic forgetting and the
enhancement of knowledge transfer. This work hopes to inspire further research that will ultimately lead to more
robust, efficient, and intelligent systems capable of lifelong learning.

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam
Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[2] Hasan Abed Al Kader Hammoud, Ameya Prabhu, Ser-Nam Lim, Philip HS Torr, Adel Bibi, and Bernard Ghanem. 2023. Rapid Adaptation in Online
Continual Learning: Are We Evaluating It Right?. In ICCV. 18852–18861.
[3] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. 2019. Online Continual
Learning with Maximal Interfered Retrieval. In NeurIPS, Vol. 32.
[4] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan,
et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
[5] Jihwan Bang, Hyunseo Koh, Seulki Park, Hwanjun Song, Jung-Woo Ha, and Jonghyun Choi. 2022. Online continual learning on a contaminated
data stream with blurry task boundaries. In CVPR. 9275–9284.
[6] Shawn Beaulieu, Lapo Frati, Thomas Miconi, Joel Lehman, Kenneth O Stanley, Jeff Clune, and Nick Cheney. 2020. Learning to continually learn.
arXiv preprint arXiv:2002.09571 (2020).
[7] Eden Belouadah, Adrian Popescu, and Ioannis Kanellos. 2021. A comprehensive study of class incremental learning algorithms for visual tasks.
Neural Networks 135 (2021), 38–54.
[8] Magdalena Biesialska, Katarzyna Biesialska, and Marta R Costa-jussà. 2020. Continual Lifelong Learning in Natural Language Processing: A Survey.
In COLING. 6523–6541.
[9] Raad Bin Tareaf. 2017. Tweets Dataset - Top 20 most followed users in Twitter social platform.
[10] Léonard Blier and Yann Ollivier. 2018. The description length of deep learning models. NeurIPS 31 (2018).

Manuscript submitted to ACM

30 Yutao Yang et al.

[11] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv
preprint arXiv:1506.02075 (2015).
[12] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ–a
large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 (2018).
[13] Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey,
Andy Cedilnik, and Kyu-Young Kim. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:1909.05358 (2019).
[14] Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. 2021. New insights on reducing abrupt
representation change in online continual learning. arXiv preprint arXiv:2104.05025 (2021).
[15] Samuel Cahyawijaya, Holy Lovenia, Tiezheng Yu, Willy Chung, and Pascale Fung. 2023. InstructAlign: High-and-Low Resource Language
Alignment via Continual Crosslingual Instruction Tuning. In the First Workshop in South East Asian Language Processing. 55–78.
[16] Zhipeng Cai, Ozan Sener, and Vladlen Koltun. 2021. Online continual learning with natural distribution shifts: An empirical study with visual data.
In ICCV. 8281–8290.
[17] Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language
explanations. NeurIPS 31 (2018).
[18] Xusheng Cao, Haori Lu, Linlan Huang, Xialei Liu, and Ming-Ming Cheng. 2024. Generative Multi-modal Models are Good Class-Incremental
Learners. arXiv preprint arXiv:2403.18383 (2024).
[19] Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. Efficient intent detection with dual sentence encoders.
arXiv preprint arXiv:2003.04807 (2020).
[20] Giuseppe Castellucci, Simone Filice, Danilo Croce, and Roberto Basili. 2021. Learning to Solve NLP Tasks in an Incremental Number of Languages.
In ACL. 837–847.
[21] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. 2018. Riemannian walk for incremental learning: Understanding
forgetting and intransigence. In ECCV. 532–547.
[22] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2019. Efficient Lifelong Learning with A-GEM. In ICLR.
[23] Cheng Chen, Junchen Zhu, Xu Luo, Hengtao Shen, Lianli Gao, and Jingkuan Song. 2024. CoIN: A Benchmark of Continual Instruction tuNing for
Multimodel Large Language Model. arXiv preprint arXiv:2403.08350 (2024).
[24] Hung-Jen Chen, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. 2020. Mitigating forgetting in online continual learning via instance-aware
parameterization. NeurIPS 33 (2020), 17466–17477.
[25] Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, and Claire Cui. 2023. Lifelong language pretraining with
distribution-specialized experts. In ICML. PMLR, 5383–5395.
[26] Daixuan Cheng, Shaohan Huang, and Furu Wei. 2023. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530
(2023).
[27] Xin Cheng, Yankai Lin, Xiuying Chen, Dongyan Zhao, and Rui Yan. 2023. Decouple knowledge from paramters for plug-and-play language
modeling. In ACL. 14288–14308.
[28] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in
Context. In EMNLP. 2174–2184.
[29] Aristotelis Chrysakis and Marie-Francine Moens. 2023. Online bias correction for task-free continual learning. ICLR (2023).
[30] Andrea Cossu, Tinne Tuytelaars, Antonio Carta, Lucia Passaro, Vincenzo Lomonaco, and Davide Bacciu. 2022. Continual pre-training mitigates
forgetting in language and vision. arXiv preprint arXiv:2205.09357 (2022).
[31] Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco
Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice
interfaces. arXiv preprint arXiv:1805.10190 (2018).
[32] Zhenyu Cui, Yuxin Peng, Xun Wang, Manyu Zhu, and Jiahuan Zhou. 2024. Continual Vision-Language Retrieval via Dynamic Knowledge
Rectification. In AAAI, Vol. 38. 11704–11712.
[33] MohammadReza Davari, Nader Asadi, Sudhir Mudur, Rahaf Aljundi, and Eugene Belilovsky. 2022. Probing representation forgetting in supervised
and unsupervised continual learning. In CVPR. 16712–16721.
[34] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual
learning survey: Defying forgetting in classification tasks. TPAMI 44, 7 (2021), 3366–3385.
[35] Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. NeurIPS
32 (2019).
[36] Luca Della Libera, Pooneh Mousavi, Salah Zaiem, Cem Subakan, and Mirco Ravanelli. 2023. CL-MASR: A Continual Learning Benchmark for
Multilingual ASR. arXiv preprint arXiv:2310.16931 (2023).
[37] Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Tom van Sonsbeek, Xiantong Zhen, Dwarikanath Mahapatra, Marcel Worring, and Cees GM
Snoek. 2022. Lifelonger: A benchmark for continual disease classification. In International Conference on Medical Image Computing and Computer-
Assisted Intervention. 314–324.
[38] Xiaowen Ding, Bing Liu, and Philip S Yu. 2008. A holistic lexicon-based approach to opinion mining. In WSDM. 231–240.

Manuscript submitted to ACM

Recent Advances of Foundation Language Models-based Continual Learning: A Survey 31

[39] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context
learning. arXiv preprint arXiv:2301.00234 (2022).
[40] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks
incremental learning. In ECCV. 86–102.
[41] Mingzhe Du, Anh Tuan Luu, Bin Ji, and See-kiong Ng. 2023. From Static to Dynamic: A Continual Learning Framework for Large Language
Models. arXiv preprint arXiv:2310.14248 (2023).
[42] Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. 2022. A Survey of Vision-Language Pre-Trained Models. In IJCAI. 5436–5443.
[43] Cynthia Dwork. 2006. Differential privacy. In International colloquium on automata, languages, and programming. 1–12.
[44] Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding dataset difficulty with V-usable information. In ICML. 5988–6008.
[45] Kamil Faber, Dominik Zurek, Marcin Pietron, Nathalie Japkowicz, Antonio Vergari, and Roberto Corizzo. 2023. From MNIST to ImageNet and Back:
Benchmarking Continual Curriculum Learning. arXiv preprint arXiv:2303.11076 (2023).
[46] Enrico Fini, Stéphane Lathuiliere, Enver Sangineto, Moin Nabi, and Elisa Ricci. 2020. Online continual learning under extreme memory constraints.
In ECCV 2020. 720–735.
[47] Binzong Geng, Fajie Yuan, Qiancheng Xu, Ying Shen, Ruifeng Xu, and Min Yang. 2021. Continual Learning for Task-oriented Dialogue System
with Iterative Network Pruning, Expanding and Masking. In ACL. 517–523.
[48] Evangelia Gogoulou, Timothée Lesort, Magnus Boman, and Joakim Nivre. 2023. A study of continual learning under language shift. arXiv preprint
arXiv:2311.01200 (2023).
[49] Oded Goldreich. 1998. Secure multi-party computation. Manuscript. Preliminary version 78, 110 (1998), 1–108.
[50] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image
understanding in visual question answering. In CVPR. 6904–6913.
[51] Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual
information systems. In International workshop ontoImage, Vol. 2.
[52] Yanan Gu, Xu Yang, Kun Wei, and Cheng Deng. 2022. Not just selection, but exploration: Online class-incremental continual learning via dual view
consistency. In CVPR. 7442–7451.
[53] Nuwan Gunasekara, Bernhard Pfahringer, Heitor Murilo Gomes, and Albert Bifet. 2023. Survey on Online Streaming Continual Learning. In IJCAI.
[54] Yiduo Guo, Bing Liu, and Dongyan Zhao. 2022. Online continual learning through mutual information maximization. In ICML. 8109–8126.
[55] Yiduo Guo, Bing Liu, and Dongyan Zhao. 2023. Dealing with Cross-Task Class Discrimination in Online Continual Learning. In CVPR. 11878–11887.
[56] Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic parsing for task oriented dialog using hierarchical
representations. arXiv preprint arXiv:1810.07942 (2018).
[57] Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. 2022. DEMix Layers: Disentangling Domains for Modular
Language Modeling. In NAACL. 5557–5576.
[58] Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. FewRel: A large-scale supervised few-shot relation
classification dataset with state-of-the-art evaluation. arXiv preprint arXiv:1810.10147 (2018).
[59] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. 2024. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. arXiv
preprint arXiv:2403.14608 (2024).
[60] Tyler L Hayes, Giri P Krishnan, Maxim Bazhenov, Hava T Siegelmann, Terrence J Sejnowski, and Christopher Kanan. 2021. Replay in deep learning:
Current approaches and missing biological elements. Neural computation 33, 11 (2021), 2908–2950.
[61] Jiangpeng He, Runyu Mao, Zeman Shao, and Fengqing Zhu. 2020. Incremental learning in online scenario. In CVPR. 13926–13935.
[62] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In
WWW. 507–517.
[63] Nithin Holla, Pushkar Mishra, Helen Yannakoudakis, and Ekaterina Shutova. 2020. Meta-learning with sparse experience replay for lifelong
language learning. arXiv preprint arXiv:2009.04891 (2020).
[64] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain
Gelly. 2019. Parameter-efficient transfer learning for NLP. In ICML. 2790–2799.
[65] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% Solution. In NAACL. 57–60.
[66] Hexiang Hu, Ozan Sener, Fei Sha, and Vladlen Koltun. 2022. Drinking from a firehose: Continual learning with web-scale natural language. TPAMI
45, 5 (2022), 5684–5696.
[67] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. 168–177.
[68] Yufan Huang, Yanzhe Zhang, Jiaao Chen, Xuezhi Wang, and Diyi Yang. 2021. Continual Learning for Text Classification with Information
Disentanglement Based Regularization. In ACL. 2736–2746.
[69] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of
semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[70] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping language to code in programmatic context. arXiv preprint
arXiv:1808.09588 (2018).
[71] Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2023. Exploring
the benefits of training expert language models over instruction tuning. In ICML. 14702–14729.
Manuscript submitted to ACM
32 Yutao Yang et al.

[72] Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, and Minjoon Seo. 2022. TemporalWiki: A
Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models. In EMNLP. 6237–6250.
[73] Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu Choi, and Minjoon Seo. 2022. Towards
Continual Knowledge Learning of Language Models. In ICLR.
[74] Khurram Javed and Martha White. 2019. Meta-learning representations for continual learning. NeurIPS 32 (2019).
[75] Saurav Jha, Dong Gong, and Lina Yao. 2024. CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models. arXiv
preprint arXiv:2403.19137 (2024).
[76] Xisen Jin, Bill Yuchen Lin, Mohammad Rostami, and Xiang Ren. 2021. Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation
for Few-shot Learning. In EMNLP. 714–729.
[77] Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. 2021. Gradient-based editing of memory examples for online task-free continual learning. NeurIPS
34 (2021), 29193–29205.
[78] Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold, and Xiang Ren. 2022. Lifelong Pretraining:
Continually Adapting Language Models to Emerging Corpora. In NAACL. 4764–4780.
[79] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading
Comprehension. In ACL. 1601–1611.
[80] Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In ICCV. 1965–1973.
[81] Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. 2022. Continual Training of Language Models for Few-Shot Learning. In EMNLP.
10205–10216.
[82] Zixuan Ke and Bing Liu. 2022. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701 (2022).
[83] Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu, and Lei Shu. 2021. Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning. In
NeurIPS, Vol. 34. 22443–22456.
[84] Zixuan Ke, Bing Liu, Hu Xu, and Lei Shu. 2021. CLASSIC: Continual and Contrastive Learning of Aspect Sentiment Classification Tasks. In EMNLP.
[85] Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. 2023. Continual Pre-training of Language Models. In ICLR.
[86] Zixuan Ke and Hu Xu. 2021. Adapting BERT for Continual Learning of a Sequence of Aspect Sentiment Classification Tasks. In NAACL.
[87] Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks.
In AAAI, Vol. 32.
[88] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In NAACL. 4171–4186.
[89] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. 2023.
Introducing language guidance in prompt-based continual learning. In ICCV. 11463–11473.
[90] Junsu Kim, Yunhoe Ku, Jihyeon Kim, Junuk Cha, and Seungryul Baek. 2024. VLM-PL: Advanced Pseudo Labeling approach Class Incremental
Object Detection with Vision-Language Model. arXiv preprint arXiv:2403.05346 (2024).
[91] Sein Kim, Namkyeong Lee, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2023. Task Relation-aware Continual User Representation
Learning. In SIGKDD. 1107–1119.
[92] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho,
Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. the national academy of sciences 114, 13 (2017),
3521–3526.
[93] Jihoon Ko, Shinhwan Kang, and Kijung Shin. 2022. BeGin: Extensive Benchmark Scenarios and An Easy-to-use Framework for Graph Continual
Learning. arXiv preprint arXiv:2211.14568 (2022).
[94] Hyunseo Koh, Dahyun Kim, Jung-Woo Ha, and Jonghyun Choi. 2021. Online continual learning on class incremental blurry task configuration
with anytime inference. arXiv preprint arXiv:2110.10031 (2021).
[95] Hyunseo Koh, Minhyuk Seo, Jihwan Bang, Hwanjun Song, Deokki Hong, Seulki Park, Jung-Woo Ha, and Jonghyun Choi. 2022. Online Boundary-Free
Continual Learning by Scheduled Data Prior. In ICLR.
[96] Kamran Kowsari, Donald E Brown, Mojtaba Heidarysafa, Kiana Jafari Meimandi, Matthew S Gerber, and Laura E Barnes. 2017. Hdltex: Hierarchical
deep learning for text classification. In ICMLA. 364–371.
[97] Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Machine learning proceedings. 331–339.
[98] Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A
Laurenzano, Lingjia Tang, et al. 2019. An evaluation dataset for intent classification and out-of-scope prediction. arXiv preprint arXiv:1909.02027
(2019).
[99] Kuan-Ying Lee, Yuanyi Zhong, and Yu-Xiong Wang. 2023. Do pre-trained models benefit equally in continual learning?. In WCACV. 6485–6493.
[100] Sungjin Lee. 2017. Toward continual learning for conversational agents. arXiv preprint arXiv:1712.09943 (2017).
[101] Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian, Davide Maltoni, David Filliat, and Natalia Díaz-Rodríguez. 2020. Continual learning for
robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion 58 (2020), 52–68.
[102] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.

Manuscript submitted to ACM

Recent Advances of Foundation Language Models-based Continual Learning: A Survey 33

[103] Dingcheng Li, Zheng Chen, Eunah Cho, Jie Hao, Xiaohu Liu, Fan Xing, Chenlei Guo, and Yang Liu. 2022. Overcoming catastrophic forgetting
during domain adaptation of seq2seq language generation. In NAACL. 5441–5454.
[104] Feng-Lin Li, Minghui Qiu, Haiqing Chen, Xiongwei Wang, Xing Gao, Jun Huang, Juwei Ren, Zhongzhou Zhao, Weipeng Zhao, Lei Wang, et al.
2017. Alime assist: An intelligent assistant for creating an innovative e-commerce experience. In CIKM. 2495–2498.
[105] Guodun Li, Yuchen Zhai, Qianglong Chen, Xing Gao, Ji Zhang, and Yin Zhang. 2022. Continual few-shot intent detection. In COLING. 333–343.
[106] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and
language. arXiv preprint arXiv:1908.03557 (2019).
[107] Huiwei Lin, Baoquan Zhang, Shanshan Feng, Xutao Li, and Yunming Ye. 2023. PCR: Proxy-based Contrastive Replay for Online Class-Incremental
Continual Learning. In CVPR. 24246–24255.
[108] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft
coco: Common objects in context. In ECCV. 740–755.
[109] Zhiqiu Lin, Jia Shi, Deepak Pathak, and Deva Ramanan. 2021. The clear benchmark: Continual learning on real-world imagery. In NeurIPS.
[110] Bing Liu. 2020. Learning on the job: Online lifelong and continual learning. In AAAI, Vol. 34. 13544–13549.
[111] Bing Liu and Sahisnu Mazumder. 2021. Lifelong and continual learning dialogue systems: learning during conversation. In AAAI, Vol. 35.
15058–15063.
[112] Bing Liu, Sahisnu Mazumder, Eric Robertson, and Scott Grigsby. 2023. AI Autonomy: Self-initiated Open-world Continual Learning and Adaptation.
AI Magazine (2023).
[113] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. NeurIPS 36 (2024).
[114] Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2015. Automated rule selection for aspect extraction in opinion mining. In IJCAI.
[115] Qingbin Liu, Xiaoyan Yu, Shizhu He, Kang Liu, and Jun Zhao. 2021. Lifelong intent detection via multi-strategy rebalancing. arXiv preprint
arXiv:2108.04445 (2021).
[116] Tianlin Liu, Lyle Ungar, and João Sedoc. 2019. Continual Learning for Sentence Representations Using Conceptors. In NAACL. 3274–3279.
[117] Xialei Liu, Xusheng Cao, Haori Lu, Jia-wen Xiao, Andrew D Bagdanov, and Ming-Ming Cheng. 2023. Class Incremental Learning with Pre-trained
Vision-Language Models. arXiv preprint arXiv:2310.20348 (2023).
[118] Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. 2021. Benchmarking natural language understanding services for building
conversational agents. In International Workshop on Spoken Dialogue Systems. 165–183.
[119] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning: Prompt Tuning Can Be Comparable to
Fine-tuning Across Scales and Tasks. In ACL. 61–68.
[120] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[121] Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The Semantic Scholar Open Research Corpus. In ACL.
4969–4983.
[122] David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. NeurIPS 30 (2017).
[123] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al.
2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
[124] Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation.
IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.
[125] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2023. An Empirical Study of Catastrophic Forgetting in Large Language
Models During Continual Fine-tuning. arXiv e-prints (2023).
[126] Shirong Ma, Shen Huang, Shulin Huang, Xiaobin Wang, Yangning Li, Hai-Tao Zheng, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. EcomGPT-CT:
Continual pre-training of e-commerce large language models with semi-structured data. arXiv preprint arXiv:2312.15696 (2023).
[127] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment
Analysis. In ACL. 142–150.
[128] Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eunjoon Cho, and Zhiguang Wang. 2020.
Continual learning in task-oriented dialogue systems. arXiv preprint arXiv:2012.15504 (2020).
[129] Aru Maekawa, Hidetaka Kamigaito, Kotaro Funakoshi, and Manabu Okumura. 2023. Generative Replay Inspired by Hippocampal Memory Indexing
for Continual Language Learning. In EACL. 930–942.
[130] Zheda Mai, Ruiwen Li, Jihwan Jeong, David Quispe, Hyunwoo Kim, and Scott Sanner. 2022. Online continual learning in image classification: An
empirical survey. Neurocomputing 469 (2022), 28–51.
[131] Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost Van De Weijer. 2022. Class-incremental learning:
survey and performance evaluation on image classification. TPAMI 45, 5 (2022), 5513–5533.
[132] Sahisnu Mazumder and Bing Liu. 2024. Lifelong and Continual Learning Dialogue Systems. Springer Nature.
[133] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. [n. d.]. The natural language decathlon: Multitask learning as question
answering. arXiv 2018. arXiv preprint arXiv:1806.08730 ([n. d.]).
[134] Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, and Emma Strubell. 2021. An Empirical Investigation of the Role of Pre-training in Lifelong
Learning. (2021).
Manuscript submitted to ACM
34 Yutao Yang et al.

[135] Umberto Michieli, Pablo Peso Parada, and Mete Ozay. 2023. Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling
High-Order Temporal Statistics. arXiv preprint arXiv:2307.12660 (2023).
[136] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023.
Recent advances in natural language processing via large pre-trained language models: A survey. Comput. Surveys 56, 2 (2023), 1–40.
[137] Jisoo Mok, Jaeyoung Do, Sungjin Lee, Tara Taghavi, Seunghak Yu, and Sungroh Yoon. 2023. Large-scale Lifelong Learning of In-context Instructions
and How to Tackle It. In ACL. 12573–12589.
[138] Natawut Monaikul, Giuseppe Castellucci, Simone Filice, and Oleg Rokhlenko. 2021. Continual learning for named entity recognition. In AAAI,
Vol. 35. 13570–13577.
[139] Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, and Gyeong-Moon Park. 2023. Online Class Incremental Learning on Stochastic Blurry Task
Boundary via Mask and Visual Prompt Tuning. In ICCV. 11731–11741.
[140] Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, and Mitesh M Khapra. 2024. A Comprehensive
Analysis of Adapter Efficiency. In IKDD. 136–154.
[141] Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. 2023. Continual vision-language representation learning with off-diagonal
information. In ICML. 26129–26149.
[142] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. arXiv preprint
arXiv:1706.09254 (2017).
[143] Alex Ororbia, Ankur Mali, C Lee Giles, and Daniel Kifer. 2022. Lifelong neural predictive coding: Learning cumulatively online without forgetting.
NeurIPS 35 (2022), 5867–5881.
[144] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS 35 (2022), 27730–27744.
[145] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A
review. Neural networks 113 (2019), 54–71.
[146] Bohao PENG, Zhuotao Tian, Shu Liu, Ming-Chang Yang, and Jiaya Jia. 2024. Scalable Language Model with Generalized Continual Learning. In
ICLR.
[147] Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, and
Jonas Pfeiffer. 2023. Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning. In EMNLP. 149–160.
[148] Zi Qian, Xin Wang, Xuguang Duan, Pengda Qin, Yuhong Li, and Wenwu Zhu. 2023. Decouple before interact: Multi-modal prompt learning for
continual visual question answering. In ICCV. 2953–2962.
[149] Chengwei Qin and Shafiq Joty. 2022. Continual Few-shot Relation Learning via Embedding Space Regularization and Data Augmentation. In ACL.
2776–2789.
[150] Chengwei Qin and Shafiq Joty. 2022. LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5. In
ICLR.
[151] Yujia Qin, Cheng Qian, Xu Han, Yankai Lin, Huadong Wang, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. Recyclable Tuning for
Continual Pre-training. In ACL. 11403–11426.
[152] Yujia Qin, Jiajie Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. ELLE: Efficient Lifelong Pre-training for Emerging
Data. In ACL. 2789–2810.
[153] Haoxuan Qu, Hossein Rahmani, Li Xu, Bryan Williams, and Jun Liu. 2021. Recent advances of continual learning in computer vision: An overview.
arXiv preprint arXiv:2109.11369 (2021).
[154] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748–8763.
[155] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In
EMNLP. 2383–2392.
[156] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image
generation. In ICML. Pmlr, 8821–8831.
[157] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents:
The schema-guided dialogue dataset. In AAAI, Vol. 34. 8689–8696.
[158] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. 2023. Progressive Prompts: Continual Learning
for Language Models. In ICLR.
[159] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017. iCaRL: Incremental Classifier and Representation
Learning. In CVPR.
[160] Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv
preprint cs/0306050 (2003).
[161] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François
Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
[162] Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2018. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv
preprint arXiv:1810.13327 (2018).
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 35

[163] Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned Language Models are Continual Learners. In EMNLP. 6107–6122.
[164] Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. In EMNLP. 6107–6122.
[165] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint
arXiv:1704.04368 (2017).
[166] Rudy Semola, Vincenzo Lomonaco, and Davide Bacciu. 2022. Continual-learning-as-a-service (claas): On-demand efficient adaptation of predictive
models. arXiv preprint arXiv:2206.06957 (2022).
[167] Khadija Shaheen, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. 2022. Continual learning for real-world autonomous
systems: Algorithms, challenges and frameworks. Journal of Intelligent & Robotic Systems 105, 1 (2022), 9.
[168] Yilin Shen, Xiangyu Zeng, and Hongxia Jin. 2019. A progressive model to enable continual learning for semantic slot filling. In EMNLP. 1279–1284.
[169] Dongsub Shim, Zheda Mai, Jihwan Jeong, Scott Sanner, Hyunwoo Kim, and Jongseong Jang. 2021. Online class-incremental continual learning
with adversarial shapley value. In AAAI, Vol. 35. 9630–9638.
[170] Chenyang Song, Xu Han, Zheni Zeng, Kuai Li, Chen Chen, Zhiyuan Liu, Maosong Sun, and Tao Yang. 2023. ConPET: Continual Parameter-Efficient
Tuning for Large Language Models. arXiv preprint arXiv:2309.14763 (2023).
[171] Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee. 2020. LAMOL: LAnguage MOdeling for Lifelong Language Learning. In ICLR. OpenReview.net.
[172] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A continual pre-training framework for
language understanding. In AAAI, Vol. 34. 8968–8975.
[173] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[174] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on
learning bug-fixing patches in the wild via neural machine translation. TOSEM 28, 4 (2019), 1–29.
[175] Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. 2022. Three types of incremental learning. Nature Machine Intelligence 4, 12 (2022),
1185–1197.
[176] Steven Vander Eeckt et al. 2023. Rehearsal-Free Online Continual Learning for Automatic Speech Recognition. arXiv e-prints (2023), arXiv–2306.
[177] Vaibhav Varshney, Mayur Patidar, Rajat Kumar, Lovekesh Vig, and Gautam Shroff. 2022. Prompt augmented generative replay via supervised
contrastive learning for lifelong intent detection. In NAACL. 1113–1127.
[178] Eli Verwimp, Kuo Yang, Sarah Parisot, Lanqing Hong, Steven McDonagh, Eduardo Pérez-Pellitero, Matthias De Lange, and Tinne Tuytelaars. 2023.
Clad: A realistic continual learning benchmark for autonomous driving. Neural Networks 161 (2023), 659–669.
[179] Michael Volske, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization. In the Workshop
on New Frontiers in Summarization. 59–63.
[180] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A
stickier benchmark for general-purpose language understanding systems. NeurIPS 32 (2019).
[181] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis
Platform for Natural Language Understanding. In EMNLP. 353–355.
[182] Chengyu Wang, Haojie Pan, Yuan Liu, Kehan Chen, Minghui Qiu, Wei Zhou, Jun Huang, Haiqing Chen, Wei Lin, and Deng Cai. 2021. Mell:
Large-scale extensible user intent classification for dialogue systems with meta lifelong learning. In SIGKDD. 3649–3659.
[183] Han Wang, Ruiliu Fu, Xuejun Zhang, and Jun Zhou. 2022. RVAE-LAMOL: Residual Variational Autoencoder to Enhance Lifelong Language
Learning. In IJCNN. IEEE, 1–9.
[184] Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, and William Yang Wang. 2019. Sentence Embedding Alignment for Lifelong
Relation Extraction. In NAACL. 796–806.
[185] Jianren Wang, Xin Wang, Yue Shang-Guan, and Abhinav Gupta. 2021. Wanderlust: Online continual object detection in the real world. In ICCV.
10829–10838.
[186] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. 2024. A comprehensive survey of continual learning: Theory, method and application.
TPAMI (2024).
[187] Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, and Deyu Meng. 2023. CBA: Improving Online Continual Learning via Continual Bias
Adaptor. In ICCV. 19082–19092.
[188] Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. 2023. Large-scale
multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research 20, 4 (2023), 447–482.
[189] Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. 2023. Orthogonal Subspace
Learning for Language Model Continual Learning. In EMNLP. 10658–10671.
[190] Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, and
Xuanjing Huang. 2024. TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models.
[191] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2022. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental
learning. NeurIPS 35 (2022), 5682–5695.
[192] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan
Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.
arXiv preprint arXiv:2204.07705 (2022).
Manuscript submitted to ACM
36 Yutao Yang et al.

[193] Zhicheng Wang, Yufang Liu, Tao Ji, Xiaoling Wang, Yuanbin Wu, Congcong Jiang, Ye Chao, Zhencong Han, Ling Wang, Xu Shao, et al. 2023.
Rehearsal-free Continual Language Learning via Efficient Parameter Isolation. In ACL. 10933–10946.
[194] Zirui Wang, Sanket Vaibhav Mehta, Barnabas Poczos, and Jaime Carbonell. 2020. Efficient Meta Lifelong-Learning with Limited Memory. In
EMNLP. 535–548.
[195] Yujie Wei, Jiaxin Ye, Zhizhong Huang, Junping Zhang, and Hongming Shan. 2023. Online Prototype Learning for Online Continual Learning. In
ICCV. 18764–18774.
[196] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural
language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745 (2015).
[197] Mateusz Wójcik, Witold Kościukiewicz, Mateusz Baran, Tomasz Kajdanowicz, and Adam Gonczarek. 2023. Domain-Agnostic Neural Architecture
for Class Incremental Continual Learning in Document Processing Platform. In ACL. 527–537.
[198] Maciej Wołczyk, Michał Zając, Razvan Pascanu, Łukasz Kuciński, and Piotr Miłoś. 2021. Continual world: A robotic benchmark for continual
reinforcement learning. NeurIPS 34 (2021), 28496–28510.
[199] Yuhao Wu, Tongjun Shi, Karthick Sharma, Chun Wei Seah, and Shuhao Zhang. 2023. Online Continual Knowledge Learning for Language Models.
arXiv preprint arXiv:2311.09632 (2023).
[200] Congying Xia, Wenpeng Yin, Yihao Feng, and Philip Yu. 2021. Incremental few-shot text classification with multi-round new classes: Formulation,
dataset and system. arXiv preprint arXiv:2104.11882 (2021).
[201] Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. 2023. Efficient continual pre-training for building domain specific large language models. arXiv
preprint arXiv:2311.08545 (2023).
[202] Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. arXiv
preprint arXiv:1904.02232 (2019).
[203] Jiaming Xu, Bo Xu, Peng Wang, Suncong Zheng, Guanhua Tian, and Jun Zhao. 2017. Self-taught convolutional neural networks for short text
clustering. Neural Networks 88 (2017), 22–31.
[204] Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna
Ramanathan, et al. 2023. Exploring continual learning for code generation models. arXiv preprint arXiv:2307.02435 (2023).
[205] An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, et al. 2021. M6-t: Exploring
sparse expert models and beyond. arXiv preprint arXiv:2105.15082 (2021).
[206] Huahui Yi, Ziyuan Qin, Qicheng Lao, Wei Xu, Zekun Jiang, Dequan Wang, Shaoting Zhang, and Kang Li. 2023. Towards General Purpose Medical
AI: Continual Learning Medical Foundation Model. arXiv preprint arXiv:2303.06580 (2023).
[207] Haiyan Yin, Ping Li, et al. 2021. Mitigating forgetting in online continual learning with neuron calibration. NeurIPS 34 (2021), 10260–10272.
[208] Wenpeng Yin, Jia Li, and Caiming Xiong. 2022. ConTinTin: Continual Learning from Task Instructions. In ACL. 3062–3072.
[209] Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang
Ling, Lei Yu, Chris Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373 (2019).
[210] Jaehong Yoon, Divyam Madaan, Eunho Yang, and Sung Ju Hwang. 2021. Online coreset selection for rehearsal-based continual learning. arXiv
preprint arXiv:2106.01085 (2021).
[211] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for
semantic inference over event descriptions. TACL 2 (2014), 67–78.
[212] Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. 2024. Boosting Continual Learning of Vision-Language
Models via Mixture-of-Experts Adapters. In CVPR.
[213] Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. Parameter-efficient transfer from sequential behaviors for user
modeling and recommendation. In SIGIR. 1469–1478.
[214] Fajie Yuan, Guoxiao Zhang, Alexandros Karatzoglou, Joemon Jose, Beibei Kong, and Yudong Li. 2021. One Person, One Model, One World: Learning
Continual User Representation without Forgetting. 696–705.
[215] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake
news. NeurIPS 32 (2019).
[216] Chen Zhang, Yu Xie, Hang Bai, Bin Yu, Weihong Li, and Yuan Gao. 2021. A survey on federated learning. Knowledge-Based Systems 216 (2021),
106775.
[217] Han Zhang, Lin Gui, Yuanzhao Zhai, Hui Wang, Yu Lei, and Ruifeng Xu. 2023. Copf: Continual learning human preference through optimal policy
fitting. arXiv preprint arXiv:2310.15694 (2023).
[218] Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. 2024. CPPO: Continual Learning for Reinforcement Learning with
Human Feedback. In ICLR.
[219] Peiyan Zhang and Sunghun Kim. 2023. A Survey on Incremental Update for Neural Recommender Systems. arXiv preprint arXiv:2303.02851 (2023).
[220] Xi Zhang, Feifei Zhang, and Changsheng Xu. 2023. Vqacl: A novel visual question answering continual learning setting. In CVPR. 19102–19112.
[221] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. NeurIPS 28 (2015).
[222] Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet, Nick Jin Sean Lim, and Yunzhe Jia. 2022. A simple but strong baseline for online
continual learning: Repeated augmented rehearsal. NeurIPS 35 (2022), 14771–14783.
[223] Yanzhe Zhang, Xuezhi Wang, and Diyi Yang. 2022. Continual Sequence Generation with Adaptive Compositional Modules. In ACL. 3653–3667.
Manuscript submitted to ACM
Recent Advances of Foundation Language Models-based Continual Learning: A Survey 37

[224] Yating Zhang, Yexiang Wang, Fei Cheng, Sadao Kurohashi, et al. 2023. Reformulating Domain Adaptation of Large Language Models as Adapt-
Retrieve-Revise. arXiv preprint arXiv:2310.03328 (2023).
[225] Zihan Zhang, Meng Fang, Ling Chen, and Mohammad-Reza Namazi-Rad. 2023. Citb: A benchmark for continual instruction tuning. arXiv preprint
arXiv:2310.14510 (2023).
[226] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al.
2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
[227] Yingxiu Zhao, Yinhe Zheng, Zhiliang Tian, Chang Gao, Jian Sun, and Nevin L. Zhang. 2022. Prompt Conditioned VAE: Enhancing Generative
Replay for Lifelong Learning in Task-Oriented Dialogue. In EMNLP. 11153–11169.
[228] Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. 2023. Preventing zero-shot transfer degradation in continual
learning of vision-language models. In ICCV. 19125–19136.
[229] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement
learning. arXiv preprint arXiv:1709.00103 (2017).
[230] Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. 2023. Deep class-incremental learning: A survey. arXiv
preprint arXiv:2302.03648 (2023).
[231] Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. 2023. Learning without forgetting for vision-language
models. arXiv preprint arXiv:2305.19270 (2023).
[232] Jie Zhou, Pei Ke, Xipeng Qiu, Minlie Huang, and Junping Zhang. 2023. ChatGPT: potential, prospects, and limitations. Frontiers of Information
Technology & Electronic Engineering (2023), 1–6.
[233] Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. 2023. Ctp: Towards vision-language continual pretraining via
compatible momentum contrast and topology preservation. In ICCV. 22257–22267.
[234] Qi Zhu, Bing Li, Fei Mi, Xiaoyan Zhu, and Minlie Huang. 2022. Continual Prompt Tuning for Dialog State Tracking. In ACL. 1124–1137.
[235] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching movies and reading books. In ICCV. 19–27.

A DETAILS OF METRICS
A.1 Overall Performance.
Moreover, Chaudhry et al. [22] devise a metric known as Learning Curve Area (LCA), which quantifies the speed of
learning in a model. It first defines an average 𝑏-shot performance, where 𝑏 represents the number of mini-batches,
subsequent to the completion of training across all 𝑇 tasks as follows:

𝑁
1 ∑︁
𝑍𝑏 = 𝑅𝑁 ,𝑖 (7)
𝑁 𝑖=1
𝐿𝐶𝐴 at 𝛽 is the area of the convergence curve 𝑍𝑏 as a function of 𝑏 ∈ [0, 𝛽]:

∫ 𝛽 𝛽
1 1 ∑︁
𝐿𝐶𝐴𝛽 = 𝑍𝑏 𝑑𝑏 = 𝑍𝑏 (8)
𝛽 +1 0 𝛽 +1
𝑏=0
The Learning Curve Area (LCA) provides insights into model learning dynamics. 𝐿𝐶𝐴0 measures the average 0-shot
performance, similar to forward transfer ([122]). 𝐿𝐶𝐴𝛽 , quantifying the area under the 𝑍𝑏 curve, evaluates both average
0-shot performance and learning speed. Although two models may achieve similar 𝑍𝑏 or 𝐴𝑇 values, they can differ
significantly in 𝐿𝐶𝐴𝛽 due to variations in learning rates. This metric is crucial for identifying models that quickly learn
from few examples, particularly when 𝛽 is small.
Qin et al. [152] propose two metrics designed to evaluate pre-trained language models (PLMs) based on their perfor-
mance within learned domains: Average Perplexity (𝐴𝑃) and Average Increased Perplexity (𝐴𝑃 + ). The aforementioned
metrics are utilized to assess key capabilities of PLMs, such as instruction following and safety, as discussed in Wang et
al. [190].
Manuscript submitted to ACM
38 Yutao Yang et al.

A.2 Memory Stability.

Chaudhry et al. [21] introduce the Forgetting Measure (FM), a metric designed to quantify the extent of forgetting a
model experiences for a specific task. The forgetting for a given task 𝑇 𝑗 after sequential training on tasks up to 𝑇𝑁 is
quantified as the difference between the highest proficiency (𝑚𝑎𝑥 (𝑅𝑙,𝑗 )) achieved on task 𝑇 𝑗 during initial training and
its proficiency (𝑅𝑁 ,𝑗 ) after subsequent learning phases:

𝑓𝑗 = max (𝑅𝑙,𝑗 − 𝑅𝑁 ,𝑗 ), ∀𝑗 < 𝑁 . (9)

𝑙 ∈ {1,...,𝑁 −1}
For the purpose of quantifying forgetting in previous tasks, the function 𝑓 𝑗 is defined within the interval [−1, 1] for
𝑗 < 𝑁.
Furthermore, to account for the number of tasks previously encountered, the Forgetting Measure (FM) at the 𝑁 -th
task represents the mean level of forgetting across all preceding tasks:

𝑁 −1
1 ∑︁
𝐹𝑀 = 𝑓𝑗 (10)
𝑁 − 1 𝑗=1
A lower FM indicates better retention of previous tasks. Here, the expansion or 𝑅 𝑗,𝑗 serves as a more effective quantifier
of retained knowledge concerning past tasks, as opposed to using max. Nonetheless, max remains a valuable estimator
for assessing the extent of forgetting that occurs throughout the learning process.
Davari et al. [33] propose a method named linear probes (LP) to assess representation forgetting. This approach
measures the effectiveness of learned representations via an optimal linear classifier trained on the frozen activations
of a base network. Representation forgetting is quantified by evaluating the change in Language Processing (LP)
performance before and after the introduction of a new task. Formally, for each model (𝑓𝜃𝑖 ) at time step 𝑖 of a task
sequence, the classifier (𝑊𝑖∗ ) is optimized as: 𝑊𝑖∗ = arg min𝑊𝑖 L (𝑊𝑖 ; 𝑓𝜃𝑖 (𝑋𝑖 ), 𝑌𝑖 ), where L, 𝑋𝑖 , and 𝑌𝑖 represent the
objective function, input data, and labels for task 𝑖, respectively. The degree of representational forgetting between two
model states, 𝜃 𝑎 and 𝜃𝑏 , where 𝜃𝑏 is derived later in the sequence, is evaluated by calculating the difference in scores:
𝑆𝑐𝑜𝑟𝑒 (𝑊𝑎 𝑓𝜃𝑎 (𝑋𝑎 ), 𝑌𝑎 ) − 𝑆𝑐𝑜𝑟𝑒 (𝑊𝑏 𝑓𝜃𝑏 (𝑋𝑎 ), 𝑌𝑎 ), where 𝑆𝑐𝑜𝑟𝑒 represents the performance metric, such as accuracy, on
the task.
Kemker et al. [87] introduce three metrics designed to CF: Ωbase , Ωnew , and Ωall . Ωbase assesses retention of initial
learning, Ωnew measures recall of new tasks, and Ωall evaluates overall proficiency in maintaining old knowledge and
acquiring new information.

𝑁
1 ∑︁ 𝛼 base,𝑖
Ωbase = (11)
𝑁 − 1 𝑖=2 𝛼 ideal
𝑁
1 ∑︁ 𝛼 new,𝑖
Ωnew = (12)
𝑁 − 1 𝑖=2 𝛼 ideal
𝑁
1 ∑︁ 𝛼 all,𝑖
Ωall = (13)
𝑁 − 1 𝑖=2 𝛼 ideal
where 𝑁 represents the total number of sessions, 𝛼 new,𝑖 is the test accuracy after learning session 𝑖, 𝛼 base,𝑖 denotes
the accuracy on the initial session after 𝑖 sessions, and 𝛼 all,𝑖 refers to the test accuracy across all test data for classes
encountered up to point 𝑖. The ideal performance (𝛼 ideal ) is defined as the offline MLP accuracy on the base set. To

Manuscript submitted to ACM

Recent Advances of Foundation Language Models-based Continual Learning: A Survey 39

facilitate comparative analysis across different datasets, Ωbase and Ωall are normalized by 𝛼 ideal . Consequently, unless a
model surpasses 𝛼 ideal , normalized results will range from 0 to 1, enabling consistent cross-dataset comparisons.
Additionally, researchers [95] devise a novel metric, termed the Knowledge Loss Ratio (KLR), quantifies knowledge
degradation using principles from information theory.

A.3 Learning Plasticity.

Intransigence measure (IM), as defined by Chaudhry et al. [21], quantifies a model’s inability to learn new tasks. This
measure is calculated by comparing the performance difference of a task when trained jointly with other tasks versus
when trained in a continual learning setting. Then the intransigence for the 𝑁 -th task can be defined as:

∗
𝐼𝑀 = 𝑅𝑁 − 𝑅𝑁 ,𝑁 , (14)
∗
where 𝑅𝑁 represents the accuracy achieved on the held-out dataset of the 𝑁 -th task, 𝑅𝑁 ,𝑁 indicates the accuracy on
the 𝑁 -th task upon completion of training in an incremental sequence up to and including task 𝑁 . Note, 𝐼𝑀𝑁 ∈ [−1, 1],
and lower values indicate superior performance.

A.4 Metrics for Continual Pre-training.

CKL [73] introduces a novel metric, named FUAR (FORGOTTEN / (UPDATED + ACQUIRED) RATIO), which quantita-
tively measures the efficiency of each CKL method. It calculates the number of instances of time-invariant knowledge
that a model forgets in order to learn or update one instance of new knowledge. When FUAR is equal to 1.0, it signifies
an equilibrium where one time-invariant knowledge instance is forgotten on average to obtain a new or updated
knowledge instance. Formally, FUAR is defined as:
𝑁 −1
max(0, Gap(𝑇𝑖𝐹 , 𝐷𝑖 , 𝐷 𝑁 )) 1 {𝑇 𝐹 ≠𝑛.𝑑.}
∑︁
𝐸𝑞 1 = (15)
𝑖
𝑖=0
𝑁 −1
max(0, Gap(𝑇𝐵𝑈 , 𝐷 𝑁 , 𝐷𝑖 )) 1 {𝑇 𝐹 ≠𝑛.𝑑.}
∑︁
𝐸𝑞 2 =
𝑖=0
𝑖
(16)
+ max(0, Gap(𝑇𝑁𝐴 , 𝐷 𝑁 , 𝐷𝑖 )) 1 {𝑇 𝐹 ≠𝑛.𝑑.}
𝑖
 𝐸𝑞 1

 if 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 > 0
𝐹𝑈 𝐴𝑅( T 𝐹


,𝑇𝑁𝑈 ,𝑇𝑁𝐴 )
= 𝐸𝑞 2 (17)


𝑛𝑜 𝑔𝑎𝑖𝑛 otherwise

𝑁 is a sequence of corpora for LM pretraining. Gap(𝑇 , 𝐷 , 𝐷 ) is Score(𝑇 )
where 𝑇 represents an arbitrary task, and (𝐷𝑖 )𝑖=0 𝑎 𝑏
of 𝐿𝑀𝑎 - Score(𝑇 ) of 𝐿𝑀𝑏 , where 𝐿𝑀𝑎 is pretrained on 𝐷𝑎 . T𝐹 = (𝑇𝑖𝐹 )𝑖=0
𝑁 −1 measures forgetting of invariant-knowledge
𝑁 −1 . If no task is from 𝐷 , 𝑇 𝐹 is "n.d." (not defined). 𝑇 𝑈 and 𝑇 𝐴 from 𝐷 measure update and acquisition of
from (𝐷𝑖 )𝑖=0 𝑖 𝑖 𝑁 𝑁 𝑁
new knowledge, respectively.

A.5 Online CL-Specific Metrics.

with the distribution of recently observed training data. The calculation of NFA involves first checking if the model
correctly predicts the label of a future sample, which can be expressed as 𝑎𝑡 = 1{𝑓𝜃𝑡 (𝑥𝑡 +1+𝑆 ) = 𝑦𝑡 +1+𝑆 }. Subsequently,
the running average is updated using the formula 𝐴𝑅𝐴 𝑡 = 𝑡1 (𝐴𝑅𝐴 𝑡 − 1 · (𝑡 − 1) + 𝑎𝑡 ).
Yogatama et al. [209] proposed a novel online codelength (ℓ (𝐷)), inspired by prequential encoding [10], to quantify
how quickly an existing model can adapt to a new task.
𝑁
∑︁
ℓ (𝐷) = log2 |𝑦| − log2 𝑝 (𝑦𝑖 |𝑥𝑖 ; 𝜃 𝐷𝑖 −1 ) (18)
𝑖=2
where |𝑌 | is the number of possible labels (classes), and 𝜃 𝐷𝑖 represents a particular subset of the dataset 𝐷. Similar to
the approach in Latent Contextual Allocation (LCA) [22], the concept of online codelength is associated with the area
under the learning curve.

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

Manuscript submitted to ACM

The Draconic Chart (Pam Crane) Englishturkish (Z-Library)
100% (3)
The Draconic Chart (Pam Crane) Englishturkish (Z-Library)
643 pages
G10LAS - Q3 - Week 6 - Illustrates Events, and Union and Intersection of Events PDF
100% (9)
G10LAS - Q3 - Week 6 - Illustrates Events, and Union and Intersection of Events PDF
5 pages
Behavioral Learning Theory
67% (3)
Behavioral Learning Theory
26 pages
SAP HANA Essentials Book PDF
No ratings yet
SAP HANA Essentials Book PDF
73 pages
03:18:1950 Article The Nation - Communism in The Caribbean?
No ratings yet
03:18:1950 Article The Nation - Communism in The Caribbean?
2 pages
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
No ratings yet
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
31 pages
Efficient Prompting Methods For Large Language Models - A Survey
100% (1)
Efficient Prompting Methods For Large Language Models - A Survey
18 pages
lecture1-intro
No ratings yet
lecture1-intro
54 pages
1719720399971
No ratings yet
1719720399971
51 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
28 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
29 pages
Large Language Models Meet NLP: A Survey
No ratings yet
Large Language Models Meet NLP: A Survey
20 pages
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
No ratings yet
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
14 pages
duan2020
No ratings yet
duan2020
6 pages
Federated and edge learning for large language models
No ratings yet
Federated and edge learning for large language models
19 pages
4
No ratings yet
4
5 pages
IT-361 Theory of Automata Outline
No ratings yet
IT-361 Theory of Automata Outline
4 pages
Course Material- Artificail Intelligence-Week1_update
No ratings yet
Course Material- Artificail Intelligence-Week1_update
78 pages
A O L M: R D O: N Verview On Anguage Odels Ecent Evelopments and Utlook
No ratings yet
A O L M: R D O: N Verview On Anguage Odels Ecent Evelopments and Utlook
24 pages
Applied Natural Language Processing
No ratings yet
Applied Natural Language Processing
3 pages
Multilingual Machine Translation With Large Language Models
No ratings yet
Multilingual Machine Translation With Large Language Models
16 pages
Statistical Natural Language Processing
No ratings yet
Statistical Natural Language Processing
2 pages
ULMfit Universal Language Model Fine-Tuning for Text Classification
No ratings yet
ULMfit Universal Language Model Fine-Tuning for Text Classification
9 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
97 pages
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
No ratings yet
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
21 pages
LLM Survey
100% (1)
LLM Survey
43 pages
2023-Large Models For Time Series and Spatio-Temporal Data - A Survey and Outlook
No ratings yet
2023-Large Models For Time Series and Spatio-Temporal Data - A Survey and Outlook
25 pages
2022 Acl-Demo 10
No ratings yet
2022 Acl-Demo 10
9 pages
Understanding LLMs: A Comprehensive Overview from Training to Inference
No ratings yet
Understanding LLMs: A Comprehensive Overview from Training to Inference
30 pages
A Comprehensive Overview of Large Language Models - 2307.06435v9
No ratings yet
A Comprehensive Overview of Large Language Models - 2307.06435v9
46 pages
TGDK 1 1 2
No ratings yet
TGDK 1 1 2
38 pages
Large Language Models: A Survey
No ratings yet
Large Language Models: A Survey
43 pages
Large Language Models in Finance
No ratings yet
Large Language Models in Finance
11 pages
Unit 5
No ratings yet
Unit 5
20 pages
1 s2.0 S0925231221010997 Main
No ratings yet
1 s2.0 S0925231221010997 Main
14 pages
WizardLM - Empowering Large Language Models To Follow Complex Instructions
No ratings yet
WizardLM - Empowering Large Language Models To Follow Complex Instructions
39 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Assessing the Capability
No ratings yet
Assessing the Capability
10 pages
N19-1213
No ratings yet
N19-1213
7 pages
Rishabh Sharma (Anantika Johari)
No ratings yet
Rishabh Sharma (Anantika Johari)
8 pages
ACD-11
0% (1)
ACD-11
67 pages
Comparative Study On Spoken Language Identification Based On Deep Learning
No ratings yet
Comparative Study On Spoken Language Identification Based On Deep Learning
5 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
2412.08821v2
No ratings yet
2412.08821v2
49 pages
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
No ratings yet
A Review On Large Language Models Architectures Applications Taxonomies Open Issues and Challenges
36 pages
PAL: Program-Aided Language Models
No ratings yet
PAL: Program-Aided Language Models
34 pages
CS698V/CS779: Statistical Natural Language Processing Course Handout 1
No ratings yet
CS698V/CS779: Statistical Natural Language Processing Course Handout 1
2 pages
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
No ratings yet
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
22 pages
2020-PAL-wu
No ratings yet
2020-PAL-wu
16 pages
Survey On LLM
No ratings yet
Survey On LLM
9 pages
Sem 321
No ratings yet
Sem 321
24 pages
ADS 1st Semester Course Outlines
No ratings yet
ADS 1st Semester Course Outlines
6 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
124 pages
Trend
No ratings yet
Trend
47 pages
Survey
No ratings yet
Survey
19 pages
Day 2 Module 2 - Understanding LLMs
No ratings yet
Day 2 Module 2 - Understanding LLMs
14 pages
MRKL Systems
No ratings yet
MRKL Systems
19 pages
2312.06562v2
No ratings yet
2312.06562v2
20 pages
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
No ratings yet
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
14 pages
Report - PDF 20240827 210738 0000
No ratings yet
Report - PDF 20240827 210738 0000
23 pages
TOC-Lesson Plan
No ratings yet
TOC-Lesson Plan
11 pages
Literature Review On Vulnerability Detection Using
No ratings yet
Literature Review On Vulnerability Detection Using
10 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Rucha Bapat Master Thesis Report
No ratings yet
Rucha Bapat Master Thesis Report
91 pages
IB SL 3.4 g.FinalAcc BS SummaryExcercises (S)
No ratings yet
IB SL 3.4 g.FinalAcc BS SummaryExcercises (S)
10 pages
ABC of Clinical Reasoning (ABC Series), 2e (Dec 19, 2022)_(1119871514)_(Wiley-Blackwell) 2nd Edition Nicola Cooper 2024 Scribd Download
100% (3)
ABC of Clinical Reasoning (ABC Series), 2e (Dec 19, 2022)_(1119871514)_(Wiley-Blackwell) 2nd Edition Nicola Cooper 2024 Scribd Download
41 pages
How Islamic Are Islamic Countries
No ratings yet
How Islamic Are Islamic Countries
40 pages
Newrest Remote Algerie
No ratings yet
Newrest Remote Algerie
47 pages
Pembahasan Soal UN Bahasa Inggris SMP 2012 (Paket Soal C29) PDF
No ratings yet
Pembahasan Soal UN Bahasa Inggris SMP 2012 (Paket Soal C29) PDF
15 pages
TM Transformer
No ratings yet
TM Transformer
40 pages
PDF Reflective Knowledge Apt Belief and Reflective Knowledge First Edition Ernest Sosa download
100% (10)
PDF Reflective Knowledge Apt Belief and Reflective Knowledge First Edition Ernest Sosa download
82 pages
Conference Checklist: 18-24 Months Prior To The Conference
No ratings yet
Conference Checklist: 18-24 Months Prior To The Conference
4 pages
Website DevelopmentProposal
No ratings yet
Website DevelopmentProposal
6 pages
A List of Forage Grasses and Scoentific Name
80% (5)
A List of Forage Grasses and Scoentific Name
5 pages
Lol
No ratings yet
Lol
2 pages
Fulfilling A Dream
No ratings yet
Fulfilling A Dream
22 pages
7:9 Planespotting at Los Angeles International Airport: Everything You Need To Know
No ratings yet
7:9 Planespotting at Los Angeles International Airport: Everything You Need To Know
4 pages
imageAction
No ratings yet
imageAction
12 pages
Jean DEspagnet
100% (1)
Jean DEspagnet
1 page
2.2 - Measurement of Turbine Efficiency - Report - OCT23
No ratings yet
2.2 - Measurement of Turbine Efficiency - Report - OCT23
5 pages
Sasmo 2020 G7
100% (1)
Sasmo 2020 G7
16 pages
Language Mock Assessment 1 Attempt Review
No ratings yet
Language Mock Assessment 1 Attempt Review
53 pages
ĐỀ SỐ 1- key
No ratings yet
ĐỀ SỐ 1- key
20 pages
ai-snake-id
No ratings yet
ai-snake-id
5 pages
Daftar Harga 11 September 2023
No ratings yet
Daftar Harga 11 September 2023
58 pages
A Dangerous Fiction - by Barbara Rogan
No ratings yet
A Dangerous Fiction - by Barbara Rogan
18 pages
Community Engagement, Solidarity, & Citizenship: Week 1-2
No ratings yet
Community Engagement, Solidarity, & Citizenship: Week 1-2
8 pages
Profiling of Learners V Anahaw
No ratings yet
Profiling of Learners V Anahaw
8 pages

Recent Advances of Foundation Language Models-Based Continual Learning - A Survey

Uploaded by

Recent Advances of Foundation Language Models-Based Continual Learning - A Survey

Uploaded by

Recent Advances of Foundation Language Models-based Continual Learning: A

ACM Reference Format:

Manuscript submitted to ACM 1

Traditional CL Task 1 … Task i … Task N

Model 1 … Model i … Model N

Foundation Language Task 1 … Task i … Task N

VLMs-based DIL (§4.1.3) S-Prompt [191], Medical AI [206], VQACL [220]

Fig. 2. Taxonomy of foundation language models for continual learning.

2.2 Continual Learning for Computer Vision

2.3 Continual Learning for NLP

2.4 Continual Learning for Other Domains

... ... ... ...

Domain 1 Domain i Domain N

... ... Task - ID

Hard Task Boundary

Task 1 Task 2 Task 3

3 SETTINGS AND LEARNING MODES OF CL

3.2 Typical Scenarios

s-3 s-2 s-1 s

s-3 s-2 s-1 s

s-3 s-2 s-1 s

s-3 s-2 s-1 s Classifiers

Expert/Gating Freeze Expert/Gating Expansion

model 𝒙 (CSC) to improve 𝒙

accuracy.Inputs We call this framework Streamcontrastive con-

(CSC): CSC loss is computed based onmagnitudes

“Net2WiderNet” approach view (Chen

and review sentence (e.g., "The soundnentially quality for this

polarity is predicted on (PLMs),

4.1.2 LLMs-based DIL.

Fusion Encoder Con$nual Fusion Encoder Fusion Encoder

(a) HMI (b) DynaMind (c) TRIPLET

With the proposed decoupled prompts, then we need inter-

4.2.1 PLMs-based TIL.

Parameter-Efficient Methods. MoE-Adapters4CL [212] introduces a parameter-efficient continual learning method to

4.3 Class-Incremental Learning

a) Teacher Knowledge Transfer (TKT)

(a) IDBR (b) PLE (c) Adaptation-CLIP

Finally, we compute the weighted sum of the value matrix

4.3.2 VLMs-based CIL.

𝑥𝑥 𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓𝜃𝜃𝑘𝑘 𝑦𝑦� 𝑡𝑡𝑡𝑡𝑡𝑡 𝑔𝑔𝜔𝜔𝑘𝑘 𝑦𝑦� 𝑡𝑡𝑡𝑡𝑡𝑡 ℒ 𝑡𝑡𝑡𝑡𝑡𝑡

𝜃𝜃 𝑘𝑘+1 = 𝜃𝜃 𝑘𝑘 − 𝛼𝛼 ⋅ 𝛻𝛻𝜃𝜃𝑘𝑘 ℒ 𝑡𝑡𝑡𝑡𝑡𝑡

experience replay retrieve

2 Model optimization algorithm in the following aspects:

5.2 Blurry Task Boundary

5.2.2 VLMs-based BTB.

6.2 Online Datasets for NLP

Training on task {T1 , ..., Tk }, test on Ti>k

RN-1,1 RN-1,2 RN-1,3 RN-1,4 RN-1,N-1 RN-1,N

RN,1 RN,2 RN,3 RN,4 ... RN,N-1 RN,N

Fig. 9. Illustration of calculating metrics.

6.3 Offline CL Datasets for Multi-modal Tasks

6.4 Offline and Online Datasets for Other Tasks

7.1 Overall Performance.

7.2 Memory Stability.

7.3 Learning Plasticity.

7.4 Metrics for Continual Pre-training.

7.5 Online CL-Specific Metrics.

8 CHALLENGES AND FURTHER WORK

Manuscript submitted to ACM

Manuscript submitted to ACM

Manuscript submitted to ACM

A.2 Memory Stability.

𝑓𝑗 = max (𝑅𝑙,𝑗 − 𝑅𝑁 ,𝑗 ), ∀𝑗 < 𝑁 . (9)

Manuscript submitted to ACM

A.3 Learning Plasticity.

A.4 Metrics for Continual Pre-training.

A.5 Online CL-Specific Metrics.

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

Manuscript submitted to ACM

You might also like