Slides Architecture Based Continual Learning
Slides Architecture Based Continual Learning
Pengxiang Wang
1 / 23
Introduction
2 / 23
Problem Definition
Continual Learning (CL): learning a sequence of tasks 𝑡 = 1, ⋯ , 𝑁 in order, with
datasets 𝐷𝑡 = {𝑥𝑡 , 𝑦𝑡 }
Task-Incremental Learning (TIL): continual learning scenario, aim to train a model �
that performs well on all learned tasks
𝑁
max ∑ metric(𝑓(𝑥𝑡 ), 𝑦𝑡 ), {𝑥𝑡 , 𝑦𝑡 } ∈ 𝐷𝑡
𝑓
𝑡=1
3 / 23
Existing Approaches for TIL
Replay-based Approaches
▶ Prevent forgetting by storing parts of the data from previous tasks
▶ Replay algorithms use them to consolidate previous knowledge
▶ E.g. iCaRL, GEM, DER, DGR ...
Regularization-based Approaches
▶ Add regularization terms constructed using information about previous tasks to
the loss function when training new tasks
▶ E.g. LwF, EWC, SI, IMM, VCL, ...
Architecture-based Approaches (what we are talking about)
▶ Dedicate network parameters in different parts of the network to different tasks
▶ Keep the parameters for previous tasks from being significantly changed
▶ E.g. Progressive Networks, PackNet, DEN, Piggyback, HAT, CPG, UCL, ...
4 / 23
Existing Approaches for TIL
Optimization-based Approaches
▶ Explicitly design and manipulate the optimization step
▶ For example, project the gradient not to interfere previous tasks
▶ E.g. GEM, A-GEM, OWM, OGD, GPM, RGO, TAG, ...
Representation-based Approaches
▶ Use special architecture or training procedure to create powerful representations
▶ Inspired from self-supervised learning, large-scale pre-training like LLMs
▶ E.g. Co2L, DualNet, prompt-based approaches (L2P, CODAPrompt, ...), CPT
(continual pre-training)...
5 / 23
Architecture-based Approaches
6 / 23
Architecture-based Approaches
7 / 23
Modular Networks: Progessive Networks
8 / 23
Modular Networks: Progessive Networks
9 / 23
Modular Networks: PathNet
PathNet, 2017
▶ Prepare a large pool of modules for the algorithm to select from
▶ Several options in each module position, concatenated and form a subnet for a
task
▶ Choose the path by tournament genetic algorithm between different paths during
the training of a task
10 / 23
Parameter Allocation: Overview
Parameter Allocation
▶ Refines the level of modules to parameters or neurons
▶ Selects a collection of parameters or neurons to allocate to each task
▶ Also forms a subnet for the task
11 / 23
Parameter Allocation: Overview
12 / 23
Parameter Allocation: PackNet
PackNet, 2018
▶ Select non-overlapping weight masks and allocate them to tasks
▶ Fix masked parameters once trained until testing using the subnet
▶ Post-hoc selection by pruning (by absolute values of weights) after training
▶ Retraining after pruning as network structure changes
▶ Manually allocation by percentage hyperparameters
13 / 23
Parameter Allocation: DEN
DEN (Dynamically Expandable Networks), 2018
▶ Find the important neurons as feature masks for testing, and duplicate
▶ Find by training with equally L2 regularisation, whose connected parameters
change a lot are important
▶ Dynamic network expansion when performance can’t be improved, prune after
▶ The training selects their own important neurons by L1 regularised training, then
only train them by L2 regularisation
▶ Manually allocation by threshold hyperparameters, slightly better than percentage
14 / 23
Parameter Allocation: Piggyback
Piggyback, 2018
▶ Learnable allocation: binary masks are gated from real values which is
differentiable and can be learned together with parameters
▶ Still binary during test
▶ Sacrifices with the network parameters fixed, reduced representation ability
SupSup, 2020
▶ Extends to task-agnostic
testing
15 / 23
Parameter Allocation: HAT
HAT (Hard Attention to the Task), 2018
▶ Masks and parameters are both learnable
▶ Fix masked parameters once trained until testing using the subnet
▶ Sparsity regularization for masks
AdaHAT, 2024 (my work)
▶ Allow minor adaptive
adjustment to masked
parameters
16 / 23
Parameter Allocation: CPG
CPG (Compacting, Picking and Growing), 2019
▶ Post-hoc pruning and retraining + network expanding + learnable masks (on
previous weights)
17 / 23
Model Decomposition: ACL
ACL (Adversarial Continual Learning), 2020
▶ Shared and task-specific, modules, features
▶ Shared module is adversarially trained with the discriminator to generate
task-invariant features. The discriminator predicts task labels
18 / 23
Model Decomposition: APD
𝜃𝑡 = 𝜎 ⊙ ℳ𝑡 + 𝜏𝑡 , ℳ𝑡 = Sigmoid(v𝑡 )
2
min ℒ ({𝜎 ⊗ ℳ𝑡 + 𝜏 𝑡 } ; 𝒟𝑡 ) + 𝜆1 ‖𝜏 𝑡 ‖1 + 𝜆2 ∥𝜎 − 𝜎(𝑡−1) ∥
𝜎,𝜏 𝑡 ,v𝑡 2
19 / 23
Model Decomposition: PGMA
PGMA (Parameter Generation and Model Adaptation), 2019
▶ Task-specific parameters 𝑝𝑡 are generated by DPG (dynamic parameter generator)
▶ Shared parameters 𝜃0 (in solver 𝑆 ) adapt itself to task 𝑡 with the generated
task-specific 𝑝𝑡
20 / 23
Challenges
21 / 23
Challenge: Network Capacity and Plasticity
22 / 23
Thank You
Thank you for your attention!
23 / 23