0% found this document useful (0 votes)
2 views

Slides Architecture Based Continual Learning

The document discusses architecture-based approaches in continual learning (CL), focusing on task-incremental learning (TIL) where models learn a sequence of tasks without access to previous data. It outlines various existing methods such as modular networks, parameter allocation, and model decomposition, highlighting their strategies to minimize inter-task interference and manage network capacity. Key challenges include balancing stability and plasticity to prevent performance drops as tasks increase.

Uploaded by

qin yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Slides Architecture Based Continual Learning

The document discusses architecture-based approaches in continual learning (CL), focusing on task-incremental learning (TIL) where models learn a sequence of tasks without access to previous data. It outlines various existing methods such as modular networks, parameter allocation, and model decomposition, highlighting their strategies to minimize inter-task interference and manage network capacity. Key challenges include balancing stability and plasticity to prevent performance drops as tasks increase.

Uploaded by

qin yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Architecture-Based Approaches in Continual Learning

Pengxiang Wang

Peking University, School of Mathematical Sciences

University of Bristol, School of Engineering Mathematics and Technology

1 / 23
Introduction

2 / 23
Problem Definition
Continual Learning (CL): learning a sequence of tasks 𝑡 = 1, ⋯ , 𝑁 in order, with
datasets 𝐷𝑡 = {𝑥𝑡 , 𝑦𝑡 }
Task-Incremental Learning (TIL): continual learning scenario, aim to train a model �
that performs well on all learned tasks

𝑁
max ∑ metric(𝑓(𝑥𝑡 ), 𝑦𝑡 ), {𝑥𝑡 , 𝑦𝑡 } ∈ 𝐷𝑡
𝑓
𝑡=1

Key assumptions when training and testing task 𝑡:


▶ No access to the whole data from previous tasks 1, ⋯ , 𝑡 − 1
▶ Testing on all seen tasks 1, ⋯ , 𝑡
▶ For TIL testing, task ID 𝑡 of each test sample is known by the model. Otherwise,
it is task-agnostic testing

3 / 23
Existing Approaches for TIL
Replay-based Approaches
▶ Prevent forgetting by storing parts of the data from previous tasks
▶ Replay algorithms use them to consolidate previous knowledge
▶ E.g. iCaRL, GEM, DER, DGR ...
Regularization-based Approaches
▶ Add regularization terms constructed using information about previous tasks to
the loss function when training new tasks
▶ E.g. LwF, EWC, SI, IMM, VCL, ...
Architecture-based Approaches (what we are talking about)
▶ Dedicate network parameters in different parts of the network to different tasks
▶ Keep the parameters for previous tasks from being significantly changed
▶ E.g. Progressive Networks, PackNet, DEN, Piggyback, HAT, CPG, UCL, ...

4 / 23
Existing Approaches for TIL

Optimization-based Approaches
▶ Explicitly design and manipulate the optimization step
▶ For example, project the gradient not to interfere previous tasks
▶ E.g. GEM, A-GEM, OWM, OGD, GPM, RGO, TAG, ...
Representation-based Approaches
▶ Use special architecture or training procedure to create powerful representations
▶ Inspired from self-supervised learning, large-scale pre-training like LLMs
▶ E.g. Co2L, DualNet, prompt-based approaches (L2P, CODAPrompt, ...), CPT
(continual pre-training)...

5 / 23
Architecture-based Approaches

6 / 23
Architecture-based Approaches

▶ Leverages the separability characteristic of the neural network architecture


▶ Treat the network as decomposable resources for tasks, rather than as a whole
▶ Dedicate different parts of a neural network to different tasks to minimize the
inter-task interference
▶ Focus on reducing representational overlap between tasks
The “part” of a network can be regarded in various ways:
▶ Modular Networks: play around network modules like layers, blocks
▶ Parameter Allocation: allocate group of parameters or neurons to task as a
subnet
▶ Model Decomposition: decompose network from various aspects into shared
and task-specific components

7 / 23
Modular Networks: Progessive Networks

Progressive Networks, 2016


▶ Expand the network with new column module
for each new task
▶ Linearly increasing model memory
▶ Similar to independent training: train a
independent network for each task

8 / 23
Modular Networks: Progessive Networks

Expert Gate, 2017


▶ A new independent expert (network)
for each new task
▶ Similar to independent training but
work in task-agnostic testing
▶ A gate works as the task ID selector at
test time
▶ The gate is a network learned through
the task sequence

9 / 23
Modular Networks: PathNet
PathNet, 2017
▶ Prepare a large pool of modules for the algorithm to select from
▶ Several options in each module position, concatenated and form a subnet for a
task
▶ Choose the path by tournament genetic algorithm between different paths during
the training of a task

10 / 23
Parameter Allocation: Overview

Parameter Allocation
▶ Refines the level of modules to parameters or neurons
▶ Selects a collection of parameters or neurons to allocate to each task
▶ Also forms a subnet for the task

11 / 23
Parameter Allocation: Overview

Parameter Allocation methods differ in ways:


▶ Methods to allocate
▶ Manually set through hyperparameters
▶ Learned together with the learning process
▶ Application of masks during training
▶ Forward pass
▶ Backward pass
▶ Parameter update step
▶ Application of masks during testing
▶ Most methods fix the selected subnet after
▶ Weight masks are way greater trained on their belonged task and use it as
than feature masks in scale the only model to predict for that task during
▶ Should keep a decent amount testing
of neurons in each layer

12 / 23
Parameter Allocation: PackNet
PackNet, 2018
▶ Select non-overlapping weight masks and allocate them to tasks
▶ Fix masked parameters once trained until testing using the subnet
▶ Post-hoc selection by pruning (by absolute values of weights) after training
▶ Retraining after pruning as network structure changes
▶ Manually allocation by percentage hyperparameters

13 / 23
Parameter Allocation: DEN
DEN (Dynamically Expandable Networks), 2018
▶ Find the important neurons as feature masks for testing, and duplicate
▶ Find by training with equally L2 regularisation, whose connected parameters
change a lot are important
▶ Dynamic network expansion when performance can’t be improved, prune after
▶ The training selects their own important neurons by L1 regularised training, then
only train them by L2 regularisation
▶ Manually allocation by threshold hyperparameters, slightly better than percentage

14 / 23
Parameter Allocation: Piggyback
Piggyback, 2018
▶ Learnable allocation: binary masks are gated from real values which is
differentiable and can be learned together with parameters
▶ Still binary during test
▶ Sacrifices with the network parameters fixed, reduced representation ability
SupSup, 2020
▶ Extends to task-agnostic
testing

15 / 23
Parameter Allocation: HAT
HAT (Hard Attention to the Task), 2018
▶ Masks and parameters are both learnable
▶ Fix masked parameters once trained until testing using the subnet
▶ Sparsity regularization for masks
AdaHAT, 2024 (my work)
▶ Allow minor adaptive
adjustment to masked
parameters

16 / 23
Parameter Allocation: CPG
CPG (Compacting, Picking and Growing), 2019
▶ Post-hoc pruning and retraining + network expanding + learnable masks (on
previous weights)

17 / 23
Model Decomposition: ACL
ACL (Adversarial Continual Learning), 2020
▶ Shared and task-specific, modules, features
▶ Shared module is adversarially trained with the discriminator to generate
task-invariant features. The discriminator predicts task labels

18 / 23
Model Decomposition: APD

APD (Additive Parameter Decomposition), 2020


▶ Decomposes the parameter matrix of a layer mathematically:

𝜃𝑡 = 𝜎 ⊙ ℳ𝑡 + 𝜏𝑡 , ℳ𝑡 = Sigmoid(v𝑡 )

▶ Apply different regularisation strategies to shared 𝜎 and task-specific 𝜏𝑡 , v𝑡

2
min ℒ ({𝜎 ⊗ ℳ𝑡 + 𝜏 𝑡 } ; 𝒟𝑡 ) + 𝜆1 ‖𝜏 𝑡 ‖1 + 𝜆2 ∥𝜎 − 𝜎(𝑡−1) ∥
𝜎,𝜏 𝑡 ,v𝑡 2

1. Shared parameters 𝜎 not deviate far from the previous


2. The capacity of task-specific 𝜏𝑡 to be as small as possible, by making it sparse

19 / 23
Model Decomposition: PGMA
PGMA (Parameter Generation and Model Adaptation), 2019
▶ Task-specific parameters 𝑝𝑡 are generated by DPG (dynamic parameter generator)
▶ Shared parameters 𝜃0 (in solver 𝑆 ) adapt itself to task 𝑡 with the generated
task-specific 𝑝𝑡

20 / 23
Challenges

21 / 23
Challenge: Network Capacity and Plasticity

Network Capacity Problem


▶ Any fixed model will eventually get full and lead to the performance drop, given
the potentially infinite task sequence
▶ Become explicit in architecture-based approaches
▶ Can be solved by taking shortcuts to expand the networks, but it is not fair
Stability-Plasticity Trade-Off
▶ Continual learning seeks to trade off the balance between stability and plasticity
▶ Approaches that fix part of model for previous tasks are lack of plasticity by
stressing too much stability
▶ Others whichever has task shared components still face the classic catastrophic
forgetting problem, which is a result of lack of stability
▶ They both lead to a bad average performance

22 / 23
Thank You
Thank you for your attention!

Please feel free to ask any questions.

My blog post provides detailed information about this:


Architecture-based Continual Learning Algorithms

23 / 23

You might also like