Representational Continuity for Unsupervised Continual Learning

Representational Continuity for
Unsupervised Continual Learning
Divyam Madaan, Jaehong Yoon, Yuanchun Li, Yunxin Liu, and Sung Ju Hwang
ICLR 2022, Oral Presentation

Data in the Wild
With the advent of technology driven applications, large amount of unsupervised data is
generated every minute of the day.

Data Driven AI Progress
Large-scale datasets have greatly influenced the progress of deep-learning models on
multiple benchmark applications.

Motivation
However, in many real-life applications, the collected datasets are limited in size during
the initial training phase of the network.

Motivation
The set of classes/tasks dynamically grow in size and change continuously with time.

Annotation Challenges
Annotation process is time-consuming (eg. bounding boxes in videos) and expensive as it
requires expert knowledge in various applications.
Therefore, it is essential to learn continual representations on unlabelled data streams.

Objective
In this work, we take a step towards training scalable unsupervised CL representations.
Time

Related Work
However, the method was restricted to simple low-resolution tasks and not scalable
to standard CL benchmark datasets.
CURL learned task-specific representations on top of shared parameters using MLP
encoders/decoders and a simple MoG generative replay.
[Rao et al. 2019] Continual Unsupervised Representational Learning. NeurIPS 2019

We learn the feature representations on an unlabelled sequence of tasks.

We use self-supervised learning to learn the feature representations.
Feature
Extractor
Feature
Extractor
Self-Supervised
Loss
However, a trivial combination of self-supervised learning with a sequence of
tasks can result in catastrophic forgetting.

To mitigate catastrophic forgetting, we revisit representations learnt on previous tasks.
Our quantitative and qualitative empirical analysis shows that reliance on
annotated data is not necessary for continual learning.
Feature
Extractor
Feature
Extractor
Self-Supervised
Loss
● Structural
Regularization
● Unsupervised Replay
● Architectural
Expansion
● Lifelong
Unsupervised Mixup

Self-Supervised Learning
Representational learning methods have shown huge potential to tackle the problem
of learning without supervision.

Simsiam maximized the similarity between two augmented views of an image
subject to gradient stopping operation.
[Chen et al. 2021] Exploring Simple Siamese Representation Learning. CVPR 2021
Simsiam (Chen et al. 2021)

Barlowtwins pushes the cross-correlation matrix computed from twin embedding to
identity matrix.
[Chen et al. 2021] Exploring Simple Siamese Representation Learning. CVPR 2021
[Zbontar and Zing et al. 2021] Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021
Simsiam (Chen et al. 2021) Barlowtwins (Zbontar and Zing et al. 2021)

Unsupervised Structural Regularization
Encourage the current task parameters to stay close to the parameters of the old task.
.
Synaptic Intelligence [Zenke et al. 2017]
Task B Loss Surrogate Loss
UCL loss with Simsiam for current task

Unsupervised Replay
We minimize the Euclidean distance between the projected outputs to evaluate
DER (Buzzega et al. 2020) for Unsupervised Replay.
[Buzzega et al. 2020] Dark Experience for General Continual Learning: a Strong, Simple Baseline. NeurIPS 2020
Minimize the
Euclidean distance

Lifelong Unsupervised Mixup (LUMP)
LUMP interpolates between the examples of the current task and random examples
selected using uniform sampling from the replay buffer.
Self-Supervised
Loss

Experimental Setup
Datasets
[Krizhevsky 2012] Krizhevsky, A. Learning multiple layer of features from tiny images. University of Toronto 2012
1) Split CIFAR-10 [Krizhevsky
2012] A dataset with 60,000
images composed of five
tasks from ten animal and
vehicle classes.
2) Split CIFAR-100 [Krizhevsky
2012] A dataset with 60,000
images composed of 20 tasks
from 100 generic object classes.
3) Split Tiny-ImageNet [Russakovsky
2015] A subset of ImageNet dataset.
We construct 20 tasks using 100
classes.
[Ruakovsk 2015] Imagenet large scale visual recognition challenge. International journal of computer vision, 2015
Metrics
1) Accuracy is the average test accuracy of all the tasks completed until the continual learning of task
2) Forgetting is the average performance decrease of each task between its maximum accuracy and accuracy at the
completion of training
is the test accuracy of task after learning task using a KNN on frozen pre-trained representations on task

Continual Learning Evaluation
Unsupervised CL methods improve accuracy with LUMP outperforming all methods.
Accuracy
Simsiam
Split Tiny-ImageNet
Split CIFAR-100
Split CIFAR-100

We observe similar gains for UCL with BarlowTwins across all the methods.
Accuracy
Simsiam
Barlowtwins
Accuracy
Split CIFAR-100
Split CIFAR-100
Split Tiny-ImageNet
Split Tiny-ImageNet
Split CIFAR-100

Further, UCL significantly reduces catastrophic forgetting on all the datasets.
Forgetting
Simsiam
Forgetting
Barlowtwins
Split CIFAR-100
Split CIFAR-100
Split Tiny-ImageNet
Split Tiny-ImageNet

Out of Distribution Evaluation
Evaluation on various OOD datasets also show consistent improvements.
Evaluation of
representations
trained with
Sequential CIFAR-100
on OOD datasets using
KNN classifier.
FMNIST
Accuracy
Accuracy

Few-Shot Training Evaluation
Next, we evaluate on a limited number of training instances, where UCL improves
accuracy and mitigate forgetting in comparison to SCL.
This is an outcome of the discriminative feature embeddings learned by UCL.
Few Shot Training
with Split CIFAR-100

Visualization of Feature Space
SCL is prone to catastrophic forgetting, as the features are noisy w/o coherent patterns.
This is because SCL is prone to forgetting, which hurts the past tasks representations.

Visualization of Feature Space
UCL features are more relevant, with LUMP learning the most distinctive features.
We believe this is because UCL captures more than class-specific features, and captures
generic information independent of the class labels.

Visualization of Loss Landscape
UCL also obtaining a flatter and smoother loss landscape compared to SCL.
It indicates that UCL is stable and robust to the forgetting.

Similarity Analysis
Representations between two independent models are highly similar in the lower
layers, but are dissimilar for the higher modules.
Two UCL models
Two SCL models
UCL and SCL models
Finetune SI DER
This highlights that the difference in learning objective between SCL and UCL leads
to the difference in their learnt representations (mostly higher layers).

Conclusion
• We attempt to bridge the gap between continual learning and representation
learning and tackle the two crucial problems of continual learning with
unlabelled data and representation learning on a sequence of tasks.
Codes available at https://github.com/divyam3897/UCL
• We show that UCL achieves better performance over SCL due to their
characteristic ability to learn discriminative, human perceptual patterns,which
makes them transfer better and more robust to catastrophic forgetting
• Furthermore, we propose Lifelong Unsupervised Mixup (LUMP) for UCL, which
further alleviates catastrophic forgetting and provides better interpretations.
• We believe that our paper can be an essential part toward training continually
learning unsupervised representations.

Representational Continuity for Unsupervised Continual Learning

Recommended

More Related Content

What's hot (20)

Similar to Representational Continuity for Unsupervised Continual Learning (20)

More from MLAI2 (20)

Recently uploaded (20)

Representational Continuity for Unsupervised Continual Learning