0% found this document useful (0 votes)
2 views

Learning With Few Data

Learning with few data presentation.

Uploaded by

martinekbh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Learning With Few Data

Learning with few data presentation.

Uploaded by

martinekbh
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

learning with few data

bit.ly/2023-nldl-tutorial
Marcus Liwicki, Machine Learning
Luleå University of Technology

Marcus Liwicki : learning with few data


are you working on your
PhD
or finished recently ?
did you ever

feel insignificant
doubt your skills
or

feel unchallenged ?
did you ever

feel insignificant
doubt your skills
or

feel unchallenged
did you ever

feel insignificant
doubt your skills
or

feel unchallenged ?
did you ever

feel insignificant
doubt your skills
or

feel unchallenged
unchallenged ?
You are not alone !
Marcus Liwicki, Machine Learning
Luleå University of Technology
bit.ly/2023-nldl-tutorial
ELLIS member, WASP member
IEEE senior member, IAPR award winner, …
agenda

motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning

and some spices in-between:


what I have learned during my life as presenter
agenda

motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning

and some spices in-between:


what I have learned during my life as presenter
machine learning needs data

11
machine learning (ideal)

Data Labels

Priors

12
reality

Data

Priors

Labels

13
Data
minimize
human
Data Labels supervision Priors

Priors Labels

how?
1. adding more unlabeled data or synthetic data
2. incorporating more prior (knowledge)
14
there are so many priors hidden in structure

15
there are so many priors hidden in structure

including priors
92.15% (SotA 88.2%)

Better than
Google

16
prior

experience (from earlier experiments)


proven architectures, meta parameters, …

knowledge (human reasoning)


correlating the given input details and identifying discriminative features

data (intrinsic or human induced)


sequential correlation, local correlation
filenames folder structures, taxonomies

x001-t14.xml
x001-t15.xml
17
time to learn something about presentations ;)
should we use dark background ?
or white ?
ok, enough of the torture

but why did so many of you torture each other?


Contrast is important
equity in the machine learning group

Marcus Gustav
Pedro Konstantina Fotini Christian Kanjar Vibha Fredrik

Notice
something?
Almost 40%
Priyamvada Saleha
woman
György Rajkumar
Oluwatosin Homam Mattias Nosheen

Sana Ali András Richa Karl Carl Prakash Lama Elisa

machine learning for the welfare of society


thanks to previous and current PhDs

Michele Alberti Vinay Pondenkandath Gustav G. Pihlgren Prakash


Ch. Chhipa
overview of approaches

end to end learning


• transfer learning (A Survey on Deep Transfer Learning - 2018)
• Utilizing pretrained models and finetuning on application specific data
• Required less data to fine tune than training it from scratch

• clustering – (Deep Clustering for Unsupervised Learning of Visual Features - 2018)


• Labelled data not required

representation learning
• auto-encoding – (Variational Autoencoder for Deep Learning of Images, Labels and Captions, 2016)
• Questionable if this is a good way to go – (A Pitfall of Unsupervised Pre-Training, 2017)

• contrastive learning (SimCLR - July 2020, SwAV – October 2020)


• Pretraining mechanism which utilizes application specific unlabeled data
• Also compute intensive but possibility to scale down
25
transfer learning

Source: https://ruder.io/transfer-learning/ Source: https://machinelearningmastery.com/transfer-learning-for-deep-


learning/

remarks
• successful but only initial layers with low-level features are common & useful across applications
• no possibility for unlabeled data

26
ImageNet pretraining works outside of
natural images

footsteps for person identification


(88 % for 13 persons, previous SotA 77 %)

MS Singh, V Pondenkandath, B Zhou, P Lukowicz, M Liwicki


Transforming sensor data to the image domain for deep learning—An application to footstep detection, IJCNN 2017

27
ImageNet pre-training works often well

Linda Studer, Michele Alberti, Vinaychandran Pondenkandath, Pinar Goktepe, Thomas Kolonko, Andreas Fischer, Marcus Liwicki, Rolf Ingold:
A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis, ICDAR, 2019

28
shortcomings – ImageNet transfer learning
ImageNet-trained CNNs are biased towards texture
– Strongly biased towards recognizing textures rather than shapes
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, September). ImageNet-trained CNNs are biased towards texture; increasing shape bias
improves accuracy and robustness. In International Conference on Learning Representations.

29
ImageNet transfer learning in medical images
Medical image domain

Transfer learning

Retina DR dataset CheXpert dataset

ImageNet

ImageNet transfer learning does not significantly affect performance on medical imaging tasks
– Ref: Transfusion: Understanding Transfer Learning for Medical Imaging
Raghu, M., Zhang, C., Kleinberg, J., & Bengio, S. (2019). Transfusion: Understanding transfer learning for medical imaging. Advances in neural information processing systems, 32.
– Task specific learning - only initial layers with low-level features are useful

Adapted from https://ai.googleblog.com/2019/12/understanding-transfer-learning-for.html

30
ImageNet transfer learning in histopathology
Sharmay, Y., Ehsany, L., Syed, S., & Brown, D. E. (2021, July). HistoTransfer: Understanding Transfer Learning for Histopathology. In 2021 IEEE EMBS International Conference on
Biomedical and Health Informatics (BHI) (pp. 1-4). IEEE.

Gastrointestinal, breast cancer


ImageNet vs. SSL

Why ImageNet supervised transfer learning is sub-optimal?


Possibly, ImageNet trained model is overfitted for natural scenes
Optimized for dataset specific characteristics

31
clustering
group features with k-means and update the weights to optimize for these assignments

Source: https://neurohive.io/en/state-of-the-art/deep-clustering-approach/
remarks
• Compute intensive when applied on images
• Non robust feature representation when feature extracted with pretrained models

32
agenda

motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding – and alternatives
contrastive learning
comparative summary
remarks on contrastive learning

and some spices in-between:


what I have learned during my life as presenter
Auto-Encoding – pre-training

INPUT ENCODER FEATURES DECODER OUTPUT

34
Auto-Encoding – classification

INPUT ENCODER FEATURES CLASSIFIER OUTPUT

“cat”

35
a pitfall of unsupervised pre-training, 2017

a good auto-encoder (low reconstruction error) does not


necessarily lead to better accuracy
alternative: use PCA or LDA for initialization

Will they
converge ?

No ! Better local minima ?

Michele Alberti, Mathias Seuret, Vinaychandran Pondenkandath, Rolf Ingold, Marcus Liwicki
Historical Document Image Segmentation with LDA-Initialized Deep Neural Networks. ICDAR 2017
37
auto-encoding limitation

what we want what we might get

38
39
variational auto-encoders

X Encoder N(μ, σ2) z Decoder X’


σ
2

Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes.“


2013
40
perceptual loss

Another
Encode z Decoder X’ Neural y’
r Network
X
Another
Neural y
Network
Thorough investigation :
Improving image autoencoder embeddings with perceptual loss, 2020
And Oskar Sjögren (yesterday)
41
42
try it out …

bit.ly/2023-nldl-tutorial
https://github.com/guspih/Perceptual-Autoencoders

https://github.com/guspih/Perceptual-Encoding

https://github.com/guspih/deep_perceptual_similarity_analysis

43
Contrastive Learning (CL)

Self-Supervised Method:
Allows model to learn
generic representations on unlabeled
data

Method:
Learn similarity between augmented representation from
same image
Learn dissimilarity otherwise

Source: https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html 44
(not so) recent work in Contrastive Learning

Simple Framework for Contrastive Learning (SimCLR)


A Simple Framework for Contrastive Learning of Visual Representations (SimCLR v1), ICML - 2020
Big Self-Supervised Models are Strong Semi-Supervised Learners (SimCLR v2), NeurIPS – 2020

Momentum Contrast Learning (MOCO)


Momentum Contrast for Unsupervised Visual Representation Learning (MOCO v1), CVPR - Mar 2020
Improved Baselines with Momentum Contrastive Learning (MOCO v2), ?? Arxiv Oct- 2020
Bootstrap Your Own Latent A New Approach to Self-Supervised Learning, NeurlPS - 2020
Contrastive Learning with Clustering
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAE), Arxiv 2020

45
Comparative Summary on SOTA

Contrastive Learning

Clustering + Self-supervised

Self-Labelling

Source (IARAI): https://www.youtube.com/watch?v=Bn66HnBxXFM

• Remarks
• Priors (augmentation mechanism) is more important than learning method
• Obtains performance approx. equal to supervised methods with 10% labelled data
it’s easy on natural images

distorted views (augmented views) of input visual


Human prior for visual Relevant Augmentation

Size Resize

Shape Crop, Flip


Foreground-Background Blur, Noise, Color schemes, filtering

Angle Flip, Rotation

Color spectrum Contrast, saturation


but does not work in other domains

Distorted views (augmented views) of input visual


Human prior for visual Relevant Augmentation

Size Resize

Shape Crop, Flip


Foreground-Background Blur, Noise, Color schemes, filtering

Angle Flip, Rotation

Color spectrum Contrast, saturation

medical images, remote sensing imagery, non-obvious visual concepts

insufficiency of human prior for distorted view

49
use two views of same patient

Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., ... & Norouzi, M. (2021). Big self-supervised models advance medical image classification. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478-3488).

50
but wait … did we just use labels ?

51
our approach: shifting focus from human prior to data prior

Supervised Self-supervised Adapting self-supervised


approach approach (on natural approach on specialized
visual concepts) domain

Data reduce human prior Data


minimize human (augmentation) &
supervision Huma incorporate data
Label n
Data
s Priors
prior Huma Data
n Prior
Priors s
Human Label Label
Priors s s

52
let us use the data prior

data (prior) magnification levels (in BreakHis data) are


utilized to generate both views for SSL input

the only human prior used in magnification sampling

achieves state-of-the-art results with only 20%


labels on classification

Chhipa, P. C., Upadhyay, R., Pihlgren, G. G., Saini, R., Uchida, S., & Liwicki, M. (2022). Magnification Prior: A Self-Supervised Method for Learning Representations on
Breast Cancer Histopathological Images. arXiv preprint arXiv:2203.07707.

53
ideas for data prior

temporal proximity

spatial proximity

sequential co-occurrence (BERT)

different modalities

more ?

54
curious, what more we can learn about
presentation techniques ?

btw., should we use slide numbers ?


typical issues, I observe at scientific
conferences :

unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :

unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :

unconfident posture
filler sounds
angle and interaction
typical issues, I observe at scientific
conferences :

unconfident posture
filler sounds
angle and interaction
agenda

motivation
prior
approaches
end to end learning
transfer learning
clustering
representation learning
auto-encoding
contrastive learning
comparative summary
remarks on contrastive learning

And some spices in-between:


What I have learned during my life as presenter
97’123’452
summary

end to end learning


• transfer learning
• clustering

representation learning
• auto-encoding
• PCA, LDA
• perceptual loss
• contrastive learning

meta learning (not covered today)

63
remarks on contrastive learning
Method Contrastive Learning Contribution Limitation
Key Factor

SimCLR V1.0 K1: Similarity learning for positive Established benchmark performance on 1. ‘Large batch size’ due to positive + negative pair
pairs unsupervised contrastive learning 2. ‘Mass gradient computation & backprop issue’ due to all
K2: Dissimilarity learning for (+ve & -ve) pairs
negative pairs

SimCLR V2.0 K1 + K2 on Task agnostic Big n/w + Added enablement of semi-supervised Same as SimCLR V1.0 + usage of bigger networks
which used in distillation for task learning through distillation
specific small n/w

MOCO V1.0 K1 + K2 over momentum encoder Revealed unsupervised contrastive learning 1. ‘Mass gradient computation & backprop issue’ due to all
where CL as dynamic dictionary with smaller batch size and lessor (+ve & -ve) pairs (same as SimCLR because as q-encoder
lookup backpropagation of gradients backpropagates)
2. Overhead of dynamic dictionary queue

MOCO V2.0 MOCO V1.0 + 2-layer MLP Stronger baseline, outperformed on 1. ‘Mass gradient computation & backprop issue’ due to all
projection head SimCLR and MOCO v1.0. (+ve & -ve) pairs same as SimCLR because q-encoder and
k-encoder both backpropagates
2. Overhead of dynamic dictionary queue

BYOL K1+ momentum encoding + two Achieves self supervised CL without 1. Complex pipeline with large number of pruning. Makes it
separate networks (online and negative pair. Establishes benchmarks in challenging for concept utilization.
target) semi-supervised approach. Robust for
smaller batch size.

SwAE K1 + Swapped” prediction Achieves self supervised CL without 1. Relatively complex loss computation due to swapped
mechanism + cluster assignment negative pair. Claims state of art in prediction
unsupervised image clustering. 2. Additional online cluster assignment swapping

DINO Distillation Self attention without supervision 1. More research required


transformers Moderate computation power 2. Authors are not self-critical

Barlow Twins Redundancy reduction minimize covariance across embedding


dimension
Maximize invariance across sample
64
remarks on contrastive learning

CL is leading the self-supervision & potential push for semi-supervised

CL in current state is compute intensive

65
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing

66
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing

knowledge distillation ( BYOL 2020, SimSiam 2020) do not use negative pairs
batch size 512
however, embedding output size in range of 4096

67
batch size is huge
SimCLR, performance increase, when batch size of 2048
reason: large number of negative pairs
requires array of GPUs and sophisticated parallel processing

knowledge distillation ( BYOL 2020, SimSiam 2020) do not use negative pairs
batch size 512
however, embedding output size in range of 4096

for non natural images, smaller batch size


is already good (128)
reason: not RGB images, but simpler

68
Remarks on Contrastive Learning

CL is leading the self-supervision & potential push for semi-supervised

CL in current state is compute intensive (batch size, negative pairs, & gradients) which
makes it challenging for direct (as-it-is) applications. Needs (Research Potential) to be
tailored for custom and small-scale application requirement.
Contrastive methods are sensitive to the choice of image/data augmentation.
Leveraging utilization of application specific but unlabeled data.

CL can be benchmarking framework (Different methods for different applications) for


semi-supervised and even supervised task.

69
thanks to my colleagues

there is so much more, I could share bit.ly/2023-nldl-tutorial

https://irdta.eu/deeplearn/2023su/

You might also like