SlideShare a Scribd company logo
Masked Autoencoders
Are Scalable Vision Learners
Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”
14th November, 2021
PR12 Paper Review
JinWon Lee
Introduction
• Deep learning has witnessed an explosion of architectures of
continuously growing capability and capacity.
• Aided by the rapid gains in hardware, models today can easily overfit
one million images and begin to demand hundreds of millions of—
often publicly inaccessible—labeled images.
• This appetite for data has been successfully addressed in natural
language processing (NLP) by self-supervised pretraining.
Introduction
• The solutions, based on autoregressive language modeling in GPT and
masked autoencoding in BERT, are conceptually simple: they remove
a portion of the data and learn to predict the removed content.
• These methods now enable training of generalizable NLP models
containing over one hundred billion parameters.
Introduction
• The idea of masked autoencoders, a form of more general denoising
autoencoders, is natural and applicable in computer vision as well.
• However, despite significant interest in this idea following the success
of BERT, progress of autoencoding methods in vision lags behind NLP.
What makes masked autoencoding different
between vision and language?
• Until recently, architectures were different. In vision, convolutional
networks were dominant over the last decade.
• Convolutions typically operate on regular grids and it is not
straightforward to integrate ‘indicators’ such as mask tokens or
positional embeddings into convolutional networks.
• This architectural gap, however, has been addressed with the
introduction of Vision Transformers (ViT) and should no longer
present an obstacle.
What makes masked autoencoding different
between vision and language?
• Information density is different between language and vision.
• Languages are human-generated signals that are highly semantic and
information-dense.
• When training a model to predict only a few missing words per
sentence, this task appears to induce sophisticated language
understanding.
• Images, on the contrary, are natural signals with heavy spatial
redundancy—e.g., a missing patch can be recovered from neighboring
patches with little high-level understanding of parts, objects, and
scenes.
What makes masked autoencoding different
between vision and language?
• The autoencoder’s decoder, which maps the latent representation
back to the input, plays a different role between reconstructing text
and images.
• In vision, the decoder reconstructs pixels, hence its output is of a
lower semantic level than common recognition tasks.
• This is in contrast to language, where the decoder predicts missing
words that contain rich semantic information.
• While in BERT the decoder can be trivial (an MLP), we found that for
images, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
Related Work
• Masked language modeling
▪ BERT and GPT, are highly successful methods for pre-training in NLP.
▪ These methods hold out a portion of the input sequence and train models to
predict the missing content.
▪ These methods have been shown to scale excellently and a large abundance
of evidence indicates that these pre-trained representations generalize well
to various downstream tasks.
Related Work
• Autoencoding
▪ It has an encoder that maps an input to a latent representation and a decoder
that reconstructs the input.
▪ Denoising autoencoders (DAE) are a class of autoencoders that corrupt an
input signal and learn to reconstruct the original, uncorrupted signal.
▪ A series of methods can be thought of as a generalized DAE under different
corruptions, e.g., masking pixels or removing color channels.
Related Work
• Masked image encoding
▪ The pioneering work of presents masking as a noise type in DAE.
▪ Context Encoder inpaints large missing regions using convolutional networks.
▪ Motivated by the success in NLP, related recent methods are based on
Transformers. iGPT operates on sequences of pixels and predicts unknown
pixels. The ViT paper studies masked patch prediction for self-supervised
learning. Most recently, BEiT proposes to predict discrete tokens.
BEiT: BERT Pre-Training of Image Transformers
Related Work
• Self-supervised learning(SSL)
▪ SSL approaches have seen significant interest in computer vision, often
focusing on different pretext tasks for pre-training.
▪ Recently, contrastive learning has been popular, which models image
similarity and dissimilarity (or only similarity) between two or more views.
▪ Contrastive and related methods strongly depend on data augmentation.
▪ Autoencoding pursues a conceptually different direction, and it exhibits
different behaviors.
Approach
• Suggested masked autoencoder(MAE) is a simple autoencoding
approach that reconstructs the original signal given its partial
observation.
• Like all autoencoders, MAE has an encoder that maps the observed
signal to a latent representation, and a decoder that reconstructs the
original signal from the latent representation.
• Unlike classical autoencoders, The authors adopt an asymmetric
design that allows the encoder to operate only on the partial,
observed signal (without mask tokens) and a lightweight decoder that
reconstructs the full signal from the latent representation and mask
tokens.
MAE Architecture
• During pre-training, a large random
subset of image patches (e.g., 75%)
is masked out.
• The encoder is applied to the small
subset of visible patches.
• Mask tokens are introduced after
the encoder, and the full set of
encoded patches and mask tokens
is processed by a small decoder
that reconstructs the original
image in pixels.
• After pre-training, the decoder is
discarded and the encoder is
applied to uncorrupted images to
produce representations for
recognition tasks.
Results
MAE Details
• Masking
▪ Following ViT, an image is divided into regular non-overlapping patches. Then
a subset of patches is sampled and masked.
▪ Sampling strategy is straightforward: random patches without replacement,
following a uniform distribution.
▪ Random sampling with a high masking ratio largely eliminates redundancy,
thus creating a task that cannot be easily solved by extrapolation from visible
neighboring patches.
MAE Details
• MAE encoder
▪ MAE encoder is a ViT but applied only on visible, unmasked patches.
▪ Just as in a standard ViT, MAE encoder embeds patches by a linear projection
with added positional embeddings.
▪ However, this encoder only operates on a small subset (e.g., 25%) of the full
set.
▪ This can reduce overall pre-training time by 3x or more and likewise reduce
memory consumption, enabling us to easily scale MAE to large models.
MAE Details
• MAE decoder
▪ The input to the MAE decoder is the full set of tokens consisting of (i)
encoded visible patches, and (ii) mask tokens.
▪ Each mask token is a shared, learned vector that indicates the presence of a
missing patch to be predicted.
▪ Positional embeddings are added to all tokens in this full set.
▪ The MAE decoder is only used during pre-training to perform the image
reconstruction task, so the decoder architecture can be flexible designed in a
manner that is independent of the encoder design.
▪ MAE’s default decoder has <10% computation per token vs. the encoder. With
this asymmetrical design, the full set of tokens are only processed by the
lightweight decoder, which significantly reduces pre-training time.
MAE Details
• Reconstruction target
▪ MAE reconstructs the input by predicting the pixel values for each masked
patch.
▪ The loss function computes the mean squared error (MSE) between the
reconstructed and original images in the pixel space. Computing the loss only
on masked patches, similar to BERT.
• Simple implementation
▪ First, generate a token for every input patch (by linear projection with an
added positional embedding). Next, randomly shuffle the list of tokens and
remove the last portion of the list, based on the masking ratio.
▪ After encoding, append a list of mask tokens to the list of encoded patches,
and unshuffle this full list (inverting the random shuffle operation) to align all
tokens with their targets.
ImageNet Experiments
• The authors do self-supervised pre-training on the ImageNet-1K(IN1K)
training set.
• Then they do supervised training to evaluate the representations with
(i) end-to-end fine-tuning or (ii) linear probing.
• Baseline: ViT-Large
▪ ViT-Large(Vit-L/16) is used as backbone in ablation study.
▪ ViT-L is very big and tends to overfit.
▪ It is nontrivial to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed.
Masking Ratio
• The optimal ratios are surprisingly high.
• 75% is good for both linear probing and
fine-tuning.
• This is in contrast with BERT(15%) and also
much higher than those in related works in
CV(20%~50%).
• Reasoning-like behavior is linked to the
learning of useful representations.
• For linear probing, the accuracy increases
steadily until the sweet point, but for fine-
tuning, the results are less sensitive to the
ratios.
• All fine-tuning results are better that
training from scratch(82.5%)
Decoder Design
• A sufficiently deep decoder is important for linear probing. This can be explained
by the gap between a pixel reconstruction task and a recognition task: the last
several layers in an autoencoder are more specialized for reconstruction, but are
less relevant for recognition.
• However, if fine-tuning is used, the last layers of the encoder can be tuned to
adapt to the recognition task. The decoder depth is less influential.
• Interestingly, MAE with a single-block decoder can perform strongly with fine-
tuning (84.8%). Note that a single Transformer block is the minimal requirement
to propagate information from visible tokens to mask tokens.
• Overall, default MAE decoder is lightweight. It has 8 blocks and a width of 512-d.
It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
Mask Token
• If the encoder uses mask tokens, it
performs worse.
• In this case, there is a gap between
pre-training and deploying: this
encoder has a large portion of mask
tokens in its input in pretraining,
which does not exist in uncorrupted
images. This gap may degrade
accuracy in deployment.
Mask Token
• By skipping the mask token in the encoder,
training computation is greatly reduced.
• Note that the speedup can be >4x for a
masking ratio of 75%, partially because the
self-attention complexity is quadratic.
• In addition, memory is greatly reduced, which
can enable training even larger models or
speeding up more by large-batch training.
• The time and memory efficiency makes MAE
favorable for training very large models.
Reconstruction Target
• Using pixels with normalization improves accuracy. This per-patch normalization
enhances the contrast locally.
• In another variant, the authors perform PCA in the patch space and use the largest PCA
coefficients (96 here) as the target. Doing so degrades accuracy.
• The authors also compare an MAE variant that predicts tokens, the target used in BEiT.
Specifically for this variant, the DALLE pre-trained dVAE is used as the tokenizer, following
BEIT.
• This tokenization improves fine-tuning accuracy vs. unnormalized pixels, but has no
advantage vs. normalized pixels.
• The dVAE tokenizer requires one more pre-training stage, which may depend on extra
data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of
ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
Data Augmentation
• MAE works well using cropping-only augmentation, either fixed-size or
random-size (both having random horizontal flipping).
• Adding color jittering degrades the results.
• Surprisingly, MAE behaves decently even if using no data augmentation
(only center-crop, no flipping). This property is dramatically different from
contrastive learning and related methods, which heavily rely on data
augmentation.
• In MAE, the role of data augmentation is mainly performed by random
masking. The masks are different for each iteration and so they generate
new training samples regardless of data augmentation.
Masking Sampling Strategy
• MAE with block-wise masking works reasonably well at a ratio of 50%,
but degrades at a ratio of 75%.
• Grid-wise sampling regularly keeps one of every four patches and this
is an easier task and has lower training loss. The reconstruction is
sharper, but the representation quality is lower.
• Simple random sampling works the best for MAE. It allows for a
higher masking ratio, which provides a greater speedup benefit.
Training Schedule
• The accuracy improves steadily with longer
training. Indeed, saturation of linear probing
have not been observed even at 1600 epochs.
• This behavior is unlike contrastive learning
methods, e.g., MoCo v3 saturates at 300
epochs for ViT-L.
• Note that the MAE encoder only sees 25% of
patches per epoch, while in contrastive
learning the encoder sees 200% (two crop) or
even more (multi-crop) patches per epoch.
Comparisons with Self-supervised Methods
• For ViT-B, all methods perform closely.
• For Vit-L the gaps among methods are bigger,
suggesting that a challenge for bigger models is to
reduce overfitting.
• MAE can scale up easily and has shown steady
improvement from bigger models.
• By fine-tuning with a 448 size, MAE achieve 87.8%
accuracy, using only IN1K data. The previous best
accuracy, among all methods using only IN1K data,
is 87.1% (512 size), based on advanced networks.
• Comparing with BEiT, MAE is more accurate while
being simpler and faster.
Comparisons with Supervised Pre-training
• In the original ViT paper, ViT-L degrades
when trained in IN1K. The authors
improved supervised recipe works better
for training from scratch, but the accuracy is
saturated.
• MAE pre-training, using only IN1K, can
generalize better: the gain over training
from scratch is bigger for higher-capacity
models.
• It follows a trend similar to the JFT-300M
supervised pre-training. This comparison
shows that MAE can help scale up model
sizes.
Partial Fine-tuning
• Table 1. shows that linear probing and fine-tuning results are largely uncorrelated.
• Linear probing has been a popular protocol in the past few years; however, it
misses the opportunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning.
• As a middle ground, we study a partial fine-tuning protocol: fine-tune the last
several layers while freezing the others.
Partial Fine-tuning
• Notably, fine-tuning only one Transformer block
boosts the accuracy significantly from 73.5% to
81.0%. Moreover, if we fine-tune only “half” of
the last block (i.e., its MLP sub-block), we can get
79.1%, much better than linear probing.
• Comparing with MoCo v3 which is a contrastive
method, MOCO v3 has higher linear probing
accuracy than MAE however, all of its partial fine-
tuning results are worse than MAE.
• These results show that the MAE representations
are less linearly separable, but they are stronger
non-linear features and perform well when a non-
linear head is tuned. These observations suggest
that linear separability is not the sole metric for
evaluating representation quality.
Transfer Learning Experiments
• Object detection and instance segmentation
▪ Mask R-CNN is fine-tuned on COCO. The ViT backbone is adapted for use with
FPN.
• Semantic segmentation
▪ Experiments on ADE20K use UperNet following the code in BEiT.
Transfer Learning Experiments
• Pixels vs. tokens
▪ While using dVAE tokens is better than using unnormalized pixels, it is
statistically similar to just using normalized pixels across all tasks and models.
It agains shows that tokenization is not necessary for MAE.
Discussion and Conclusion
• Simple algorithms that scale well are the core of deep learning.
• In NLP, simple self-supervised learning methods enable benefits from
exponentially scaling models. In computer vision, practical pre-
training paradigms are dominantly supervised despite progress in
self-supervised learning.
• Self supervised learning in vision may now be embarking on a similar
trajectory as in NLP.
Discussion and Conclusion
• On the other hand, images and languages are signals of a different
nature and this difference must be addressed carefully. Images are
merely recorded light without a semantic decomposition into the
visual analogue of words.
• Removing random patches that most likely do not form a semantic
segment and MAE reconstructs pixels which are not semantic entities.
Nevertheless, MAE infers complex, holistic reconstructions suggesting
it has learned numerous visual concepts, i.e., semantics.
• The authors hypothesize that this behavior occurs by way of a rich
hidden representation inside the MAE.
Thank you
Ad

Recommended

Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
Sangmin Woo
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Introduction to Visual transformers
Introduction to Visual transformers
leopauly
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Transformer in Vision
Transformer in Vision
Sangmin Woo
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
Segment Anything
Segment Anything
fake can
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
GuoqingLiu9
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
Karimdabbabi
 
Autoencoders
Autoencoders
CloudxLab
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Edureka!
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
GANs and Applications
GANs and Applications
Hoang Nguyen
 
Explicit Density Models
Explicit Density Models
Sangwoo Mo
 
Autoencoder
Autoencoder
HARISH R
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
Jun Lang
 
Variational Autoencoder
Variational Autoencoder
Mark Chang
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
岳華 杜
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
Edge AI and Vision Alliance
 
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
YutaSuzuki27
 
Image-to-Image Translation
Image-to-Image Translation
Junho Kim
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
cevesom156
 
Explanation of Autoencoder to Variontal Auto Encoder
Explanation of Autoencoder to Variontal Auto Encoder
seshathirid
 

More Related Content

What's hot (20)

Self-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
GuoqingLiu9
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
Karimdabbabi
 
Autoencoders
Autoencoders
CloudxLab
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Edureka!
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
GANs and Applications
GANs and Applications
Hoang Nguyen
 
Explicit Density Models
Explicit Density Models
Sangwoo Mo
 
Autoencoder
Autoencoder
HARISH R
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
Jun Lang
 
Variational Autoencoder
Variational Autoencoder
Mark Chang
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
岳華 杜
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
Edge AI and Vision Alliance
 
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
YutaSuzuki27
 
Image-to-Image Translation
Image-to-Image Translation
Junho Kim
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
GuoqingLiu9
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
Karimdabbabi
 
Autoencoders
Autoencoders
CloudxLab
 
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Edureka!
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
GANs and Applications
GANs and Applications
Hoang Nguyen
 
Explicit Density Models
Explicit Density Models
Sangwoo Mo
 
Autoencoder
Autoencoder
HARISH R
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
Jun Lang
 
Variational Autoencoder
Variational Autoencoder
Mark Chang
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
岳華 杜
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
Marina Santini
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
Jinwon Lee
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
Edge AI and Vision Alliance
 
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
YutaSuzuki27
 
Image-to-Image Translation
Image-to-Image Translation
Junho Kim
 

Similar to PR-355: Masked Autoencoders Are Scalable Vision Learners (20)

jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
cevesom156
 
Explanation of Autoencoder to Variontal Auto Encoder
Explanation of Autoencoder to Variontal Auto Encoder
seshathirid
 
Lecture 7-8 From Autoencoder to VAE.pdf
Lecture 7-8 From Autoencoder to VAE.pdf
EmadAbdelkader5
 
Lecture 7-8 From Autoencoder to VAE.pptx
Lecture 7-8 From Autoencoder to VAE.pptx
yosrghozzi2023
 
Brief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
ShubhamMittal569818
 
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
ShubhamMittal569818
 
Introduction to Autoencoders: Types and Applications
Introduction to Autoencoders: Types and Applications
Amr Rashed
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
Jason Anderson
 
Introduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
DALL-E.pdf
DALL-E.pdf
dsfajkh
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
BEIT:BERT Pre-training of Image Transformers.pdf
BEIT:BERT Pre-training of Image Transformers.pdf
Po-Chuan Chen
 
Autoencoder
Autoencoder
Wataru Hirota
 
Deep learning unsupervised learning diapo
Deep learning unsupervised learning diapo
Milton Paja
 
Autoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi
 
Autoencoder
Autoencoder
Mehrnaz Faraz
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
Roelof Pieters
 
autoencoder-190813144108.pptx
autoencoder-190813144108.pptx
kiran814572
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Changwon National University
 
jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
cevesom156
 
Explanation of Autoencoder to Variontal Auto Encoder
Explanation of Autoencoder to Variontal Auto Encoder
seshathirid
 
Lecture 7-8 From Autoencoder to VAE.pdf
Lecture 7-8 From Autoencoder to VAE.pdf
EmadAbdelkader5
 
Lecture 7-8 From Autoencoder to VAE.pptx
Lecture 7-8 From Autoencoder to VAE.pptx
yosrghozzi2023
 
Brief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
ShubhamMittal569818
 
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
ShubhamMittal569818
 
Introduction to Autoencoders: Types and Applications
Introduction to Autoencoders: Types and Applications
Amr Rashed
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
Jason Anderson
 
Introduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
DALL-E.pdf
DALL-E.pdf
dsfajkh
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
BEIT:BERT Pre-training of Image Transformers.pdf
BEIT:BERT Pre-training of Image Transformers.pdf
Po-Chuan Chen
 
Deep learning unsupervised learning diapo
Deep learning unsupervised learning diapo
Milton Paja
 
Autoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
Roelof Pieters
 
autoencoder-190813144108.pptx
autoencoder-190813144108.pptx
kiran814572
 
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
AI 로봇 아티스트의 비밀(창원대학교 정보통신공학과 특강)
Changwon National University
 
Ad

More from Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Ad

Recently uploaded (20)

"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 

PR-355: Masked Autoencoders Are Scalable Vision Learners

  • 1. Masked Autoencoders Are Scalable Vision Learners Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners” 14th November, 2021 PR12 Paper Review JinWon Lee
  • 2. Introduction • Deep learning has witnessed an explosion of architectures of continuously growing capability and capacity. • Aided by the rapid gains in hardware, models today can easily overfit one million images and begin to demand hundreds of millions of— often publicly inaccessible—labeled images. • This appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining.
  • 3. Introduction • The solutions, based on autoregressive language modeling in GPT and masked autoencoding in BERT, are conceptually simple: they remove a portion of the data and learn to predict the removed content. • These methods now enable training of generalizable NLP models containing over one hundred billion parameters.
  • 4. Introduction • The idea of masked autoencoders, a form of more general denoising autoencoders, is natural and applicable in computer vision as well. • However, despite significant interest in this idea following the success of BERT, progress of autoencoding methods in vision lags behind NLP.
  • 5. What makes masked autoencoding different between vision and language? • Until recently, architectures were different. In vision, convolutional networks were dominant over the last decade. • Convolutions typically operate on regular grids and it is not straightforward to integrate ‘indicators’ such as mask tokens or positional embeddings into convolutional networks. • This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) and should no longer present an obstacle.
  • 6. What makes masked autoencoding different between vision and language? • Information density is different between language and vision. • Languages are human-generated signals that are highly semantic and information-dense. • When training a model to predict only a few missing words per sentence, this task appears to induce sophisticated language understanding. • Images, on the contrary, are natural signals with heavy spatial redundancy—e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes.
  • 7. What makes masked autoencoding different between vision and language? • The autoencoder’s decoder, which maps the latent representation back to the input, plays a different role between reconstructing text and images. • In vision, the decoder reconstructs pixels, hence its output is of a lower semantic level than common recognition tasks. • This is in contrast to language, where the decoder predicts missing words that contain rich semantic information. • While in BERT the decoder can be trivial (an MLP), we found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.
  • 8. Related Work • Masked language modeling ▪ BERT and GPT, are highly successful methods for pre-training in NLP. ▪ These methods hold out a portion of the input sequence and train models to predict the missing content. ▪ These methods have been shown to scale excellently and a large abundance of evidence indicates that these pre-trained representations generalize well to various downstream tasks.
  • 9. Related Work • Autoencoding ▪ It has an encoder that maps an input to a latent representation and a decoder that reconstructs the input. ▪ Denoising autoencoders (DAE) are a class of autoencoders that corrupt an input signal and learn to reconstruct the original, uncorrupted signal. ▪ A series of methods can be thought of as a generalized DAE under different corruptions, e.g., masking pixels or removing color channels.
  • 10. Related Work • Masked image encoding ▪ The pioneering work of presents masking as a noise type in DAE. ▪ Context Encoder inpaints large missing regions using convolutional networks. ▪ Motivated by the success in NLP, related recent methods are based on Transformers. iGPT operates on sequences of pixels and predicts unknown pixels. The ViT paper studies masked patch prediction for self-supervised learning. Most recently, BEiT proposes to predict discrete tokens.
  • 11. BEiT: BERT Pre-Training of Image Transformers
  • 12. Related Work • Self-supervised learning(SSL) ▪ SSL approaches have seen significant interest in computer vision, often focusing on different pretext tasks for pre-training. ▪ Recently, contrastive learning has been popular, which models image similarity and dissimilarity (or only similarity) between two or more views. ▪ Contrastive and related methods strongly depend on data augmentation. ▪ Autoencoding pursues a conceptually different direction, and it exhibits different behaviors.
  • 13. Approach • Suggested masked autoencoder(MAE) is a simple autoencoding approach that reconstructs the original signal given its partial observation. • Like all autoencoders, MAE has an encoder that maps the observed signal to a latent representation, and a decoder that reconstructs the original signal from the latent representation. • Unlike classical autoencoders, The authors adopt an asymmetric design that allows the encoder to operate only on the partial, observed signal (without mask tokens) and a lightweight decoder that reconstructs the full signal from the latent representation and mask tokens.
  • 14. MAE Architecture • During pre-training, a large random subset of image patches (e.g., 75%) is masked out. • The encoder is applied to the small subset of visible patches. • Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. • After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images to produce representations for recognition tasks.
  • 16. MAE Details • Masking ▪ Following ViT, an image is divided into regular non-overlapping patches. Then a subset of patches is sampled and masked. ▪ Sampling strategy is straightforward: random patches without replacement, following a uniform distribution. ▪ Random sampling with a high masking ratio largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches.
  • 17. MAE Details • MAE encoder ▪ MAE encoder is a ViT but applied only on visible, unmasked patches. ▪ Just as in a standard ViT, MAE encoder embeds patches by a linear projection with added positional embeddings. ▪ However, this encoder only operates on a small subset (e.g., 25%) of the full set. ▪ This can reduce overall pre-training time by 3x or more and likewise reduce memory consumption, enabling us to easily scale MAE to large models.
  • 18. MAE Details • MAE decoder ▪ The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. ▪ Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. ▪ Positional embeddings are added to all tokens in this full set. ▪ The MAE decoder is only used during pre-training to perform the image reconstruction task, so the decoder architecture can be flexible designed in a manner that is independent of the encoder design. ▪ MAE’s default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only processed by the lightweight decoder, which significantly reduces pre-training time.
  • 19. MAE Details • Reconstruction target ▪ MAE reconstructs the input by predicting the pixel values for each masked patch. ▪ The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. Computing the loss only on masked patches, similar to BERT. • Simple implementation ▪ First, generate a token for every input patch (by linear projection with an added positional embedding). Next, randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio. ▪ After encoding, append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets.
  • 20. ImageNet Experiments • The authors do self-supervised pre-training on the ImageNet-1K(IN1K) training set. • Then they do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. • Baseline: ViT-Large ▪ ViT-Large(Vit-L/16) is used as backbone in ablation study. ▪ ViT-L is very big and tends to overfit. ▪ It is nontrivial to train supervised ViT-L from scratch and a good recipe with strong regularization is needed.
  • 21. Masking Ratio • The optimal ratios are surprisingly high. • 75% is good for both linear probing and fine-tuning. • This is in contrast with BERT(15%) and also much higher than those in related works in CV(20%~50%). • Reasoning-like behavior is linked to the learning of useful representations. • For linear probing, the accuracy increases steadily until the sweet point, but for fine- tuning, the results are less sensitive to the ratios. • All fine-tuning results are better that training from scratch(82.5%)
  • 22. Decoder Design • A sufficiently deep decoder is important for linear probing. This can be explained by the gap between a pixel reconstruction task and a recognition task: the last several layers in an autoencoder are more specialized for reconstruction, but are less relevant for recognition. • However, if fine-tuning is used, the last layers of the encoder can be tuned to adapt to the recognition task. The decoder depth is less influential. • Interestingly, MAE with a single-block decoder can perform strongly with fine- tuning (84.8%). Note that a single Transformer block is the minimal requirement to propagate information from visible tokens to mask tokens. • Overall, default MAE decoder is lightweight. It has 8 blocks and a width of 512-d. It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
  • 23. Mask Token • If the encoder uses mask tokens, it performs worse. • In this case, there is a gap between pre-training and deploying: this encoder has a large portion of mask tokens in its input in pretraining, which does not exist in uncorrupted images. This gap may degrade accuracy in deployment.
  • 24. Mask Token • By skipping the mask token in the encoder, training computation is greatly reduced. • Note that the speedup can be >4x for a masking ratio of 75%, partially because the self-attention complexity is quadratic. • In addition, memory is greatly reduced, which can enable training even larger models or speeding up more by large-batch training. • The time and memory efficiency makes MAE favorable for training very large models.
  • 25. Reconstruction Target • Using pixels with normalization improves accuracy. This per-patch normalization enhances the contrast locally. • In another variant, the authors perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. • The authors also compare an MAE variant that predicts tokens, the target used in BEiT. Specifically for this variant, the DALLE pre-trained dVAE is used as the tokenizer, following BEIT. • This tokenization improves fine-tuning accuracy vs. unnormalized pixels, but has no advantage vs. normalized pixels. • The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
  • 26. Data Augmentation • MAE works well using cropping-only augmentation, either fixed-size or random-size (both having random horizontal flipping). • Adding color jittering degrades the results. • Surprisingly, MAE behaves decently even if using no data augmentation (only center-crop, no flipping). This property is dramatically different from contrastive learning and related methods, which heavily rely on data augmentation. • In MAE, the role of data augmentation is mainly performed by random masking. The masks are different for each iteration and so they generate new training samples regardless of data augmentation.
  • 27. Masking Sampling Strategy • MAE with block-wise masking works reasonably well at a ratio of 50%, but degrades at a ratio of 75%. • Grid-wise sampling regularly keeps one of every four patches and this is an easier task and has lower training loss. The reconstruction is sharper, but the representation quality is lower. • Simple random sampling works the best for MAE. It allows for a higher masking ratio, which provides a greater speedup benefit.
  • 28. Training Schedule • The accuracy improves steadily with longer training. Indeed, saturation of linear probing have not been observed even at 1600 epochs. • This behavior is unlike contrastive learning methods, e.g., MoCo v3 saturates at 300 epochs for ViT-L. • Note that the MAE encoder only sees 25% of patches per epoch, while in contrastive learning the encoder sees 200% (two crop) or even more (multi-crop) patches per epoch.
  • 29. Comparisons with Self-supervised Methods • For ViT-B, all methods perform closely. • For Vit-L the gaps among methods are bigger, suggesting that a challenge for bigger models is to reduce overfitting. • MAE can scale up easily and has shown steady improvement from bigger models. • By fine-tuning with a 448 size, MAE achieve 87.8% accuracy, using only IN1K data. The previous best accuracy, among all methods using only IN1K data, is 87.1% (512 size), based on advanced networks. • Comparing with BEiT, MAE is more accurate while being simpler and faster.
  • 30. Comparisons with Supervised Pre-training • In the original ViT paper, ViT-L degrades when trained in IN1K. The authors improved supervised recipe works better for training from scratch, but the accuracy is saturated. • MAE pre-training, using only IN1K, can generalize better: the gain over training from scratch is bigger for higher-capacity models. • It follows a trend similar to the JFT-300M supervised pre-training. This comparison shows that MAE can help scale up model sizes.
  • 31. Partial Fine-tuning • Table 1. shows that linear probing and fine-tuning results are largely uncorrelated. • Linear probing has been a popular protocol in the past few years; however, it misses the opportunity of pursuing strong but non-linear features—which is indeed a strength of deep learning. • As a middle ground, we study a partial fine-tuning protocol: fine-tune the last several layers while freezing the others.
  • 32. Partial Fine-tuning • Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%. Moreover, if we fine-tune only “half” of the last block (i.e., its MLP sub-block), we can get 79.1%, much better than linear probing. • Comparing with MoCo v3 which is a contrastive method, MOCO v3 has higher linear probing accuracy than MAE however, all of its partial fine- tuning results are worse than MAE. • These results show that the MAE representations are less linearly separable, but they are stronger non-linear features and perform well when a non- linear head is tuned. These observations suggest that linear separability is not the sole metric for evaluating representation quality.
  • 33. Transfer Learning Experiments • Object detection and instance segmentation ▪ Mask R-CNN is fine-tuned on COCO. The ViT backbone is adapted for use with FPN. • Semantic segmentation ▪ Experiments on ADE20K use UperNet following the code in BEiT.
  • 34. Transfer Learning Experiments • Pixels vs. tokens ▪ While using dVAE tokens is better than using unnormalized pixels, it is statistically similar to just using normalized pixels across all tasks and models. It agains shows that tokenization is not necessary for MAE.
  • 35. Discussion and Conclusion • Simple algorithms that scale well are the core of deep learning. • In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre- training paradigms are dominantly supervised despite progress in self-supervised learning. • Self supervised learning in vision may now be embarking on a similar trajectory as in NLP.
  • 36. Discussion and Conclusion • On the other hand, images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. • Removing random patches that most likely do not form a semantic segment and MAE reconstructs pixels which are not semantic entities. Nevertheless, MAE infers complex, holistic reconstructions suggesting it has learned numerous visual concepts, i.e., semantics. • The authors hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE.