Vision Transformers (ViT) in Image Recognition - Full Guide - Viso - Ai
Vision Transformers (ViT) in Image Recognition - Full Guide - Viso - Ai
ai
(https://viso.ai)
Transformer models have become the de-facto status quo in Natural Language Processing
(NLP). In computer vision research, there has recently been a rise in interest in Vision
Transformers (ViTs) and Multilayer perceptrons (MLPs) (https://viso.ai/deep-learning/deep-
neural-network-three-popular-types/).
https://viso.ai/deep-learning/vision-transformer-vit/ 1/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
(https://viso.ai)
While the Transformer architecture has become the highest standard for tasks involving Natural
Language Processing (NLP) (https://viso.ai/deep-learning/natural-language-processing/), its
use cases relating to Computer Vision (CV) (https://viso.ai/computer-vision/what-is-computer-
vision/) remain only a few. In computer vision, attention is either used in conjunction with
convolutional networks (CNN) or used to substitute certain aspects of convolutional networks
while keeping their entire composition intact. Popular image recognition algorithms include
ResNet (https://viso.ai/deep-learning/resnet-residual-neural-network/), VGG
(https://viso.ai/deep-learning/vgg-very-deep-convolutional-networks/), YOLOv3
(https://viso.ai/deep-learning/yolov3-overview/), and YOLOv7 (https://viso.ai/deep-
learning/yolov7-guide/).
However, this dependency on CNN is not mandatory, and a pure transformer applied directly to
sequences of image patches can work exceptionally well on image classification
(https://viso.ai/computer-vision/image-classification/) tasks.
https://viso.ai/deep-learning/vision-transformer-vit/ 2/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
The fine-tuning code and pre-trained ViT models are available on the GitHub of Google Research.
You find them here (https://github.com/google-research/vision_transformer). The ViT models
were pre-trained on the ImageNet and ImageNet-21k datasets.
Vision Transformer (ViT) achieves remarkable results compared to convolutional neural networks
(CNN) while obtaining fewer computational resources for pre-training. In comparison to
convolutional neural networks (CNN), Vision Transformer (ViT) show a generally weaker inductive
bias resulting in increased reliance on model regularization or data augmentation
(https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/) (AugReg)
when training on smaller datasets.
The ViT is a visual model based on the architecture of a transformer originally designed for text-
based tasks. The ViT model represents an input image as a series of image patches, like the
series of word embeddings used when using transformers to text, and directly predicts class
https://viso.ai/deep-learning/vision-transformer-vit/ 3/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
labels for the image. ViT exhibits an extraordinary performance when trained on enough data,
(https://viso.ai)
breaking the performance of a similar state-of-art CNN with 4x fewer computational resources.
These transformers have high success rates when it comes to NLP models and are now also
applied to images for image recognition tasks. CNN uses pixel arrays, whereas ViT splits the
images into visual tokens. The visual transformer divides an image into fixed-size patches,
correctly embeds each of them, and includes positional embedding as an input to the
transformer encoder. Moreover, ViT models outperform CNNs
(https://arxiv.org/abs/2105.07581) by almost four times when it comes to computational
efficiency and accuracy.
The self-attention layer in ViT makes it possible to embed information globally across the overall
image. The model also learns on training data to encode the relative location of the image
patches to reconstruct the structure of the image.
Multi-Head Self Attention Layer (MSP): This layer concatenates all the attention outputs
linearly to the right dimensions. The many attention heads help train local and global
dependencies in an image.
Multi-Layer Perceptrons (MLP) Layer: This layer contains a two-layer with Gaussian Error
Linear Unit (GELU).
Layer Norm (LN): This is added prior to each block as it does not include any new
dependencies between the training images. This thereby helps improve the training time
and overall performance.
Moreover, residual connections are included after each block as they allow the components to
flow through the network directly without passing through non-linear activations.
In the case of image classification, the MLP layer implements the classification head. It does it
with one hidden layer at pre-training time and a single linear layer for fine-tuning.
https://viso.ai/deep-learning/vision-transformer-vit/ 4/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
(https://viso.ai)
Raw images (left) with attention maps of the ViT-S/16 model (right). – Source
(https://arxiv.org/abs/2106.01548)
Attention, more specifically, self-attention is one of the essential blocks of machine learning
transformers. It is a computational primitive used to quantify pairwise entity interactions that
help a network to learn the hierarchies and alignments present inside input data. Attention has
proven to be a key element for vision networks to achieve higher robustness.
https://viso.ai/deep-learning/vision-transformer-vit/ 5/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
(https://viso.ai)
The overall architecture of the vision transformer model is given as follows in a step-by-step
manner:
https://viso.ai/deep-learning/vision-transformer-vit/ 6/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
While the ViT full-transformer architecture is a promising option for vision processing tasks, the
performance of ViTs is still inferior to that of similar-sized CNN alternatives (such as ResNet)
when trained from scratch on a mid-sized dataset such as ImageNet.
https://viso.ai/deep-learning/vision-transformer-vit/ 7/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
(https://viso.ai)
The performance of a vision transformer model depends on decisions such as that of the
optimizer, network depth, and dataset-specific hyperparameters. Compared to ViT, CNNs are
easier to optimize.
The disparity on a pure transformer is to marry a transformer to a CNN front end. The usual ViT
stem leverages a 16*16 convolution with a 16 stride. In comparison, a 3*3 convolution with stride
2 increases the stability and elevates precision.
CNN turns basic pixels into a feature map. Later, the feature map is translated by a tokenizer into
a sequence of tokens that are then inputted into the transformer. The transformer then applies
the attention technique to create a sequence of output tokens. Eventually, a projector reconnects
the output tokens to the feature map. The latter allows the examination to navigate potentially
crucial pixel-level details. This thereby lowers the number of tokens that need to be studied,
lowering costs significantly.
Particularly, if the ViT model is trained on huge datasets that are over 14M images, it can
outperform the CNNs. If not, the best option is to stick to ResNet (https://viso.ai/deep-
learning/resnet-residual-neural-network/) or EfficientNet. The vision transformer model is trained
https://viso.ai/deep-learning/vision-transformer-vit/ 8/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
on a huge dataset even before the process of fine-tuning. The only change is to disregard the
(https://viso.ai)
MLP layer and add a new D times KD*K layer, where K is the number of classes of the small
dataset.
Video forecasting and activity recognition are all parts of video processing that require ViT.
Moreover, image enhancement, colorization, and image super-resolution also use ViT models.
Last but not least, ViTs have numerous applications in 3D analysis, such as segmentation and
point cloud classification.
Conclusion
The vision transformer model uses multi-head self-attention in Computer Vision without requiring
image-specific biases. The model splits the images into a series of positional embedding
patches, which are processed by the transformer encoder. It does so to understand the local and
global features that the image possesses. Last but not least, the ViT has a higher precision rate
on a large dataset with reduced training time.
Being a Computer Vision Engineer in 2023 (https://viso.ai/computer-
vision/computer-vision-engineer/)
Learn what computer vision engineers do, what skills are required to be a successful computer vision
engineer, and the job outlook in 2023.
What’s next
Read More » (https://viso.ai/computer-vision/computer-vision-engineer/)
Read more about related topics and other state-of-the-art methods in machine learning, image
processing, and recognition.
https://viso.ai/deep-learning/vision-transformer-vit/ 9/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
(https://viso.ai/applications/fall-detection-vision-deep-learning-application/)
Optical Character Recognition (OCR) (https://viso.ai/computer-vision/optical-character-
(https://viso.ai)
recognition-ocr/)
Show me more
(https://viso.ai/)
viso.ai
https://viso.ai/deep-learning/vision-transformer-vit/ 10/11
1/25/23, 4:54 PM Vision Transformers (ViT) in Image Recognition: Full Guide - viso.ai
About (https://viso.ai/company/)
Company (https://viso.ai/company/)
Careers (https://viso.ai/jobs/)
Terms (https://viso.ai/terms-of-service/)
Contact (https://viso.ai/contact/)
(htt
https://viso.ai/deep-learning/vision-transformer-vit/ 11/11