0% found this document useful (0 votes)
8 views40 pages

Lecture 13

The document discusses computer vision techniques for embedded systems including convolutional neural networks and transformers. It describes large image datasets like JFT and effects of training data and model capacity. Limitations of CNNs and how transformers can be used for image recognition by treating images as sequences of patches are covered.

Uploaded by

huo si
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views40 pages

Lecture 13

The document discusses computer vision techniques for embedded systems including convolutional neural networks and transformers. It describes large image datasets like JFT and effects of training data and model capacity. Limitations of CNNs and how transformers can be used for image recognition by treating images as sequences of patches are covered.

Uploaded by

huo si
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Computer Vision for

Embedded Systems

Yung-Hsiang Lu
Purdue University
[email protected]

Yung-Hsiang Lu, Purdue University 1


Revisiting Unreasonable Effectiveness of Data in
Deep Learning Era
Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta, ICCV 2017

JFT dataset: 300M images, 18,291 categories

Yung-Hsiang Lu, Purdue University 2


Yung-Hsiang Lu, Purdue University 3
JFT Dataset
● 300M images
● 375M labels
● 18,291 categories
○ 1,165 types of animals
○ 5,720 types of vehicles
○ maximum depth of hierarchy is 12
○ maximum number of children is 2,876
● heavy tail distribution: 3K categories with fewer than 100 images each
● image sizes: 340 x 340 cropped to 299 x 299, normalized to [-1, 1]

Yung-Hsiang Lu, Purdue University 4


Effects of Training Examples
COCO PASCAL VOC

Yung-Hsiang Lu, Purdue University 5


Effect of Model Capacity
COCO
on ResNet

Yung-Hsiang Lu, Purdue University 6


Limitations of Convolutional Neural Networks
● Convolution considers neighbor pixels but at fixed distances
● Same parameters are applied to all pixels even though objects may
have different sizes
● Hyperparameters (stride, filter size, number of layers …) determined in
advance (may be determined by neural architecture search)

Yung-Hsiang Lu, Purdue University 7


An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,
Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 2020

Yung-Hsiang Lu, Purdue University 8


Yung-Hsiang Lu, Purdue University 9
Create Image Patches

image height image width

number of channels size of patch

number of patches

Yung-Hsiang Lu, Purdue University 10


Options of Position Embedding
● no position information (bags of patches)
● 1D position embedding (sequences of patches)
● 2D position embedding
● relative position embedding

Yung-Hsiang Lu, Purdue University 11


Datasets
● ImageNet 1K: 1K classes, 1.3M images
● ImageNet 21K: 21K classes, 14M images
● JFT: 18K classes, 14M images
● CIFAR-10 and 100
● Oxford-IIIT Pets
● Oxford Flowers-102

Yung-Hsiang Lu, Purdue University 12


Model Variants

Yung-Hsiang Lu, Purdue University 13


Comparison

Yung-Hsiang Lu, Purdue University 14


Yung-Hsiang Lu, Purdue University 15
Yung-Hsiang Lu, Purdue University 16
Yung-Hsiang Lu, Purdue University 17
How to train your ViT? Data, Augmentation,
and Regularization in Vision Transformers
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob
Uszkoreit, Lucas Beyer 2021

Yung-Hsiang Lu, Purdue University 18


Yung-Hsiang Lu, Purdue University 19
ViViT: A Video Vision Transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold Chen Sun, CVPR 2021

Yung-Hsiang Lu, Purdue University 20


The Kinetics Human Action Video
Dataset 2017

Yung-Hsiang Lu, Purdue University 21


Kinetics Dataset
● 400 human action classes
● each action at least 400 clips
● each clip 10 seconds from Youtube
● Single person activities: drawing, laughing, drinking
● Person-person activities: shaking hands, hugging
● Person-object activities: washing dishes, mowing lawn

Yung-Hsiang Lu, Purdue University 22


Yung-Hsiang Lu, Purdue University 23
Crowdsourcing to label data

Yung-Hsiang Lu, Purdue University 24


Yung-Hsiang Lu, Purdue University 25
https://nanonets.com/blog/optical-flow/ Yung-Hsiang Lu, Purdue University 26
https://nanonets.com/blog/optical-flow/

Yung-Hsiang Lu, Purdue University 27


https://nanonets.com/blog/optical-flow/
Yung-Hsiang Lu, Purdue University 28
Lucas–Kanade method
The method assume the optical flow in each small neighborhood is
unchanged.

https://en.wikipedia.org/wiki/Lucas%E2%80%93Kanade_method
Yung-Hsiang Lu, Purdue University 29
Yung-Hsiang Lu, Purdue University 30
Yung-Hsiang Lu, Purdue University 31
UCF-101 Yung-Hsiang Lu, Purdue University HMDB-51 32
ViViT: A Video Vision Transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold Chen Sun, CVPR 2021

Yung-Hsiang Lu, Purdue University 33


Yung-Hsiang Lu, Purdue University 34
Yung-Hsiang Lu, Purdue University 35
Yung-Hsiang Lu, Purdue University 36
Yung-Hsiang Lu, Purdue University 37
Yung-Hsiang Lu, Purdue University 38
Yung-Hsiang Lu, Purdue University 39
Yung-Hsiang Lu, Purdue University 40

You might also like