SlideShare a Scribd company logo
Multimodal Residual Learning
for Visual QA
NamHyuk Ahn
Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)
Visual QA
Evaluation Metric
- Robust to variabilityinter-
human
- Human accuracy is almost 90
- 248,349 Training questions
(82,783 Images)
- 121,512 Validation questions
(40,504 Images)
- 244,302 Testing questions
(81,434 Images)
Stacked Attention Network
Motivation
- Answering question requires
multi-step reasoning
- With {bicycles, window, street,
baskets, dogs} objects
- To answer good question,
pinpoint relevant region.
Q: what are sitting in the basket
on a bicycle
Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which
successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network
- Image Model
• Extract image feature using
CNN
- Question Model
• Extract semantic vector
using CNN or LSTM
- Stacked Attention
• Multi-step reasoning
with attention layer
Stacked Attention
Multi-step reasoning
using attention layer
Image / Question Model
- Image Model
• Get feature map from
raw pixel Image
• Rescale image to 448x448,
take feature from pool5 of
VGGNet (14x14x512)
• Additional layer to fit to
question feature
- Question Model
•
Stacked Attention Model
- Global image feature leads to
suboptimal due to noise from
irrelevant object / region.
- Instead use SAM to pinpoint
relevant region
- Given image feature matrix
and question vector ,
14x14 attention distribution
- Get weighted sum of image
vectors from each region.
-
refined query vector
Result
Multimodal Residual Learning for Visual QA
Residual Learning
Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
Residual Network (ResNet)
Residual Block
- To avoid degradation
problem, add shortcut
connection.
- Element-wise addition with
F(x) and shortcut connection,
and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection
Multimodal Residual
Network
Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA
dataset (not today :(.
- Introducing a method to visualize spatial attention
effect of joint residual mappings
Background
SAN
- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]
- With just elem-wise multiple,
visual and question feature
embed very well.
MRN
- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global
multiplication [Lu et al.] does.
Multimodal Residual Learning for Visual QA
Multimodal Residual Learning for Visual QA
Quantitative Analysis
- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in
question cause overfitting.
- (d) identity shortcut cause
degradation (extra linear
mapping is needed).
- (e) performs reasonable, but
extra shortcut is not essential.
Quantitative Analysis
# of Learning blocks
- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features
- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less
feature dim (2048 vs 4096).
# of Answer Class
- Trade-off relation among
answer type, but 2k is best
- Implicit attention with multiplication
- Get high-resolution attention map
Multimodal Residual Learning for Visual QA
Reference
- Yang, Zichao, et al. "Stacked attention networks for image question
answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv
preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of
the IEEE International Conference on Computer Vision. 2015.
Ad

More Related Content

What's hot (20)

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
Trong-An Bui
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
Edge AI and Vision Alliance
 
Cnn
CnnCnn
Cnn
Nirthika Rajendran
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
NamHyuk Ahn
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
Sunghoon Joo
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
Roozbeh Sanaei
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
Cnn
CnnCnn
Cnn
Mehrnaz Faraz
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
Gioele Ciaparrone
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
Jeremy Nixon
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
Trong-An Bui
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
Edge AI and Vision Alliance
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
Case Study of Convolutional Neural Network
Case Study of Convolutional Neural NetworkCase Study of Convolutional Neural Network
Case Study of Convolutional Neural Network
NamHyuk Ahn
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
Sunghoon Joo
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
 
Convolutional neural networks
Convolutional neural networks Convolutional neural networks
Convolutional neural networks
Roozbeh Sanaei
 
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
Gioele Ciaparrone
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
Jeremy Nixon
 

Viewers also liked (19)

807 103康八上 my comic book
807 103康八上 my comic book807 103康八上 my comic book
807 103康八上 my comic book
Ally Lin
 
Postavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojářePostavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojáře
Ladislav Prskavec
 
Giveandget.com
Giveandget.comGiveandget.com
Giveandget.com
Hegedűs Zsolt
 
Reclaiming the idea of the University
Reclaiming the idea of the UniversityReclaiming the idea of the University
Reclaiming the idea of the University
Richard Hall
 
Ingles
InglesIngles
Ingles
Isaias Yañez
 
Gamification review 1
Gamification review 1Gamification review 1
Gamification review 1
Daria Axelrod Marmer
 
Frede space up paris 2013
Frede space up paris 2013Frede space up paris 2013
Frede space up paris 2013
Jędrzej Górski
 
6º básico a semana 09 al 13 de mayo (1)
6º básico a semana 09  al 13 de  mayo (1)6º básico a semana 09  al 13 de  mayo (1)
6º básico a semana 09 al 13 de mayo (1)
Colegio Camilo Henríquez
 
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Doug Oldfield
 
User experience eBay
User experience eBayUser experience eBay
User experience eBay
MariaSerrano655
 
Marquette Social Listening presentation
Marquette Social Listening presentationMarquette Social Listening presentation
Marquette Social Listening presentation
7Summits
 
อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์
ooh Pongtorn
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-framework
Abdhesh Kumar
 
4º básico a semana 03 de junio al 10 de junio
4º básico a  semana  03 de junio al 10 de junio4º básico a  semana  03 de junio al 10 de junio
4º básico a semana 03 de junio al 10 de junio
Colegio Camilo Henríquez
 
Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5
Maurizio Taglioretti
 
iProductive Environment Platform
iProductive Environment PlatformiProductive Environment Platform
iProductive Environment Platform
Productive Environment Institute
 
Bateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileiraBateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileira
manda555
 
9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat
Rosyid Musthofa
 
The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...
EDUCAUSE
 
807 103康八上 my comic book
807 103康八上 my comic book807 103康八上 my comic book
807 103康八上 my comic book
Ally Lin
 
Postavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojářePostavte zeď mezi svoje vývojáře
Postavte zeď mezi svoje vývojáře
Ladislav Prskavec
 
Reclaiming the idea of the University
Reclaiming the idea of the UniversityReclaiming the idea of the University
Reclaiming the idea of the University
Richard Hall
 
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...
Doug Oldfield
 
Marquette Social Listening presentation
Marquette Social Listening presentationMarquette Social Listening presentation
Marquette Social Listening presentation
7Summits
 
อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์อุปกรณ์เครือข่ายงคอมพิวเตอร์
อุปกรณ์เครือข่ายงคอมพิวเตอร์
ooh Pongtorn
 
Scala play-framework
Scala play-frameworkScala play-framework
Scala play-framework
Abdhesh Kumar
 
4º básico a semana 03 de junio al 10 de junio
4º básico a  semana  03 de junio al 10 de junio4º básico a  semana  03 de junio al 10 de junio
4º básico a semana 03 de junio al 10 de junio
Colegio Camilo Henríquez
 
Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5Introduzione a Netwrix Auditor 8.5
Introduzione a Netwrix Auditor 8.5
Maurizio Taglioretti
 
Bateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileiraBateria e contrabaixo na música popular brasileira
Bateria e contrabaixo na música popular brasileira
manda555
 
9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat9. konsolidasi database_di_pusat
9. konsolidasi database_di_pusat
Rosyid Musthofa
 
The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...The blended learning research: What we now know about high quality faculty de...
The blended learning research: What we now know about high quality faculty de...
EDUCAUSE
 
Ad

Similar to Multimodal Residual Learning for Visual QA (20)

ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
Mohit Garg
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A Review
IRJET Journal
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
Hoa Le
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Daniel Davis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
IEEEMEMTECHSTUDENTSPROJECTS
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
Darian Frajberg
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
CHENHuiMei
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Simone Ercoli
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
Eduard Tyantov
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Ontico
 
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET Journal
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learning
Utkarsh Contractor
 
IRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial ImagesIRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial Images
IRJET Journal
 
Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation
Adz91 Digital Ads Pvt Ltd
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
Seunghyun Hwang
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
Mohit Garg
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A Review
IRJET Journal
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
Hoa Le
 
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...
Daniel Davis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...
IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...
IEEEMEMTECHSTUDENTSPROJECTS
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
Darian Frajberg
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
CHENHuiMei
 
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...
Simone Ercoli
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
Eduard Tyantov
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Ontico
 
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...
IRJET Journal
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learning
Utkarsh Contractor
 
IRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial ImagesIRJET - Gender Recognition from Facial Images
IRJET - Gender Recognition from Facial Images
IRJET Journal
 
Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation Mining weakly labeled web facial images for search based face annotation
Mining weakly labeled web facial images for search based face annotation
Adz91 Digital Ads Pvt Ltd
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
Ioan Toma
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
Vinay Shukla
 
Ad

Recently uploaded (20)

Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 

Multimodal Residual Learning for Visual QA

  • 1. Multimodal Residual Learning for Visual QA NamHyuk Ahn
  • 2. Table of Contents 1. Visual QA 2. Stacked Attention Network (SAN) 3. Residual Learning 4. Multimodal Residual Network (MRN)
  • 3. Visual QA Evaluation Metric - Robust to variabilityinter- human - Human accuracy is almost 90 - 248,349 Training questions (82,783 Images) - 121,512 Validation questions (40,504 Images) - 244,302 Testing questions (81,434 Images)
  • 5. Motivation - Answering question requires multi-step reasoning - With {bicycles, window, street, baskets, dogs} objects - To answer good question, pinpoint relevant region. Q: what are sitting in the basket on a bicycle
  • 6. Stacked Attention Network (SAN) - SAN allows multi-step reasoning for visual QA - Extension of Attention mechanism which successfully applied in captioning, translation etc. Q: what are sitting in the basket on a bicycle
  • 7. Stacked Attention Network - Image Model • Extract image feature using CNN - Question Model • Extract semantic vector using CNN or LSTM - Stacked Attention • Multi-step reasoning with attention layer Stacked Attention Multi-step reasoning using attention layer
  • 8. Image / Question Model - Image Model • Get feature map from raw pixel Image • Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512) • Additional layer to fit to question feature - Question Model •
  • 9. Stacked Attention Model - Global image feature leads to suboptimal due to noise from irrelevant object / region. - Instead use SAM to pinpoint relevant region - Given image feature matrix and question vector , 14x14 attention distribution - Get weighted sum of image vectors from each region. - refined query vector
  • 13. Problem of degradation - More depth, more accurate but deep network can vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer) - More deeper, degradation problem occur • Not only overfit, but also increase training error
  • 14. Residual Network (ResNet) Residual Block - To avoid degradation problem, add shortcut connection. - Element-wise addition with F(x) and shortcut connection, and pass through ReLU. - Similar to LSTM http://torch.ch/blog/2016/02/04/resnets.html Shortcut connection
  • 16. Introduction - Extend deep residual learning for visual QA - Achieving the state-of-the-art results on visual QA dataset (not today :(. - Introducing a method to visualize spatial attention effect of joint residual mappings
  • 17. Background SAN - But question info contribute weakly, it cause bottleneck Baseline [Lu et al.] - With just elem-wise multiple, visual and question feature embed very well. MRN - Shortcut mapping and stacking architecture - No weighted-sum - Instead use global multiplication [Lu et al.] does.
  • 20. Quantitative Analysis - (a) shows large improvement over SAN, (b) is better. - (c) add extra embedding in question cause overfitting. - (d) identity shortcut cause degradation (extra linear mapping is needed). - (e) performs reasonable, but extra shortcut is not essential.
  • 21. Quantitative Analysis # of Learning blocks - 58.85% (L=1), 59.44% (L=2), 60.53% (L=3), 60.42% (L=4) Visual Features - ResNet-152 is significantly better than VGGNet - Even though ResNet has less feature dim (2048 vs 4096). # of Answer Class - Trade-off relation among answer type, but 2k is best
  • 22. - Implicit attention with multiplication - Get high-resolution attention map
  • 24. Reference - Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015). - Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016). - Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.