Multimodal Residual Learning for Visual QA

Nov 28, 20160 likes730 views

Multimodal Residual Network (MRN) extends residual learning to visual question answering to achieve state-of-the-art results. MRN introduces shortcut connections between question and image embeddings to avoid degradation from very deep networks. Evaluation shows MRN outperforms stacked attention networks and improves with increased depth up to 3 blocks. Implicit attention maps reveal spatial focus without weighted sums.

Multimodal Residual Learning
for Visual QA
NamHyuk Ahn

Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)

Visual QA
Evaluation Metric
- Robust to variabilityinter-
human
- Human accuracy is almost 90
- 248,349 Training questions
(82,783 Images)
- 121,512 Validation questions
(40,504 Images)
- 244,302 Testing questions
(81,434 Images)

Motivation
- Answering question requires
multi-step reasoning
- With {bicycles, window, street,
baskets, dogs} objects
- To answer good question,
pinpoint relevant region.
Q: what are sitting in the basket
on a bicycle

Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which
successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle

Stacked Attention Network
- Image Model
• Extract image feature using
CNN
- Question Model
• Extract semantic vector
using CNN or LSTM
- Stacked Attention
• Multi-step reasoning
with attention layer
Stacked Attention
Multi-step reasoning
using attention layer

Image / Question Model
- Image Model
• Get feature map from
raw pixel Image
• Rescale image to 448x448,
take feature from pool5 of
VGGNet (14x14x512)
• Additional layer to fit to
question feature
- Question Model
•

Stacked Attention Model
- Global image feature leads to
suboptimal due to noise from
irrelevant object / region.
- Instead use SAM to pinpoint
relevant region
- Given image feature matrix
and question vector ,
14x14 attention distribution
- Get weighted sum of image
vectors from each region.
-
refined query vector

Multimodal Residual Learning for Visual QA

Problem of degradation
- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error

Residual Network (ResNet)
Residual Block
- To avoid degradation
problem, add shortcut
connection.
- Element-wise addition with
F(x) and shortcut connection,
and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection

Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA
dataset (not today :(.
- Introducing a method to visualize spatial attention
effect of joint residual mappings

Background
SAN
- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]
- With just elem-wise multiple,
visual and question feature
embed very well.
MRN
- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global
multiplication [Lu et al.] does.

Quantitative Analysis
- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in
question cause overfitting.
- (d) identity shortcut cause
degradation (extra linear
mapping is needed).
- (e) performs reasonable, but
extra shortcut is not essential.

Quantitative Analysis
# of Learning blocks
- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features
- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less
feature dim (2048 vs 4096).
# of Answer Class
- Trade-off relation among
answer type, but 2k is best

- Implicit attention with multiplication
- Get high-resolution attention map

Reference
- Yang, Zichao, et al. "Stacked attention networks for image question
answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv
preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of
the IEEE International Conference on Computer Vision. 2015.

The document discusses three neural network models for semantic segmentation: DeconvNet, DecoupledNet, and TransferNet. DeconvNet uses deconvolution layers to generate dense pixel-wise segmentation maps from convolutional features. DecoupledNet is designed for semi-supervised learning, using separate networks for classification and binary segmentation with bridging layers. TransferNet introduces an attention model to enable transferring a segmentation model trained on one dataset to a different dataset with new classes.

Convolutional Neural Network and RNN for OCR problem.Vishal Mishra

This document presents a thesis on using sequence-to-sequence learning with deep learning techniques for optical character recognition. The author aims to convert images of mathematical equations into LaTeX representations. Convolutional neural networks, recurrent neural networks, long short-term memory networks, and attention models are discussed as approaches. Details are provided on the architecture and workings of CNNs, RNNs, and LSTMs. The thesis will propose a model and discuss results and future work.

Image Segmentation Using Deep Learning : A surveyNUPUR YADAV

1. The document discusses various deep learning models for image segmentation, including fully convolutional networks, encoder-decoder models, multi-scale pyramid networks, and dilated convolutional models. 2. It provides details on popular architectures like U-Net, SegNet, and models from the DeepLab family. 3. The document also reviews datasets commonly used to evaluate image segmentation methods and reports accuracies of different models on the Cityscapes dataset.

Convolutional neural network from VGG to DenseNetSungminYou

This document summarizes recent developments in convolutional neural networks (CNNs) for image recognition, including residual networks (ResNets) and densely connected convolutional networks (DenseNets). It reviews CNN structure and components like convolution, pooling, and ReLU. ResNets address degradation problems in deep networks by introducing identity-based skip connections. DenseNets connect each layer to every other layer to encourage feature reuse, addressing vanishing gradients. The document outlines the structures of ResNets and DenseNets and their advantages over traditional CNNs.

Convolutional neural networkMojammilHusain

CNN and its applications by ketakiKetaki Patwari

The document describes a vehicle detection system using a fully convolutional regression network (FCRN). The FCRN is trained on patches from aerial images to predict a density map indicating vehicle locations. The proposed system is evaluated on two public datasets and achieves higher precision and recall than comparative shallow and deep learning methods for vehicle detection in aerial images. The system could help with applications like urban planning and traffic management.

Machine Learning - Convolutional Neural NetworkRichard Kuo

The document provides an overview of convolutional neural networks (CNNs) for visual recognition. It discusses the basic concepts of CNNs such as convolutional layers, activation functions, pooling layers, and network architectures. Examples of classic CNN architectures like LeNet-5 and AlexNet are presented. Modern architectures such as Inception and ResNet are also discussed. Code examples for image classification using TensorFlow, Keras, and Fastai are provided.

Transfer Learning in NLP: A SurveyNUPUR YADAV

Convolutional Neural Networks : Popular Architecturesananth

In this presentation we look at some of the popular architectures, such as ResNet, that have been successfully used for a variety of applications. Starting from the AlexNet and VGG that showed that the deep learning architectures can deliver unprecedented accuracies for Image classification and localization tasks, we review other recent architectures such as ResNet, GoogleNet (Inception) and the more recent SENet that have won ImageNet competitions.

Convolutional Neural Network Models - Deep LearningMohamed Loey

Review-image-segmentation-by-deep-learningTrong-An Bui

"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance

For the full video of this presentation, please visit: http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit For more information about embedded vision, please visit: http://www.embedded-vision.com Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit. Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware. This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.

CnnNirthika Rajendran

Convolutional neural networks (CNNs) learn multi-level features and perform classification jointly and better than traditional approaches for image classification and segmentation problems. CNNs have four main components: convolution, nonlinearity, pooling, and fully connected layers. Convolution extracts features from the input image using filters. Nonlinearity introduces nonlinearity. Pooling reduces dimensionality while retaining important information. The fully connected layer uses high-level features for classification. CNNs are trained end-to-end using backpropagation to minimize output errors by updating weights.

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh

The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.

Computer Vision for BeginnersSanghamitra Deb

This document provides an overview of computer vision techniques including classification and object detection. It discusses popular deep learning models such as AlexNet, VGGNet, and ResNet that advanced the state-of-the-art in image classification. It also covers applications of computer vision in areas like healthcare, self-driving cars, and education. Additionally, the document reviews concepts like the classification pipeline in PyTorch, data augmentation, and performance metrics for classification and object detection like precision, recall, and mAP.

Case Study of Convolutional Neural NetworkNamHyuk Ahn

This document summarizes the evolution of convolutional neural networks (CNNs) from LeNet to ResNet. It discusses key CNN architectures like AlexNet, VGGNet, GoogLeNet, and ResNet and the techniques they introduced such as ReLU, dropout, batch normalization, and residual connections. These techniques helped reduce overfitting and allowed training of much deeper networks, leading to substantially improved accuracy on the ImageNet challenge over time, from AlexNet's top-5 error of 15.3% in 2012 to ResNet's 3.57% in 2015.

PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo

PR-351: Adaptive Aggregation Networks for Class-Incremental Learning Paper link: https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_Adaptive_Aggregation_Networks_for_Class-Incremental_Learning_CVPR_2021_paper.pdf Video presentation link: https://youtu.be/Fd30KJPq9UM #class imbalance, #knowledge distillation, # class incremental learning, #CVPR, #Deeplearning

Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna

Convolutional neural networks Roozbeh Sanaei

This document contains 50 sections by Roozbeh Sanaei summarizing key concepts in convolutional neural networks. The sections cover topics such as convolution operations, padding, pooling, CNN architectures like LeNet-5 and AlexNet, optimization techniques like residual blocks, and object detection algorithms including YOLO, RCNN, and Faster RCNN. The document also discusses concepts like feature pyramids, transposed convolutions, style transfer, and Siamese networks.

Emerging Properties in Self-Supervised Vision TransformersSungchul Kim

The document summarizes the DINO self-supervised learning approach for vision transformers. DINO uses a teacher-student framework where the teacher's predictions are used to supervise the student through knowledge distillation. Two global and several local views of an image are passed through the student, while only global views are passed through the teacher. The student is trained to match the teacher's predictions for local views. DINO achieves state-of-the-art results on ImageNet with linear evaluation and transfers well to downstream tasks. It also enables vision transformers to discover object boundaries and semantic layouts.

CnnMehrnaz Faraz

The document discusses Convolutional Neural Networks (CNNs), a type of deep learning algorithm used for computer vision tasks. CNNs have convolutional layers that apply filters to input images to extract features, and pooling layers that reduce the spatial size of representations. They use shared weights and local connectivity to classify images. Common CNN architectures described include LeNet-5, AlexNet, VGG16, GoogLeNet and ResNet, with increasing numbers of layers and parameters over time.

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee

Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone

Recently, Convolutional Neural Networks have been successfully applied to image segmentation tasks. Here we present some of the most recent techniques that increased the accuracy in such tasks. First we describe the Inception architecture and its evolution, which allowed to increase width and depth of the network without increasing the computational burden. We then show how to adapt classification networks into fully convolutional networks, able to perform pixel-wise classification for segmentation tasks. We finally introduce the hypercolumn technique to further improve state-of-the-art on various fine-grained localization tasks.

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

https://telecombcn-dl.github.io/2017-dlcv/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

#PR12 #PR344 안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다. 오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다. 부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다. 우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다. 셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고 CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고, Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다. 또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다. 이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다. 개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다. 자세한 내용은 영상을 참고해주세요! 감사합니다 영상링크: https://youtu.be/NVLMZZglx14 논문링크: https://arxiv.org/abs/2108.13002

Deep Learning - CNN and RNNAshray Bhandare

PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee

Computer Vision 분야에서 CNN은 과연 살아남을 수 있을까요? 안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 317번째 논문 리뷰입니다. 이번에는 Google Research, Brain Team의 MLP-Mixer: An all-MLP Architecture for Vision을 리뷰해보았습니다. Attention의 공격도 버거운데 이번에는 MLP(Multi-Layer Perceptron)의 공격입니다. MLP만을 사용해서 Image Classification을 하는데 성능도 좋고 속도도 빠르고.... 구조를 간단히 소개해드리면 ViT(Vision Transformer)의 self-attention 부분을 MLP로 변경하였습니다. MLP block 2개를 사용하여 하나는 patch(token)들 간의 연산을 하는데 사용하고, 하나는 patch 내부 연산을 하는데 사용합니다. 사실 MLP를 사용하긴 했지만 논문에도 언급되어 있듯이, 이 부분을 일종의 convolution이라고 볼 수 있는데요... 그래도 transformer 기반의 network이 가질 수밖에 없는 quadratic complexity를 linear로 낮춰주고 convolution의 inductive bias 거의 없이 아주아주 simple한 구조를 활용하여 이렇게 좋은 성능을 보여준 점이 멋집니다. 반면에 역시나 data를 많이 써야 한다거나, MLP의 한계인 fixed length의 input만 받을 수 있다는 점은 단점이라고 생각하는데요, 이 연구를 시작으로 MLP도 다시한번 조명받는 계기가 되면 좋을 것 같네요 비슷한 시점에 나온 비슷한 연구들도 마지막에 간략하게 소개하였습니다. 재미있게 봐주세요. 감사합니다! 논문링크: https://arxiv.org/abs/2105.01601 영상링크: https://youtu.be/KQmZlxdnnuY

Understanding Convolutional Neural NetworksJeremy Nixon

807 103康八上 my comic bookAlly Lin

Postavte zeď mezi svoje vývojářeLadislav Prskavec

More Related Content

What's hot (20)

Convolutional Neural Networks : Popular Architecturesananth

Convolutional Neural Network Models - Deep LearningMohamed Loey

Review-image-segmentation-by-deep-learningTrong-An Bui

"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance

CnnNirthika Rajendran

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh

Computer Vision for BeginnersSanghamitra Deb

Case Study of Convolutional Neural NetworkNamHyuk Ahn

PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo

Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna

Convolutional neural networks Roozbeh Sanaei

Emerging Properties in Self-Supervised Vision TransformersSungchul Kim

CnnMehrnaz Faraz

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee

Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

Deep Learning - CNN and RNNAshray Bhandare

PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee

Understanding Convolutional Neural NetworksJeremy Nixon

Convolutional Neural Networks : Popular Architecturesananth

Convolutional Neural Network Models - Deep LearningMohamed Loey

Review-image-segmentation-by-deep-learningTrong-An Bui

"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance

CnnNirthika Rajendran

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh

Computer Vision for BeginnersSanghamitra Deb

Case Study of Convolutional Neural NetworkNamHyuk Ahn

PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo

Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna

Convolutional neural networks Roozbeh Sanaei

Emerging Properties in Self-Supervised Vision TransformersSungchul Kim

CnnMehrnaz Faraz

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee

Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

Deep Learning - CNN and RNNAshray Bhandare

PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee

Understanding Convolutional Neural NetworksJeremy Nixon

Viewers also liked (19)

807 103康八上 my comic bookAlly Lin

Postavte zeď mezi svoje vývojářeLadislav Prskavec

Giveandget.comHegedűs Zsolt

The document proposes an idea for a website called GiveAndGet.com that functions as an online community and information exchange portal. It would allow members to help each other by sharing transportation via ridesharing, finding accommodations for holidays, exchanging coupons for shopping, and accessing other services. The site aims to help people live more consciously, cleverly, and environmentally friendly. It would operate through members registering and participating in topical communities while leveraging opportunities for business through advertising and sponsored services.

Reclaiming the idea of the UniversityRichard Hall

InglesIsaias Yañez

Gamification review 1Daria Axelrod Marmer

This document proposes a gamification system for LinkedIn Groups to encourage more user participation. It details: - Current low levels of member participation in groups - A scoring system that awards points for positive contributions and deducts for negative actions - Plans to test different scoring algorithms and display user contribution levels and top contributor status - Goals of increasing unique users and page views in groups by encouraging more contributions - Potential future expansions like moderating low-scoring members and different experiences for top contributors

Frede space up paris 2013Jędrzej Górski

The FREon Decay Experiment (FREDE) team from Wroclaw University of Technology in Poland plans to launch a payload into the stratosphere via rocket or balloon to study Freon decay phenomena in the ozone layer. The experiment would examine the chemical processes of Freon decay and collect data on factors like pressure and temperature at different layers of the stratosphere. The team has developed the on-board systems architecture and mechanical design for the stratospheric payload and has undergone training for the REXUS/BEXUS program launch.

6º básico a semana 09 al 13 de mayo (1)Colegio Camilo Henríquez

El documento es un informativo semanal para los padres de un curso de sexto básico. Informa sobre los requerimientos y actividades de la próxima semana, incluyendo el aniversario del colegio el jueves y viernes, las asignaturas y temas que se verán, los materiales necesarios, y las evaluaciones programadas. También incluye información sobre nivelación, biblioteca virtual, y horarios de atención de los profesores.

Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Doug Oldfield

User experience eBayMariaSerrano655

Este documento proporciona recomendaciones para mejorar la conversión, fomentar la adopción, aumentar la participación de los usuarios y las ventas cruzadas en eBay. Las sugerencias incluyen utilizar subtítulos y opciones de letra en los anuncios, imágenes de alta calidad, información sobre el vendedor, programas de fidelización y recordatorios de correo electrónico. También recomienda analizar las palabras clave, incluir titulares descriptivos y una barra de navegación para mejorar la experiencia del usuario.

Marquette Social Listening presentation7Summits

อุปกรณ์เครือข่ายงคอมพิวเตอร์ooh Pongtorn

Scala play-frameworkAbdhesh Kumar

4º básico a semana 03 de junio al 10 de junioColegio Camilo Henríquez

Este documento proporciona información sobre los requerimientos y horarios académicos para los estudiantes de cuarto año básico de la semana del 6 al 10 de junio en el Colegio Camilo Henríquez. Incluye detalles sobre los libros, materiales y horarios requeridos para cada asignatura, así como enlaces de recursos adicionales en línea. También presenta el calendario de evaluaciones para el mes de junio.

Introduzione a Netwrix Auditor 8.5Maurizio Taglioretti

iProductive Environment PlatformProductive Environment Institute

iPEP is a platform that allows users to organize both physical and electronic files in one place, making it easy to find information stored in different locations. It combines paper filing methodology with web technology to retrieve files online or offline within seconds. The document discusses iPEP subscription plans, including individual access for $197 per year or network access for multiple users and projects starting at $447 for setup. Becoming an iPEP specialist is also mentioned.

Bateria e contrabaixo na música popular brasileiramanda555

9. konsolidasi database_di_pusatRosyid Musthofa

The blended learning research: What we now know about high quality faculty de...EDUCAUSE

This document summarizes research on faculty development and course design for blended learning. It finds that active learning, administration/leadership, and responsiveness are the most important competencies for blended teaching. Faculty development programs should provide hands-on experience in a blended course to help instructors acquire new skills. Topics covered include course redesign, content selection, learning activities, assessment, and time management. Supporting faculty with blended course design and a learning community is key to success.

807 103康八上 my comic bookAlly Lin

Postavte zeď mezi svoje vývojářeLadislav Prskavec

Giveandget.comHegedűs Zsolt

Reclaiming the idea of the UniversityRichard Hall

InglesIsaias Yañez

Gamification review 1Daria Axelrod Marmer

Frede space up paris 2013Jędrzej Górski

6º básico a semana 09 al 13 de mayo (1)Colegio Camilo Henríquez

Recruit, Retain, Realize - How Third Party Transactional Data Can Power Your ...Doug Oldfield

User experience eBayMariaSerrano655

Marquette Social Listening presentation7Summits

อุปกรณ์เครือข่ายงคอมพิวเตอร์ooh Pongtorn

Scala play-frameworkAbdhesh Kumar

4º básico a semana 03 de junio al 10 de junioColegio Camilo Henríquez

Introduzione a Netwrix Auditor 8.5Maurizio Taglioretti

iProductive Environment PlatformProductive Environment Institute

Bateria e contrabaixo na música popular brasileiramanda555

9. konsolidasi database_di_pusatRosyid Musthofa

The blended learning research: What we now know about high quality faculty de...EDUCAUSE

Similar to Multimodal Residual Learning for Visual QA (20)

ResNeSt: Split-Attention NetworksSeunghyun Hwang

This document proposes ResNeSt, a split-attention network that divides feature maps into groups and applies attention mechanisms across groups. It outperforms ResNet variants on image classification, object detection, semantic segmentation, and instance segmentation while maintaining the same computational efficiency. The paper introduces ResNeSt's split attention block, training strategies including large batches, data augmentation, and regularization methods. Evaluation shows ResNeSt achieves state-of-the-art accuracy on ImageNet and downstream tasks using less computation than NAS models.

Big learning 1.2Mohit Garg

This document provides a summary of practical machine learning on big data platforms. It begins with an introduction and agenda, then provides a quick brief on the machine learning process. It discusses the current landscape of open source tools, including evolutionary drivers and examples. It covers case studies from Twitter and their experience. Finally, it discusses architectural forces like Moore's Law and Kryder's Law that are shaping the field. The document aims to present a unified approach for machine learning on big data platforms and discuss how industry leaders are implementing these techniques.

IRJET-Image Question Answering: A ReviewIRJET Journal

This document provides a review of image question answering, which involves understanding visual elements in an image and common-sense knowledge to provide responses to open-ended questions about the image. It discusses approaches that map images and questions to a common feature space, along with datasets used to train and evaluate these systems. Several existing datasets for image question answering are described and compared, including DAQUAR, COCO-QA, VQA, and FM-IQA. Algorithms for image question answering extract features from the image and question and combine them to generate an answer, and baseline models are used as starting points for evaluation.

B4UConference_machine learning_deeplearningHoa Le

Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Daniel Davis

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks

Real-time/online machine learning is an integral piece in the machine learning landscape, particularly in regard to unsupervised learning. Areas such as focused advertising, stock price prediction, recommendation engines, network evolution and IoT streams in smart cities and smart homes are increasing in demand and scale. Continuously-updating models with efficient update methodologies, accurate labeling, feature extraction, and modularity for mixed models are integral to maintaining scalability, precision, and accuracy in high demand scenarios. This session explores a real-time/online learning algorithm and implementation using Spark Streaming in a hybrid batch/ semi-supervised setting. It presents an easy-to-use, highly scalable architecture with advanced customization and performance optimization. Within this framework, we will examine some of the key methodologies for implementing the algorithm, including partitioning and aggregation schemes, feature extraction, model evaluation and correction over time, and our approaches to minimizing loss and improving convergence. The result is a simple, accurate pipeline that can be easily adapted and scaled to a variety of use cases. The performance of the algorithm will be evaluated comparatively against existing implementations in both linear and logistic prediction. The session will also cover real-time uses cases of the streaming pipeline using real time-series data and present strategies for optimization and implementation to improve both accuracy and efficiency in a semi-supervised setting.

IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS

2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT

2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS

Applying Deep Learning with Weak and Noisy labelsDarian Frajberg

Scientific seminar at Politecnico di Milano Como, Italy September 2018 In recent years, Deep Learning has achieved outstanding results outperforming previous techniques and even humans, thus becoming the state-of-the-art in a wide range of tasks, among which Computer Vision has been one of the most benefited areas. Nonetheless, most of this success is tightly coupled to strongly supervised learning tasks, which require highly accurate, expensive and labor-intensive defined ground truth labels. In this presentation, we will introduce diverse alternatives to deal with this problem and support the training of Deep Learning models for Computer Vision tasks by simplifying the process of data labelling or exploiting the unlimited supply of publicly available data in Internet (such as user-tagged images from Flickr). Such alternatives rely on data comprising noisy and weak labels, which are much easier to collect but require special care to be used.

深度學習在AOI的應用CHENHuiMei

This document discusses using fully convolutional neural networks for defect inspection. It begins with an agenda that outlines image segmentation using FCNs and defect inspection. It then provides details on data preparation including labeling guidelines, data augmentation, and model setup using techniques like deconvolution layers and the U-Net architecture. Metrics for evaluating the model like Dice score and IoU are also covered. The document concludes with best practices for successful deep learning projects focusing on aspects like having a large reusable dataset, feasibility of the problem, potential payoff, and fault tolerance.

Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli

I presented an interesting paper during the Vision and Multimedia Reading Group about DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (pdf). It is a complete evaluation about features extracted from the activation of a deep convolutional network trained with a large scale dataset. This a work of Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell from Berkeley University

Face Recognition: From Scratch To HatchEduard Tyantov

Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Ontico

HighLoad++ 2017 Зал «Найроби+Касабланка», 7 ноября, 15:00 Тезисы: http://www.highload.ru/2017/abstracts/3044.html Мы разработали технологию по детекту и распознаванию лиц для продуктов компании Mail.ru, которая показывает высокие результаты на известных тестах. Технология на данный момент используется в Мобильном Облаке@Mail.ru для кластеризации фотографий по людям, а также во внутренних сервисах компании. ...

IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET Journal

This document proposes a multi-label neural network for road scene recognition in autonomous vehicles. It introduces a large-scale dataset called Driving Scene, which contains over 110,000 images across 52 road scene classes. The challenges of this dataset include multi-class prediction, data imbalance with varying image resolutions. The proposed neural network architecture incorporates both single-label and multi-label classification to address data imbalance. It utilizes a deep data integration strategy based on AdaBoost to focus training on minority classes and misclassified samples. Additionally, lane detection and lane departure warning systems are included to provide more context for autonomous driving. The network is trained and evaluated on the Driving Scene dataset to recognize road scenes.

Surveillance scene classification using machine learningUtkarsh Contractor

The problem of scene classification in surveillance footage is of great importance for ensuring security in public areas. With challenges such as low quality feeds, occlusion, viewpoint variations, background clutter etc. The task is both challenging and error-prone. Therefore it is important to keep the false positives low to maintain a high accuracy of detection. In this paper, we adapt high performing CNN architectures to identify abandoned luggage in a surveillance feed. We explore several CNN based approaches, from Transfer Learning on the Imagenet dataset to object classification using Faster R-CNNs on the COCO dataset. Using network visualization techniques, we gain insight into what the neural network sees and the basis of classification decision. The experiments have been conducted on real world datasets, and highlights the complexity in such classifications. Obtained results indicate that a combination of proposed techniques outperforms the individual approaches.

IRJET - Gender Recognition from Facial ImagesIRJET Journal

This document discusses gender recognition from facial images using a Wide Residual Network model. The model is trained on a dataset from Kaggle to predict gender from live video stream faces detected by a webcam. When a male face is detected, it draws a red border and sounds an alarm, as the purpose is for male-restricted areas video surveillance. It preprocesses detected faces before predicting gender with the WideResNet model, which reduces depth and increases width compared to standard residual networks for faster training. Experimental results found it achieved good performance for male restricted area video monitoring.

Mining weakly labeled web facial images for search based face annotation Adz91 Digital Ads Pvt Ltd

This document discusses a framework for search-based face annotation by mining weakly labeled facial images from the web. It proposes an unsupervised label refinement (ULR) approach to refine the noisy and incomplete labels of web images using machine learning. The learning problem is formulated as a convex optimization and efficient algorithms are developed to solve the large-scale task. Additionally, a clustering-based approximation algorithm is proposed to improve scalability. The proposed system achieves promising results in experiments by enhancing label quality compared to other approaches.

The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma

The document discusses the LDBC Social Network Benchmark for evaluating database and graph processing systems. It describes the benchmark's social network data generator which produces realistic data following power law distributions and correlations. It also outlines the benchmark's three workloads: interactive, business intelligence, and graph analytics. The focus is on the interactive workload, which includes complex read queries, simple read queries, and concurrent updates. It aims to identify choke points and measure the acceleration factor a system can sustain for the query mix while meeting a maximum query latency. Parameter curation is used to select query parameters that produce stable performance. The parallel query driver respects dependencies between queries to evaluate a system's ability to handle the workload concurrently.

Apache con big data 2015 - Data Science from the trenchesVinay Shukla

ResNeSt: Split-Attention NetworksSeunghyun Hwang

Big learning 1.2Mohit Garg

IRJET-Image Question Answering: A ReviewIRJET Journal

B4UConference_machine learning_deeplearningHoa Le

Enabling Real-Time Adaptivity in MOOCs with a Personalized Next-Step Recommen...Daniel Davis

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks

IEEE 2014 JAVA DATA MINING PROJECTS Mining weakly labeled web facial images f...IEEEFINALYEARSTUDENTPROJECTS

2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEFINALYEARSTUDENTPROJECT

2014 IEEE JAVA DATA MINING PROJECT Mining weakly labeled web facial images fo...IEEEMEMTECHSTUDENTSPROJECTS

Applying Deep Learning with Weak and Noisy labelsDarian Frajberg

深度學習在AOI的應用CHENHuiMei

Vision and Multimedia Reading Group: DeCAF: a Deep Convolutional Activation F...Simone Ercoli

Face Recognition: From Scratch To HatchEduard Tyantov

Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Ontico

IRJET - Multi-Label Road Scene Prediction for Autonomous Vehicles using Deep ...IRJET Journal

Surveillance scene classification using machine learningUtkarsh Contractor

IRJET - Gender Recognition from Facial ImagesIRJET Journal

Mining weakly labeled web facial images for search based face annotation Adz91 Digital Ads Pvt Ltd

The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma

Apache con big data 2015 - Data Science from the trenchesVinay Shukla

Recently uploaded (20)

Introduction to Zoomlion Earthmoving.pptxAS1920

QA/QC Manager (Quality management Expert)rccbatchplant

IntroSlides-April-BuildWithAI-VertexAI.pdfLuiz Carneiro

RICS Membership-(The Royal Institution of Chartered Surveyors).pdfMohamedAbdelkader115

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

introduction to machine learining for beginersJoydebSheet

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...charlesdick1345

π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社

今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。 This presentation introduces robot foundation models that integrate vision, language, and action. Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.

ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob

DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDhruvChotaliya2

Avnet Silica's PCIM 2025 Highlights FlyerWillDavies22

new ppt artificial intelligence historyyyPianoPianist

Metal alkyne complexes.pptx in chemistrymee23nu

fluke dealers in bangalore..............Haresh Vaswani

The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.

ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai

With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes the answer to interpreting the decision-making process that AI malware analysis models use to determine malicious benign samples to gain trust that in a production environment, the system is able to catch malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to balance explainability with security. Following this framework, designing a machine with an AI malware detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique, implementing AdvXAI defensive measures, and continually evaluating the model. This framework will significantly contribute to automated malware detection and XAI efforts allowing for secure systems that are resilient to adversarial attacks.

Data Structures_Introduction to algorithms.pptxRushaliDeshmukh2

Concept of Problem Solving, Introduction to Algorithms, Characteristics of Algorithms, Introduction to Data Structure, Data Structure Classification (Linear and Non-linear, Static and Dynamic, Persistent and Ephemeral data structures), Time complexity and Space complexity, Asymptotic Notation - The Big-O, Omega and Theta notation, Algorithmic upper bounds, lower bounds, Best, Worst and Average case analysis of an Algorithm, Abstract Data Types (ADT)

theory-slides-for react for beginners.pptxsanchezvanessa7896

15th International Conference on Computer Science, Engineering and Applicatio...IJCSES Journal

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

Introduction to Zoomlion Earthmoving.pptxAS1920

QA/QC Manager (Quality management Expert)rccbatchplant

IntroSlides-April-BuildWithAI-VertexAI.pdfLuiz Carneiro

RICS Membership-(The Royal Institution of Chartered Surveyors).pdfMohamedAbdelkader115

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdfinmishra17121973

introduction to machine learining for beginersJoydebSheet

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...charlesdick1345

π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社

ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob

DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDhruvChotaliya2

Avnet Silica's PCIM 2025 Highlights FlyerWillDavies22

new ppt artificial intelligence historyyyPianoPianist

Metal alkyne complexes.pptx in chemistrymee23nu

fluke dealers in bangalore..............Haresh Vaswani

ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai

Data Structures_Introduction to algorithms.pptxRushaliDeshmukh2

theory-slides-for react for beginners.pptxsanchezvanessa7896

15th International Conference on Computer Science, Engineering and Applicatio...IJCSES Journal

Degree_of_Automation.pdf for Instrumentation and industrial specialistshreyabhosale19

Multimodal Residual Learning for Visual QA

1. Multimodal Residual Learning for Visual QA NamHyuk Ahn

2. Table of Contents 1. Visual QA 2. Stacked Attention Network (SAN) 3. Residual Learning 4. Multimodal Residual Network (MRN)

3. Visual QA Evaluation Metric - Robust to variabilityinter- human - Human accuracy is almost 90 - 248,349 Training questions (82,783 Images) - 121,512 Validation questions (40,504 Images) - 244,302 Testing questions (81,434 Images)

4. Stacked Attention Network

5. Motivation - Answering question requires multi-step reasoning - With {bicycles, window, street, baskets, dogs} objects - To answer good question, pinpoint relevant region. Q: what are sitting in the basket on a bicycle

6. Stacked Attention Network (SAN) - SAN allows multi-step reasoning for visual QA - Extension of Attention mechanism which successfully applied in captioning, translation etc. Q: what are sitting in the basket on a bicycle

7. Stacked Attention Network - Image Model • Extract image feature using CNN - Question Model • Extract semantic vector using CNN or LSTM - Stacked Attention • Multi-step reasoning with attention layer Stacked Attention Multi-step reasoning using attention layer

8. Image / Question Model - Image Model • Get feature map from raw pixel Image • Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512) • Additional layer to fit to question feature - Question Model •

9. Stacked Attention Model - Global image feature leads to suboptimal due to noise from irrelevant object / region. - Instead use SAM to pinpoint relevant region - Given image feature matrix and question vector , 14x14 attention distribution - Get weighted sum of image vectors from each region. - refined query vector

10. Result

12. Residual Learning

13. Problem of degradation - More depth, more accurate but deep network can vanish/explode gradient • BN, Xavier Init, Dropout can handle (~30 layer) - More deeper, degradation problem occur • Not only overfit, but also increase training error

14. Residual Network (ResNet) Residual Block - To avoid degradation problem, add shortcut connection. - Element-wise addition with F(x) and shortcut connection, and pass through ReLU. - Similar to LSTM http://torch.ch/blog/2016/02/04/resnets.html Shortcut connection

15. Multimodal Residual Network

16. Introduction - Extend deep residual learning for visual QA - Achieving the state-of-the-art results on visual QA dataset (not today :(. - Introducing a method to visualize spatial attention effect of joint residual mappings

17. Background SAN - But question info contribute weakly, it cause bottleneck Baseline [Lu et al.] - With just elem-wise multiple, visual and question feature embed very well. MRN - Shortcut mapping and stacking architecture - No weighted-sum - Instead use global multiplication [Lu et al.] does.

20. Quantitative Analysis - (a) shows large improvement over SAN, (b) is better. - (c) add extra embedding in question cause overfitting. - (d) identity shortcut cause degradation (extra linear mapping is needed). - (e) performs reasonable, but extra shortcut is not essential.

21. Quantitative Analysis # of Learning blocks - 58.85% (L=1), 59.44% (L=2), 60.53% (L=3), 60.42% (L=4) Visual Features - ResNet-152 is significantly better than VGGNet - Even though ResNet has less feature dim (2048 vs 4096). # of Answer Class - Trade-off relation among answer type, but 2k is best

22. - Implicit attention with multiplication - Get high-resolution attention map

24. Reference - Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015). - Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016). - Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Multimodal Residual Learning for Visual QA

Recommended

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Multimodal Residual Learning for Visual QA (20)

Recently uploaded (20)

Multimodal Residual Learning for Visual QA