SlideShare a Scribd company logo
Generation of Synthetic Referring Expressions
for Object Segmentation in Videos
Author: Ioannis Kazakos
Table of
Contents
1. Topic & Background
2. Relevant Literature
3. Motivation
4. Method
5. Experiments & Results
6. Conclusions
1.Topic & Background
Vision & Language
● Recently raised research area
● Owing to deep learning revolution and independent success in CV and NLP
○ CNNs, Object detection/segmentation models
○ LSTMs, Word embeddings
● Many applications
○ Autonomous driving
○ Assistance of visually impaired individuals
○ Interactive video editing
○ Navigation from vision and language etc.
Vision & Language Tasks
● Visual Question Answering, Agrawal et al. 2015
● Caption Generation, Vinyals et al. 2015
● Text to Images, Zhang et al. 2016
● And many more!
● Referring Expression
○ An accurate description of a specific object, but not of any other object in the current scene
○ Example:
■ “a woman” ❌
■ “a woman in red” ❌
■ “a woman in red on the right” ✅
■ “a woman in red top and blue shorts” ✅
● Object Segmentation
○ Assign a label to every pixel corresponding to the target object
Object Segmentation with Referring Expressions
Referring Expression Video Object Segmentation
2. Relevant Literature
Many works on images
● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016
● Subsequent works tried to jointly model vision and language features and leverage
attention to better capture dependencies between visual and linguistic features
● Most of these works use the Refer-It collection of datasets for training and evaluation
○ Three large-scale image datasets with referring expressions and segmentation masks
○ Collected on top of Microsoft COCO (Common objects in Context)
○ RefCOCO, RefCOCO+ and RefCOCOg
● 142,209 referring expressions
● 50,000 objects
● 19,994 images
RefCOCO dataset
Expression = “right kid”
Expression = “left elephant”
Few works on videos
● “Video Object Segmentation with
Language Referring Expressions”,
Khoreva et al. 2018
○ DAVIS-2017: Big set of 78 object classes
○ Too few videos (150 in total)
○ They use a frame-based model
○ Pre-training on RefCOCO is used
● “Actor and action segmentation from a
sentence”, Gavrilyuk et al. 2018
○ A2D: Small set of object classes (only 8 actors)
○ J-HDMB: Single object in each video
DAVIS-2017
3. Motivation
Main Challenges
● Models
○ Temporal consistency across frames
○ Models’ size and complexity
● Data
○ No large-scale datasets for videos
○ Poor quality of crowdsourced referring expressions
■ ~10% fail to correctly describe the target object (no RE)
Analysis from Bellver et al. 2020
A2D
DAVIS-2017
Method Inspiration
A2D
DAVIS-2017
● Existing datasets include trivial cases where a single object from each class
appears
● In such cases an object can be identified using only its class e.g. saying “a
person” or “a horse”
● Existing large datasets for video object segmentation are labeled in terms of
object classes
● Annotating a large dataset with referring expressions requires tremendous
human effort
Basic Idea
Generate (automatically) synthetic referring expressions starting from an object’s
class and enhancing them with other cues without any human annotation cost
Thesis Purpose
1. Propose a method for generating synthetic referring expressions for a large-scale
video object segmentation dataset
1. Evaluate the effectiveness of the generated synthetic referring expressions for the
task of video object segmentation with referring expressions
4. Method
YouTube-VIS Dataset
YouTube-VOS
→ Large-scale dataset for video object segmentation
→ Short YouTube videos of 3-6 seconds
→ 4,453 videos in total
→ 94 object categories
YouTube-VIS
→ Created on top of YouTube-VOS
→ 2,883 videos
→ 40 object classes
→ Exhaustively annotated = All objects belonging to
the 40 classes are labeled with pixel-wise masks.
● The formulation of our method allows its application to any other object
detection/segmentation dataset
● We apply our proposed method on the YouTube-VIS dataset
Overview
1. Ground-truth annotations
● Object class
● Bounding boxes
○ Relative size
○ Relative location
2. Faster R-CNN, Ren et al. 2015
● Enhanced with attribute head by Tang et al. 2020
● Pre-trained on Visual Genome dataset for attribute detection
○ Able to detect a predefined set of 201 attributes
○ Includes color and non-color attributes
○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)
Cues
1. Object Class (e.g “a person”)
○ It can be enough only if one object of this class is present in the video frame
○ However, in most cases more cues are necessary
Cues
2. Relative Size
○ The areas At and Ao of the target and other object
bounding boxes are computed:
■ At >= 2Ao : “bigger” is added to the ref. expression
■ At <= 0.5Ao : “smaller” respectively
■ 0.5Ao < At < 2Ao : relative location not applicable
○ Similarly for more objects, “biggest”/“smallest” if
target is “bigger”/ “smaller” than all other objects
“a bigger dog”
Cues
3. Relative Location (1 or 2 other objects of the same class)
○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries
○ The maximum non-overlapping distance between bounding boxes is calculated
○ If distance above a certain threshold, relative location is computed, according to the axis found:
■ If X-axis: “on the left” / “on the right”
■ If Y-axis: “in the front” / “in the back”
○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in
the middle”, “in the front left” etc.)
“rabbit on the left”
rabbit rabbit
rabbit
Cues
4. Attributes
○ Faster R-CNN detection is matched to the target object using Intersection-over-Union
○ An attribute is added to the referring expression only if it is unique for the target object
○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”)
○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”)
Detected Attributes:
'white' : 0.9250
'black' : 0.8844
'brown' : 0.8062
“a white rabbit”
SynthRef-YouTube-VIS
Example of referring expressions
generated with the proposed method
5. Experiments & Results
We use RefVOS model (Bellver et al. 2020) for the experiments
● Frame-based model
● DeepLabv3 visual encoder
● BERT language encoder
● Multi-modal embedding obtained via multiplication
DeepLabv3
Model
Training Details
● Batch size of 8 video frames (2 GPUs)
● Frames are cropped/padded to 480x480
● SGD optimizer
● Learning rate policy depends on the target dataset
Evaluation Metrics
1. Region Similarity (J)
Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask
1. Contour Accuracy (F)
F1-score of the contour-based precision Pc and recall Rc between the contour points of the
predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching.
1. Precision@X
Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive
if its J is larger than X, and as false positive otherwise. Then, Precision is computed
as the ratio between the number of true positives and the total number of instances
Experiments
1. Extra pre-training of the model using the generated synthetic data and
evaluating on DAVIS-2017 and A2D Sentences datasets
Results on DAVIS-2017
DAVIS-2017
Validation
DAVIS-2017
Train & Validation
No fine-tuning
Fine-tuning
Qualitative Results on DAVIS-2017
Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS
Results on A2D Sentences
Referring expressions of A2D Sentences are focused on actions,
including mostly verbs and less attributes
Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
Refer-YouTube-VOS
● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions
● This allowed a direct comparison of our synthetic referring expressions with human-produced
ones
Human vs Synthetic
Training:
1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset)
2. Human-produced referring expressions from Refer-YouTube-VOS
Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions
from Refer-YouTube-VOS
Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
1. Ablation study
Ablation Study
● Impact of Synthetic Referring Expression Information (DAVIS-2017)
● Freezing the language branch for synthetic pre-training
6. Conclusions
1. Pre-training a model using the synthetic referring expressions, when it is additionally
trained on real ones, increases its ability to generalize across different datasets.
1. Gains are higher when no fine-tuning is performed on the target dataset
1. Synthetic referring expressions do not achieve better results than human-produced ones
but can be used complementary without any additional annotation cost
1. More information in the referring expressions yields better segmentation accuracy
Conclusions
● Extend the proposed method by adding more cues
○ Use scene-graph generation models to add relationships between objects
Image from Xu et al. 2017
● Apply the proposed method to other existing object detection/segmentation datasets
○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO
Future work
Thank you!
Questions?
Ad

More Related Content

What's hot (20)

Master Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
Master Thesis of Computer Engineering SuperResoluton Giuseppe CaliendoMaster Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
Master Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
GiuseppeCaliendo2
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
Neo4j
 
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
Sang Jun Lee
 
Collaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCFCollaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCF
Park JunPyo
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Marcel Kurovski
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Databricks
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
Julien SIMON
 
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
NEHA Kapoor
 
스마트 팩토리 표준화(RAMI 4.0 quick review)
스마트 팩토리 표준화(RAMI 4.0 quick review)스마트 팩토리 표준화(RAMI 4.0 quick review)
스마트 팩토리 표준화(RAMI 4.0 quick review)
YOONSEOK JANG
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
Abhinav Dadhich
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
Jon Lederman
 
You only look once
You only look onceYou only look once
You only look once
Gin Kyeng Lee
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Edge AI and Vision Alliance
 
TensorFlow Object Detection API
TensorFlow Object Detection APITensorFlow Object Detection API
TensorFlow Object Detection API
Algoscale Technologies Inc.
 
UNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptxUNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptx
RohanPathak30
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
Yu Huang
 
Neo4j: What's Under the Hood
Neo4j: What's Under the HoodNeo4j: What's Under the Hood
Neo4j: What's Under the Hood
Neo4j
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
Universitat Politècnica de Catalunya
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Master Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
Master Thesis of Computer Engineering SuperResoluton Giuseppe CaliendoMaster Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
Master Thesis of Computer Engineering SuperResoluton Giuseppe Caliendo
GiuseppeCaliendo2
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
Neo4j
 
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
[5분 논문요약] Structured Knowledge Distillation for Semantic Segmentation
Sang Jun Lee
 
Collaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCFCollaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCF
Park JunPyo
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
Marcel Kurovski
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4jTransforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Databricks
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
Julien SIMON
 
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
NEHA Kapoor
 
스마트 팩토리 표준화(RAMI 4.0 quick review)
스마트 팩토리 표준화(RAMI 4.0 quick review)스마트 팩토리 표준화(RAMI 4.0 quick review)
스마트 팩토리 표준화(RAMI 4.0 quick review)
YOONSEOK JANG
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
Abhinav Dadhich
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
Jon Lederman
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Edge AI and Vision Alliance
 
UNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptxUNIT 1 Machine Learning [KCS-055] (1).pptx
UNIT 1 Machine Learning [KCS-055] (1).pptx
RohanPathak30
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
Yu Huang
 
Neo4j: What's Under the Hood
Neo4j: What's Under the HoodNeo4j: What's Under the Hood
Neo4j: What's Under the Hood
Neo4j
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 

Similar to Generation of Synthetic Referring Expressions for Object Segmentation in Videos (20)

Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science
 
American sign language recognizer
American sign language recognizerAmerican sign language recognizer
American sign language recognizer
Garrett Broughton, Architect/Engineer
 
Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
Goergen Institute for Data Science
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Jason Anderson
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Interactive Video Search: Where is the User in the Age of Deep Learning?
Interactive Video Search: Where is the User in the Age of Deep Learning?Interactive Video Search: Where is the User in the Age of Deep Learning?
Interactive Video Search: Where is the User in the Age of Deep Learning?
klschoef
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
Wee Hyong Tok
 
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"..."How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
Edge AI and Vision Alliance
 
Look, Listen and Act [Navigation via Reinforcement Learning]
Look, Listen and Act [Navigation via Reinforcement Learning]Look, Listen and Act [Navigation via Reinforcement Learning]
Look, Listen and Act [Navigation via Reinforcement Learning]
이 의령
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
Isabelle Augenstein
 
3D Environment : HomeNavigation
3D Environment : HomeNavigation3D Environment : HomeNavigation
3D Environment : HomeNavigation
Yechan(Paul) Kim
 
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
DataScienceConferenc1
 
Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)
Er. Shiva K. Shrestha
 
Visual concept learning
Visual concept learningVisual concept learning
Visual concept learning
Vaibhav Singh
 
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Edge AI and Vision Alliance
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
Enrico Palumbo
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
Detection ofs Signlanguageminorppt1.pptx
Detection ofs Signlanguageminorppt1.pptxDetection ofs Signlanguageminorppt1.pptx
Detection ofs Signlanguageminorppt1.pptx
vigocib930
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Jason Anderson
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Interactive Video Search: Where is the User in the Age of Deep Learning?
Interactive Video Search: Where is the User in the Age of Deep Learning?Interactive Video Search: Where is the User in the Age of Deep Learning?
Interactive Video Search: Where is the User in the Age of Deep Learning?
klschoef
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
Wee Hyong Tok
 
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"..."How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
"How Image Sensor and Video Compression Parameters Impact Vision Algorithms,"...
Edge AI and Vision Alliance
 
Look, Listen and Act [Navigation via Reinforcement Learning]
Look, Listen and Act [Navigation via Reinforcement Learning]Look, Listen and Act [Navigation via Reinforcement Learning]
Look, Listen and Act [Navigation via Reinforcement Learning]
이 의령
 
3D Environment : HomeNavigation
3D Environment : HomeNavigation3D Environment : HomeNavigation
3D Environment : HomeNavigation
Yechan(Paul) Kim
 
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
[DSC Europe 24] Nemanja Milosevic - Beyond Supervised Learning with Zero-Shot...
DataScienceConferenc1
 
Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)
Er. Shiva K. Shrestha
 
Visual concept learning
Visual concept learningVisual concept learning
Visual concept learning
Vaibhav Singh
 
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Edge AI and Vision Alliance
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
Enrico Palumbo
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
Detection ofs Signlanguageminorppt1.pptx
Detection ofs Signlanguageminorppt1.pptxDetection ofs Signlanguageminorppt1.pptx
Detection ofs Signlanguageminorppt1.pptx
vigocib930
 
Ad

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 

Generation of Synthetic Referring Expressions for Object Segmentation in Videos

  • 1. Generation of Synthetic Referring Expressions for Object Segmentation in Videos Author: Ioannis Kazakos
  • 2. Table of Contents 1. Topic & Background 2. Relevant Literature 3. Motivation 4. Method 5. Experiments & Results 6. Conclusions
  • 4. Vision & Language ● Recently raised research area ● Owing to deep learning revolution and independent success in CV and NLP ○ CNNs, Object detection/segmentation models ○ LSTMs, Word embeddings ● Many applications ○ Autonomous driving ○ Assistance of visually impaired individuals ○ Interactive video editing ○ Navigation from vision and language etc.
  • 5. Vision & Language Tasks ● Visual Question Answering, Agrawal et al. 2015 ● Caption Generation, Vinyals et al. 2015 ● Text to Images, Zhang et al. 2016 ● And many more!
  • 6. ● Referring Expression ○ An accurate description of a specific object, but not of any other object in the current scene ○ Example: ■ “a woman” ❌ ■ “a woman in red” ❌ ■ “a woman in red on the right” ✅ ■ “a woman in red top and blue shorts” ✅ ● Object Segmentation ○ Assign a label to every pixel corresponding to the target object Object Segmentation with Referring Expressions
  • 7. Referring Expression Video Object Segmentation
  • 9. Many works on images ● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016 ● Subsequent works tried to jointly model vision and language features and leverage attention to better capture dependencies between visual and linguistic features ● Most of these works use the Refer-It collection of datasets for training and evaluation ○ Three large-scale image datasets with referring expressions and segmentation masks ○ Collected on top of Microsoft COCO (Common objects in Context) ○ RefCOCO, RefCOCO+ and RefCOCOg
  • 10. ● 142,209 referring expressions ● 50,000 objects ● 19,994 images RefCOCO dataset Expression = “right kid” Expression = “left elephant”
  • 11. Few works on videos ● “Video Object Segmentation with Language Referring Expressions”, Khoreva et al. 2018 ○ DAVIS-2017: Big set of 78 object classes ○ Too few videos (150 in total) ○ They use a frame-based model ○ Pre-training on RefCOCO is used ● “Actor and action segmentation from a sentence”, Gavrilyuk et al. 2018 ○ A2D: Small set of object classes (only 8 actors) ○ J-HDMB: Single object in each video DAVIS-2017
  • 13. Main Challenges ● Models ○ Temporal consistency across frames ○ Models’ size and complexity ● Data ○ No large-scale datasets for videos ○ Poor quality of crowdsourced referring expressions ■ ~10% fail to correctly describe the target object (no RE) Analysis from Bellver et al. 2020 A2D DAVIS-2017
  • 14. Method Inspiration A2D DAVIS-2017 ● Existing datasets include trivial cases where a single object from each class appears ● In such cases an object can be identified using only its class e.g. saying “a person” or “a horse” ● Existing large datasets for video object segmentation are labeled in terms of object classes ● Annotating a large dataset with referring expressions requires tremendous human effort
  • 15. Basic Idea Generate (automatically) synthetic referring expressions starting from an object’s class and enhancing them with other cues without any human annotation cost
  • 16. Thesis Purpose 1. Propose a method for generating synthetic referring expressions for a large-scale video object segmentation dataset 1. Evaluate the effectiveness of the generated synthetic referring expressions for the task of video object segmentation with referring expressions
  • 18. YouTube-VIS Dataset YouTube-VOS → Large-scale dataset for video object segmentation → Short YouTube videos of 3-6 seconds → 4,453 videos in total → 94 object categories YouTube-VIS → Created on top of YouTube-VOS → 2,883 videos → 40 object classes → Exhaustively annotated = All objects belonging to the 40 classes are labeled with pixel-wise masks. ● The formulation of our method allows its application to any other object detection/segmentation dataset ● We apply our proposed method on the YouTube-VIS dataset
  • 19. Overview 1. Ground-truth annotations ● Object class ● Bounding boxes ○ Relative size ○ Relative location 2. Faster R-CNN, Ren et al. 2015 ● Enhanced with attribute head by Tang et al. 2020 ● Pre-trained on Visual Genome dataset for attribute detection ○ Able to detect a predefined set of 201 attributes ○ Includes color and non-color attributes ○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)
  • 20. Cues 1. Object Class (e.g “a person”) ○ It can be enough only if one object of this class is present in the video frame ○ However, in most cases more cues are necessary
  • 21. Cues 2. Relative Size ○ The areas At and Ao of the target and other object bounding boxes are computed: ■ At >= 2Ao : “bigger” is added to the ref. expression ■ At <= 0.5Ao : “smaller” respectively ■ 0.5Ao < At < 2Ao : relative location not applicable ○ Similarly for more objects, “biggest”/“smallest” if target is “bigger”/ “smaller” than all other objects “a bigger dog”
  • 22. Cues 3. Relative Location (1 or 2 other objects of the same class) ○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries ○ The maximum non-overlapping distance between bounding boxes is calculated ○ If distance above a certain threshold, relative location is computed, according to the axis found: ■ If X-axis: “on the left” / “on the right” ■ If Y-axis: “in the front” / “in the back” ○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in the middle”, “in the front left” etc.) “rabbit on the left” rabbit rabbit rabbit
  • 23. Cues 4. Attributes ○ Faster R-CNN detection is matched to the target object using Intersection-over-Union ○ An attribute is added to the referring expression only if it is unique for the target object ○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”) ○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”) Detected Attributes: 'white' : 0.9250 'black' : 0.8844 'brown' : 0.8062 “a white rabbit”
  • 24. SynthRef-YouTube-VIS Example of referring expressions generated with the proposed method
  • 25. 5. Experiments & Results
  • 26. We use RefVOS model (Bellver et al. 2020) for the experiments ● Frame-based model ● DeepLabv3 visual encoder ● BERT language encoder ● Multi-modal embedding obtained via multiplication DeepLabv3 Model
  • 27. Training Details ● Batch size of 8 video frames (2 GPUs) ● Frames are cropped/padded to 480x480 ● SGD optimizer ● Learning rate policy depends on the target dataset
  • 28. Evaluation Metrics 1. Region Similarity (J) Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask 1. Contour Accuracy (F) F1-score of the contour-based precision Pc and recall Rc between the contour points of the predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching. 1. Precision@X Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive if its J is larger than X, and as false positive otherwise. Then, Precision is computed as the ratio between the number of true positives and the total number of instances
  • 29. Experiments 1. Extra pre-training of the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets
  • 30. Results on DAVIS-2017 DAVIS-2017 Validation DAVIS-2017 Train & Validation No fine-tuning Fine-tuning
  • 31. Qualitative Results on DAVIS-2017 Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS
  • 32. Results on A2D Sentences Referring expressions of A2D Sentences are focused on actions, including mostly verbs and less attributes
  • 33. Experiments 1. Pre-training the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets 1. Training on human vs synthetic referring expressions on the same videos
  • 34. Refer-YouTube-VOS ● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions ● This allowed a direct comparison of our synthetic referring expressions with human-produced ones
  • 35. Human vs Synthetic Training: 1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset) 2. Human-produced referring expressions from Refer-YouTube-VOS Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions from Refer-YouTube-VOS
  • 36. Experiments 1. Pre-training the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets 1. Training on human vs synthetic referring expressions on the same videos 1. Ablation study
  • 37. Ablation Study ● Impact of Synthetic Referring Expression Information (DAVIS-2017) ● Freezing the language branch for synthetic pre-training
  • 39. 1. Pre-training a model using the synthetic referring expressions, when it is additionally trained on real ones, increases its ability to generalize across different datasets. 1. Gains are higher when no fine-tuning is performed on the target dataset 1. Synthetic referring expressions do not achieve better results than human-produced ones but can be used complementary without any additional annotation cost 1. More information in the referring expressions yields better segmentation accuracy Conclusions
  • 40. ● Extend the proposed method by adding more cues ○ Use scene-graph generation models to add relationships between objects Image from Xu et al. 2017 ● Apply the proposed method to other existing object detection/segmentation datasets ○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO Future work