0% found this document useful (0 votes)
37 views

Lecture-27-Introduction to VLM

Uploaded by

muneebke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Lecture-27-Introduction to VLM

Uploaded by

muneebke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Research Paper

• Each paper has following parts


1. Title
2. Abstract
3. Introduction
4. Rest of the paper
• Related Work
• Method
• Results
• Conclusion

• Each part is equally important (25% each!)


1/10/2024 CAP6412 - Lecture 1 Introduction 14
How to read a research paper?
• You must read the paper several times to understand it.
• When you read the paper first time,
• if you do not understand something do not get stuck,
• keep reading assuming you will figure out that later.
• When you read it the second time, you will understand much
more, and the third time even more ...
• Read the abstract first then look at the figures with captions
and then conclusion
How to read a research paper?
• Try first to get a general idea of the paper
• What problem is being solved?
• What are the main steps?
• How can I implement the method?,
• Even though I do not understand why each step is performed the
way it is performed
• Try to relate the method to other methods you know, and
conceptually find similarities and differences.
How to read a research paper?
• In the first reading it may be a good idea to skip the related work.

• Do not use dictionary to just look up the meaning of technical terms

• Try to understand each concept in isolation, and then integrate them


to understand the whole paper.
Useful Blogs about how to read a paper?
• https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.
pdf

• https://blogs.lse.ac.uk/impactofsocialsciences/2016/05/09/how-to-
read-and-understand-a-scientific-paper-a-guide-for-non-scientists/

1/10/2024 CAP6412 - Lecture 1 Introduction 18


Visual-Language Models: Short Introduction
• Material from
• A Dive into Vision-Language Models:
https://huggingface.co/blog/vision_language_pretraining
• Beginner’s Guide to Large Language Models | by Digitate | Medium
https://medium.com/@igniobydigitate/a-beginners-guide-to-large-
language-models-e5e9e63d84a
• Introduction to Visual-Language Model | by Navendu Brajesh | Medium
https://medium.com/@navendubrajesh/vision-language-models-an-
introduction-37853f535415
• https://huggingface.co/blog/vision_language_pretraining

1/10/2024 CAP6412 - Lecture 1 Introduction 19


Computer Vision Tasks
• Object Classification
• Object Detection
• Object Segmentation
• Instant Segmentation
• Object Retrieval
• Semantic Segmentation
• Action Classification
• Object Tracking
• ….
1/10/2024 CAP6412 - Lecture 1 Introduction 20
Limitation
• Computer Vision techniques output images, bounding boxes,
classes,..

• They don’t communicate through text

• Humans are good at communicating with language and text

1/10/2024 CAP6412 - Lecture 1 Introduction 21


Computer Vision Tasks requiring language
• Images
• Image Captioning
• Visual Question & Answering
• Image-to-Text Retrieval
• Text-to-Image Retrieval
• Text-guided image generation
• Video
• Video Captioning
• Video Q&A
• Video-text Retrieval
• …
1/10/2024 CAP6412 - Lecture 1 Introduction 22
Image Captioning

O Vinyals et al. Show and Tell: A Neural Image Caption Generator, 2014
1/10/2024 CAP6412 - Lecture 1 Introduction 23
VQA

VQA: Visual Question Answering, Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C.
Lawrence Zitnick, Dhruv Batra, Devi Parikh, 2015
1/10/2024 CAP6412 - Lecture 1 Introduction 24
Text to Video Retrieval

Sirnam, Swetha; Rizve, Mamshad Nayeem; Kuhne, Hilde; Shah, Mubarak


Preserving Modality Structure Improves Multi-Modal Learning, ICCV, 2023
• Text-to-Image
Text-to-Image

Gu, Shuyang, et al. "Vector quantized diffusion model for text-to-image synthesis."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Natural Language Processing (NLP)

• Search engines
• Spam filtering
• Machine translation
• Sentiment analysis
• Document summarization
• …..

1/10/2024 CAP6412 - Lecture 1 Introduction 28


Natural Language Processing (NLP)

• Limitations
• Not able to decode visual cues

• Linguistic Ambiguities

• Verifying interpretations against real-world visual references

1/10/2024 CAP6412 - Lecture 1 Introduction 29


Natural Language Processing (NLP)
• Limitations
• Not able to decode visual cues
• Linguistic Ambiguities
• Visual Interpretations

• Exhibit Flair of text analytics and generation they fall short in decoding visual cues.
• grapple with linguistic ambiguities and are handicapped when it comes to verifying their interpretations
against real-world visual references,

1/10/2024 CAP6412 - Lecture 1 Introduction 30


Language Models
• Understand and Generate text
• Learn from raw text
• Transformer architecture

Source: Beginner’s Guide to Large Language Models | by Digitate | Medium


1/10/2024 CAP6412 - Lecture 1 Introduction 31
Large Language Models
• Pre-trained on large datasets

• They have large number of parameters

1/10/2024 CAP6412 - Lecture 1 Introduction 32


LLM Datasets
• Common Crawl consists of ~60% of training data.

• WebText2 (Open AI, from Reddit) consists of ~22% of training data.

• Books1 consists of ~8 % of training data.

• Books2 consists of ~ 8% of training data.

• Wikipedia consists of ~ 3% of training data.


Source: Beginner’s Guide to Large Language Models | by Digitate | Medium
1/10/2024 CAP6412 - Lecture 1 Introduction 33
LLMs
• LaMDA: Developed by google, trained on 1.56 trillion words of public dialog data. It powers the BARD
chatbot!
• LLaMA: Developed by Meta, a relatively small model (7B parameters) yet accurate as compared to GPT3.
• BLOOM: open source and multilingual model, trained data from 46 natural languages and 13 programming
languages.
• Galactica: Developed by Meta, can store, combine, and reason about scientific knowledge.
• Codex: model that powers GitHub Copilot. Proficient in more than a dozen programming languages, Codex
can now interpret simple commands in natural language and execute them.
• PaLM-E: Developed by google, a LLM focused on robot sensor data.
• Chinchilla: Developed by Deepmind, considerably simplifies downstream utilization because it requires
much less computer power for inference and fine-tuning.

Source: Beginner’s Guide to Large Language Models | by Digitate | Medium


1/10/2024 CAP6412 - Lecture 1 Introduction 34
GPT (Generative Pre-trained Transformer)

Source: Beginner’s Guide to Large Language Models | by Digitate | Medium


1/10/2024 CAP6412 - Lecture 1 Introduction 35
Pre-Training
• Self-Supervised

• Auto-regressive

• Unidirectional

• It understands the relationship between various words in the given context

Source: Beginner’s Guide to Large Language Models | by Digitate | Medium


1/10/2024 CAP6412 - Lecture 1 Introduction 36
Limitations of LLMs
• LLMs are large

• LLMs are Black box

• LLMs can have bias

• LLMs can do hallucinations

• LLMs may have IP issue


Source: Beginner’s Guide to Large Language Models | by Digitate | Medium
1/10/2024 CAP6412 - Lecture 1 Introduction 37
Applications of LLMs
• Code generation
• Content generation tools
• Copywriting
• Conversational tools
• Educational tools
• Enterprise search
• Information retrieval

Source: Beginner’s Guide to Large Language Models | by Digitate | Medium


1/10/2024 CAP6412 - Lecture 1 Introduction 38
Visual Language Models
• Vision systems are fundamental to understanding our world
• However, humans are good in communicating through language
• Complex relations between objects and their locations can be better described in
human language (text)
• Visual-Language models bridge the gap between vision and language
• VLMs understand both images and text
• The output of VLM can be modified through human-provided prompts, e.g.,
• segmenting a particular object by providing a bounding box,
• having interactive dialogues by asking questions about an image or video scene
• manipulating the robot’s behavior through language instructions

Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Khan,
Foundational Models Defining a New Era in Vision: A Survey and Outlook, https://arxiv.org/pdf/2307.13721.pdf

1/10/2024 CAP6412 - Lecture 1 Introduction 39


Visual Language Models
• Vision systems are fundamental to understanding our world
• The complex relations between objects and their locations can be better
described in human language (text)
• Visual-Language models bridge the gap between vision and language
• The output of VLM can be modified through human-provided prompts, e.g.,
• segmenting a particular object by providing a bounding box,
• having interactive dialogues by asking questions about an image or video scene
• manipulating the robot’s behavior through language instructions

1/10/2024 CAP6412 - Lecture 1 Introduction 40


• Model that understand both image and text

1/10/2024 CAP6412 - Lecture 1 Introduction 41


Language and Senses
• Humans are the only known species where much of knowledge
learning happens symbolically through language.
• In addition to information received directly from Five senses.
• Vision
• Hearing
• Touch
• Taste
• Smell

Source: Wikipedia
1/10/2024 CAP6412 - Lecture 1 Introduction 42
Large Multi-model Models (LMMs)

• Image
• Video
• Text
• Audio (speech, music)
• Physiological signals
• ….

1/10/2024 CAP6412 - Lecture 1 Introduction 43


Visual-Language Tasks
• Image retrieval from natural language text
• Phrase grounding, i.e., performing object detection from an input image and
natural text (example: A young person swings a bat).
• Visual question answering, i.e., finding answers from an input image and a
question in natural language
• Generate a caption for a given image
• Detection of hate speech from social media content involving both images and
text modalities
• Visual-Language Navigation
Credit: A Dive into Vision-Language Models
https://huggingface.co/blog/vision_language_pretraining
1/10/2024 CAP6412 - Lecture 1 Introduction 44
Contrastive Learning Image Pre-training (CLIP)

Image
Input Image Representation

A dog lying in grass

Input Text
Text Learning Transferable Visual Models From Natural Language Supervision
Representation Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya
Sutskever
Contrastive Learning Image Pre-training (CLIP)

Input Image Image


Representation

A dog lying in grass

Input Text
Text Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Representation Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya
Sutskever
Contrastive Language Image Pre-training (CLIP)
Contrastive Language Image Pre-training (CLIP)
GPT-4

https://cdn.openai.com/papers/gpt-4.pdf

1/10/2024 CAP6412 - Lecture 1 Introduction 49


Mini-GPT4

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large


Language Models, Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
1/10/2024 CAP6412 - Lecture 1 Introduction 50
LLaVa

Visual Instruction Tuning


Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

1/10/2024 CAP6412 - Lecture 1 Introduction 51


LLaVA

Visual Instruction Tuning


Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

1/10/2024 CAP6412 - Lecture 1 Introduction 52


Video ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 53
Video ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 54
Video ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 55
PG-Video-LLaVA

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models


Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad K
1/10/2024 CAP6412 - Lecture 1 Introduction 56
PG-Video-LLaVA

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models


Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad K
1/10/2024 CAP6412 - Lecture 1 Introduction 57
PG-Video-LLaVA

1/10/2024 CAP6412 - Lecture 1 Introduction 58


Thankyou

1/10/2024 CAP6412 - Lecture 1 Introduction 59

You might also like