Lecture-27-Introduction to VLM
Lecture-27-Introduction to VLM
• https://blogs.lse.ac.uk/impactofsocialsciences/2016/05/09/how-to-
read-and-understand-a-scientific-paper-a-guide-for-non-scientists/
O Vinyals et al. Show and Tell: A Neural Image Caption Generator, 2014
1/10/2024 CAP6412 - Lecture 1 Introduction 23
VQA
VQA: Visual Question Answering, Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C.
Lawrence Zitnick, Dhruv Batra, Devi Parikh, 2015
1/10/2024 CAP6412 - Lecture 1 Introduction 24
Text to Video Retrieval
Gu, Shuyang, et al. "Vector quantized diffusion model for text-to-image synthesis."
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Natural Language Processing (NLP)
• Search engines
• Spam filtering
• Machine translation
• Sentiment analysis
• Document summarization
• …..
• Limitations
• Not able to decode visual cues
• Linguistic Ambiguities
• Exhibit Flair of text analytics and generation they fall short in decoding visual cues.
• grapple with linguistic ambiguities and are handicapped when it comes to verifying their interpretations
against real-world visual references,
• Auto-regressive
• Unidirectional
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Khan,
Foundational Models Defining a New Era in Vision: A Survey and Outlook, https://arxiv.org/pdf/2307.13721.pdf
Source: Wikipedia
1/10/2024 CAP6412 - Lecture 1 Introduction 42
Large Multi-model Models (LMMs)
• Image
• Video
• Text
• Audio (speech, music)
• Physiological signals
• ….
Image
Input Image Representation
Input Text
Text Learning Transferable Visual Models From Natural Language Supervision
Representation Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya
Sutskever
Contrastive Learning Image Pre-training (CLIP)
Input Text
Text Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Representation Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya
Sutskever
Contrastive Language Image Pre-training (CLIP)
Contrastive Language Image Pre-training (CLIP)
GPT-4
https://cdn.openai.com/papers/gpt-4.pdf
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 53
Video ChatGPT
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 54
Video ChatGPT
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
1/10/2024 CAP6412 - Lecture 1 Introduction 55
PG-Video-LLaVA