SlideShare a Scribd company logo
DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space"
Multimodal LLMs
• What are Multimodal Language Models
• Background / How do they work
• LLaVA papers/projects
• LLaVA model demonstration
• Image classification project with LLaVA
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Deep Learning Affinity Group (DLAG)
https://research.fredhutch.org/dlag/en.html
Feb 20, 2024
Who Am I?
2
Link: AI Robert
3
The Papers to Read
LLaVA 1.0
https://arxiv.org/abs/2304.08485
https://arxiv.org/pdf/2304.08485.pdf
https://llava-vl.github.io/
https://github.com/haotian-liu/LLaVA
https://arxiv.org/abs/2310.03744
https://arxiv.org/pdf/2310.03744.pdf
https://huggingface.co/liuhaotian/llava-v1.5-13b
https://arxiv.org/abs/2306.00890
https://arxiv.org/pdf/2306.00890.pdf
https://huggingface.co/microsoft/llava-med-7b-delta
https://github.com/microsoft/LLaVA-Med
LLaVA 1.5 LLaVA-Med
4
Multimodal Language Models
Multimodal language models are AI systems designed to understand, interpret, and generate information across different
forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations
between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual
information.
Why is the
sky blue?
A person wearing a red cap and
sleeveless outfit is soaring through
a cloudless sky on a brightly
colored hang glider.
The sky appears blue because
molecules in the Earth's
atmosphere scatter sunlight the
shorter wavelength of blue more
than other colors.
Multimodal
Language
Model
I like pizza
5
Multimodal Language Models
Source: https://twitter.com/GregKamradt/status/1711772496159252981
Use Case Breakdown
Describe
• Animal Identification
• What's in this photo
Interpret
• Technical Flame Graph Interpretation
• Schematic Interpretation
• Twitter Thread Explainer
Recommend
• Food Recommendations
• Website Feedback
• Painting Feedback
Convert
• Figma Screens
• Adobe Lightroom Settings
• Suggest ad copy based on a webpage
Extract
• Structured Data From Driver's License
• Extract structured itemsfrom an image
• Handwriting Extraction
Assist
• Excel Formula Helper
• Find My Glasses
• Live Poker Advice
• Video game recommendations
Evaluate
• Dog Cuteness Evaluator
• Bounding Box Evaluator
• Thumbnail Testing
Links to Examples
6
AI Vision has come a long way.
GPT-4 Vision
LLaVA 1.6 34B
Research scientist and a founding member
at OpenAI. Sr. Director of AI at Telsa.
source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/
2024
2012
7
What’s funny about this?
Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
LLaVA 1.6 34B
GPT-4 Vision
What’s unusual about this image?
LLaVA 1.6 34B
GPT-4 Vision
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Quick Introduction to Tokens and Embeddings
required to understand how LLMs process text and
images.
9
Text Tokenization
10
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
11
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
There are many embedding models:
12
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
Semantic Text Similarity
13
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
Embeddings Plot Tool
14
https://github.com/robert-mcdermott/embeddings_plot
A command line utility I created to
visualize word embeddings
Embedding-plot, is a command line
utility that can visualize word
embeddings in either 2D or 3D scatter
plots using dimensionality reduction
techniques (PCA, t-SNE or UMAP) and
clustering in a scatter plot.
Image Embedding Example
15
source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences
“Visualization of a 2D embedding of the
style space trained with strategic sampling
computed with t-SNE. The embedding is
based on 200,000 images from the test set.
For a clear visual representation, we
discretize the style space into a grid and
pick one image from each grid cell at
random.”
CLIP (Contrastive Language-Image Pre-Training)
16
Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP
High-Level Architecture
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
The LLaVA Papers
17
LLaVA 1.0 – Large Language and Vision Assistant
18
• https://arxiv.org/abs/2304.08485
• https://arxiv.org/pdf/2304.08485.pdf
• https://llava-vl.github.io/
• https://github.com/haotian-liu/LLaVA
Instruction tuning large language models (LLMs) using machine-
generated instruction-following data has improved zero-shot
capabilities on new tasks, but the idea is less explored in the
multimodal field. In this paper, we present the first attempt to use
language-only GPT-4 to generate multimodal language-image
instruction-following data. By instruction tuning on such generated
data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision
encoder and LLM for general-purpose visual and language
understanding. Our early experiments show that LLaVA
demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen
images/instructions and yields a 85.1% relative score compared
with GPT-4 on a synthetic multimodal instruction-following dataset.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4
achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4
generated visual instruction tuning data, our model and code base
publicly available.
Abstract
LLaVA 1.0 Training
19
LLaVA 1.0 Architecture
20
Tokenizing + Embedding Tokenizing + Embedding
LLaVA 1.0 Performance
21
22
• https://arxiv.org/abs/2310.03744
• https://arxiv.org/pdf/2310.03744.pdf
• https://huggingface.co/liuhaotian/llava
-v1.5-13b
Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In this
note, we show that the fully-connected vision-language cross-
modal connector in LLaVA is surprisingly powerful and data-
efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-
task-oriented VQA data with simple response formatting
prompts, we establish stronger baselines that achieve state-of-
the-art across 11 benchmarks.
Our final 13B checkpoint uses merely 1.2M publicly available
data, and finishes full training in ~1 day on a single 8-A100
node.
We hope this can make state-of-the-art LMM research more
accessible. Code and model will be publicly available.
Abstract
LLaVA (1.5) – Large Language and Vision Assistant
23
LLaVA 1.5 Changes
Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented
VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art
across 11 benchmarks.
LLaVA 1.5
24
Training Data Sources Performance
25
LLaVA 1.5 Comparison Examples
LLaVA 1.6
26
LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Benchmarks Example
27
• https://arxiv.org/abs/2306.00890
• https://arxiv.org/pdf/2306.00890.pdf
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigationsfocus on unimodal text.
Multimodalconversational AI has seen rapid progress by leveraging billions of
image-text pairs from the public web, but such general-domain vision-language
models still lack sophisticationin understanding and conversing about biomedical
images. In this paper, we propose a cost-efficient approach for training a vision language
conversational assistant that can answer open-ended research questions
of biomedical images. The key idea is to leverage a large-scale, broad-coverage
biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-followingdata from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically,the model first learns to align biomedical vocabulary
using the figure-caption pairs as is, then learns to master open-ended conversational
semantics using GPT-4 generated instruction-followingdata, broadly mimicking
how a layperson gradually acquires biomedical knowledge. This enables us to train
a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less
than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational
capability and can follow open-ended instruction to assist with inquiries
about a biomedical image. On three standard biomedical visual question answering
datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art
on certain metrics. To facilitate biomedical multimodal research, we will release
our instruction-followingdata and the LLaVA-Med model.
Abstract
LLaVA-Med
https://github.com/microsoft/LLaVA-Med
https://huggingface.co/microsoft/llava-med-7b-delta
28
LLaVA-Med
29
LLaVA-Med Comparison Examples
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
My LLaVA based Image Classifier Experiment
30
Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
31
LLaVA 1.5: Handwritten Digit Classification
Abstract
https://github.com/robert-mcdermott/LLM-Image-Classification
32
LLaVA 1.5: Animal Classification
https://github.com/robert-mcdermott/LLM-Image-Classification
33
LLaVA 1.5: Chess Piece Identification
https://github.com/robert-mcdermott/LLM-Image-Classification
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Live Demo Time
34
Thank you
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Ad

More Related Content

What's hot (20)

AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
Yalçın Yenigün
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
The AI Revolution - Aaron Stelle WFG Title
The AI Revolution - Aaron Stelle WFG TitleThe AI Revolution - Aaron Stelle WFG Title
The AI Revolution - Aaron Stelle WFG Title
Aaron Stelle
 
EMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage ArchitectureEMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage Architecture
EMC
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
ChatGPT_Prompts.pptx
ChatGPT_Prompts.pptxChatGPT_Prompts.pptx
ChatGPT_Prompts.pptx
Chakrit Phain
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
Anass Bensrhir - Senior Data Scientist
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
ChatGPT Evaluation for NLP
ChatGPT Evaluation for NLPChatGPT Evaluation for NLP
ChatGPT Evaluation for NLP
XiachongFeng
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
caa28steve
 
Conversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsConversational AI and Chatbot Integrations
Conversational AI and Chatbot Integrations
Cristina Vidu
 
Using Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxUsing Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptx
JonathanDietz3
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
Hady Elsahar
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
Debmalya Biswas
 
Chunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector DatabasesChunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector Databases
Zilliz
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
Bhaskar Mitra
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
The AI Revolution - Aaron Stelle WFG Title
The AI Revolution - Aaron Stelle WFG TitleThe AI Revolution - Aaron Stelle WFG Title
The AI Revolution - Aaron Stelle WFG Title
Aaron Stelle
 
EMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage ArchitectureEMC Atmos Cloud Storage Architecture
EMC Atmos Cloud Storage Architecture
EMC
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
ChatGPT_Prompts.pptx
ChatGPT_Prompts.pptxChatGPT_Prompts.pptx
ChatGPT_Prompts.pptx
Chakrit Phain
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
ChatGPT Evaluation for NLP
ChatGPT Evaluation for NLPChatGPT Evaluation for NLP
ChatGPT Evaluation for NLP
XiachongFeng
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
caa28steve
 
Conversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsConversational AI and Chatbot Integrations
Conversational AI and Chatbot Integrations
Cristina Vidu
 
Using Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxUsing Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptx
JonathanDietz3
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
Hady Elsahar
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
Debmalya Biswas
 
Chunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector DatabasesChunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector Databases
Zilliz
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
Bhaskar Mitra
 
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j – The Fastest Path to Scalable Real-Time Analytics
Neo4j
 

Similar to Introduction to Multimodal LLMs with LLaVA (20)

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
Finalyear Projects
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
Paige Morgan
 
RAG Techniques – for engineering student
RAG Techniques – for engineering studentRAG Techniques – for engineering student
RAG Techniques – for engineering student
ÑïshĶãrsʜ Shäh
 
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdfLarge Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
BrsioftBlogger
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
saastr
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
Rikki Wright
 
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
Svetlin Ivanov
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
Marvin Bertin
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
adil raja
 
Master LLMs with LangChain -the basics of LLM
Master LLMs with LangChain -the basics of LLMMaster LLMs with LangChain -the basics of LLM
Master LLMs with LangChain -the basics of LLM
ssuser3d8087
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
Mikhail Burtsev
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
Rajesh Muddana
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Robert McDermott
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
Finalyear Projects
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
Paige Morgan
 
RAG Techniques – for engineering student
RAG Techniques – for engineering studentRAG Techniques – for engineering student
RAG Techniques – for engineering student
ÑïshĶãrsʜ Shäh
 
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdfLarge Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
Large Language Modelsjjjhhhjjjjjjbbbbbbj.pdf
BrsioftBlogger
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
SaaStr Annual 2024: Outsmarting LLMs: 5 Strategies for Founders & Technologis...
saastr
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
Rikki Wright
 
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expe...
Svetlin Ivanov
 
Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025Introduction to LLM Post-Training - MIT 6.S191 2025
Introduction to LLM Post-Training - MIT 6.S191 2025
Maxime Labonne
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
adil raja
 
Master LLMs with LangChain -the basics of LLM
Master LLMs with LangChain -the basics of LLMMaster LLMs with LangChain -the basics of LLM
Master LLMs with LangChain -the basics of LLM
ssuser3d8087
 
Ad

Recently uploaded (20)

Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Ad

Introduction to Multimodal LLMs with LLaVA

  • 1. DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space" Multimodal LLMs • What are Multimodal Language Models • Background / How do they work • LLaVA papers/projects • LLaVA model demonstration • Image classification project with LLaVA Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) [email protected] Deep Learning Affinity Group (DLAG) https://research.fredhutch.org/dlag/en.html Feb 20, 2024
  • 2. Who Am I? 2 Link: AI Robert
  • 3. 3 The Papers to Read LLaVA 1.0 https://arxiv.org/abs/2304.08485 https://arxiv.org/pdf/2304.08485.pdf https://llava-vl.github.io/ https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744 https://arxiv.org/pdf/2310.03744.pdf https://huggingface.co/liuhaotian/llava-v1.5-13b https://arxiv.org/abs/2306.00890 https://arxiv.org/pdf/2306.00890.pdf https://huggingface.co/microsoft/llava-med-7b-delta https://github.com/microsoft/LLaVA-Med LLaVA 1.5 LLaVA-Med
  • 4. 4 Multimodal Language Models Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information. Why is the sky blue? A person wearing a red cap and sleeveless outfit is soaring through a cloudless sky on a brightly colored hang glider. The sky appears blue because molecules in the Earth's atmosphere scatter sunlight the shorter wavelength of blue more than other colors. Multimodal Language Model I like pizza
  • 5. 5 Multimodal Language Models Source: https://twitter.com/GregKamradt/status/1711772496159252981 Use Case Breakdown Describe • Animal Identification • What's in this photo Interpret • Technical Flame Graph Interpretation • Schematic Interpretation • Twitter Thread Explainer Recommend • Food Recommendations • Website Feedback • Painting Feedback Convert • Figma Screens • Adobe Lightroom Settings • Suggest ad copy based on a webpage Extract • Structured Data From Driver's License • Extract structured itemsfrom an image • Handwriting Extraction Assist • Excel Formula Helper • Find My Glasses • Live Poker Advice • Video game recommendations Evaluate • Dog Cuteness Evaluator • Bounding Box Evaluator • Thumbnail Testing Links to Examples
  • 6. 6 AI Vision has come a long way. GPT-4 Vision LLaVA 1.6 34B Research scientist and a founding member at OpenAI. Sr. Director of AI at Telsa. source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 2024 2012
  • 7. 7 What’s funny about this? Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/ LLaVA 1.6 34B GPT-4 Vision
  • 8. What’s unusual about this image? LLaVA 1.6 34B GPT-4 Vision
  • 9. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Quick Introduction to Tokens and Embeddings required to understand how LLMs process text and images. 9
  • 10. Text Tokenization 10 Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for machine learning models. https://platform.openai.com/tokenizer Tokenization Visualized Resulting Token IDs
  • 11. 11 Vector Embeddings Applications • Natural Language Processing tasks: sentiment analysis, named entity recognition, etc. • Information retrieval: search engines, recommendation systems. • Visualization: using dimensionality reduction to visualize semantic relationships https://huggingface.co/spaces/mteb/leaderboard 5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01 -9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02 -2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02 -1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02 -4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03 -2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02 -9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02 -7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02 5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02 6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02 -3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03 -4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03 1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02 -2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02 5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02 * "A fat tuxedo cat" = * The "all-MiniLM-L6-v2" embedding model has 384 dimensions https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Definition • Representations of text in numerical form. • Convert variable-length text into fixed-size vectors in high- dimensional space. Purpose • Capture semantic meaning and relationships between words, phrases, or longer text. • Enable mathematical operations on text (e.g., similarity measurement, arithmetic operations). Characteristics • Words with similar meanings are close in vector space. • Allows for operations like "king" - "man" + "woman" ≈ "queen". There are many embedding models:
  • 12. 12 Vector Embeddings • There are several dozen embedding models • They range in complexity from 384 to 1536 dimensions • They range in max sequence length from 512 to 8191 tokens • Embedding models are generally not compatible with each other Interactive embedding explorer: https://blog.echen.me/embedding-explorer/
  • 13. Semantic Text Similarity 13 Sentence 1 Sentence 2 Cosine Similarity The cat sits outside The dog plays in the garden 0.2838 A man is playing guitar A woman watches TV -0.0327 The new movie is awesome The new movie is so great 0.8939 Jim can run very fast James is the fastest runner 0.6844 My goldfish is hungry Pluto is a planet! 0.0454 • Measures the cosine of the angle between two vectors. • Value between -1 and 1; where 1 means vectors are identical, 0 means orthogonal, and -1 means diametrically opposite (rare in text embeddings). These clearly used different embedding models https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
  • 14. Embeddings Plot Tool 14 https://github.com/robert-mcdermott/embeddings_plot A command line utility I created to visualize word embeddings Embedding-plot, is a command line utility that can visualize word embeddings in either 2D or 3D scatter plots using dimensionality reduction techniques (PCA, t-SNE or UMAP) and clustering in a scatter plot.
  • 15. Image Embedding Example 15 source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences “Visualization of a 2D embedding of the style space trained with strategic sampling computed with t-SNE. The embedding is based on 200,000 images from the test set. For a clear visual representation, we discretize the style space into a grid and pick one image from each grid cell at random.”
  • 16. CLIP (Contrastive Language-Image Pre-Training) 16 Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP High-Level Architecture
  • 17. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center The LLaVA Papers 17
  • 18. LLaVA 1.0 – Large Language and Vision Assistant 18 • https://arxiv.org/abs/2304.08485 • https://arxiv.org/pdf/2304.08485.pdf • https://llava-vl.github.io/ • https://github.com/haotian-liu/LLaVA Instruction tuning large language models (LLMs) using machine- generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. Abstract
  • 20. LLaVA 1.0 Architecture 20 Tokenizing + Embedding Tokenizing + Embedding
  • 22. 22 • https://arxiv.org/abs/2310.03744 • https://arxiv.org/pdf/2310.03744.pdf • https://huggingface.co/liuhaotian/llava -v1.5-13b Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross- modal connector in LLaVA is surprisingly powerful and data- efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic- task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of- the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. Abstract LLaVA (1.5) – Large Language and Vision Assistant
  • 23. 23 LLaVA 1.5 Changes Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.
  • 24. LLaVA 1.5 24 Training Data Sources Performance
  • 26. LLaVA 1.6 26 LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ Benchmarks Example
  • 27. 27 • https://arxiv.org/abs/2306.00890 • https://arxiv.org/pdf/2306.00890.pdf Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigationsfocus on unimodal text. Multimodalconversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophisticationin understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-followingdata from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically,the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-followingdata, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-followingdata and the LLaVA-Med model. Abstract LLaVA-Med https://github.com/microsoft/LLaVA-Med https://huggingface.co/microsoft/llava-med-7b-delta
  • 30. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center My LLaVA based Image Classifier Experiment 30 Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
  • 31. 31 LLaVA 1.5: Handwritten Digit Classification Abstract https://github.com/robert-mcdermott/LLM-Image-Classification
  • 32. 32 LLaVA 1.5: Animal Classification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 33. 33 LLaVA 1.5: Chess Piece Identification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 34. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Live Demo Time 34
  • 35. Thank you Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) [email protected]