0% found this document useful (0 votes)
50 views

Multimodal RAG Systems Hands-On Guide

Uploaded by

aegr82
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Multimodal RAG Systems Hands-On Guide

Uploaded by

aegr82
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Hands-on Guide

to
Multimodal
RAG Systems

Dipanjan (DJ)
Multimodal RAG System

Traditional RAG systems are constrained to text data, making them


ineffective for multimodal data, which includes text, images, tables,
and more
These systems integrate multimodal data processing (text, images,
tables) and utilize multimodal LLMs, like GPT-4o, to provide more
contextual and accurate answers.
The guide provides a detailed guide on building a Multimodal RAG
system with LangChain, integrating intelligent document loaders,
vector databases, and multi-vector retrievers.

Source: A Comprehensive Guide to Building Multimodal RAG Systems


Multimodal Datasets

Multimodal data consists of a mixture of text, tables, images,


graphs and optionally audio and video
The idea is to detect, parse and extract these different elements
separately and then generate downstream artifacts like
embeddings

Source: A Comprehensive Guide to Building Multimodal RAG Systems


Multimodal RAG Workflow

Option 1: Use multimodal embeddings (such as CLIP) to embed images and text
together. Retrieve either using similarity search, but simply link to images in a
docstore. Pass raw images and text chunks to a multimodal LLM for synthesis.
Option 2: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text
summaries from images. Embed and retrieve text summaries using a text
embedding model. Again, reference raw text chunks or tables from a docstore for
answer synthesis by a regular LLM; in this case, we exclude images from the
docstore.
Option 3: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text,
table and image summaries (text chunk summaries are optional). Embed and
retrieved text, table, and image summaries with reference to the raw elements, as
we did above in option 1. Again, raw images, tables, and text chunks will be
passed to a multimodal LLM for answer synthesis.

Option 3 is the best especially if you have charts as images else you can also
generate multimodal embeddings from text and image combinations

Source: A Comprehensive Guide to Building Multimodal RAG Systems


MultiVector Retriever
Workflow

We will first use a document parsing tool like Unstructured to extract the text, table and
image elements separately
Then we will pass each extracted element into an LLM and generate a detailed text
summary as depicted above.
Next we will store the summaries and their embeddings into a vector database by using
any popular embedder model like OpenAI Embedders. We will also store the
corresponding raw document element (text, table, image) for each summary in a document
store, which can be any database platform like Redis.
The multi-vector retriever links each summary and its embedding to the original
document’s raw element (text, table, image) using a common document identifier (doc_id).
Now, when a user question comes in, first, the multi-vector retriever retrieves the relevant
summaries, which are similar to the question and then using the common doc_ids, the
original text, table and image elements are returned back which are further passed on to
the RAG system’s LLM as the context to answer the user question.

Source: A Comprehensive Guide to Building Multimodal RAG Systems


Multimodal RAG Architecture

Load all documents and use a document loader like unstructured.io to extract text
chunks, image, and tables.
If necessary, convert HTML tables to markdown; they are often very effective with
LLMs
Pass each text chunk, image, and table into a multimodal LLM like GPT-4o and get a
detailed summary.
Store summaries in a vector DB and the raw document pieces in a document DB like
Redis
Connect the two databases with a common document_id using a multi-vector retriever
to identify which summary maps to which raw document piece.
Connect this multi-vector retrieval system with a multimodal LLM like GPT-4o.
Query the system, and based on similar summaries to the query, get the raw document
pieces, including tables and images, as the context.
Using the above context, generate a response using the multimodal LLM for the
question.

Source: A Comprehensive Guide to Building Multimodal RAG Systems


Hands-on Guide

Check out the


HANDS-ON GUIDE
here

You might also like