Multimodal RAG Systems Hands-On Guide
Multimodal RAG Systems Hands-On Guide
to
Multimodal
RAG Systems
Dipanjan (DJ)
Multimodal RAG System
Option 1: Use multimodal embeddings (such as CLIP) to embed images and text
together. Retrieve either using similarity search, but simply link to images in a
docstore. Pass raw images and text chunks to a multimodal LLM for synthesis.
Option 2: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text
summaries from images. Embed and retrieve text summaries using a text
embedding model. Again, reference raw text chunks or tables from a docstore for
answer synthesis by a regular LLM; in this case, we exclude images from the
docstore.
Option 3: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text,
table and image summaries (text chunk summaries are optional). Embed and
retrieved text, table, and image summaries with reference to the raw elements, as
we did above in option 1. Again, raw images, tables, and text chunks will be
passed to a multimodal LLM for answer synthesis.
Option 3 is the best especially if you have charts as images else you can also
generate multimodal embeddings from text and image combinations
We will first use a document parsing tool like Unstructured to extract the text, table and
image elements separately
Then we will pass each extracted element into an LLM and generate a detailed text
summary as depicted above.
Next we will store the summaries and their embeddings into a vector database by using
any popular embedder model like OpenAI Embedders. We will also store the
corresponding raw document element (text, table, image) for each summary in a document
store, which can be any database platform like Redis.
The multi-vector retriever links each summary and its embedding to the original
document’s raw element (text, table, image) using a common document identifier (doc_id).
Now, when a user question comes in, first, the multi-vector retriever retrieves the relevant
summaries, which are similar to the question and then using the common doc_ids, the
original text, table and image elements are returned back which are further passed on to
the RAG system’s LLM as the context to answer the user question.
Load all documents and use a document loader like unstructured.io to extract text
chunks, image, and tables.
If necessary, convert HTML tables to markdown; they are often very effective with
LLMs
Pass each text chunk, image, and table into a multimodal LLM like GPT-4o and get a
detailed summary.
Store summaries in a vector DB and the raw document pieces in a document DB like
Redis
Connect the two databases with a common document_id using a multi-vector retriever
to identify which summary maps to which raw document piece.
Connect this multi-vector retrieval system with a multimodal LLM like GPT-4o.
Query the system, and based on similar summaries to the query, get the raw document
pieces, including tables and images, as the context.
Using the above context, generate a response using the multimodal LLM for the
question.