0% found this document useful (0 votes)
59 views

Deploying a Multimodal RAG System Using vLLM and Milvus _ by Zilliz _ Nov, 2024 _ Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Deploying a Multimodal RAG System Using vLLM and Milvus _ by Zilliz _ Nov, 2024 _ Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Get unlimited access to the best of Medium for less than $1/week. Become a member

Deploying a Multimodal RAG System Using


Open in app

vLLM and Milvus


Search
47

Zilliz · Following
7 min read · 1 day ago

Listen Share More

Imagine you’ve spent months fine-tuning your AI application around a specific LLM
through an API provider. Then, out of the blue, you receive an email: “We’re
deprecating the model you’re using in favor of our new version.” Sound familiar?
While cloud API providers offer the convenience of powerful, ready-to-use AI
capabilities, relying solely on them also introduces several significant risks:

Lack of Control: You have no control over model versions or updates.

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 1/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Unpredictability: You may face sudden changes in model behavior or


capabilities.

Limited Insight: There’s often limited visibility into performance and usage
patterns.

Privacy Concerns: Data privacy can be a critical issue, especially when handling
sensitive information.

So, what’s the solution? How can you take back control? How can you mitigate these
risks while enhancing your system’s capabilities? The answer lies in building a more
robust, independent system using open-source solutions.

This blog will guide you through creating a Multimodal RAG with Milvus and vLLM.
By leveraging the power of an open-source vector database combined open-source
LLM inference, you can design a system capable of processing and understanding
multiple types of data — text, images, audio, and even videos. This approach not
only puts you in complete control of the technology but also ensures a system that’s
both powerful and versatile, surpassing traditional text-based solutions.

What We’ll Build: a Multimodal RAG Fully Under Your Control


We’re going to build a Multimodal RAG system using Milvus and vLLM, illustrating
how you can self host your LLM and gain full control over your AI applications. Our
tutorial will guide you through creating a Streamlit application that demonstrates
the power of integrating multiple data types. Here’s what we’ll cover:

1. Process video input by extracting frames and transcribing the audio

2. Store and efficiently index multimodal data using Milvus

3. We use OpenAI CLIP to encode the images into embeddings that can then be
searched with Milvus

4. We use the Mistral Embedding model to encode the text into embeddings.

5. Retrieve relevant context based on user queries using Milvus

6. Generate responses using Pixtral running with vLLM, leveraging both visual and
textual understanding

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 2/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

By the end of this tutorial, you’ll have developed a flexible, scalable system entirely
under your control-no more worrying about API deprecations or unexpected
changes.

What is Milvus?
Milvus is an open-source, high-performance, and highly scalable vector database
that can store, index, and search billion-scale unstructured data through high-
dimensional vector embeddings. It is perfect for building modern AI applications
such as retrieval augmented generation (RAG), semantic search, multimodal search,
and recommendation systems. Milvus runs efficiently across various environments,
from laptops to large-scale distributed systems.
What is vLLM?
The core idea of vLLM (Virtual Large Language Model) is to optimize the serving
and execution of LLMs by utilizing efficient memory management techniques. Here
are the key aspects:

Optimized Memory Management: vLLM implements advanced memory


allocation and management techniques to fully leverage available hardware
resources. This optimization helps to run large language models efficiently,
preventing memory bottlenecks that can hinder performance.

Dynamic Batching: vLLM adapts batch sizes and sequences based on the
memory and compute capabilities of the underlying hardware. This dynamic
adjustment enhances processing throughput and minimizes latency during
model inference.

Modular Design: The architecture of vLLM is modular, facilitating


straightforward integration with various hardware accelerators. This modularity
also enables easy scaling across multiple devices or clusters, making it highly
adaptable to different deployment scenarios.

Efficient Resource Utilization: vLLM optimizes the use of critical resources such
as CPUs, GPUs, and memory. This efficiency allows the system to support larger
models and handle increased numbers of simultaneous requests, which is
essential in production environments where both scalability and performance
are key.

Seamless Integration: Designed to integrate smoothly with existing machine


learning frameworks and libraries, vLLM provides a user-friendly interface. This

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 3/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

ensures that developers can easily deploy and manage large language models
across a range of applications without extensive reconfiguration.

Core Components of Our Multimodal RAG


The multimodal RAG app we are building comprises of the following key
components:

vLLM is the inference library we will use for the inference and serving of the
Pixtral multimodal model.

Koyeb provides the infrastructure layer for our deployment, offering a


serverless platform specialized for AI workloads. With native vLLM integration
and automated GPU resource management, it makes it easy to deploy LLM while
maintaining production-grade performance and scalability.

Pixtral from Mistral AI acts as our multimodal brain, combining a 400M


parameter vision encoder with a 12B parameter multimodal decoder. This
architecture allows it to process both images and text within the same context
window.

Milvus provides the vector storage foundation, efficiently managing


embeddings from different modalities. Its ability to handle multiple vector types
and perform fast similarity search makes it perfect for multimodal applications.

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 4/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Figure: The multimodal RAG architecture

Getting Started
First, let’s install our dependencies:

# Core LlamaIndex packages


pip install -U llama-index-vector-stores-milvus llama-index-multi-
modal-llms-mistralai llama-index-embeddings-mistralai llama-index-
multi-modal-llms-openai llama-index-embeddings-clip llama_index

# Video and audio processing


pip install moviepy pytube pydub SpeechRecognition openai-whisper
ffmpeg-python soundfile

# Image processing and visualization


pip install torch torchvision matplotlib scikit-image
git+https://github.com/openai/CLIP.git

# Utilities and infrastructure


pip install pymilvus streamlit ftfy regex tqdm

Setting Up the Environment


We’ll start by configuring our environment and importing necessary libraries:

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 5/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

import os
import base64
import json
from pathlib import Path
from dotenv import load_dotenv
from llama_index.core import Settings
from llama_index.embeddings.mistralai import MistralAIEmbedding

# Load environment variables


load_dotenv()

# Configure default embedding model


Settings.embed_model = MistralAIEmbedding(
"mistral-embed",
api_key=os.getenv("MISTRAL_API_KEY")
)

Video Processing Pipeline


The heart of our system is the video processing pipeline, which transforms raw
video content into data that our RAG system can understand and process efficiently.

def process_video(video_path: str, output_folder: str,


output_audio_path: str) -> dict:
# Create output directory if it doesn't exist
Path(output_folder).mkdir(parents=True, exist_ok=True)

# Extract frames from video


video_to_images(video_path, output_folder)

# Extract and transcribe audio


video_to_audio(video_path, output_audio_path)
text_data = audio_to_text(output_audio_path)

# Save transcription
with open(os.path.join(output_folder, "output_text.txt"), "w")
as file:
file.write(text_data)

os.remove(output_audio_path)
return {"Author": "Example Author", "Title": "Example Title",
"Views": "1000000"}

This pipeline breaks down videos into:

Image frames (extracted at 0.2 FPS)

Audio transcription using Whisper


https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 6/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Metadata about the video

Building the Vector Index


We use Milvus to store our multimodal embeddings. Here’s how we create our
index:

def create_index(output_folder: str):


# Create different collections for text and images
text_store = MilvusVectorStore(
uri="milvus_local.db",
collection_name="text_collection",
overwrite=True,
dim=1024
)

image_store = MilvusVectorStore(
uri="milvus_local.db",
collection_name="image_collection",
overwrite=True,
dim=512
)

storage_context = StorageContext.from_defaults(
vector_store=text_store,
image_store=image_store
)

# Load and index documents


documents = SimpleDirectoryReader(output_folder).load_data()
return MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)

Query Processing with Pixtral


When a user asks a question, we need to:

1. Retrieve relevant context from our vector store

2. Process the query with Pixtral using both text and images

Here’s our query processing function:

def process_query_with_image(query_str, context_str, metadata_str,


image_document):
client = OpenAI(

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 7/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

base_url=os.getenv("KOYEB_ENDPOINT"),
api_key=os.getenv("KOYEB_TOKEN")
)

with open(image_document.image_path, "rb") as image_file:


image_base64 =
base64.b64encode(image_file.read()).decode("utf-8")

qa_tmpl_str = """
Given the provided information, including relevant images and
retrieved context
from the video, accurately and precisely answer the query
without any
additional prior knowledge.
---------------------
Context: {context_str}
Metadata: {metadata_str}
---------------------
Query: {query_str}
Answer: """

# Prepare messages for Pixtral


messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": qa_tmpl_str.format(
context_str=context_str,
query_str=query_str,
metadata_str=metadata_str
)
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,
{image_base64}"
}
},
],
}
]

completion = client.chat.completions.create(
model="mistralai/Pixtral-12B-2409",
messages=messages,
max_tokens=300
)
return completion.choices[0].message.content

Building the Streamlit Interface

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 8/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Finally, we create a user-friendly interface with Streamlit:

def main():
st.title("MultiModal RAG with Pixtral & Milvus")

# Initialize session state


if 'index' not in st.session_state:
st.session_state.index = None
st.session_state.retriever_engine = None
st.session_state.metadata = None

# Video input
video_path = st.text_input("Enter video path:")

if video_path and not st.session_state.index:


with st.spinner("Processing video..."):
# Process video and create index
[... processing code ...]

if st.session_state.index:
st.subheader("Chat with the Video")
query = st.text_input("Ask a question about the video:")

if query:
with st.spinner("Generating response..."):
# Generate and display response
[... query processing code ...]

if __name__ == "__main__":
main()

Running the Application


Before starting the application, ensure you have:

1. Set up your environment variables in .env

2. Installed all required dependencies

Then launch the application:

streamlit run app.py

You’ll see the homepage where you can:

Upload videos for processing

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 9/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Ask questions about the content of the video

Read the responses from Pixtral with relevant video frames

Figure: The interface of your multimodal RAG app built with Milvus and Pixtral

From now on, you can interact with the video and for example, learn more about
the the Gaussian Distribution.

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 10/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 11/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Figure: Performing the multimodal search


Conclusion
In this blog post, we’ve demonstrated how to build a powerful multimodal RAG
system using Milvus, Pixtral, and vLLM. Through the combination of Milvus’s
efficient vector storage capabilities and Pixtral’s advanced multimodal
understanding, we’ve created a system that can process, understand, and respond to
queries about video content. And this system is fully under your control.

We’d Love to Hear What You Think!


If you like this blog post, please consider:

⭐ Giving us a star on GitHub


💬 Joining our Milvus Discord community to share your experiences
🔍 Exploring our Bootcamp repository for more examples of multimodal
applications with Milvus

Engineering

Following

Written by Zilliz
322 Followers · 14 Following

Building the #VectorDatabase for enterprise-grade AI.

More from Zilliz

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 12/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Zilliz

An Ultimate Guide to Vectorizing and Querying Structured Data

May 29 5 1

Zilliz

Optimizing RAG Applications: A Guide to Methodologies, Metrics, and


Evaluation Tools for Enhanced…
RAG, or Retrieval Augmented Generation, is a prominent AI framework in the era of large
language models (LLMs) like ChatGPT. It enhances…

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 13/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Jan 8 120

Zilliz

Getting Started with Pgvector: A Guide for Developers Exploring Vector


Databases

Nov 17, 2023 62 1

Zilliz

Top 10 Best Multimodal AI Models You Should Know

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 14/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Sep 10

See all from Zilliz

Recommended from Medium

Rohan Ahir

Llama 3.2-Vision model for OCR to automate KYC Process


Today, we’re going beyond theory and diving straight into applied data science. While many
discussions focus on high-level concepts, this…

Nov 3 20

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 15/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

ully

Microsoft GraphRAG v0.4.0


Microsoft recently released version v0.4.0 of the GraphRAG project, featuring several
significant updates. The most notable additions are…

6d ago 7

Lists

Leadership
61 stories · 485 saves

Leadership upgrades
7 stories · 109 saves

Stories to Help You Grow as a Software Developer


19 stories · 1467 saves

Good Product Thinking


12 stories · 754 saves

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 16/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Shrinivasan Sankar

LightRAG — Simple and efficient rival to GraphRAG?


Traditional RAG systems work by indexing raw data. This data is simply chunked and stored in
vector DBs. Whenever a query comes from the…

Nov 6 3

In Towards Data Science by Cristian Leo

How to Query a Knowledge Graph with LLMs Using gRAG


Google, Microsoft, LinkedIn, and many more tech companies are using Graph RAG. Why? Let’s
understand it by building one from scratch.

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 17/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

6d ago 595 4

In Towards Data Science by Dr. Leon Eversberg

How to Create a RAG Evaluation Dataset From Documents


Automatically create domain-specific datasets in any language using LLMs

Nov 4 544 5

Byte-Sized AI Blog

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 18/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium

Exploring Multimodal Retrieval-Augmented Generation (RAG) with CLIP


for Fashion Recommendations
As AI evolves, we encounter increasingly complex data interactions, particularly where multiple
modalities like text and images converge…

Oct 8 100

See more recommendations

https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 19/19

You might also like