Deploying a Multimodal RAG System Using vLLM and Milvus _ by Zilliz _ Nov, 2024 _ Medium
Deploying a Multimodal RAG System Using vLLM and Milvus _ by Zilliz _ Nov, 2024 _ Medium
Get unlimited access to the best of Medium for less than $1/week. Become a member
Zilliz · Following
7 min read · 1 day ago
Imagine you’ve spent months fine-tuning your AI application around a specific LLM
through an API provider. Then, out of the blue, you receive an email: “We’re
deprecating the model you’re using in favor of our new version.” Sound familiar?
While cloud API providers offer the convenience of powerful, ready-to-use AI
capabilities, relying solely on them also introduces several significant risks:
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 1/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Limited Insight: There’s often limited visibility into performance and usage
patterns.
Privacy Concerns: Data privacy can be a critical issue, especially when handling
sensitive information.
So, what’s the solution? How can you take back control? How can you mitigate these
risks while enhancing your system’s capabilities? The answer lies in building a more
robust, independent system using open-source solutions.
This blog will guide you through creating a Multimodal RAG with Milvus and vLLM.
By leveraging the power of an open-source vector database combined open-source
LLM inference, you can design a system capable of processing and understanding
multiple types of data — text, images, audio, and even videos. This approach not
only puts you in complete control of the technology but also ensures a system that’s
both powerful and versatile, surpassing traditional text-based solutions.
3. We use OpenAI CLIP to encode the images into embeddings that can then be
searched with Milvus
4. We use the Mistral Embedding model to encode the text into embeddings.
6. Generate responses using Pixtral running with vLLM, leveraging both visual and
textual understanding
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 2/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
By the end of this tutorial, you’ll have developed a flexible, scalable system entirely
under your control-no more worrying about API deprecations or unexpected
changes.
What is Milvus?
Milvus is an open-source, high-performance, and highly scalable vector database
that can store, index, and search billion-scale unstructured data through high-
dimensional vector embeddings. It is perfect for building modern AI applications
such as retrieval augmented generation (RAG), semantic search, multimodal search,
and recommendation systems. Milvus runs efficiently across various environments,
from laptops to large-scale distributed systems.
What is vLLM?
The core idea of vLLM (Virtual Large Language Model) is to optimize the serving
and execution of LLMs by utilizing efficient memory management techniques. Here
are the key aspects:
Dynamic Batching: vLLM adapts batch sizes and sequences based on the
memory and compute capabilities of the underlying hardware. This dynamic
adjustment enhances processing throughput and minimizes latency during
model inference.
Efficient Resource Utilization: vLLM optimizes the use of critical resources such
as CPUs, GPUs, and memory. This efficiency allows the system to support larger
models and handle increased numbers of simultaneous requests, which is
essential in production environments where both scalability and performance
are key.
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 3/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
ensures that developers can easily deploy and manage large language models
across a range of applications without extensive reconfiguration.
vLLM is the inference library we will use for the inference and serving of the
Pixtral multimodal model.
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 4/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Getting Started
First, let’s install our dependencies:
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 5/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
import os
import base64
import json
from pathlib import Path
from dotenv import load_dotenv
from llama_index.core import Settings
from llama_index.embeddings.mistralai import MistralAIEmbedding
# Save transcription
with open(os.path.join(output_folder, "output_text.txt"), "w")
as file:
file.write(text_data)
os.remove(output_audio_path)
return {"Author": "Example Author", "Title": "Example Title",
"Views": "1000000"}
image_store = MilvusVectorStore(
uri="milvus_local.db",
collection_name="image_collection",
overwrite=True,
dim=512
)
storage_context = StorageContext.from_defaults(
vector_store=text_store,
image_store=image_store
)
2. Process the query with Pixtral using both text and images
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 7/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
base_url=os.getenv("KOYEB_ENDPOINT"),
api_key=os.getenv("KOYEB_TOKEN")
)
qa_tmpl_str = """
Given the provided information, including relevant images and
retrieved context
from the video, accurately and precisely answer the query
without any
additional prior knowledge.
---------------------
Context: {context_str}
Metadata: {metadata_str}
---------------------
Query: {query_str}
Answer: """
completion = client.chat.completions.create(
model="mistralai/Pixtral-12B-2409",
messages=messages,
max_tokens=300
)
return completion.choices[0].message.content
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 8/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
def main():
st.title("MultiModal RAG with Pixtral & Milvus")
# Video input
video_path = st.text_input("Enter video path:")
if st.session_state.index:
st.subheader("Chat with the Video")
query = st.text_input("Ask a question about the video:")
if query:
with st.spinner("Generating response..."):
# Generate and display response
[... query processing code ...]
if __name__ == "__main__":
main()
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 9/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Figure: The interface of your multimodal RAG app built with Milvus and Pixtral
From now on, you can interact with the video and for example, learn more about
the the Gaussian Distribution.
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 10/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 11/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Engineering
Following
Written by Zilliz
322 Followers · 14 Following
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 12/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Zilliz
May 29 5 1
Zilliz
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 13/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Jan 8 120
Zilliz
Zilliz
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 14/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Sep 10
Rohan Ahir
Nov 3 20
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 15/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
ully
6d ago 7
Lists
Leadership
61 stories · 485 saves
Leadership upgrades
7 stories · 109 saves
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 16/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Shrinivasan Sankar
Nov 6 3
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 17/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
6d ago 595 4
Nov 4 544 5
Byte-Sized AI Blog
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 18/19
2024/11/14 晚上11:42 Deploying a Multimodal RAG System Using vLLM and Milvus | by Zilliz | Nov, 2024 | Medium
Oct 8 100
https://medium.com/@zilliz_learn/deploying-a-multimodal-rag-system-using-vllm-and-milvus-033482a1cb0d 19/19