最新｜用Qwen3 Embedding+Milvus，搭建最强企业知识库

最新推荐文章于 2025-07-29 11:50:47 发布

大语言模型

最新推荐文章于 2025-07-29 11:50:47 发布

阅读量858

点赞数 26

CC 4.0 BY-SA版权

文章标签： embedding milvus 人工智能 LLM 程序员 AI大模型 RAG

本文链接：https://blog.csdn.net/2301_81940605/article/details/148607912

前言

这几天阿里低调放出两款 Qwen3 家族的新模型：Qwen3-Embedding 和 Qwen3-Reranker（都分别包括0.6B轻量版、4B平衡版、8B高性能版三种尺寸）。两款模型基于 Qwen3 基座训练，天然具备强大的多语言理解能力，支持119种语言，覆盖主流自然语言和编程语言。

我简单看了下 Hugging Face 上的数据和评价，有几个点蛮值得分享

Qwen3-Embedding-8B 在 MTEB 多语言榜上拿到 70.58 分，超过 BGE、E5、甚至 Google Gemini 等一众明星模型。
Qwen3-Reranker-8B 在多语言排序任务中得分 69.02，中文得分达到 77.45，在现有开源 reranker 模型中也是顶流。
文本向量统一在同一个语义空间，中文问句可以直接命中英文结果，特别适合做全球化场景下的智能搜索或客服系统。

这意味着，这两款模型不只是“在开源模型里还不错”，而是“全面追平甚至反超主流商用API”，在RAG 检索、跨语种搜索、代码查找等系统，尤其是中文语境中，这两款模型已经具备可直接上生产的实力。

那么如何用它来搭建一个RAG系统，本文将给出深度教程。

01 RAG搭建教程（Qwen3-Embedding-0.6B + Qwen3-Reranker-0.6B)

教程亮点：手把手教你利用Qwen3最新发布的embedding模型和reranker模型搭建一个RAG，两阶段检索设计（召回+重排）平衡了效率与精度！

环境准备

! pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers

Requires transformers>=4.51.0

Requires sentence-transformers>=2.7.0

在本示例中，我们将使用 OpenAI 作为文本生成的大型语言模型，因此您需要将 API 密钥 OPENAI_API_KEY 作为环境变量准备给大型语言模型使用。

import osos.environ["OPENAI_API_KEY"] = "sk-************"

数据准备

我们可以使用Milvus文档2.4. x中的FAQ页面作为RAG中的私有知识，这是构建一个基础RAG的良好数据源。

下载zip文件并将文档解压缩到文件夹milvus_docs

! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

我们从文件夹milvus_docs/en/faq中加载所有markdown文件，对于每个文档，我们只需用“#”来分隔文件中的内容，就可以大致分隔markdown文件各个主要部分的内容。

from glob import globtext_lines = []for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):    with open(file_path, "r") as file:        file_text = file.read()    text_lines += file_text.split("# ")

准备 LLM 和Embedding模型

本示例中使用 Qwen3-Embedding-0.6B 来进行文本嵌入，使用Qwen3-Reranker-0.6B对检索的结果进行重排序。

from openai import OpenAIfrom sentence_transformers import SentenceTransformerimport torchfrom transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM# Initialize OpenAI client for LLM generationopenai_client = OpenAI()# Load Qwen3-Embedding-0.6B model for text embeddingsembedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")# Load Qwen3-Reranker-0.6B model for rerankingreranker_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')reranker_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()# Reranker configurationtoken_false_id = reranker_tokenizer.convert_tokens_to_ids("no")token_true_id = reranker_tokenizer.convert_tokens_to_ids("yes")max_reranker_length = 8192prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False)

输出结果示例

定义一个函数，利用 Qwen3-Embedding-0.6B 模型生成文本嵌入。该函数将用于生成文档嵌入和查询嵌入。

def emb_text(text, is_query=False):    """    Generate text embeddings using Qwen3-Embedding-0.6B model.    Args:        text: Input text to embed        is_query: Whether this is a query (True) or document (False)    Returns:        List of embedding values    """    if is_query:        # For queries, use the "query" prompt for better retrieval performance        embeddings = embedding_model.encode([text], prompt_name="query")    else:        # For documents, use default encoding        embeddings = embedding_model.encode([text])    return embeddings[0].tolist()

定义重排序函数以提升检索质量。这些函数使用Qwen3-Reranker实现完整的重排序管道，根据文档与查询的相关性对候选文档进行评估和重新排序。其中各函数主要作用分别是：

format_instruction(): 将查询、文档和任务指令格式化为重排序模型的标准输入格式
process_inputs(): 对格式化后的文本进行分词编码，并添加特殊token用于模型判断
compute_logits(): 使用重排序模型计算“查询-文档”对的相关性得分（0-1之间）
rerank_documents(): 基于查询相关性对文档进行重新排序，返回按相关性得分降序排列的文档列表

def format_instruction(instruction, query, doc):    """Format instruction for reranker input"""    if instruction is None:        instruction = 'Given a web search query, retrieve relevant passages that answer the query'    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(        instruction=instruction, query=query, doc=doc    )    return outputdef process_inputs(pairs):    """Process inputs for reranker"""    inputs = reranker_tokenizer(        pairs, padding=False, truncation='longest_first',        return_attention_mask=False, max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens)    )    for i, ele in enumerate(inputs['input_ids']):        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens    inputs = reranker_tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_reranker_length)    for key in inputs:        inputs[key] = inputs[key].to(reranker_model.device)    return inputs@torch.no_grad()def compute_logits(inputs, **kwargs):    """Compute relevance scores using reranker"""    batch_scores = reranker_model(**inputs).logits[:, -1, :]    true_vector = batch_scores[:, token_true_id]    false_vector = batch_scores[:, token_false_id]    batch_scores = torch.stack([false_vector, true_vector], dim=1)    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)    scores = batch_scores[:, 1].exp().tolist()    return scoresdef rerank_documents(query, documents, task_instruction=None):    """    Rerank documents based on query relevance using Qwen3-Reranker    Args:        query: Search query        documents: List of documents to rerank        task_instruction: Task instruction for reranking    Returns:        List of (document, score) tuples sorted by relevance score    """    if task_instruction is None:        task_instruction = 'Given a web search query, retrieve relevant passages that answer the query'    # Format inputs for reranker    pairs = [format_instruction(task_instruction, query, doc) for doc in documents]    # Process inputs and compute scores    inputs = process_inputs(pairs)    scores = compute_logits(inputs)    # Combine documents with scores and sort by score (descending)    doc_scores = list(zip(documents, scores))    doc_scores.sort(key=lambda x: x[1], reverse=True)    return doc_scores

生成一个测试向量，并打印其维度以及前几个元素。

test_embedding = emb_text("This is a test")embedding_dim = len(test_embedding)print(embedding_dim)print(test_embedding[:10])

结果示例：

1024[-0.009923271834850311, -0.030248118564486504, -0.011494234204292297, -0.05980192497372627, -0.0026795873418450356, 0.016578301787376404, -0.04073038697242737, 0.03180320933461189, -0.024417787790298462, 2.1764861230622046e-05]

将数据加载到Milvus

创建集合

from pymilvus import MilvusClientmilvus_client = MilvusClient(uri="./milvus_demo.db")collection_name = "my_rag_collection"

关于MilvusClient的参数设置：

将URI设置为本地文件（例如./milvus.db）是最便捷的方法，因为它会自动使用Milvus Lite将所有数据存储在该文件中。
如果你有大规模数据，可以在Docker或Kubernetes上搭建性能更强的Milvus服务器。在这种情况下，请使用服务器的URI（例如http://localhost:19530）作为你的URI。
如果你想使用Zilliz Cloud（Milvus的全托管云服务），请调整URI和令牌，它们分别对应Zilliz Cloud中的公共端点（Public Endpoint）和API密钥（Api key）。

检查集合是否已经存在，如果存在则将其删除。

if milvus_client.has_collection(collection_name):    milvus_client.drop_collection(collection_name)

创建一个具有指定参数的新集合。

如果未指定任何字段信息，Milvus将自动创建一个默认的ID字段作为主键，以及一个向量字段用于存储向量数据。一个预留的JSON字段用于存储未在schema中定义的字段及其值。

milvus_client.create_collection(    collection_name=collection_name,    dimension=embedding_dim,    metric_type="IP",  # Inner product distance    consistency_level="Strong",  # Strong consistency level)

插入集合

逐行遍历文本，创建嵌入向量，然后将数据插入Milvus。

下面是一个新的字段text，它是集合中的一个未定义的字段。它将自动创建一个对应的text字段（实际上它底层是由保留的JSON动态字段实现的，你不用关心其底层实现。）

from tqdm import tqdmdata = []for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):    data.append({"id": i, "vector": emb_text(line), "text": line})milvus_client.insert(collection_name=collection_name, data=data)

输出结果示例：

Creating embeddings: 100%|██████████████████████████████████████████████████████████████████████████| 72/72 [00:08<00:00,  8.68it/s]{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}

结合重排序技术增强RAG

检索数据

我们来指定一个关于Milvus的常见问题。

question = "How is data stored in milvus?"

在集合中搜索该问题，并获取具有最高语义匹配度的前 10 个候选答案，然后使用重排序器来选出最佳的 3 个匹配项。

# Step 1: Initial retrieval with larger candidate setsearch_res = milvus_client.search(    collection_name=collection_name,    data=[        emb_text(question, is_query=True)    ],  # Use the `emb_text` function with query prompt to convert the question to an embedding vector    limit=10,  # Return top 10 candidates for reranking    search_params={"metric_type": "IP", "params": {}},  # Inner product distance    output_fields=["text"],  # Return the text field)# Step 2: Extract candidate documents for rerankingcandidate_docs = [res["entity"]["text"] for res in search_res[0]]# Step 3: Rerank documents using Qwen3-Rerankerprint("Reranking documents...")reranked_docs = rerank_documents(question, candidate_docs)# Step 4: Select top 3 reranked documentstop_reranked_docs = reranked_docs[:3]print(f"Selected top {len(top_reranked_docs)} documents after reranking")

让我们来看看此次查询的重新排序结果吧！

import json# Display reranked results with reranker scoresreranked_lines_with_scores = [    (doc, score) for doc, score in top_reranked_docs]print("Reranked results:")print(json.dumps(reranked_lines_with_scores, indent=4))# Also show original embedding-based results for comparisonprint("\n" + "="*80)print("Original embedding-based results (top 3):")original_lines_with_distances = [    (res["entity"]["text"], res["distance"]) for res in search_res[0][:3]]print(json.dumps(original_lines_with_distances, indent=4))

输出结果示例：

从结果中我们可以看到Qwen3-Reranker的重排序效果明显，相关性得分区分度较好

Reranked results(top 3):[    [        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",        0.9997891783714294    ],    [        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",        0.9989748001098633    ],    [        "Does the query perform in memory? What are incremental data and historical data?\n\nYes. When a query request comes, Milvus searches both incremental data and historical data by loading them into memory. Incremental data are in the growing segments, which are buffered in memory before they reach the threshold to be persisted in storage engine, while historical data are from the sealed segments that are stored in the object storage. Incremental data and historical data together constitute the whole dataset to search.\n\n###",        0.9984032511711121    ]]================================================================================Original embedding-based results(top 3):[    [        " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",        0.8306853175163269    ],    [        "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",        0.7302717566490173    ],    [        "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###",        0.7003671526908875    ]]

使用大型语言模型（LLM）构建检索增强生成（RAG）响应

将检索到的文档转换为字符串格式。

context = "\n".join(    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

为大语言模型提供系统提示（system prompt）和用户提示（user prompt）。这个提示是通过从Milvus检索到的文档生成的。

SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.<context>{context}</context><question>{question}</question>"""

使用Open AI 的大语言模型gpt-4o，根据提示生成响应。

response = openai_client.chat.completions.create(    model="gpt-4o",    messages=[        {"role": "system", "content": SYSTEM_PROMPT},        {"role": "user", "content": USER_PROMPT},    ],)print(response.choices[0].message.content)

输出结果展示：

In Milvus, data is stored in two main forms: inserted data and metadata. Inserted data, which includes vector data, scalar data, and collection-specific schema, is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends for this purpose, including MinIO, AWS S3, Google Cloud Storage, Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage. Metadata for Milvus is generated by its various modules and stored in etcd.

`02 小结`

通过以上教程和输出结果展示，不难发现，通义千问团队在Qwen3系列中推出的embedding和reranker模型表现相当不错。这两个模型的结合使用为RAG系统提供了一个相对完整且实用的解决方案。

在设计理念上Embedding模型支持query和document的差异化处理，体现了对检索任务的深入理解；Reranker采用交叉编码器架构，能够捕捉query-document间的精细交互；教程中的两阶段检索设计（召回+重排）更是平衡了效率与精度。特别是Qwen3-Embedding-0.6B（1024维）和Qwen3-Reranker-0.6B都采用了相对轻量的参数规模，支持本地部署，减少了对外部API的依赖，在保证性能的同时，降低了硬件要求，适合中小企业和个人开发者使用。

事实上，Qwen3系列推出embedding和reranker模型，其实不是个例，不是巧合，而是产业共识。

原因很简单，这两个模块，决定了大模型是否具备产品化能力。

生成式大模型最大的问题在于：不确定性高、评估难、成本重。

要解决以上问题，无论是 RAG、LLM Memory、Agent ，本质上都依赖一个前提：能否将语义压缩成机器可高效检索和判断的向量表达。

Embedding 与 Ranking 则是目前的最优路径：标准清晰、性能可测、成本可控、易于灰度。Embedding 决定你能不能“找得到”，Ranking 决定你能不能“选得准”。这使它们成为模型商品化最先跑通的 API 模块之一：调用频率高（每次检索都需要）、切换成本高（与索引绑定）、商业价值高（可用作底层 infra）。

如何系统的去学习大模型LLM ？

大模型时代，火爆出圈的LLM大模型让程序员们开始重新评估自己的本领。 “AI会取代那些行业？”“谁的饭碗又将不保了？”等问题热议不断。

事实上，抢你饭碗的不是AI，而是会利用AI的人。

继科大讯飞、阿里、华为等巨头公司发布AI产品后，很多中小企业也陆续进场！超高年薪，挖掘AI大模型人才！ 如今大厂老板们，也更倾向于会AI的人，普通程序员，还有应对的机会吗？

与其焦虑……

不如成为「掌握AI工具的技术人」，毕竟AI时代，谁先尝试，谁就能占得先机！

但是LLM相关的内容很多，现在网上的老课程老教材关于LLM又太少。所以现在小白入门就只能靠自学，学习成本和门槛很高。

基于此，我用做产品的心态来打磨这份大模型教程，深挖痛点并持续修改了近70次后，终于把整个AI大模型的学习门槛，降到了最低！

在这个版本当中：

第一您不需要具备任何算法和数学的基础
第二不要求准备高配置的电脑
第三不必懂Python等任何编程语言

您只需要听我讲，跟着我做即可，为了让学习的道路变得更简单，这份大模型教程已经给大家整理并打包，现在将这份 LLM大模型资料 分享出来：包括LLM大模型书籍、640套大模型行业报告、LLM大模型学习视频、LLM大模型学习路线、开源大模型学习教程等, 😝有需要的小伙伴，可以 扫描下方二维码领取🆓↓↓↓

一、LLM大模型经典书籍

AI大模型已经成为了当今科技领域的一大热点，那以下这些大模型书籍就是非常不错的学习资源。

在这里插入图片描述

二、640套LLM大模型报告合集

这套包含640份报告的合集，涵盖了大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。(几乎涵盖所有行业)
在这里插入图片描述

三、LLM大模型系列视频教程

在这里插入图片描述

四、LLM大模型开源教程（LLaLA/Meta/chatglm/chatgpt）

在这里插入图片描述

五、AI产品经理大模型教程

在这里插入图片描述

LLM大模型学习路线 ↓

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
- L1.4.1 知识大模型
- L1.4.2 生产大模型
- L1.4.3 模型工程方法论
- L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
- L2.1.1 OpenAI API接口
- L2.1.2 Python接口接入
- L2.1.3 BOT工具类框架
- L2.1.4 代码示例
- L2.2 Prompt框架
- L2.3 流水线工程
- L2.4 总结与展望