Chonkie是一个专为大语言模型(LLM)应用场景设计的轻量级文本分块处理库,提供高效的文本分割和管理解决方案。该库采用最小依赖设计理念,特别适用于现实世界的自然语言处理管道。本文将详细介绍Chonkie的核心功能、设计理念以及五种主要的文本分块策略。
Chonkie的核心理念是简化文档分块处理流程,让开发者能够专注于核心业务逻辑而非底层的文本处理细节。
文本分块概述
文本分块是将大型文本文档分解为更小、更易于管理的文本片段的过程,这些片段可以有效应用于检索增强生成(RAG)应用程序和大语言模型处理。在现代自然语言处理系统中,文本分块是一个关键的预处理步骤,直接影响下游任务的性能。
理想分块策略的评估标准
优秀的文本分块器和分块结果需要满足三个核心标准:可重构性、独立性和充分性。
可重构性是指分块必须保持与原始文本的逻辑关联,确保所有分块组合后能够完整还原原始文本内容。这要求分块器在分割过程中不能丢失或改变原始信息。
独立性要求每个分块作为一个自包含的语义单元,能够独立表达完整的概念或思想。从单个分块中提取信息时,不应导致原始文本中重要信息的缺失。
充分性确保每个分块包含足够的信息量以产生实际价值,避免产生过于碎片化且缺乏实际意义的文本片段。
文档分块的必要性分析
在大语言模型应用中,将整个文档作为单一输入存在多个限制因素。
首先是上下文窗口限制。所有大语言模型都有固定的上下文长度限制,这个限制决定了模型在单次推理中能够处理的最大token数量。文本分块技术能够将超长文档分解为符合模型处理能力的token片段。
其次是计算效率考量。在实际应用中,每次查询时加载和处理大型文档(如100GB规模)在计算资源和时间成本上都不现实。注意力机制即使经过优化,其计算复杂度仍然较高,文本分块能够显著提升系统的计算效率和内存使用效率。
语义表示优化是另一个关键因素。如前所述,合理的文本分块能够将不同的概念和思想表示为独立的语义单元。未经分块的完整文档可能导致模型在概念理解上产生混淆。由于表示模型通常采用有损压缩机制,保持分块的简洁性有助于模型更准确地理解上下文信息。
最后,减少模型幻觉现象也是重要考虑因素。一次性提供过多上下文信息容易导致模型产生幻觉,即模型可能使用不相关的信息来回答查询。更小、更聚焦的文本分块能够有效降低这种风险。
Chonkie的技术优势
Chonkie在设计和实现上采用了多项优化技术,提供了显著的性能优势。
该库采用管道化处理架构,通过流水线方式处理文档,为分块过程制定更强的启发式规则。这种设计在不影响分块质量的前提下,显著提升了处理速度。
缓存和预计算机制是另一项重要特性。Chonkie会缓存分块处理的中间结果,避免重复计算,从而在保证分块质量的同时实现更快的处理速度。
Chonkie实现了智能token估计-验证反馈循环,通过这一机制确保分块大小接近最优配置,同时绕过了传统分词器的某些效率瓶颈。
在分词器选择上,Chonkie默认使用高性能分词器tiktoken,相比默认分词器具有更高的处理效率。考虑到tiktoken对某些模型类型的支持限制,Chonkie采用AutoTikTokenizer包装器,为所有Hugging Face模型提供完整支持。
对于语义相似度计算,Chonkie默认使用Model2Vec静态嵌入,这些预计算的嵌入向量存储在查找表中,避免了查询时运行嵌入模型的计算开销,实现了超快的嵌入计算速度。
此外,Chonkie还支持并行处理,通过并行化处理多个文档来更好地利用可用的计算资源,在不影响分块质量的前提下进一步提升处理速度。
环境配置与安装
为了避免与现有Python环境产生冲突,建议为Chonkie创建独立的虚拟环境:
python3 -m venv chonkie-tutorial
source chonkie-tutorial/bin/activate
pip install chonkie[all]
Chonkie提供了灵活的安装选项,支持根据具体需求安装不同的功能模块:
# 基本安装 (TokenChunker, SentenceChunker, RecursiveChunker)
# pip install chonkie
# 用于Hugging Face Hub支持
# pip install "chonkie[hub]"
# 用于可视化支持 (例如,富文本输出)
# pip install "chonkie[viz]"
# 用于默认语义提供商支持 (包括Model2Vec)
# pip install "chonkie[semantic]"
# 用于OpenAI嵌入支持
# pip install "chonkie[openai]"
# 用于Cohere嵌入支持
# pip install "chonkie[cohere]"
# 用于Jina嵌入支持
# pip install "chonkie[jina]"
# 用于SentenceTransformer嵌入支持 (LateChunker所需)
# pip install "chonkie[st]"
# 用于CodeChunker支持
# pip install "chonkie[code]"
# 用于NeuralChunker支持 (基于BERT)
# pip install "chonkie[neural]"
# 用于SlumberChunker支持 (Genie/LLM接口)
# pip install "chonkie[genie]"
# 用于一起安装多个功能
# pip install "chonkie[st, code, genie]"
核心分块策略详解
Chonkie提供了五种主要的文本分块策略,每种策略都针对特定的应用场景进行了优化。这些策略包括语义分块、SDPM分块、延迟分块、神经网络分块和递归分块。
语义分块(Semantic Chunking)
语义分块基于语义相似性原理对文本进行处理,将语义相关的内容保持在同一分块中。这种方法特别适用于上下文保存至关重要的RAG应用程序。
以下示例展示了语义分块器如何处理包含两个不同主题的文本:
fromchonkieimportSemanticChunker
# 使用默认参数进行基本初始化
chunker=SemanticChunker(
embedding_model="minishlab/potion-base-8M", # 默认模型
threshold=0.5, # 相似度阈值 (0-1) 或 (1-100) 或 "auto"
chunk_size=20, # 每个分块的最大令牌数
min_sentences=1 # 每个分块的初始句子数
)
text="""First paragraph about Donkeys.
Why donkey are colored black and white.
This paragraph is about Pigs.
Pigs are pink and have curly tails."""
chunks=chunker.chunk(text)
forchunkinchunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Number of sentences: {len(chunk.sentences)}")
输出结果:
Chunk text: First paragraph about a Donkeys.
Why donkey are colored black and white.
Token count: 15
Number of sentences: 2
Chunk text: This paragraph is about Pigs.
Pigs are pink and have curly tails.
Token count: 14
Number of sentences: 2
SDPM分块器(SDPM Chunker)
SDPM分块器是语义分块的增强版本,采用语义双通道合并方法(Semantic Dual Pass Merging)进行文本分割,提供了更优秀的上下文保存能力。
该方法首先根据语义相似性对内容进行分组,然后在跳跃窗口内合并相似的组,这使其能够连接在文本中可能不连续但语义相关的内容。这种技术特别适用于具有分散重复主题或概念的文档处理。
fromchonkieimportSDPMChunker
# 使用默认参数进行基本初始化
chunker=SDPMChunker(
embedding_model="minishlab/potion-base-8M", # 默认模型
threshold=0.5, # 相似度阈值 (0-1)
chunk_size=50, # 每个分块的最大令牌数
min_sentences=1, # 每个分块的初始句子数
skip_window=1 # 寻找相似性时要跳过的分块数
)
text="""The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.
Data preprocessing is a crucial step in training."""
chunks=chunker.chunk(text)
forchunkinchunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Number of sentences: {len(chunk.sentences)}")
输出结果:
Chunk text: The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.
Token count: 42
Number of sentences: 5
Chunk text: Data preprocessing is a crucial step in training.
Token count: 12
Number of sentences: 1
延迟分块器(Late Chunker)
延迟分块器基于延迟分块论文中描述的策略实现,构建在RecursiveChunker之上,通过使用文档级嵌入创建语义上更丰富的分块表示。
与为每个分块独立生成嵌入的传统方法不同,延迟分块器首先将整个文本编码为单一嵌入向量,然后使用递归规则分割文本,通过对完整文档嵌入相关部分进行平均来导出每个分块的嵌入。这种方法使每个分块能够携带更广泛的上下文信息,从而改善RAG系统中的检索性能。
# !pip install numpy==1.26.4
# !pip install tf-keras
fromchonkieimportLateChunker, RecursiveRules
chunker=LateChunker(
embedding_model="all-MiniLM-L6-v2",
chunk_size=10,
rules=RecursiveRules(),
min_characters_per_chunk=24,
)
text="""First paragraph about a specific topic.
Second paragraph continuing the same topic.
Third paragraph switching to a different topic.
Fourth paragraph expanding on the new topic."""
chunks=chunker(text)
forchunkinchunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
输出结果:
Chunk text: First paragraph about a specific topic.
Token count: 8
Chunk text: Second paragraph continuing the same topic.
Token count: 7
Chunk text: Third paragraph switching to a different topic.
Token count: 8
Chunk text: Fourth paragraph expanding on the new topic.
Token count: 9
神经网络分块器(Neural Chunker)
神经网络分块器采用深度学习模型来检测文本中的语义转换点,能够在主题或上下文发生显著变化的位置进行文档分割。该分块器使用经过微调的BERT模型进行神经分块处理,能够生成高度连贯的分块结果,特别适用于RAG应用。
以下示例展示了神经分块器如何处理包含多个主题转换点的长文本:
fromchonkieimportNeuralChunker
# 使用默认参数进行基本初始化
chunker=NeuralChunker(
model="mirth/chonky_modernbert_base_1", # 默认模型
device_map="cuda", # 运行模型的设备 ('cpu', 'cuda', 等)
min_characters_per_chunk=10, # 分块的最小字符数
return_type="chunks" # 输出类型
)
text="""Limited context window - All LLMs have a limit on how much text they can process at once. This is referred to as the Context Window. Chunking helps in breaking down the large text document into processable tokens
Computational Efficiency - It is not possible to load a 100GB document every time you make a query. Attention mechanisms, even when optimized, are computationally expensive O(n). Chunking keeps things efficient and memory-friendly.
Better Representation - As mentioned earlier, chunks represent each idea as an independent entity. Not chunking your document will likely cause your model to conflate concepts and get confused.
Representation models use lossy compression, so keeping chunks concise ensures the model understands the context better.
Reduced Hallucination - Feeding too much context at once causes the models to hallucinate. They start using irrelevant information to answer queries, and that's a big no-no. Smaller, focused chunks reduce this risk."""
chunks=chunker.chunk(text)
forchunkinchunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Start index: {chunk.start_index}")
print(f"End index: {chunk.end_index}")
输出结果:
Chunk text: Limited context window - All LLMs have a limit on how much text they can process at once. This is referred to as the Context Window. Chunking helps in breaking down the large text document into processable tokens
Token count: 48
Start index: 0
End index: 213
Chunk text: Computational Efficiency - It is not possible to load a 100GB document every time you make a query. Attention mechanisms, even when optimized, are computationally expensive O(n). Chunking keeps things efficient and memory-friendly.
Token count: 51
Start index: 213
End index: 445
Chunk text: Better Representation - As mentioned earlier, chunks represent each idea as an independent entity. Not chunking your document will likely cause your model to conflate concepts and get confused.
Representation models use lossy compression, so keeping chunks concise ensures the model understands the context better.
Token count: 60
Start index: 445
End index: 761
Chunk text: Reduced Hallucination - Feeding too much context at once causes the models to hallucinate. They start using irrelevant information to answer queries, and that's a big no-no. Smaller, focused chunks reduce this risk.
Token count: 49
Start index: 761
End index: 976
神经网络分块器在各种分块策略中表现出最高的效率,支持CUDA设备加速以实现快速分块处理。
递归分块器(Recursive Chunker)
递归分块器通过预定义的RecursiveRule规则集合递归地处理文档分块,同时也支持自定义规则创建。规则集合包含了关于如何创建分块的详细信息,例如使用哪些分隔符进行分割。Chonkie提供了预定义的规则配方来处理不同类型的文本,包括Markdown格式和基于语言特性的分隔符。
递归分块器特别适用于处理结构良好的长文档,例如书籍或研究论文。
fromchonkieimportRecursiveChunker, RecursiveRules
chunker=RecursiveChunker(
tokenizer_or_token_counter="gpt2",
chunk_size=12,
rules=RecursiveRules(),
min_characters_per_chunk=24,
return_type="chunks",
)
Chonkie为所有分块策略提供了三种不同的文档处理方式:单文本分块、批量分块和可调用接口。
单文本递归分块
单文本模式允许将单个文本文档传递给分块器进行处理:
text="""This is the first sentence. This is the second sentence.
And here's a third one with some additional context."""
chunks=chunker.chunk(text)
forchunkinchunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
输出结果:
Chunk text: This is the first sentence.
Token count: 7
Chunk text: This is the second sentence.
Token count: 8
Chunk text: And here's a third one with some additional context.
Token count: 12
Chunk text: This is the first sentence.
Token count: 6
批量递归分块
当需要同时处理多个文档时,可以使用批量分块功能:
texts= [
"This is the first sentence. This is the second sentence. And here's a third one with some additional context.",
"This is the first sentence. This is the second sentence. And here's a third one with some additional context.",
]
chunks=chunker.chunk_batch(texts)
forchkinchunks:
forchunkinchk:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
输出结果:
🦛 choooooooooooooooooooonk 100% • 2/2 docs chunked [00:00<00:00, 11.96doc/s] 🌱
Chunk text: This is the first sentence.
Token count: 7
Chunk text: This is the second sentence.
Token count: 7
Chunk text: And here's a third one with some additional context.
Token count: 11
Chunk text: This is the first sentence.
Token count: 7
Chunk text: This is the second sentence.
Token count: 7
Chunk text: And here's a third one with some additional context.
Token count: 11
可调用接口递归分块
Chonkie还支持通过可调用接口进行分块处理:
# 单文本
chunks=chunker("This is the first sentence. This is the second sentence.")
# 多文本
batch_chunks=chunker(["Text 1. More text.", "Text 2. More."])
总结
Chonkie作为一个专业的文本分块处理库,为大语言模型应用提供了全面而高效的解决方案。通过其丰富的分块策略和优化的处理架构,开发者能够根据具体的应用场景选择最适合的分块方法,从而提升整个NLP管道的性能和效率。该库的设计理念和技术实现为现代自然语言处理应用提供了重要的基础设施支持。
文档
https://avoid.overfit.cn/post/8867d3eb734f4e71935718082240797f
作者:Mayur Jain