0% found this document useful (0 votes)

43 views

Anthropic-cookbook:Skills:Contextual-embeddings:Guide.ipynb at Main · Anthropics

Uploaded by

lalithindian

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Anthropic-cookbook:Skills:Contextual-embeddings:Guide.ipynb at Main · Anthropics

Uploaded by

lalithindian

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

anthropics / anthropic-cookbook Public

Code Issues 11 Pull requests 5 Actions Projects Security

main

anthropic-cookbook / skills / contextual-embeddings / guide.ipynb

saflamini change table of contents 440505f · last week

1387 lines (1387 loc) · 60.4 KB

Enhancing RAG with Contextual

Retrieval
Note: For more background information on Contextual Retrieval,
including additional performance evaluations on various datasets,
we recommend reading our accompanying blog post.

Retrieval Augmented Generation (RAG) enables Claude to leverage your internal

knowledge bases, codebases, or any other corpus of documents when
providing a response. Enterprises are increasingly building RAG applications to
improve workflows in customer support, Q&A over internal company
documents, financial & legal analysis, code generation, and much more.

In a separate guide, we walked through setting up a basic retrieval system,

anthropic-cookbook / skills / contextual-embeddings
demonstrated how to evaluate its performance, and then outlined a few
main Top
/ guide.ipynb
techniques to improve performance. In this guide, we present a technique for
improving retrieval performance: Contextual Embeddings.
Preview Code Blame Raw

In traditional RAG, documents are typically split into smaller chunks for efficient
retrieval. While this approach works well for many applications, it can lead to
problems when individual chunks lack sufficient context. Contextual
Embeddings solve this problem by adding relevant context to each chunk before
embedding. This method improves the quality of each embedded chunk,
allowing for more accurate retrieval and thus better overall performance.
Averaged across all data sources we tested, Contextual Embeddings reduced
the top-20-chunk retrieval failure rate by 35%.
The same chunk-specific context can also be used with BM25 search to further
improve retrieval performance. We introduce this technique in the “Contextual
BM25” section.

In this guide, we'll demonstrate how to build and optimize a Contextual Retrieval
system using a dataset of 9 codebases as our knowledge base. We'll walk
through:

1. Setting up a basic retrieval pipeline to establish a baseline for performance.

2. Contextual Embeddings: what it is, why it works, and how prompt caching
makes it practical for production use cases.

3. Implementing Contextual Embeddings and demonstrating performance

improvements.

4. Contextual BM25: improving performance with contextual BM25 hybrid

search.

5. Improving performance with reranking,

Evaluation Metrics & Dataset:

We use a pre-chunked dataset of 9 codebases - all of which have been chunked
according to a basic character splitting mechanism. Our evaluation dataset
contains 248 queries - each of which contains a 'golden chunk.' We'll use a
metric called Pass@k to evaluate performance. Pass@k checks whether or not
the 'golden document' was present in the first k documents retrieved for each
query. Contextual Embeddings in this case helped us to improve Pass@10
performance from ~87% --> ~95%.

You can find the code files and their chunks in

data/codebase_chunks.json and the evaluation dataset in
data/evaluation_set.jsonl

Additional Notes:
Prompt caching is helpful in managing costs when using this retrieval method.
This feature is currently available on Anthropic's 1P API, and is coming soon to
our 3P partner environments in AWS Bedrock and GCP Vertex. We know that
many of our customers leverage AWS Knowledge Bases and GCP Vertex AI APIs
when building RAG solutions, and this method can be used on either platform
with a bit of customization. Consider reaching out to Anthropic or your
AWS/GCP account team for guidance on this!

To make it easier to use this method on Bedrock, the AWS team has provided us
with code that you can use to implement a Lambda function that adds context
with code that you can use to implement a Lambda function that adds context
to each document. If you deploy this Lambda function, you can select it as a
custom chunking option when configuring a Bedrock Knowledge Base. You can
find this code in contextual-rag-lambda-function . The main lambda
function code is in lambda_function.py .

Table of Contents
1. Setup

2. Basic RAG

3. Contextual Embeddings

4. Contextual BM25

5. Reranking

Setup
We'll need a few libraries, including:

1. anthropic - to interact with Claude

2. voyageai - to generate high quality embeddings

3. cohere - for reranking

4. elasticsearch for performant BM25 search

5. pandas , numpy , matplotlib , and scikit-learn for data

manipulation and visualization

You'll also need API keys from Anthropic, Voyage AI, and Cohere

In [ ]: !pip install anthropic

!pip install voyageai
!pip install cohere
!pip install elasticsearch
!pip install pandas
!pip install numpy

In [2]: import os

os.environ['VOYAGE_API_KEY'] = "YOUR KEY HERE"

os.environ['ANTHROPIC_API_KEY'] = "YOUR KEY HERE"
os.environ['COHERE_API_KEY'] = "YOUR KEY HERE"
In [3]: import anthropic

client = anthropic.Anthropic(
# This is the default and can be omitted
api_key=os.getenv("ANTHROPIC_API_KEY"),
)

Initialize a Vector DB Class

In this example, we're using an in-memory vector DB, but for a production
application, you may want to use a hosted solution.

In [4]: import os
import pickle
import json
import numpy as np
import voyageai
from typing import List, Dict, Any
from tqdm import tqdm

class VectorDB:
def __init__(self, name: str, api_key = None):
if api_key is None:
api_key = os.getenv("VOYAGE_API_KEY")
self.client = voyageai.Client(api_key=api_key)
self.name = name
self.embeddings = []
self.metadata = []
self.query_cache = {}
self.db_path = f"./data/{name}/vector_db.pkl"

def load_data(self, dataset: List[Dict[str, Any]]):

texts_to_embed = []
metadata = []
total_chunks = sum(len(doc['chunks']) for doc in dataset)

with tqdm(total=total_chunks, desc="Processing chunks") as pbar:

for doc in dataset:
for chunk in doc['chunks']:
texts_to_embed.append(chunk['content'])
metadata.append({
'doc_id': doc['doc_id'],
'original_uuid': doc['original_uuid'],
'chunk_id': chunk['chunk_id'],
'original_index': chunk['original_index'],
'content': chunk['content']
})
pbar.update(1)
pbar.update(1)

self._embed_and_store(texts_to_embed, metadata)
self.save_db()

print(f"Vector database loaded and saved. Total chunks processed: {le

def _embed_and_store(self, texts: List[str], data: List[Dict[str, Any]]):

batch_size = 128
with tqdm(total=len(texts), desc="Embedding chunks") as pbar:
result = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
batch_result = self.client.embed(batch, model="voyage-2").emb
result.extend(batch_result)
pbar.update(len(batch))

self.embeddings = result
self.metadata = data

def search(self, query: str, k: int = 20) -> List[Dict[str, Any]]:

if query in self.query_cache:
query_embedding = self.query_cache[query]
else:
query_embedding = self.client.embed([query], model="voyage-2").em
self.query_cache[query] = query_embedding

if not self.embeddings:
raise ValueError("No data loaded in the vector database.")

similarities = np.dot(self.embeddings, query_embedding)

top_indices = np.argsort(similarities)[::-1][:k]

top_results = []
for idx in top_indices:
result = {
"metadata": self.metadata[idx],
"similarity": float(similarities[idx]),
}
top_results.append(result)

return top_results

def save_db(self):
data = {
"embeddings": self.embeddings,
"metadata": self.metadata,
"query_cache": json.dumps(self.query_cache),
}
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
with open(self.db_path, "wb") as file:
pickle.dump(data, file)

def load_db(self):
if not os.path.exists(self.db_path):
raise ValueError("Vector database file not found. Use load_data t
with open(self.db_path, "rb") as file:
data = pickle.load(file)
self.embeddings = data["embeddings"]
self.metadata = data["metadata"]
self.metadata = data["metadata"]
self.query_cache = json.loads(data["query_cache"])

def validate_embedded_chunks(self):
unique_contents = set()
for meta in self.metadata:
unique_contents.add(meta['content'])

print(f"Validation results:")
print(f"Total embedded chunks: {len(self.metadata)}")
print(f"Unique embedded contents: {len(unique_contents)}")

if len(self.metadata) != len(unique_contents):
print("Warning: There may be duplicate chunks in the embedded dat
else:
print("All embedded chunks are unique.")

In [ ]: # Load your transformed dataset

with open('data/codebase_chunks.json', 'r') as f:
transformed_dataset = json.load(f)

# Initialize the VectorDB

base_db = VectorDB("base_db")

# Load and process the data

base_db.load_data(transformed_dataset)

Basic RAG
To get started, we'll set up a basic RAG pipeline using a bare bones approach.
This is sometimes called 'Naive RAG' by many in the industry. A basic RAG
pipeline includes the following 3 steps:

1. Chunk documents by heading - containing only the content from each

subheading

2. Embed each document

3. Use Cosine similarity to retrieve documents in order to answer query

In [228… import json

from typing import List, Dict, Any, Callable, Union
from tqdm import tqdm

def load_jsonl(file_path: str) -> List[Dict[str, Any]]:

"""Load JSONL file and return a list of dictionaries."""
with open(file_path, 'r') as file:
return [json.loads(line) for line in file]

def evaluate_retrieval(queries: List[Dict[str, Any]], retrieval_function: Cal

total_score = 0
total_queries = len(queries)

for query_item in tqdm(queries, desc="Evaluating retrieval"):

for query_item in tqdm(queries, desc="Evaluating retrieval"):
query = query_item['query']
golden_chunk_uuids = query_item['golden_chunk_uuids']

# Find all golden chunk contents

golden_contents = []
for doc_uuid, chunk_index in golden_chunk_uuids:
golden_doc = next((doc for doc in query_item['golden_documents']
if not golden_doc:
print(f"Warning: Golden document not found for UUID {doc_uuid
continue

golden_chunk = next((chunk for chunk in golden_doc['chunks'] if c

if not golden_chunk:
print(f"Warning: Golden chunk not found for index {chunk_inde
continue

golden_contents.append(golden_chunk['content'].strip())

if not golden_contents:
print(f"Warning: No golden contents found for query: {query}")
continue

retrieved_docs = retrieval_function(query, db, k=k)

# Count how many golden chunks are in the top k retrieved documents
chunks_found = 0
for golden_content in golden_contents:
for doc in retrieved_docs[:k]:
retrieved_content = doc['metadata'].get('original_content', d
if retrieved_content == golden_content:
chunks_found += 1
break

query_score = chunks_found / len(golden_contents)

total_score += query_score

average_score = total_score / total_queries

pass_at_n = average_score * 100
return {
"pass_at_n": pass_at_n,
"average_score": average_score,
"total_queries": total_queries
}

def retrieve_base(query: str, db, k: int = 20) -> List[Dict[str, Any]]:

"""
Retrieve relevant documents using either VectorDB or ContextualVectorDB.

:param query: The query string

:param db: The VectorDB or ContextualVectorDB instance
:param k: Number of top results to retrieve
:return: List of retrieved documents
"""
return db.search(query, k=k)

def evaluate_db(db, original_jsonl_path: str, k):

# Load the original JSONL data for queries and ground truth
original_data = load_jsonl(original_jsonl_path)
# Evaluate retrieval
results = evaluate_retrieval(original_data, retrieve_base, db, k)
print(f"Pass@{k}: {results['pass_at_n']:.2f}%")
print(f"Total Score: {results['average_score']}")
print(f"Total queries: {results['total_queries']}")

In [381… results5 = evaluate_db(base_db, 'data/evaluation_set.jsonl', 5)

results10 = evaluate_db(base_db, 'data/evaluation_set.jsonl', 10)
results20 = evaluate_db(base_db, 'data/evaluation_set.jsonl', 20)

Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 40.70it/

s]
Pass@5: 80.92%
Total Score: 0.8091877880184332
Total queries: 248
Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 39.50it/
s]
Pass@10: 87.15%
Total Score: 0.8714957757296468
Total queries: 248
Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 39.43it/
s]
Pass@20: 90.06%
Total Score: 0.9006336405529954
Total queries: 248

Contextual Embeddings
With basic RAG, each embedded chunk contains a potentially useful piece of
information, but these chunks lack context. With Contextual Embeddings, we
create a variation on the embedding itself by adding more context to each text
chunk before embedding it. Specifically, we use Claude to create a concise
context that explains the chunk using the context of the overall document. In the
case of our codebases dataset, we can provide both the chunk and the full file
that each chunk was found within to an LLM, then produce the context. Then,
we will combine this 'context' and the raw text chunk together into a single text
block prior to creating each embedding.

Additional Considerations: Cost and Latency

The extra work we're doing to 'situate' each document happens only at
ingestion time: it's a cost you'll pay once when you store each document (and
periodically in the future if you have a knowledge base that updates over time).
There are many approaches like HyDE (hypothetical document embeddings)
which involve performing steps to improve the representation of the query prior
to executing a search. These techniques have shown to be moderately effective,
but they add significant latency at runtime.

Prompt caching also makes this much more cost effective. Creating contextual
Prompt caching also makes this much more cost effective. Creating contextual
embeddings requires us to pass the same document to the model for every
chunk we want to generate extra context for. With prompt caching, we can write
the overall doc to the cache once, and then because we're doing our ingestion
job all in sequence, we can just read the document from cache as we generate
context for each chunk within that document (the information you write to the
cache has a 5 minute time to live). This means that the first time we pass a
document to the model, we pay a bit more to write it to the cache, but for each
subsequent API call that contains that doc, we receive a 90% discount on all of
the input tokens read from the cache. Assuming 800 token chunks, 8k token
documents, 50 token context instructions, and 100 tokens of context per chunk,
the cost to generate contextualized chunks is $1.02 per million document
tokens.

When you load data into your ContextualVectorDB below, you'll see in logs just
how big this impact is.

Warning: some smaller embedding models have a fixed input token limit.
Contextualizing the chunk makes it longer, so if you notice much worse
performance from contextualized embeddings, the contextualized chunk is
likely getting truncated

In [15]: DOCUMENT_CONTEXT_PROMPT = """

<document>
{doc_content}
</document>
"""

CHUNK_CONTEXT_PROMPT = """
Here is the chunk we want to situate within the whole document
<chunk>
{chunk_content}
</chunk>

Please give a short succinct context to situate this chunk within the overall
Answer only with the succinct context and nothing else.
"""

def situate_context(doc: str, chunk: str) -> str:

response = client.beta.prompt_caching.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
temperature=0.0,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": DOCUMENT_CONTEXT_PROMPT.format(doc_content=do
"cache_control": {"type": "ephemeral"} #we will make
},
{
{
"type": "text",
"text": CHUNK_CONTEXT_PROMPT.format(chunk_content=chu
}
]
}
],
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
return response

jsonl_data = load_jsonl('data/evaluation_set.jsonl')
# Example usage
doc_content = jsonl_data[0]['golden_documents'][0]['content']
chunk_content = jsonl_data[0]['golden_chunks'][0]['content']

response = situate_context(doc_content, chunk_content)

print(f"Situated context: {response.content[0].text}")

# Print cache performance metrics

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Cache creation input tokens: {response.usage.cache_creation_input_tok
print(f"Cache read input tokens: {response.usage.cache_read_input_tokens}")

Situated context: This chunk describes the `DiffExecutor` struct, whic

h is an executor for differential fuzzing. It wraps two executors that
are run sequentially with the same input, and also runs the secondary
executor in the `run_target` method.
Input tokens: 366
Output tokens: 55
Cache creation input tokens: 3046
Cache read input tokens: 0

In [318… import os
import pickle
import json
import numpy as np
import voyageai
from typing import List, Dict, Any
from tqdm import tqdm
import anthropic
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

class ContextualVectorDB:
def __init__(self, name: str, voyage_api_key=None, anthropic_api_key=None
if voyage_api_key is None:
voyage_api_key = os.getenv("VOYAGE_API_KEY")
if anthropic_api_key is None:
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

self.voyage_client = voyageai.Client(api_key=voyage_api_key)
self.anthropic_client = anthropic.Anthropic(api_key=anthropic_api_key
self.name = name
self.embeddings = []
self.metadata = []
self.query_cache = {}
self.query_cache = {}
self.db_path = f"./data/{name}/contextual_vector_db.pkl"

self.token_counts = {
'input': 0,
'output': 0,
'cache_read': 0,
'cache_creation': 0
}
self.token_lock = threading.Lock()

def situate_context(self, doc: str, chunk: str) -> tuple[str, Any]:

DOCUMENT_CONTEXT_PROMPT = """
<document>
{doc_content}
</document>
"""

CHUNK_CONTEXT_PROMPT = """
Here is the chunk we want to situate within the whole document
<chunk>
{chunk_content}
</chunk>

Please give a short succinct context to situate this chunk within the
Answer only with the succinct context and nothing else.
"""

response = self.anthropic_client.beta.prompt_caching.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1000,
temperature=0.0,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": DOCUMENT_CONTEXT_PROMPT.format(doc_conten
"cache_control": {"type": "ephemeral"} #we will m
},
{
"type": "text",
"text": CHUNK_CONTEXT_PROMPT.format(chunk_content
},
]
},
],
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
return response.content[0].text, response.usage

def load_data(self, dataset: List[Dict[str, Any]], parallel_threads: int

if self.embeddings and self.metadata:
print("Vector database is already loaded. Skipping data loading."
return
if os.path.exists(self.db_path):
print("Loading vector database from disk.")
self.load_db()
return
texts_to_embed = []
metadata = []
total_chunks = sum(len(doc['chunks']) for doc in dataset)

def process_chunk(doc, chunk):

#for each chunk, produce the context
contextualized_text, usage = self.situate_context(doc['content'],
with self.token_lock:
self.token_counts['input'] += usage.input_tokens
self.token_counts['output'] += usage.output_tokens
self.token_counts['cache_read'] += usage.cache_read_input_tok
self.token_counts['cache_creation'] += usage.cache_creation_i

return {
#append the context to the original text chunk
'text_to_embed': f"{chunk['content']}\n\n{contextualized_text
'metadata': {
'doc_id': doc['doc_id'],
'original_uuid': doc['original_uuid'],
'chunk_id': chunk['chunk_id'],
'original_index': chunk['original_index'],
'original_content': chunk['content'],
'contextualized_content': contextualized_text
}
}

print(f"Processing {total_chunks} chunks with {parallel_threads} thre

with ThreadPoolExecutor(max_workers=parallel_threads) as executor:
futures = []
for doc in dataset:
for chunk in doc['chunks']:
futures.append(executor.submit(process_chunk, doc, chunk)

for future in tqdm(as_completed(futures), total=total_chunks, des

result = future.result()
texts_to_embed.append(result['text_to_embed'])
metadata.append(result['metadata'])

self._embed_and_store(texts_to_embed, metadata)
self.save_db()

#logging token usage

print(f"Contextual Vector database loaded and saved. Total chunks pro
print(f"Total input tokens without caching: {self.token_counts['input
print(f"Total output tokens: {self.token_counts['output']}")
print(f"Total input tokens written to cache: {self.token_counts['cach
print(f"Total input tokens read from cache: {self.token_counts['cache

total_tokens = self.token_counts['input'] + self.token_counts['cache_

savings_percentage = (self.token_counts['cache_read'] / total_tokens)
print(f"Total input token savings from prompt caching: {savings_perce
print("Tokens read from cache come at a 90 percent discount!")

#we use voyage AI here for embeddings. Read more here: https://docs.voyag
def _embed_and_store(self, texts: List[str], data: List[Dict[str, Any]]):
batch_size = 128
result = [
self.voyage_client.embed(
texts[i : i + batch_size],
texts[i : i + batch_size],
model="voyage-2"
).embeddings
for i in range(0, len(texts), batch_size)
]
self.embeddings = [embedding for batch in result for embedding in bat
self.metadata = data

def search(self, query: str, k: int = 20) -> List[Dict[str, Any]]:

if query in self.query_cache:
query_embedding = self.query_cache[query]
else:
query_embedding = self.voyage_client.embed([query], model="voyage
self.query_cache[query] = query_embedding

if not self.embeddings:
raise ValueError("No data loaded in the vector database.")

similarities = np.dot(self.embeddings, query_embedding)

top_indices = np.argsort(similarities)[::-1][:k]

top_results = []
for idx in top_indices:
result = {
"metadata": self.metadata[idx],
"similarity": float(similarities[idx]),
}
top_results.append(result)
return top_results

In [319… # Load the transformed dataset

with open('data/codebase_chunks.json', 'r') as f:
transformed_dataset = json.load(f)

# Initialize the ContextualVectorDB

contextual_db = ContextualVectorDB("my_contextual_db")

# Load and process the data

#note: consider increasing the number of parallel threads to run this faster,
contextual_db.load_data(transformed_dataset, parallel_threads=5)
Processing 737 chunks with 5 threads
Processing chunks: 100%|██████████| 737/737 [02:37<00:00, 4.69it/s]
Contextual Vector database loaded and saved. Total chunks processed: 7
37
Total input tokens without caching: 500383
Total output tokens: 40318
Total input tokens written to cache: 341422
Total input tokens read from cache: 2825073
Total input token savings from prompt caching: 77.04% of all input tok
ens used were read from cache.
Tokens read from cache come at a 90 percent discount!

In [360… r5 = evaluate_db(contextual_db, 'data/evaluation_set.jsonl', 5)

r10 = evaluate_db(contextual_db, 'data/evaluation_set.jsonl', 10)
r20 = evaluate_db(contextual_db, 'data/evaluation_set.jsonl', 20)

Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 39.53it/

s]
Pass@5: 86.37%
Total Score: 0.8637192780337941
Total queries: 248
Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 40.05it/
s]
Pass@10: 92.81%
Total Score: 0.9280913978494625
Total queries: 248
Evaluating retrieval: 100%|██████████| 248/248 [00:06<00:00, 39.64it/
s]
Pass@20: 93.78%
Total Score: 0.9378360215053763
Total queries: 248

Contextual BM25
Contextual embeddings is an improvement on traditional semantic search RAG,
but we can improve performance further. In this section we'll show you how you
can use contextual embeddings and contextual BM25 together. While you can
see performance gains by pairing these techniques together without the
context, adding context to these methods reduces the top-20-chunk retrieval
failure rate by 42%.

BM25 is a probabilistic ranking function that improves upon TF-IDF. It scores

documents based on query term frequency, while accounting for document
length and term saturation. BM25 is widely used in modern search engines for
its effectiveness in ranking relevant documents. For more details, see this blog
post. We'll use elastic search for the BM25 portion of this section, which will
require you to have the elasticsearch library installed and it will also require you
to spin up an Elasticsearch server in the background. The easiest way to do this
is to install docker and run the following docker command:
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300
-e "discovery.type=single-node" -e
"xpack.security.enabled=false" elasticsearch:8.8.0

One difference between a typical BM25 search and what we'll do in this section
is that, for each chunk, we'll run each BM25 search on both the chunk content
and the additional context that we generated in the previous section. From
there, we'll use a technique called reciprocal rank fusion to merge the results
from our BM25 search with our semantic search results. This allows us to
perform a hybrid search across both our BM25 corpus and vector DB to return
the most optimal documents for a given query.

In the function below, we allow you the option to add weightings to the semantic
search and BM25 search documents as you merge them with Reciprocal Rank
Fusion. By default, we set these to 0.8 for the semantic search results and 0.2 to
the BM25 results. We'd encourage you to experiment with different values here.

In [369… import os
import json
from typing import List, Dict, Any
from tqdm import tqdm
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

class ElasticsearchBM25:
def __init__(self, index_name: str = "contextual_bm25_index"):
self.es_client = Elasticsearch("http://localhost:9200")
self.index_name = index_name
self.create_index()

def create_index(self):
index_settings = {
"settings": {
"analysis": {"analyzer": {"default": {"type": "english"}}},
"similarity": {"default": {"type": "BM25"}},
"index.queries.cache.enabled": False # Disable query cache
},
"mappings": {
"properties": {
"content": {"type": "text", "analyzer": "english"},
"contextualized_content": {"type": "text", "analyzer": "e
"doc_id": {"type": "keyword", "index": False},
"chunk_id": {"type": "keyword", "index": False},
"original_index": {"type": "integer", "index": False},
}
},
}
if not self.es_client.indices.exists(index=self.index_name):
self.es_client.indices.create(index=self.index_name, body=index_s
print(f"Created index: {self.index_name}")

def index_documents(self, documents: List[Dict[str, Any]]):

actions = [
{
{
"_index": self.index_name,
"_source": {
"content": doc["original_content"],
"contextualized_content": doc["contextualized_content"],
"doc_id": doc["doc_id"],
"chunk_id": doc["chunk_id"],
"original_index": doc["original_index"],
},
}
for doc in documents
]
success, _ = bulk(self.es_client, actions)
self.es_client.indices.refresh(index=self.index_name)
return success

def search(self, query: str, k: int = 20) -> List[Dict[str, Any]]:

self.es_client.indices.refresh(index=self.index_name) # Force refres
search_body = {
"query": {
"multi_match": {
"query": query,
"fields": ["content", "contextualized_content"],
}
},
"size": k,
}
response = self.es_client.search(index=self.index_name, body=search_b
return [
{
"doc_id": hit["_source"]["doc_id"],
"original_index": hit["_source"]["original_index"],
"content": hit["_source"]["content"],
"contextualized_content": hit["_source"]["contextualized_cont
"score": hit["_score"],
}
for hit in response["hits"]["hits"]
]

def create_elasticsearch_bm25_index(db: ContextualVectorDB):

es_bm25 = ElasticsearchBM25()
es_bm25.index_documents(db.metadata)
return es_bm25

def retrieve_advanced(query: str, db: ContextualVectorDB, es_bm25: Elasticsea

num_chunks_to_recall = 150

# Semantic search
semantic_results = db.search(query, k=num_chunks_to_recall)
ranked_chunk_ids = [(result['metadata']['doc_id'], result['metadata']['or

# BM25 search using Elasticsearch

bm25_results = es_bm25.search(query, k=num_chunks_to_recall)
ranked_bm25_chunk_ids = [(result['doc_id'], result['original_index']) for

# Combine results
chunk_ids = list(set(ranked_chunk_ids + ranked_bm25_chunk_ids))
chunk_id_to_score = {}

# Initial scoring with weights

# Initial scoring with weights
for chunk_id in chunk_ids:
score = 0
if chunk_id in ranked_chunk_ids:
index = ranked_chunk_ids.index(chunk_id)
score += semantic_weight * (1 / (index + 1)) # Weighted 1/n scor
if chunk_id in ranked_bm25_chunk_ids:
index = ranked_bm25_chunk_ids.index(chunk_id)
score += bm25_weight * (1 / (index + 1)) # Weighted 1/n scoring
chunk_id_to_score[chunk_id] = score

# Sort chunk IDs by their scores in descending order

sorted_chunk_ids = sorted(
chunk_id_to_score.keys(), key=lambda x: (chunk_id_to_score[x], x[0],
)

# Assign new scores based on the sorted order

for index, chunk_id in enumerate(sorted_chunk_ids):
chunk_id_to_score[chunk_id] = 1 / (index + 1)

# Prepare the final results

final_results = []
semantic_count = 0
bm25_count = 0
for chunk_id in sorted_chunk_ids[:k]:
chunk_metadata = next(chunk for chunk in db.metadata if chunk['doc_id
is_from_semantic = chunk_id in ranked_chunk_ids
is_from_bm25 = chunk_id in ranked_bm25_chunk_ids
final_results.append({
'chunk': chunk_metadata,
'score': chunk_id_to_score[chunk_id],
'from_semantic': is_from_semantic,
'from_bm25': is_from_bm25
})

if is_from_semantic and not is_from_bm25:

semantic_count += 1
elif is_from_bm25 and not is_from_semantic:
bm25_count += 1
else: # it's in both
semantic_count += 0.5
bm25_count += 0.5

return final_results, semantic_count, bm25_count

def load_jsonl(file_path: str) -> List[Dict[str, Any]]:

with open(file_path, 'r') as file:
return [json.loads(line) for line in file]

def evaluate_db_advanced(db: ContextualVectorDB, original_jsonl_path: str, k:

original_data = load_jsonl(original_jsonl_path)
es_bm25 = create_elasticsearch_bm25_index(db)

try:
# Warm-up queries
warm_up_queries = original_data[:10]
for query_item in warm_up_queries:
_ = retrieve_advanced(query_item['query'], db, es_bm25, k)

total_score = 0
total_semantic_count = 0
total_semantic_count = 0
total_bm25_count = 0
total_results = 0

for query_item in tqdm(original_data, desc="Evaluating retrieval"):

query = query_item['query']
golden_chunk_uuids = query_item['golden_chunk_uuids']

golden_contents = []
for doc_uuid, chunk_index in golden_chunk_uuids:
golden_doc = next((doc for doc in query_item['golden_document
if golden_doc:
golden_chunk = next((chunk for chunk in golden_doc['chunk
if golden_chunk:
golden_contents.append(golden_chunk['content'].strip(

if not golden_contents:
print(f"Warning: No golden contents found for query: {query}"
continue

retrieved_docs, semantic_count, bm25_count = retrieve_advanced(qu

chunks_found = 0
for golden_content in golden_contents:
for doc in retrieved_docs[:k]:
retrieved_content = doc['chunk']['original_content'].stri
if retrieved_content == golden_content:
chunks_found += 1
break

query_score = chunks_found / len(golden_contents)

total_score += query_score

total_semantic_count += semantic_count
total_bm25_count += bm25_count
total_results += len(retrieved_docs)

total_queries = len(original_data)
average_score = total_score / total_queries
pass_at_n = average_score * 100

semantic_percentage = (total_semantic_count / total_results) * 100 if

bm25_percentage = (total_bm25_count / total_results) * 100 if total_r

results = {
"pass_at_n": pass_at_n,
"average_score": average_score,
"total_queries": total_queries
}

print(f"Pass@{k}: {pass_at_n:.2f}%")
print(f"Average Score: {average_score:.2f}")
print(f"Total queries: {total_queries}")
print(f"Percentage of results from semantic search: {semantic_percent
print(f"Percentage of results from BM25: {bm25_percentage:.2f}%")

return results, {"semantic": semantic_percentage, "bm25": bm25_percen

finally:
# Delete the Elasticsearch index
# Delete the Elasticsearch index
if es_bm25.es_client.indices.exists(index=es_bm25.index_name):
es_bm25.es_client.indices.delete(index=es_bm25.index_name)
print(f"Deleted Elasticsearch index: {es_bm25.index_name}")

In [370… results5 = evaluate_db_advanced(contextual_db, 'data/evaluation_set.jsonl', 5

results10 = evaluate_db_advanced(contextual_db, 'data/evaluation_set.jsonl',
results20 = evaluate_db_advanced(contextual_db, 'data/evaluation_set.jsonl',

Created index: contextual_bm25_index

Evaluating retrieval: 100%|██████████| 248/248 [00:08<00:00, 28.36it/
s]
Pass@5: 86.43%
Average Score: 0.86
Total queries: 248
Percentage of results from semantic search: 55.12%
Percentage of results from BM25: 44.88%
Deleted Elasticsearch index: contextual_bm25_index
Created index: contextual_bm25_index
Evaluating retrieval: 100%|██████████| 248/248 [00:08<00:00, 28.02it/
s]
Pass@10: 93.21%
Average Score: 0.93
Total queries: 248
Percentage of results from semantic search: 58.35%
Percentage of results from BM25: 41.65%
Deleted Elasticsearch index: contextual_bm25_index
Created index: contextual_bm25_index
Evaluating retrieval: 100%|██████████| 248/248 [00:08<00:00, 28.15it/
s]
Pass@20: 94.99%
Average Score: 0.95
Total queries: 248
Percentage of results from semantic search: 61.94%
Percentage of results from BM25: 38.06%
Deleted Elasticsearch index: contextual_bm25_index

Adding a Reranking Step

If you want to improve performance further, we recommend adding a re-ranking
step. When using a re-ranker, you can retrieve more documents initially from
your vector store, then use your re-ranker to select a subset of these
documents. One common technique is to use re-ranking as a way to implement
high precision hybrid search. You can use a combination of semantic search and
keyword based search in your initial retrieval step (as we have done earlier in
this guide), then use a re-ranking step to choose only the k most relevant docs
from a combined list of documents returned by your semantic search and
keyword search systems.

Below, we'll demonstrate only the re-ranking step (skipping the hybrid search
technique for now). You'll see that we retrieve 10x the number of documents
than the number of final k documents we want to retrieve, then use a re-ranking
model from Cohere to select the 10 most relevant results from that list. Adding
the re-ranking step delivers a modest additional gain in performance. In our
case, Pass@10 improves from 92.81% --> 94.79%.

In [378… import cohere

from typing import List, Dict, Any, Callable
import json
from tqdm import tqdm

def load_jsonl(file_path: str) -> List[Dict[str, Any]]:

with open(file_path, 'r') as file:
return [json.loads(line) for line in file]

def chunk_to_content(chunk: Dict[str, Any]) -> str:

original_content = chunk['metadata']['original_content']
contextualized_content = chunk['metadata']['contextualized_content']
return f"{original_content}\n\nContext: {contextualized_content}"

def retrieve_rerank(query: str, db, k: int) -> List[Dict[str, Any]]:

co = cohere.Client( os.getenv("COHERE_API_KEY"))

# Retrieve more results than we normally would

semantic_results = db.search(query, k=k*10)

# Extract documents for reranking, using the contextualized content

documents = [chunk_to_content(res) for res in semantic_results]

response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=k
)
time.sleep(0.1)

final_results = []
for r in response.results:
original_result = semantic_results[r.index]
final_results.append({
"chunk": original_result['metadata'],
"score": r.relevance_score
})

return final_results

def evaluate_retrieval_rerank(queries: List[Dict[str, Any]], retrieval_functi

total_score = 0
total_queries = len(queries)

for query_item in tqdm(queries, desc="Evaluating retrieval"):

query = query_item['query']
golden_chunk_uuids = query_item['golden_chunk_uuids']

golden_contents = []
for doc_uuid, chunk_index in golden_chunk_uuids:
golden_doc = next((doc for doc in query_item['golden_documents']
if golden_doc:
golden_chunk = next((chunk for chunk in golden_doc['chunks']
golden_chunk = next((chunk for chunk in golden_doc['chunks']
if golden_chunk:
golden_contents.append(golden_chunk['content'].strip())

if not golden_contents:
print(f"Warning: No golden contents found for query: {query}")
continue

retrieved_docs = retrieval_function(query, db, k)

chunks_found = 0
for golden_content in golden_contents:
for doc in retrieved_docs[:k]:
retrieved_content = doc['chunk']['original_content'].strip()
if retrieved_content == golden_content:
chunks_found += 1
break

query_score = chunks_found / len(golden_contents)

total_score += query_score

average_score = total_score / total_queries

pass_at_n = average_score * 100
return {
"pass_at_n": pass_at_n,
"average_score": average_score,
"total_queries": total_queries
}

def evaluate_db_advanced(db, original_jsonl_path, k):

Mitsui-Man B&W Me-B Engines Instruction
100% (5)
Mitsui-Man B&W Me-B Engines Instruction
520 pages
Getdb PDF
100% (1)
Getdb PDF
25 pages
Ielts Liz: IELTS Writing Task 2 Model Essay
50% (2)
Ielts Liz: IELTS Writing Task 2 Model Essay
30 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Pt3 English Module (Weak)
82% (11)
Pt3 English Module (Weak)
5 pages
Orion Instruction Manual PDF
No ratings yet
Orion Instruction Manual PDF
2 pages
How To Use Alembic For Database Migrations in Your FastAPI Application
No ratings yet
How To Use Alembic For Database Migrations in Your FastAPI Application
8 pages
DA0101EN-Review-Introduction - Jupyter Notebook
No ratings yet
DA0101EN-Review-Introduction - Jupyter Notebook
8 pages
SFCC Certification Dump_1
No ratings yet
SFCC Certification Dump_1
14 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Ai Platform Qwik Start
No ratings yet
Ai Platform Qwik Start
27 pages
25 Python Scripts To Boost Your Productivity - by Neuro Bytes - CodeCuriosity - Nov, 2024 - Medium
No ratings yet
25 Python Scripts To Boost Your Productivity - by Neuro Bytes - CodeCuriosity - Nov, 2024 - Medium
15 pages
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
No ratings yet
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
9 pages
Homework 0
No ratings yet
Homework 0
4 pages
U-2 NoSql -QA
No ratings yet
U-2 NoSql -QA
13 pages
Python and pyspark Questions INT
No ratings yet
Python and pyspark Questions INT
8 pages
SESION 10 (Pandas 2)
No ratings yet
SESION 10 (Pandas 2)
120 pages
Langchain-Retrieval-Augmentation - Ipynb - Colab
No ratings yet
Langchain-Retrieval-Augmentation - Ipynb - Colab
3 pages
Tran Quang Kha - Full Stack Engineer
No ratings yet
Tran Quang Kha - Full Stack Engineer
21 pages
Multimodal Report Generation
No ratings yet
Multimodal Report Generation
10 pages
British_Airways_Forage_Report
No ratings yet
British_Airways_Forage_Report
12 pages
docs-llamaindex-ai...
No ratings yet
docs-llamaindex-ai...
3 pages
Introduction To The CodeIgniter PHP Framework
100% (1)
Introduction To The CodeIgniter PHP Framework
104 pages
Ender Is A Full Featured Package Manager For Your Browser.: Libraries Are The Past!
No ratings yet
Ender Is A Full Featured Package Manager For Your Browser.: Libraries Are The Past!
8 pages
nokia-topic-modelling
No ratings yet
nokia-topic-modelling
11 pages
LAB 2 Transfer Learning
No ratings yet
LAB 2 Transfer Learning
10 pages
AI Database Query System
No ratings yet
AI Database Query System
7 pages
ArangoDB PerformanceCourse Release 1
No ratings yet
ArangoDB PerformanceCourse Release 1
71 pages
PPFSD - Question - Bank Answers New
No ratings yet
PPFSD - Question - Bank Answers New
61 pages
24CSPC212-PIC Lab Manual
No ratings yet
24CSPC212-PIC Lab Manual
45 pages
ChatGPT - Auto Classification TensorFlow
No ratings yet
ChatGPT - Auto Classification TensorFlow
38 pages
ML Pgms_24Mar2025
No ratings yet
ML Pgms_24Mar2025
23 pages
QA_Using_Gemini_Langchain_ChromaDB_PDF
No ratings yet
QA_Using_Gemini_Langchain_ChromaDB_PDF
2 pages
Data - DSPy
No ratings yet
Data - DSPy
4 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
UI21CS29_Lab2
No ratings yet
UI21CS29_Lab2
11 pages
DBG
No ratings yet
DBG
64 pages
AI Database Querying Solution
No ratings yet
AI Database Querying Solution
19 pages
Integrating Machine Learning Into Web Applications With Flask
No ratings yet
Integrating Machine Learning Into Web Applications With Flask
7 pages
Doctrine
No ratings yet
Doctrine
24 pages
Using NHibernate With MySQL
No ratings yet
Using NHibernate With MySQL
5 pages
TDS REST API For Phoenix Contact Proficloud
No ratings yet
TDS REST API For Phoenix Contact Proficloud
4 pages
Python Units 4 Notes
No ratings yet
Python Units 4 Notes
11 pages
How To Store Any File Into SQL Database
No ratings yet
How To Store Any File Into SQL Database
15 pages
Week-3 Spring Data JPA and Hibernate[1]
No ratings yet
Week-3 Spring Data JPA and Hibernate[1]
29 pages
Using Django, Docker and Scikit-Learn To Bootstrap Your Machine Learning Project
No ratings yet
Using Django, Docker and Scikit-Learn To Bootstrap Your Machine Learning Project
36 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
DA0101EN-2-Review-Data-Wrangling - Jupyter Notebook
No ratings yet
DA0101EN-2-Review-Data-Wrangling - Jupyter Notebook
14 pages
DPR
No ratings yet
DPR
7 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Demystifying Rails Plugin Development
100% (2)
Demystifying Rails Plugin Development
45 pages
MERN Labs
No ratings yet
MERN Labs
21 pages
Analyzing US Economic Data and Building Dashboard - IBM Watson Studio
No ratings yet
Analyzing US Economic Data and Building Dashboard - IBM Watson Studio
1 page
Analyzing US Economic Data and Building Dashboard - IBM Watson Studio
No ratings yet
Analyzing US Economic Data and Building Dashboard - IBM Watson Studio
1 page
BDA List of Experiments For Practical Exam
No ratings yet
BDA List of Experiments For Practical Exam
21 pages
j4
No ratings yet
j4
7 pages
Zep Sqoop Big Data Interview Questions
No ratings yet
Zep Sqoop Big Data Interview Questions
25 pages
Prompt Caching in RAG Workflow For Financial Analysis
No ratings yet
Prompt Caching in RAG Workflow For Financial Analysis
18 pages
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Mouser Quote QF02F50
No ratings yet
Mouser Quote QF02F50
1 page
In Gov cbse-SSCER-111887312023
No ratings yet
In Gov cbse-SSCER-111887312023
1 page
Zerodha Case Study
No ratings yet
Zerodha Case Study
4 pages
Peer Review Worksheet
No ratings yet
Peer Review Worksheet
4 pages
WWW Ojs - Aaresearchindex
No ratings yet
WWW Ojs - Aaresearchindex
2 pages
Walter Martino Minus One
100% (1)
Walter Martino Minus One
32 pages
01 Introduction To Ethical Theories
No ratings yet
01 Introduction To Ethical Theories
94 pages
Bhaskar PDF
No ratings yet
Bhaskar PDF
8 pages
Mod 1 Basic Civil KtuQbank
No ratings yet
Mod 1 Basic Civil KtuQbank
14 pages
Phoenix Command Equipment Reference
100% (1)
Phoenix Command Equipment Reference
2 pages
Oracle 10g Reports 4 of 4
No ratings yet
Oracle 10g Reports 4 of 4
12 pages
Carl Rogers
100% (1)
Carl Rogers
5 pages
A Guide To Global Grants
No ratings yet
A Guide To Global Grants
40 pages
Lesson 5
No ratings yet
Lesson 5
12 pages
Electrical Rooms Fire Fighting
No ratings yet
Electrical Rooms Fire Fighting
2 pages
TG Lube Oil Flushing
No ratings yet
TG Lube Oil Flushing
38 pages
unity
No ratings yet
unity
2 pages
Antenna Introduction 003
No ratings yet
Antenna Introduction 003
46 pages
For A Tin Ingot: The Archaeology of Oral Interpretation: Marzena Chrobak
No ratings yet
For A Tin Ingot: The Archaeology of Oral Interpretation: Marzena Chrobak
15 pages
Barriers in Workover Operations
No ratings yet
Barriers in Workover Operations
12 pages
stat400-hw14-Fa24
No ratings yet
stat400-hw14-Fa24
4 pages
5 - PDFsam - 1shear N T Beams
No ratings yet
5 - PDFsam - 1shear N T Beams
1 page
Trắc nghiệm HRM
No ratings yet
Trắc nghiệm HRM
4 pages
Biological Classification
0% (1)
Biological Classification
9 pages
Mitratech Whitepaper - ROI of Legal Management Tech
No ratings yet
Mitratech Whitepaper - ROI of Legal Management Tech
16 pages
Agilent E5061B Network Analyzer Data Sheet
No ratings yet
Agilent E5061B Network Analyzer Data Sheet
28 pages