0% found this document useful (0 votes)

914 views26 pages

Vector Database Essentials

This document provides an overview of vector databases, including their uses, how embeddings are generated, and distance measures. It discusses autoencoders and variational autoencoders for dimensionality reduction to generate embeddings, and explores concepts like brute force KNeighbors search and vector stores for querying embeddings.

Uploaded by

PavanKumar Mantha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

914 views26 pages

Vector Database Essentials

Uploaded by

PavanKumar Mantha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Vector Databases Cookbook

Pavan Kumar M K
First Edition
Foreword

"Explore the ins and outs of Vector Databases in this insigh:ul book. Unlike others, it goes
beyond product talk, oﬀering a deep dive into the fundamentals. Discover the unique
contribuDon of Chroma DB, with pracDcal use cases woven seamlessly into the narraDve.
It's a natural, hands-on approach to understanding the core of Vector DBs and their role in
the ever-evolving data landscape."

Sashank Pappu
CEO, Antz.ai
Preface

In the ever-evolving landscape of data management and informaDon retrieval, the

advent of vector databases has ushered in a paradigm shiO, oﬀering unparalleled
eﬃciency in handling and querying high-dimensional data. This book, "The EssenDals of
Vector Databases," focus on the core principles and complex detailed aspects surrounding
the concept of vector databases, providing a comprehensive journey from foundaDonal
concepts to advanced applicaDons.

Designed for readers with a foundaDonal understanding of machine learning and

deep learning, this book embarks on a systemaDc exploraDon of the principles that
underpin vector databases. From the fundamental building blocks to sophisDcated
techniques, each chapter aims to demysDfy the complexiDes associated with vector
databases, empowering readers to harness the full potenDal of these cuSng-edge
technologies.

As we navigate through the pages, we will unravel the essenDals of creaDng,

managing, and querying vector databases. The content is craOed to cater to a spectrum of
readers, from those seeking a solid grounding in the basics to those eager to explore the
fronDers of advanced concepts in the realm of vector databases. PracDcal examples and
real-world applicaDons serve as guiding beacons, illustraDng how these concepts manifest
in diverse scenarios.

It is presumed that readers come armed with a foundaDonal knowledge of machine

learning and deep learning, enhancing their ability to appreciate and absorb the nuances
presented throughout the book. By assuming this foundaDonal knowledge, we aim to
provide a more immersive and insigh:ul learning experience, allowing readers to bridge
the gap between theoreDcal understanding and pracDcal implementaDon.
Embark on this enriching journey, where the convergence of machine learning and vector
databases unfolds before your eyes. May this book be your guide, equipping you with the
knowledge and skills to navigate the intricate landscapes of vector databases, from the
rudiments to the cuSng-edge, in pursuit of opDmal data management and retrieval
soluDons.
Dedicated to my dad and to my best friend arun sir
I would like to express my hear4elt gra6tude to deeplearning.ai and the numerous insigh4ul
blog ar6cles that have been a guiding light throughout the crea6on of this book. The
concepts presented are deeply rooted in the knowledge and inspira6on drawn from these
invaluable resources. My sincere thanks to the brilliant minds behind deeplearning.ai and
the authors of the insigh4ul blog ar6cles, whose contribu6ons have played a pivotal role in
shaping the content and direc6on of this book. This journey of wri6ng would not have been
possible without their unwavering commitment to advancing the ﬁeld of deep learning. I
extend my apprecia6on to all who have shared their exper6se, making this endeavour a
meaningful and successful one.
Table Of Contents

Chapter 1: IntroducDon ....................................................................................................... 6

Chapter 2: Real Time Use Cases of Vector Databases ......................................................... 7
Chapter 3: How do we get embeddings? ............................................................................ 8
Chapter 4: Measuring Distance between vector Embeddings .......................................... 10
1. Euclidean Distance (L2): .................................................................................................. 10
2. ManhaNan Distance (L1): ............................................................................................... 10
3. Dot Product: ................................................................................................................... 10
4. Cosine Distance: ............................................................................................................. 11
Chapter 5: Bruit force Distance Measure using KNeighbours algorithm .......................... 13
Output showing the spred of vectors in two dimensional space ..................................... 15
Chapter 6: What Are Vector Stores? .................................................................................. 16
1. Installa6on of Chroma DB: .............................................................................................. 16
Chapter 7: ImplemenDng our ﬁrst Vector Search.............................................................. 18
1. More on Querying with and without ﬁlters .................................................................... 18
Vector Databases support CRUD opera6ons ...................................................................... 20
Chapter 8: Going From CRUD to SemanDc Search ............................................................ 22
The Future and Beyond ...................................................................................................... 23
Final Chapter: Conclusion .................................................................................................. 25
Chapter 1: Introduc1on

In the vast landscape of data, a new type of database has emerged, surrounded by
intrigue. These databases, called vector databases, promise quick data retrieval and clever
similarity detec6on. However, for those unfamiliar, exploring this realm might seem like
naviga6ng a complex maze blindfolded.

Tradi6onal databases provide a sense of familiarity with their organized tables and rows. Yet,
when dealing with complex data like images, text, and user preferences, these structures fall
short. Here enters the vector database, speciﬁcally designed for the intricate nature of such
high-dimensional data.

Picture each data point as a constella6on, its essence captured in the angles and distances
between various aNributes. Vector databases grasp this celes6al language, storing data
points as vectors—mathema6cal en66es encoding the essence of each "star."

The true marvel lies not just in storage but in retrieval. Unlike tradi6onal databases
struggling with similarity nuances, vector databases possess a nearly magical ability to
recognize paNerns and connec6ons. They unveil hidden rela6onships between seemingly
unrelated data points, revealing insights that might elude the keenest human eye.

Imagine having a million unique photographs. A tradi6onal database might let you search by
tags, but ﬁnding all images of, for instance, a sunrise over a calm ocean could be challenging.
A vector database, on the other hand, eﬀortlessly pinpoints these hidden gems, guided by
the subtle dance of vectors.

This newfound power unlocks numerous possibili6es, from personalized recommenda6ons

to eﬀec6ve fraud detec6on and ground breaking scien6ﬁc discoveries. However, before we
delve into this exci6ng world, an essen6al step awaits. We must remove the blindfold and
gain a clear understanding of these intriguing en66es—unveiling core concepts, intended
purpose, and the essence of what makes a vector database work.

So, dear reader, get ready for an enlightening journey. Let's embark on this quest together,
and by the end, the once-mysterious world of vector databases will be an open book, ready
to be explored and harnessed for the greater good.
Chapter 2: Real Time Use Cases of Vector Databases

AutomoDve Industry:
• Mul6-modal search can aid in iden6fying automo6ve parts. Users can capture images
of components, and the system can retrieve relevant informa6on and documenta6on
from a vector database, facilita6ng repairs and maintenance.

Fashion and Design:

• In the fashion industry, users can take pictures of clothing or design elements, and
mul6-modal search combined with vector databases can assist in ﬁnding similar
styles, paNerns, or fabrics, enhancing the crea6ve and shopping processes.

Personal Finance Management:

• Users can capture images of receipts, invoices, or financial documents using mul6-
modal search. Vector databases can store and organize this data, allowing individuals
to track expenses, manage budgets, and retrieve financial insights efficiently.

Drug Discovery:
• Researchers can employ mul6-modal search to analyze chemical structures and
biological images related to drug discovery. Vector databases can store informa6on
about compounds, their proper6es, and poten6al applica6ons in medicine.

Medical Research Literature:

• Researchers can use mul6-modal search to explore medical research literature.
Images of research papers, ﬁgures, or charts can be submiNed, and vector databases
can store and organize scien6ﬁc knowledge for comprehensive literature reviews.
Chapter 3: How do we get embeddings?

Autoencoders (AEs) and Varia6onal Autoencoders (VAEs) are types of neural networks used
in unsupervised learning for dimensionality reduc6on and genera6ve tasks. Autoencoders
consist of an encoder and a decoder. The encoder compresses the input data into a lower-
dimensional representa6on, known as the latent space, while the decoder reconstructs the
input from this compressed representa6on. The network is trained to minimize the
diﬀerence between the input and the reconstructed output, forcing the encoder to learn a
meaningful representa6on of the data.

Varia6onal Autoencoders introduce a probabilis6c element to the latent space. Instead of

mapping inputs to a fixed point in the latent space, VAEs map inputs to probability
distribu6ons over the latent space. This probabilis6c approach allows VAEs to generate
diverse and meaningful outputs during the decoding process. The significance of the latent
space lies in its ability to capture essen6al features and paNerns of the input data in a
compact and con6nuous manner. It serves as a compressed, con6nuous representa6on that
can be manipulated for tasks such as data genera6on, interpola6on, and explora6on. The
latent space in both Autoencoders and Varia6onal Autoencoders plays a crucial role in
disentangling and capturing the underlying structure of the input data, enabling more
effec6ve and versa6le representa6on learning.

he latent space in autoencoders, including varia6onal autoencoders (VAEs), is indeed a

vector representa6on of the input data. This representa6on is a compressed and abstract
encoding of the essen6al features present in the input. Each point in the latent space
corresponds to a diﬀerent encoding of the input data.

Let us understand more about these embeddings by looking at a prac6cal example

Code Figure 1

1. import Libraries: The code shows the modules to be imported and uses
SentenceTransformer for working with pre-trained models and List for type hin6ng.
2. Class DefiniDon: Define a class TextualEmbeddings that ini6alizes an instance of the
SentenceTransformer model. The model is specified by the model_name_or_path
parameter, and the default is set to 'paraphrase-MiniLM-L6-v2'.
3. Encode Method: Define a method encode that takes a list of sentences (data) as
input and returns the corresponding embeddings using the encode method of the
pre-trained model.
4. Main Block: Specify a list of sentences that you want to encode.
5. InstanDaDon of the Class: Create an instance of the TextualEmbeddings class and
use it to encode the specified sentences.
Chapter 4: Measuring Distance between vector Embeddings

1. Euclidean Distance (L2):

Euclidean distance is a fundamental metric used to measure the straight-line

distance between two points in a space. In the context of vector databases, it serves as a
distance metric between vectors. The Euclidean distance (L2 norm) between two vectors, A
and B, is calculated as the square root of the sum of squared diﬀerences between their
corresponding elements. Mathema6cally, this is expressed as

Euclidean distance is sensi6ve to magnitude and direc6on, making it suitable for scenarios
where both magnitude and orienta6on maNer.

2. Manhattan Distance (L1):

ManhaNan distance, also known as L1 norm or taxicab distance, measures the sum
of absolute diﬀerences between corresponding elements of two vectors. In the context of
vector databases, ManhaNan distance is calculated as the sum of the absolute diﬀerences
between the coordinates of two vectors. Mathema6cally, it is expressed as

Unlike Euclidean distance, ManhaNan distance is less inﬂuenced by outliers and is ohen
preferred when the impact of extreme values should be minimized.

3. Dot Product:

The dot product is a mathema6cal opera6on that quan6ﬁes the similarity between
two vectors. In the context of vector databases, the dot product measures the cosine of the
angle between two vectors. If the vectors are orthogonal, the dot product is zero; if they
point in the same direc6on, the dot product is posi6ve, and if they point in opposite
direc6ons, the dot product is nega6ve. Mathema6cally, the dot product of vectors A and B is
given by
The dot product is valuable for measuring the alignment of vectors and is ohen used in tasks
such as similarity and relevance scoring.

4. Cosine Distance:

Cosine distance is a measure of similarity between two vectors based on the cosine
of the angle between them. In the context of vector databases, cosine distance is ohen used
to assess the similarity of vectors regardless of their magnitude. It is par6cularly useful in
scenarios where the magnitude of vectors is not a signiﬁcant factor, such as text data. Cosine
distance is calculated as the cosine of the angle between two vectors A and B, represented
as

This distance metric produces a value between -1 and 1, where 1 indicates complete
similarity, 0 indicates orthogonality, and -1 indicates complete dissimilarity. Cosine distance
is widely employed in informa6on retrieval and recommenda6on systems for assessing
document or item similarity.

Let us see the above concepts with some examples. What we will do here is we
combine the above code which will give the embeddings for sentences and pass through a
u6lity class which we are going to design now so that we get the distance between the
embeddings.
The below code has two methods numbered 1 and 2 respec6vely. The Method 1 is
Constructor (__init__): Ini6alizes the class instance.
• self.vector1 and self.vector2: Randomly generated dense vectors of size 30.
• self.distance_metric: An enumera6on (Enum) represen6ng distance metrics,
including ManhaNan, Euclidean, and Cosine distances.
The Method 2 is the business logic which will actually compute the distance measure based
on given criteria
• Checks the value of distance_metric_name against the enumera6on values to
determine the desired distance metric.
• If distance_metric_name is 'manhajan_distance', it calculates and returns the
ManhaNan distance.
• If distance_metric_name is 'euclidean_distance', it calculates and returns the
Euclidean distance.
• If distance_metric_name is 'cosine_distance' (or any other value), it calculates and
returns the Cosine distance.
Code Figure 2

If we modify the code ﬁgure 1 shown above something as below then the logic will compute
the distance between the vectors and ﬁnally display the similarity between the passed in
vectors.

Code ﬁgure 3
Chapter 5: Bruit force Distance Measure using KNeighbours
algorithm

In this instruc6onal segment, we delve into the fundamental concepts surrounding

vector or seman6c search, employing the brute-force k-nearest-neighbors algorithm to build
an intui6ve understanding. The tutorial progresses by guiding you through the
implementa6on of brute-force KNN, elucida6ng its applica6on in accurately retrieving
nearest vectors in the embedding space rela6ve to a query vector. As we navigate this
journey, we confront the challenges associated with the run6me complexity of brute-force
KNN algorithms, paving the way for the explora6on of approximate nearest-neighbors
algorithms—a core component of vector database technology.

Vectors, serving as conduits for the intrinsic meaning embedded within our data,
become instrumental in seeking data points that resonate in meaning with our queries. This
process, known as seman6c or vector search, hinges on the iden6ﬁca6on and retrieval of the
closest objects within vector space. The tutorial elaborates on the seman6c search,
emphasizing its reliance on the meaning encapsulated in words or images. A detailed
walkthrough of the brute-force approach is provided, involving sequen6al steps: calcula6ng
distances between all vectors and the query vector, sor6ng these distances, and ﬁnally
returning the top K best-matching objects based on the smallest distances—a paradigm
recognized in classical machine learning as the K nearest neighbor algorithm.

However, the tutorial underscores a cri6cal considera6on—the substan6al

computa6onal cost associated with brute-force searches. As the quan6ty of data points
escalates, the overall query 6me experiences propor6onal growth. A demonstra6on of this
algorithm in code, encompassing the scaling up of both data points and dimensions,
reinforces the impact of increased computa6onal demands.

To illustrate this point, a speed test func6on assesses the 6me complexity of the
brute-force algorithm across various scales, ranging from 20 objects to millions. The
observed results demonstrate a no6ceable increase in query 6me as the number of objects
expands, revealing the inherent limita6ons of the brute-force approach, par6cularly with
substan6al datasets.

Furthermore, the tutorial addresses the impact of dimensional augmenta6on on

vector embeddings, exploring scenarios where the dimensionality is increased to 768
dimensions. Performance tests underscore the computa6onal challenges, showcasing how
query 6mes escalate, especially with larger datasets. Real-world scenarios, where vectors
may encompass hundreds of millions of objects, pose signiﬁcant challenges for brute-force
methodology.

In conclusion, the tutorial accentuates the intricate rela6onship between the number
of vectors and query 6me. The exponen6al increase in query dura6on, par6cularly in
scenarios mirroring real-world complexi6es, necessitates the explora6on of alterna6ve
methodologies to ensure 6mely and eﬃcient results. The subsequent lesson promises an
explora6on of diverse methods to navigate these challenges and facilitate eﬀec6ve queries
across numerous vectors.

The KNNDistanceMeasure class serves as a u6lity for exploring k-Nearest Neighbors

distance measurements. It oﬀers the capability to generate random vectors, plot
embeddings and query vectors, and perform k-Nearest Neighbors search based on speciﬁed
parameters. This class can be instrumental for understanding and experimen6ng with the
KNN algorithm in a controlled environment, allowing users to visualize embeddings, queries,
and their nearest neighbors.

Code Figure 4
random_vector Method:
The random_vector method is responsible for genera6ng a random 2-dimensional vector. It
u6lizes NumPy's randn func6on to create a vector with 50 data points along each of the two
dimensions.

plot_data Method:
The plot_data method facilitates the visualiza6on of embeddings and query vectors. It takes
two arguments: data_vector represen6ng the embeddings and query_vector represen6ng
the vector used as a query. The method generates a scaNer plot, marking the embeddings
and the query vector. Each point on the plot corresponds to an embedding, and the query
vector is highlighted in blue. Text annota6ons on the plot correspond to the indices of the
embeddings.
nearest_neighbours Method:
The nearest_neighbours method performs k-Nearest Neighbors search. It takes several
parameters:
• k: The number of neighbors to retrieve (default is 3).
• algorithm: The algorithm used for nearest neighbors search (default is 'brute').
• metric: The distance metric used for calcula6ng distances (default is 'euclidean').
• data_vector: The embeddings dataset.
• query_vector: The vector for which nearest neighbors are to be found.

If we plot the 2-dimensional vector that is randomly generated using a driver program this is
how it looks as.

Output showing the spred of vectors in two dimensional space

Chapter 6: What Are Vector Stores?
Vector databases are purpose-built databases tailored for the eﬃcient storage and
retrieval of vector embeddings, play a crucial role in addressing the limita6ons of
conven6onal databases, such as SQL, when it comes to handling extensive vector data. The
necessity for specialized stores arises from the inadequacies of tradi6onal databases in
eﬃciently managing the storage and retrieval of large-scale vector informa6on.

Embeddings as seen in the previous concepts serve as numerical representa6ons of

data, par6cularly unstructured data like text, situated within a high-dimensional space. The
inherent nature of these embeddings makes them less compa6ble with conven6onal
rela6onal databases, which struggle with the storage and retrieval of such intricate vector
representa6ons.

Vector databases, on the other hand, are adept at indexing and swihly searching for
similar vectors through the applica6on of advanced similarity algorithms. This capability
empowers applica6ons to iden6fy and retrieve vectors that bear resemblance to a specified
target vector query. In essence, vector stores provide an op6mized environment for
managing and querying vector data, enabling efficient explora6on and retrieval of related
vectors in response to specific queries.

In this book we will consider Chroma DB as our vector store and explain all the
concepts related to vector stores. Chroma DB, an open-source vector store, is designed for
the storage and retrieval of vector embeddings, primarily serving the purpose of preserving
embeddings and associated metadata. This stored informa6on proves valuable for
subsequent u6liza6on by expansive language models. Notably, Chroma DB finds applica6on
in seman6c search engines dealing with textual data. key features of Chroma DB are as
follows.
1. Diverse Storage OpDons:
• Chroma DB supports various underlying storage alterna6ves, including
DuckDB for standalone setups and ClickHouse for enhanced scalability.
2. SoOware Development Kits (SDKs):
• It furnishes Sohware Development Kits (SDKs) for Python and
JavaScript/TypeScript, facilita6ng seamless integra6on into projects
developed in these programming languages.
3. Emphasis on Simplicity and Speed:
• Chroma DB priori6zes simplicity, speed, and analy6cal capabili6es, aligning its
design with the objec6ves of straigh4orward usage, rapid performance, and
data analysis.
4. Self-Hosted Server OpDon:
• An addi6onal feature of Chroma DB is the availability of a self-hosted server
op6on, providing users with the flexibility to host and manage the vector
store infrastructure according to their specific requirements.

1. Installation of Chroma DB:

You can run a Chroma server in a Docker container or as a Hosted service.

You can get the Chroma Docker image from Docker Hub, or from the Chroma GitHub
Container Registry

docker pull chromadb/chroma

docker run -p 8000:8000 chromadb/chroma

You can also build the Docker image yourself from the Dockerﬁle in the Chroma GitHub
repository

git clone [email protected]:chroma-core/chroma.git

cd chroma
docker-compose up -d --build

The Chroma client can then be conﬁgured to connect to the server running in the Docker
container.

import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Chapter 7: Implemen1ng our first Vector Search
The encapsulated func6onali6es within the ChromadbHelper class delineate a u6lity
designed for seamless interac6on with Chroma DB, an adept database tailored for the
efficient management of vector embeddings. The instan6a6on process ini6alizes a
connec6on to the Chroma DB server, configured with the host as 'localhost' and the port as
'8000'. The class offers a repertoire of methods to navigate key opera6ons: fetching
collec6ons, crea6ng and dele6ng collec6ons, saving data along with metadata, and querying
the database. Notably, the method for saving data internally employs Chroma DB, relying on
the all-MiniLM-L6-v2 model for embedding handling. This cohesive design allows for an
intui6ve and structured approach to interact with Chroma DB, abstrac6ng complexi6es
associated with HTTP client interac6ons and database opera6ons. In the academic spirit, the
class serves as a pedagogical tool, facilita6ng a lucid understanding of how to navigate and
u6lize Chroma DB effec6vely for vector embedding applica6ons.

Code Figure 5

1. More on Querying with and without filters

Exploring Chroma collec6ons involves diverse querying techniques facilitated by the .query
method. One approach entails querying with a set of query_embeddings, where each query
embedding is a numerical representa6on of a search query. By invoking the .query method
with parameters such as query_embeddings, n_results, where, and where_document,
users can retrieve the top matching results for each query embedding, allowing for
metadata-based and content-based ﬁltering.

Code Figure 6
If the dimensions of the supplied query embeddings do not align with the collec6on, an
excep6on will be raised. Alterna6vely, users can opt to query by a set of query_texts.
Chroma first embeds each query text using the collec6on's embedding func6on and
subsequently performs the query with the generated embeddings. The .query method, in
this scenario, also supports parameters like n_results, where, and where_document.
Retrieving items from a collec6on by their unique iden6fiers (ids) is achievable through the
.get method, where users can specify the desired ids and apply op6onal where and
where_document filters.

Code Figure 7
The .get method, if invoked without specific ids, returns all items in the collec6on that
match the specified filters. Notably, when using .get or .query, the include parameter allows
users to selec6vely retrieve data fields such as embeddings, documents, metadatas, and
distances. By default, Chroma returns documents, metadatas, and distances for query
results, while excluding embeddings for performance reasons. Users can customize the
returned data fields by providing an array of included field names to the includes parameter
of the .query or .get method, tailoring the output to their specific requirements.

Code Figure 8
Metadata filtering in Chroma supports a range of operators, providing users with versa6le
op6ons for refining queries based on metadata aNributes. The $eq operator enables filtering
for equality, accommoda6ng strings, integers, and floats. Conversely, the $ne operator
excludes items that are not equal to the specified value, suppor6ng string, integer, and float
comparisons. For numeric comparisons, Chroma offers the $gt operator to filter for values
greater than the specified threshold, and the $gte operator for values greater than or equal
to the given threshold. On the other hand, the $lt operator facilitates filtering for values less
than the specified threshold, while the $lte operator includes values less than or equal to
the specified threshold. This array of operators provides users with a comprehensive toolkit
to precisely tailor their metadata-based filters, promo6ng flexibility and precision in
querying Chroma collec6ons.

Code Figure 9

Vector Databases support CRUD operations

The StudentDB class encapsulates a simple yet illustra6ve Python applica6on, showcasing
the fundamental CRUD opera6ons within the domain of Chroma DB. Ini6ated with an
instan6a6on of Chroma DB and an OpenAI embedding func6on, the class seamlessly
integrates the capabili6es of both technologies. Leveraging this integra6on, the class defines
methods for crea6ng, reading, upda6ng, and dele6ng student records within a collec6on
named "students." The create_student method orchestrates the addi6on of new student
informa6on, genera6ng a unique iden6fier for each student. Subsequently, the
read_student method retrieves and displays the informa6on of a specified student,
demonstra6ng the read opera6on. The update_student method allows for the modifica6on
of an exis6ng student's informa6on, exemplifying the update opera6on. Finally, the
delete_student method facilitates the removal of a student record based on a provided
iden6fier, illustra6ng the delete opera6on. This concise yet comprehensive demonstra6on
underscores the seamless integra6on of Chroma DB and OpenAI embeddings, offering a
tangible illustra6on of CRUD opera6ons within the context of a vector database.
Code Figure 10
Chapter 8: Going From CRUD to Seman1c Search
In the development of our forthcoming applica6on, we embark on the task of
crea6ng a straigh4orward yet efficient system. This system involves the storage of two
dis6nct documents, namely "student_info" and "university_info," within the vector database
known as Chroma DB. Leveraging the OpenAI embeddings, we employ a custom embedding
func6on intrinsic to Chroma DB to facilitate the incorpora6on of these documents. The
embedding func6on plays a pivotal role in encapsula6ng the seman6c nuances and
contextual informa6on of the documents. As the documents find their residence within
Chroma DB, we subsequently proceed to pose queries to this database. Chroma DB, armed
with its inherent capability to compute minimum distances based on context, eventually
responds by returning the document that exhibits the closest contextual match to the posed
ques6ons. This applica6on thus underscores the seamless synergy between document
storage, embedding func6ons, and query responses within the realm of vector databases,
exemplifying the prac6cality and efficacy of such systems in real-world applica6ons, Below
code exemplifies the above scenario.

Code Figure 10

In the above code encapsulated within the ChromadbOpenAI class, a sophis6cated

integra6on of Chroma DB and OpenAI embeddings is undertaken to facilitate seamless
storage and retrieval of informa6on. The instan6a6on process ini6alizes a Chroma DB client
and employs an OpenAI embedding func6on, speciﬁcally 'text-embedding-ada-002,'
enriching the system's capacity to comprehend and represent textual data. Subsequently, a
collec6on named "students_and_university" is instan6ated within Chroma DB, u6lizing the
OpenAI embedding func6on for embedding documents. The save_data method
orchestrates the incorpora6on of two dis6nct documents, "student_info" and
"university_info," into the aforemen6oned collec6on.

These documents are accompanied by metadata deno6ng their respec6ve sources,

and unique iden6fiers ("id1" and "id2") are assigned for efficient referencing. The query
method exemplifies the applica6on's query func6onality, where an inquiry regarding the
GPA of Pavan is posed. The system leverages the collec6on's embedding func6on and
responds by retrieving the most contextually relevant result, showcasing the intricate
interplay between document storage, metadata annota6on, and query resolu6on within the
domain of vector databases and advanced embedding func6ons. This integra6on
underscores the academic explora6on of harnessing cupng-edge technologies for efficient
and contextually aware informa6on retrieval.

The Future and Beyond

In the context of RAG (Retrieval Augmented Generated) applica6ons, the prepara6on

of documents involves segmen6ng them into suitable lengths, a crucial step influenced by
the selec6on of embedding models and the subsequent Large Language Model (LLM)
applica6on that u6lizes these documents as context. This process is vital to op6mize the
compa6bility and effec6veness of the documents within the chosen framework. Once
segmented, the next phase involves indexing per6nent data. This entails genera6ng
embeddings for the documents and popula6ng a Vector Search index with this enriched
data. By doing so, the system is equipped to efficiently perform searches and retrievals,
ensuring that the document embeddings are readily available for seamless integra6on into
the RAG applica6ons. This me6culous approach ensures that the documents are
appropriately tailored for the chosen embedding model and downstream LLM applica6on,
fostering an op6mal synergy between the input data and the overarching language model
framework.

The ChromaDBVectorizer class is designed to orchestrate the vectoriza6on of

documents and their subsequent storage and retrieval within Chroma DB, all while adhering
to best prac6ces for error handling and resource ini6aliza6on. In the constructor method
(__init__), the class ﬁrst establishes a connec6on to Chroma DB using a ChromaDBHelper
instance and creates or fetches a collec6on named <your_collecDon_name>. The crea6on
or fetching process is encapsulated in a try-except block, ensuring robust handling of
poten6al excep6ons. In the event of an error, the collec6on is created, and a corresponding
message is printed. Following this, an OpenAI embedding model (embed_model) is
instan6ated using an OpenAI API key, and a set of documents is loaded from a speciﬁed
directory using a SimpleDirectoryReader. The class then ini6alizes a ChromaVectorStore
with the fetched or created collec6on, sepng up the storage context and service context for
subsequent opera6ons.

Moving forward, the save_to_database method is deﬁned to facilitate the storage of

vectorized documents in Chroma DB. A VectorStoreIndex is created from the loaded
documents, incorpora6ng the established storage and service contexts. This index,
represen6ng the vectorized data, is returned by the method.
Subsequently, the query method is implemented to perform a query on the vectorized data
stored in Chroma DB. It u6lizes the VectorStoreIndex obtained from the save_to_database
method to instan6ate a query engine (query_engine). The method then executes a sample
query, speciﬁcally inquiring about the segment proﬁt of aerospace. The response from the
query engine is printed for examina6on.

Throughout the class, proper indenta6on and modulariza6on are employed to

enhance readability and maintainability. Addi6onally, the class exhibits a systema6c
approach to error handling, ensuring the resilience of the applica6on under various
circumstances. This comprehensive explana6on provides a clear understanding of the class's
purpose and func6onality within the context of document vectoriza6on and Chroma DB
integra6on.

Code Figure 11
Final Chapter: Conclusion
In the stream of large language model systems, vector stores such as Chroma DB
have become indispensable components. Their specialized storage capabili6es and eﬃcient
retrieval of vector embeddings play a pivotal role in facilita6ng swih access to per6nent
seman6c informa6on, thereby empowering the func6onality of Large Language Models
(LLMs).

This tutorial on Chroma DB delves into the fundamental aspects of its u6liza6on.
Topics covered encompass the founda6onal steps of crea6ng a collec6on, incorpora6ng
documents, transforming text into embeddings, execu6ng queries for seman6c similarity,
and proﬁciently managing the collec6ons. This comprehensive tutorial serves as a valuable
resource for individuals seeking to grasp the essen6als of employing Chroma DB in their
language model endeavours.

As part of the con6nuous learning journey, the subsequent phase involves the
seamless integra6on of vector databases into genera6ve AI applica6ons. The LlamaIndex
framework is an invaluable tool for users aiming to eﬀortlessly ingest, manage, and retrieve
private or domain-speciﬁc data for their AI applica6ons within the framework of Large
Language Model (LLM)-based systems. Furthermore, enthusiasts can explore the intricacies
of LLMOps and its areas of applica6ons. This progression allows prac66oners to deepen
their understanding and applica6on of vector databases, fostering a more nuanced approach
to harnessing their capabili6es for advanced language model development.

Warwick Thesis Latex
100% (2)
Warwick Thesis Latex
5 pages
Unit 2 Reading and Writing Files
No ratings yet
Unit 2 Reading and Writing Files
33 pages
GBS
No ratings yet
GBS
2 pages
CHV_PRO_Chigo
No ratings yet
CHV_PRO_Chigo
2 pages
Cad Margate Ext3 CPF Block E-1 - 231205 - 210620
No ratings yet
Cad Margate Ext3 CPF Block E-1 - 231205 - 210620
2 pages
Sitticute Ulit
No ratings yet
Sitticute Ulit
37 pages
Advance Excel Course Applicaiton 210721
No ratings yet
Advance Excel Course Applicaiton 210721
1 page
Wizmate GB
No ratings yet
Wizmate GB
21 pages
ZIEHL ABEGG Main Catalogue Centrifugal Fans With IEC Standard Moto
No ratings yet
ZIEHL ABEGG Main Catalogue Centrifugal Fans With IEC Standard Moto
285 pages
Kahoot
No ratings yet
Kahoot
3 pages
Emma Hart 001 5000k
No ratings yet
Emma Hart 001 5000k
5 pages
OBC (Female Only) Csab Round 2
No ratings yet
OBC (Female Only) Csab Round 2
7 pages
MTAS Operation and Configuration
No ratings yet
MTAS Operation and Configuration
2 pages
Mathssssss
No ratings yet
Mathssssss
4 pages
TDS - Reniso Triton Se, Sez - en
No ratings yet
TDS - Reniso Triton Se, Sez - en
6 pages
Neuro-Fuzzy Artificial Neural Networks & Fuzzy Logic-IJRASET
No ratings yet
Neuro-Fuzzy Artificial Neural Networks & Fuzzy Logic-IJRASET
10 pages
As32 TTL 1W
No ratings yet
As32 TTL 1W
22 pages
MLOPS
No ratings yet
MLOPS
56 pages
Masonry Testing 101 - 2018 04 12
100% (1)
Masonry Testing 101 - 2018 04 12
57 pages
Computer Science Faculty Information Systems Department: Data Warehousing & BI
No ratings yet
Computer Science Faculty Information Systems Department: Data Warehousing & BI
52 pages
Technical Specifications of Medical Devices For Ophthalmology Equipment
100% (1)
Technical Specifications of Medical Devices For Ophthalmology Equipment
70 pages
Dynamo - Wikipedia
No ratings yet
Dynamo - Wikipedia
13 pages
Swedy - Report
No ratings yet
Swedy - Report
45 pages
MCQ1
No ratings yet
MCQ1
51 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (2)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
SF-QR-10HP-4P (HT) Ip55
No ratings yet
SF-QR-10HP-4P (HT) Ip55
2 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
7 Agentic RAG System Architectures to Build AI Agents
No ratings yet
7 Agentic RAG System Architectures to Build AI Agents
12 pages
Vector Databases - A Technical Primer
No ratings yet
Vector Databases - A Technical Primer
68 pages
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
100% (1)
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
21 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
100% (1)
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
39 pages
Types of RAG: @bhavishya Pandit
No ratings yet
Types of RAG: @bhavishya Pandit
15 pages
LLM Questions
100% (1)
LLM Questions
51 pages
LangChainJS For Beginners - Nathan Sebhastian
No ratings yet
LangChainJS For Beginners - Nathan Sebhastian
168 pages
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Mastering Chunking in RAG - Techniques and Strategies
No ratings yet
Mastering Chunking in RAG - Techniques and Strategies
12 pages
MLops Concept
No ratings yet
MLops Concept
20 pages
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
100% (1)
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
24 pages
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
No ratings yet
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
12 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
Generative AI On AWS
100% (6)
Generative AI On AWS
208 pages
RAG Notes
No ratings yet
RAG Notes
19 pages
LLM
100% (1)
LLM
10 pages
GenerativeAI Projects
100% (2)
GenerativeAI Projects
46 pages
GenAI POC - Training
100% (1)
GenAI POC - Training
43 pages
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
LangGraph: multi-agent systems
No ratings yet
LangGraph: multi-agent systems
9 pages
RAG Technics
No ratings yet
RAG Technics
8 pages
Improve Real-World RAG Systems
No ratings yet
Improve Real-World RAG Systems
43 pages
What Are Vector Databases
No ratings yet
What Are Vector Databases
5 pages
Fine-Tuning Pre-Trained Models For Generative AI Applications
100% (2)
Fine-Tuning Pre-Trained Models For Generative AI Applications
19 pages
Rag 1708257109
No ratings yet
Rag 1708257109
5 pages
Software Architecture in An AI World
No ratings yet
Software Architecture in An AI World
25 pages
A Taxonomy of Retrieval Augmented Generation
100% (1)
A Taxonomy of Retrieval Augmented Generation
56 pages
Thimira Amaratunga - Understanding Large Language Models - Learning Their Underlying Concepts and Technologies-Apress (2023)
100% (7)
Thimira Amaratunga - Understanding Large Language Models - Learning Their Underlying Concepts and Technologies-Apress (2023)
145 pages
Large Language Models
100% (1)
Large Language Models
23 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Building LLM Applications For Production
100% (3)
Building LLM Applications For Production
28 pages
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
No ratings yet
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
18 pages
Building Machine Learning Systems With A Feature Store - Early Release
100% (1)
Building Machine Learning Systems With A Feature Store - Early Release
48 pages
Diffusion
100% (5)
Diffusion
62 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
Generative AI - POC - Readout
100% (3)
Generative AI - POC - Readout
56 pages
Introduction To Software Reverse Engineering With Ghidra Session 2: C To ASM
No ratings yet
Introduction To Software Reverse Engineering With Ghidra Session 2: C To ASM
49 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (1)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
RAG Architecture
100% (7)
RAG Architecture
52 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
LangChain Cheat Sheet KDnuggets
No ratings yet
LangChain Cheat Sheet KDnuggets
1 page
Vector_Databases
No ratings yet
Vector_Databases
35 pages
Mlops Ebook With Preview
67% (3)
Mlops Ebook With Preview
57 pages
Building LLM Powered Applications With Langchain
100% (1)
Building LLM Powered Applications With Langchain
11 pages
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
No ratings yet
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
13 pages
Electrical Power Systems Electrical Power Engineering Power Systems
No ratings yet
Electrical Power Systems Electrical Power Engineering Power Systems
29 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Hands-On Guide to Agentic Corrective RAG-1
No ratings yet
Hands-On Guide to Agentic Corrective RAG-1
5 pages
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
No ratings yet
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
34 pages
Maths MCQ PDF
No ratings yet
Maths MCQ PDF
78 pages
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
No ratings yet
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
31 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
13 pages
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
50% (2)
Best Practices For Fine-Tuning and Prompt Engineering LLMs - Weights & Biases LLM Whitepaper
21 pages
26 RAG Concepts in Alphabetical Order
No ratings yet
26 RAG Concepts in Alphabetical Order
15 pages
AI
100% (2)
AI
234 pages
Computer Science Class Xii 2021 22 Investigatory Project
100% (2)
Computer Science Class Xii 2021 22 Investigatory Project
36 pages
Retrieval Augmented Generation - Streamlining The Creation of Intelligent Natural Language Processing Models
No ratings yet
Retrieval Augmented Generation - Streamlining The Creation of Intelligent Natural Language Processing Models
8 pages
Cold Facts 2011-Fall
No ratings yet
Cold Facts 2011-Fall
52 pages
Generative AI With Large Language Models
100% (2)
Generative AI With Large Language Models
31 pages
Amazon SWOT Analysis
No ratings yet
Amazon SWOT Analysis
7 pages
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
From Everand
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
Amandeep
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

Vector Database Essentials

Uploaded by

Vector Database Essentials

Uploaded by

Vector Databases Cookbook

In the ever-evolving landscape of data management and informaDon retrieval, the

Designed for readers with a foundaDonal understanding of machine learning and

As we navigate through the pages, we will unravel the essenDals of creaDng,

It is presumed that readers come armed with a foundaDonal knowledge of machine

Chapter 1: IntroducDon ....................................................................................................... 6

This newfound power unlocks numerous possibili6es, from personalized recommenda6ons

Fashion and Design:

Personal Finance Management:

Medical Research Literature:

Varia6onal Autoencoders introduce a probabilis6c element to the latent space. Instead of

he latent space in autoencoders, including varia6onal autoencoders (VAEs), is indeed a

Let us understand more about these embeddings by looking at a prac6cal example

1. Euclidean Distance (L2):

Euclidean distance is a fundamental metric used to measure the straight-line

2. Manhattan Distance (L1):

In this instruc6onal segment, we delve into the fundamental concepts surrounding

However, the tutorial underscores a cri6cal considera6on—the substan6al

Furthermore, the tutorial addresses the impact of dimensional augmenta6on on

The KNNDistanceMeasure class serves as a u6lity for exploring k-Nearest Neighbors

Output showing the spred of vectors in two dimensional space

Embeddings as seen in the previous concepts serve as numerical representa6ons of

1. Installation of Chroma DB:

You can run a Chroma server in a Docker container or as a Hosted service.

docker pull chromadb/chroma

git clone [email protected]:chroma-core/chroma.git

1. More on Querying with and without filters

Vector Databases support CRUD operations

In the above code encapsulated within the ChromadbOpenAI class, a sophis6cated

These documents are accompanied by metadata deno6ng their respec6ve sources,

The Future and Beyond

In the context of RAG (Retrieval Augmented Generated) applica6ons, the prepara6on

The ChromaDBVectorizer class is designed to orchestrate the vectoriza6on of

Moving forward, the save_to_database method is deﬁned to facilitate the storage of

Throughout the class, proper indenta6on and modulariza6on are employed to

You might also like