Vector Database Essentials
Vector Database Essentials
Pavan Kumar M K
First Edition
Foreword
"Explore the ins and outs of Vector Databases in this insigh:ul book. Unlike others, it goes
beyond product talk, offering a deep dive into the fundamentals. Discover the unique
contribuDon of Chroma DB, with pracDcal use cases woven seamlessly into the narraDve.
It's a natural, hands-on approach to understanding the core of Vector DBs and their role in
the ever-evolving data landscape."
Sashank Pappu
CEO, Antz.ai
Preface
In the vast landscape of data, a new type of database has emerged, surrounded by
intrigue. These databases, called vector databases, promise quick data retrieval and clever
similarity detec6on. However, for those unfamiliar, exploring this realm might seem like
naviga6ng a complex maze blindfolded.
Tradi6onal databases provide a sense of familiarity with their organized tables and rows. Yet,
when dealing with complex data like images, text, and user preferences, these structures fall
short. Here enters the vector database, specifically designed for the intricate nature of such
high-dimensional data.
Picture each data point as a constella6on, its essence captured in the angles and distances
between various aNributes. Vector databases grasp this celes6al language, storing data
points as vectors—mathema6cal en66es encoding the essence of each "star."
The true marvel lies not just in storage but in retrieval. Unlike tradi6onal databases
struggling with similarity nuances, vector databases possess a nearly magical ability to
recognize paNerns and connec6ons. They unveil hidden rela6onships between seemingly
unrelated data points, revealing insights that might elude the keenest human eye.
Imagine having a million unique photographs. A tradi6onal database might let you search by
tags, but finding all images of, for instance, a sunrise over a calm ocean could be challenging.
A vector database, on the other hand, effortlessly pinpoints these hidden gems, guided by
the subtle dance of vectors.
So, dear reader, get ready for an enlightening journey. Let's embark on this quest together,
and by the end, the once-mysterious world of vector databases will be an open book, ready
to be explored and harnessed for the greater good.
Chapter 2: Real Time Use Cases of Vector Databases
AutomoDve Industry:
• Mul6-modal search can aid in iden6fying automo6ve parts. Users can capture images
of components, and the system can retrieve relevant informa6on and documenta6on
from a vector database, facilita6ng repairs and maintenance.
Drug Discovery:
• Researchers can employ mul6-modal search to analyze chemical structures and
biological images related to drug discovery. Vector databases can store informa6on
about compounds, their proper6es, and poten6al applica6ons in medicine.
Autoencoders (AEs) and Varia6onal Autoencoders (VAEs) are types of neural networks used
in unsupervised learning for dimensionality reduc6on and genera6ve tasks. Autoencoders
consist of an encoder and a decoder. The encoder compresses the input data into a lower-
dimensional representa6on, known as the latent space, while the decoder reconstructs the
input from this compressed representa6on. The network is trained to minimize the
difference between the input and the reconstructed output, forcing the encoder to learn a
meaningful representa6on of the data.
1. import Libraries: The code shows the modules to be imported and uses
SentenceTransformer for working with pre-trained models and List for type hin6ng.
2. Class DefiniDon: Define a class TextualEmbeddings that ini6alizes an instance of the
SentenceTransformer model. The model is specified by the model_name_or_path
parameter, and the default is set to 'paraphrase-MiniLM-L6-v2'.
3. Encode Method: Define a method encode that takes a list of sentences (data) as
input and returns the corresponding embeddings using the encode method of the
pre-trained model.
4. Main Block: Specify a list of sentences that you want to encode.
5. InstanDaDon of the Class: Create an instance of the TextualEmbeddings class and
use it to encode the specified sentences.
Chapter 4: Measuring Distance between vector Embeddings
Euclidean distance is sensi6ve to magnitude and direc6on, making it suitable for scenarios
where both magnitude and orienta6on maNer.
ManhaNan distance, also known as L1 norm or taxicab distance, measures the sum
of absolute differences between corresponding elements of two vectors. In the context of
vector databases, ManhaNan distance is calculated as the sum of the absolute differences
between the coordinates of two vectors. Mathema6cally, it is expressed as
Unlike Euclidean distance, ManhaNan distance is less influenced by outliers and is ohen
preferred when the impact of extreme values should be minimized.
3. Dot Product:
The dot product is a mathema6cal opera6on that quan6fies the similarity between
two vectors. In the context of vector databases, the dot product measures the cosine of the
angle between two vectors. If the vectors are orthogonal, the dot product is zero; if they
point in the same direc6on, the dot product is posi6ve, and if they point in opposite
direc6ons, the dot product is nega6ve. Mathema6cally, the dot product of vectors A and B is
given by
The dot product is valuable for measuring the alignment of vectors and is ohen used in tasks
such as similarity and relevance scoring.
4. Cosine Distance:
Cosine distance is a measure of similarity between two vectors based on the cosine
of the angle between them. In the context of vector databases, cosine distance is ohen used
to assess the similarity of vectors regardless of their magnitude. It is par6cularly useful in
scenarios where the magnitude of vectors is not a significant factor, such as text data. Cosine
distance is calculated as the cosine of the angle between two vectors A and B, represented
as
This distance metric produces a value between -1 and 1, where 1 indicates complete
similarity, 0 indicates orthogonality, and -1 indicates complete dissimilarity. Cosine distance
is widely employed in informa6on retrieval and recommenda6on systems for assessing
document or item similarity.
Let us see the above concepts with some examples. What we will do here is we
combine the above code which will give the embeddings for sentences and pass through a
u6lity class which we are going to design now so that we get the distance between the
embeddings.
The below code has two methods numbered 1 and 2 respec6vely. The Method 1 is
Constructor (__init__): Ini6alizes the class instance.
• self.vector1 and self.vector2: Randomly generated dense vectors of size 30.
• self.distance_metric: An enumera6on (Enum) represen6ng distance metrics,
including ManhaNan, Euclidean, and Cosine distances.
The Method 2 is the business logic which will actually compute the distance measure based
on given criteria
• Checks the value of distance_metric_name against the enumera6on values to
determine the desired distance metric.
• If distance_metric_name is 'manhajan_distance', it calculates and returns the
ManhaNan distance.
• If distance_metric_name is 'euclidean_distance', it calculates and returns the
Euclidean distance.
• If distance_metric_name is 'cosine_distance' (or any other value), it calculates and
returns the Cosine distance.
Code Figure 2
If we modify the code figure 1 shown above something as below then the logic will compute
the distance between the vectors and finally display the similarity between the passed in
vectors.
Code figure 3
Chapter 5: Bruit force Distance Measure using KNeighbours
algorithm
Vectors, serving as conduits for the intrinsic meaning embedded within our data,
become instrumental in seeking data points that resonate in meaning with our queries. This
process, known as seman6c or vector search, hinges on the iden6fica6on and retrieval of the
closest objects within vector space. The tutorial elaborates on the seman6c search,
emphasizing its reliance on the meaning encapsulated in words or images. A detailed
walkthrough of the brute-force approach is provided, involving sequen6al steps: calcula6ng
distances between all vectors and the query vector, sor6ng these distances, and finally
returning the top K best-matching objects based on the smallest distances—a paradigm
recognized in classical machine learning as the K nearest neighbor algorithm.
To illustrate this point, a speed test func6on assesses the 6me complexity of the
brute-force algorithm across various scales, ranging from 20 objects to millions. The
observed results demonstrate a no6ceable increase in query 6me as the number of objects
expands, revealing the inherent limita6ons of the brute-force approach, par6cularly with
substan6al datasets.
In conclusion, the tutorial accentuates the intricate rela6onship between the number
of vectors and query 6me. The exponen6al increase in query dura6on, par6cularly in
scenarios mirroring real-world complexi6es, necessitates the explora6on of alterna6ve
methodologies to ensure 6mely and efficient results. The subsequent lesson promises an
explora6on of diverse methods to navigate these challenges and facilitate effec6ve queries
across numerous vectors.
Code Figure 4
random_vector Method:
The random_vector method is responsible for genera6ng a random 2-dimensional vector. It
u6lizes NumPy's randn func6on to create a vector with 50 data points along each of the two
dimensions.
plot_data Method:
The plot_data method facilitates the visualiza6on of embeddings and query vectors. It takes
two arguments: data_vector represen6ng the embeddings and query_vector represen6ng
the vector used as a query. The method generates a scaNer plot, marking the embeddings
and the query vector. Each point on the plot corresponds to an embedding, and the query
vector is highlighted in blue. Text annota6ons on the plot correspond to the indices of the
embeddings.
nearest_neighbours Method:
The nearest_neighbours method performs k-Nearest Neighbors search. It takes several
parameters:
• k: The number of neighbors to retrieve (default is 3).
• algorithm: The algorithm used for nearest neighbors search (default is 'brute').
• metric: The distance metric used for calcula6ng distances (default is 'euclidean').
• data_vector: The embeddings dataset.
• query_vector: The vector for which nearest neighbors are to be found.
If we plot the 2-dimensional vector that is randomly generated using a driver program this is
how it looks as.
Vector databases, on the other hand, are adept at indexing and swihly searching for
similar vectors through the applica6on of advanced similarity algorithms. This capability
empowers applica6ons to iden6fy and retrieve vectors that bear resemblance to a specified
target vector query. In essence, vector stores provide an op6mized environment for
managing and querying vector data, enabling efficient explora6on and retrieval of related
vectors in response to specific queries.
In this book we will consider Chroma DB as our vector store and explain all the
concepts related to vector stores. Chroma DB, an open-source vector store, is designed for
the storage and retrieval of vector embeddings, primarily serving the purpose of preserving
embeddings and associated metadata. This stored informa6on proves valuable for
subsequent u6liza6on by expansive language models. Notably, Chroma DB finds applica6on
in seman6c search engines dealing with textual data. key features of Chroma DB are as
follows.
1. Diverse Storage OpDons:
• Chroma DB supports various underlying storage alterna6ves, including
DuckDB for standalone setups and ClickHouse for enhanced scalability.
2. SoOware Development Kits (SDKs):
• It furnishes Sohware Development Kits (SDKs) for Python and
JavaScript/TypeScript, facilita6ng seamless integra6on into projects
developed in these programming languages.
3. Emphasis on Simplicity and Speed:
• Chroma DB priori6zes simplicity, speed, and analy6cal capabili6es, aligning its
design with the objec6ves of straigh4orward usage, rapid performance, and
data analysis.
4. Self-Hosted Server OpDon:
• An addi6onal feature of Chroma DB is the availability of a self-hosted server
op6on, providing users with the flexibility to host and manage the vector
store infrastructure according to their specific requirements.
You can also build the Docker image yourself from the Dockerfile in the Chroma GitHub
repository
The Chroma client can then be configured to connect to the server running in the Docker
container.
import chromadb
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
Chapter 7: Implemen1ng our first Vector Search
The encapsulated func6onali6es within the ChromadbHelper class delineate a u6lity
designed for seamless interac6on with Chroma DB, an adept database tailored for the
efficient management of vector embeddings. The instan6a6on process ini6alizes a
connec6on to the Chroma DB server, configured with the host as 'localhost' and the port as
'8000'. The class offers a repertoire of methods to navigate key opera6ons: fetching
collec6ons, crea6ng and dele6ng collec6ons, saving data along with metadata, and querying
the database. Notably, the method for saving data internally employs Chroma DB, relying on
the all-MiniLM-L6-v2 model for embedding handling. This cohesive design allows for an
intui6ve and structured approach to interact with Chroma DB, abstrac6ng complexi6es
associated with HTTP client interac6ons and database opera6ons. In the academic spirit, the
class serves as a pedagogical tool, facilita6ng a lucid understanding of how to navigate and
u6lize Chroma DB effec6vely for vector embedding applica6ons.
Code Figure 5
Exploring Chroma collec6ons involves diverse querying techniques facilitated by the .query
method. One approach entails querying with a set of query_embeddings, where each query
embedding is a numerical representa6on of a search query. By invoking the .query method
with parameters such as query_embeddings, n_results, where, and where_document,
users can retrieve the top matching results for each query embedding, allowing for
metadata-based and content-based filtering.
Code Figure 6
If the dimensions of the supplied query embeddings do not align with the collec6on, an
excep6on will be raised. Alterna6vely, users can opt to query by a set of query_texts.
Chroma first embeds each query text using the collec6on's embedding func6on and
subsequently performs the query with the generated embeddings. The .query method, in
this scenario, also supports parameters like n_results, where, and where_document.
Retrieving items from a collec6on by their unique iden6fiers (ids) is achievable through the
.get method, where users can specify the desired ids and apply op6onal where and
where_document filters.
Code Figure 7
The .get method, if invoked without specific ids, returns all items in the collec6on that
match the specified filters. Notably, when using .get or .query, the include parameter allows
users to selec6vely retrieve data fields such as embeddings, documents, metadatas, and
distances. By default, Chroma returns documents, metadatas, and distances for query
results, while excluding embeddings for performance reasons. Users can customize the
returned data fields by providing an array of included field names to the includes parameter
of the .query or .get method, tailoring the output to their specific requirements.
Code Figure 8
Metadata filtering in Chroma supports a range of operators, providing users with versa6le
op6ons for refining queries based on metadata aNributes. The $eq operator enables filtering
for equality, accommoda6ng strings, integers, and floats. Conversely, the $ne operator
excludes items that are not equal to the specified value, suppor6ng string, integer, and float
comparisons. For numeric comparisons, Chroma offers the $gt operator to filter for values
greater than the specified threshold, and the $gte operator for values greater than or equal
to the given threshold. On the other hand, the $lt operator facilitates filtering for values less
than the specified threshold, while the $lte operator includes values less than or equal to
the specified threshold. This array of operators provides users with a comprehensive toolkit
to precisely tailor their metadata-based filters, promo6ng flexibility and precision in
querying Chroma collec6ons.
Code Figure 9
The StudentDB class encapsulates a simple yet illustra6ve Python applica6on, showcasing
the fundamental CRUD opera6ons within the domain of Chroma DB. Ini6ated with an
instan6a6on of Chroma DB and an OpenAI embedding func6on, the class seamlessly
integrates the capabili6es of both technologies. Leveraging this integra6on, the class defines
methods for crea6ng, reading, upda6ng, and dele6ng student records within a collec6on
named "students." The create_student method orchestrates the addi6on of new student
informa6on, genera6ng a unique iden6fier for each student. Subsequently, the
read_student method retrieves and displays the informa6on of a specified student,
demonstra6ng the read opera6on. The update_student method allows for the modifica6on
of an exis6ng student's informa6on, exemplifying the update opera6on. Finally, the
delete_student method facilitates the removal of a student record based on a provided
iden6fier, illustra6ng the delete opera6on. This concise yet comprehensive demonstra6on
underscores the seamless integra6on of Chroma DB and OpenAI embeddings, offering a
tangible illustra6on of CRUD opera6ons within the context of a vector database.
Code Figure 10
Chapter 8: Going From CRUD to Seman1c Search
In the development of our forthcoming applica6on, we embark on the task of
crea6ng a straigh4orward yet efficient system. This system involves the storage of two
dis6nct documents, namely "student_info" and "university_info," within the vector database
known as Chroma DB. Leveraging the OpenAI embeddings, we employ a custom embedding
func6on intrinsic to Chroma DB to facilitate the incorpora6on of these documents. The
embedding func6on plays a pivotal role in encapsula6ng the seman6c nuances and
contextual informa6on of the documents. As the documents find their residence within
Chroma DB, we subsequently proceed to pose queries to this database. Chroma DB, armed
with its inherent capability to compute minimum distances based on context, eventually
responds by returning the document that exhibits the closest contextual match to the posed
ques6ons. This applica6on thus underscores the seamless synergy between document
storage, embedding func6ons, and query responses within the realm of vector databases,
exemplifying the prac6cality and efficacy of such systems in real-world applica6ons, Below
code exemplifies the above scenario.
Code Figure 10
Code Figure 11
Final Chapter: Conclusion
In the stream of large language model systems, vector stores such as Chroma DB
have become indispensable components. Their specialized storage capabili6es and efficient
retrieval of vector embeddings play a pivotal role in facilita6ng swih access to per6nent
seman6c informa6on, thereby empowering the func6onality of Large Language Models
(LLMs).
This tutorial on Chroma DB delves into the fundamental aspects of its u6liza6on.
Topics covered encompass the founda6onal steps of crea6ng a collec6on, incorpora6ng
documents, transforming text into embeddings, execu6ng queries for seman6c similarity,
and proficiently managing the collec6ons. This comprehensive tutorial serves as a valuable
resource for individuals seeking to grasp the essen6als of employing Chroma DB in their
language model endeavours.
As part of the con6nuous learning journey, the subsequent phase involves the
seamless integra6on of vector databases into genera6ve AI applica6ons. The LlamaIndex
framework is an invaluable tool for users aiming to effortlessly ingest, manage, and retrieve
private or domain-specific data for their AI applica6ons within the framework of Large
Language Model (LLM)-based systems. Furthermore, enthusiasts can explore the intricacies
of LLMOps and its areas of applica6ons. This progression allows prac66oners to deepen
their understanding and applica6on of vector databases, fostering a more nuanced approach
to harnessing their capabili6es for advanced language model development.