Milvus Overview
Milvus Overview
Realized By :
- Abderrahmane Boucenna
Year : 2022-2023
1
Abstract
The rapid growth of Artificial Intelligence (AI) and Big Data has revolutionized
various sectors, leading to groundbreaking advancements in technology, research, and
decision-making processes. AI algorithms and models, combined with massive
amounts of data, have enabled organizations to extract valuable insights, make
accurate predictions, and automate complex tasks. However, the increasing volume,
velocity, and variety of data have posed significant challenges in terms of storage,
retrieval, and analysis. This has resulted in the need for more efficient and scalable
solutions,usually the use of modern no-SQL databases with big data architectures is
enough for this task but due to the properties of data handled by AI , many
inefficiencies remain leading to the emergence of vector databases.
for the purpose of this project ,we will use Milvus, an open-source vector database
designed to efficiently store, index, and search high-dimensional vector data. It is
specifically built to address the challenges associated with managing large-scale vector
data in AI and Big Data applications. this is why we will be focusing more on the
vector database properties that distinguishes from the usual DBMS rather than the
usual DBMS features.
2
Contents
1 Vector Databases Overview 6
1.1 Motivation of creation . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Foundations of Vector Databases . . . . . . . . . . . . . . . . . 7
1.2.1 Vector Embedding . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Ml Algorithms . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Problems with traditional Databases . . . . . . . . . . . . . . 11
1.4 Vector Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Known Vector Databases and systems . . . . . . . . . . . . . . 21
1.6.1 Weaviate . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.2 Facebook Faiss . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.3 Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Milvus overview 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Why Milvus ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Milvus Architecture 24
3.1 General Design . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Distributed Architecture . . . . . . . . . . . . . . . . . . . . . 25
3.3 main components . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 SDK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Load balancer . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Access layer . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4 coordinator service . . . . . . . . . . . . . . . . . . . . 27
3.3.5 Meta Data . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.6 Message Storage . . . . . . . . . . . . . . . . . . . . . . 28
3.3.7 Log Broker . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.8 Worker Node . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.9 Object Storage . . . . . . . . . . . . . . . . . . . . . . . 29
3
5 Data Model and Organization 32
5.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 shards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4.1 Growing Segment . . . . . . . . . . . . . . . . . . . . . 34
5.4.2 Sealed Segment . . . . . . . . . . . . . . . . . . . . . . 35
5.4.3 Flushed Segment . . . . . . . . . . . . . . . . . . . . . 36
11 Index manipulation 47
11.1 Vector Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.2 metric_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4
11.3 index_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.4 Scalar Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
13 Security 49
13.1 User Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
13.2 database security . . . . . . . . . . . . . . . . . . . . . . . . . 50
5
1 Vector Databases Overview
1.1 Motivation of creation
With the booming of the application of AI in several domains like IOT,medical
imaging,large language models that have large volumes of unstructured data as
According to IDC, 80% of data will be unstructured by 2025[1],New type of
Highly used data manipulations are rising , this would be handled conveniently
with No-SQL databases combined with big data but problems arises with the
details of these new types of usage or querying for this data .Some of the new
types of emerging operations are :
• Recommandation systems : Many modern large scale companies
like Facebook and YouTube use recommendation systems that try to find
similar high dimensionality items to recommend, this goes beyond the
classic approach of filtering with predefined sets of values because it is
conducted on none structured data .one problem here is that sometimes
we don’t even know these predefined sets and we need to load the entire
database and feed it to the AI algorithm to treat it ,wouldn’t it be better
if we could have a database that can save the patterns and quickly extract
the similar items without loading the entire database ?
• Image Searching : image searching platforms like google lens and image
filtering try to find the closest image to the searched one or the best image
that fits certain constraints, an image of course being treated as a 2d
vector of real values .the problem here is that images are often assigned
with a string title or id to recognize them , performing a large recognition
operation would mean extracting the images off of the database and then
feeding them to the AI model , which is largely time consuming. many
traditional databases store unstructured data in a binary representation,
this doesn’t say anything about the features of the data nor it’s patterns.
it’s basically creating a read only memory useful for just storing the data
without any other advantage.wouldn’t it better if we store unstructured
data (images) as a format that actually can have operations performed on
it, a format that keeps a semantic meaning for the data and shows it’s
features, a format that liberates us of needing to load the entire data to
the model so that when a query comes it can be handled efficiently ?
6
• Long Term Memory: A general AI problem, Imaging the following
scenario. we have a database storing unstructured data for example large
sequences of text, an application wants to classify the texts depedning on
a criteria such as "harmful or none harmful text" , we load the data ,
we feed the ai algorithm and we get the results. what if this operation is
redandunt, we cannot extract the entire database everytime so what do
we do, we could partition our database into to databases (one is harmful
text the other is not) or simply add a new field (text class) and it stores
the class. the second solution is better however if we have millions of lines
(or documents) of text adding this is highly costly but there’s worst what
if in a day we get alot of queries that all have a different criteria and
the number of the criteria is infinite (searching positive tweets, searching
racist tweets,....) , it’s not convenient to add thousands of fields to the
database.the basic problem here is that once we conduct our AI search we
don’t get any long term benefit of it, as storing it’s result is costly and
useless . what if there’s a data format that is convenient to search patterns
, a data format that can quickly classify the data depending on the criteria
and without needing to store the results at each time ?
7
meaning of it illustrated in figure 1 where unstructured data is transformed into
high dimensional vectors in a way that keeps track of the semantic meaning of
data which results in facilitation of operations like finding similar data , vector
searching.
8
- Graph Embeddings: Graph embedding techniques aim to represent
nodes or entities in a graph as vectors. These methods learn to capture
structural and semantic relationships between nodes in a graph. Techniques
like Graph Convolutional Networks (GCNs) and GraphSAGE are widely
used for graph embedding.
- Knowledge Graph Embeddings: Knowledge graph embedding techniques
aim to represent entities and relations in a knowledge graph as vectors.
These embeddings capture the semantic relationships between entities and
enable reasoning and inference. Techniques like TransE, TransR, and
DistMult are commonly used for knowledge graph embedding.
- Sequential Embeddings: Sequential embedding techniques capture the
sequential patterns in data such as time series, sequences of events, or
sequences of actions. Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM), and Gated Recurrent Units (GRUs) are often used to
learn sequential representations.
9
Figure 2: Visual Representation of a vector database items
1.2.3 Ml Algorithms
Having seen the data format and the environment in which these vectors live in ,
we need to see what type of operations would be conducted in this vector space,
having established that the core need behind the vector databases is similarity
even in cases where we cannot determine the criteria, Machine learning often
revolves around finding the rules instead of following them and many machine
learning algorithms have dealt with this problem ,we state the most used algorithms
in vector databases :
- KNN short for k nearest neighbors is an unsupervised ML algorithm used
to find the nearest neighbors to a data point .this algorithm is heavily used
for clustering and is the backbone of vector searches as it provides the
nearest vectors to the wanted query.
- dimensionality reduction as most of the times ,many of the data dimensions
or features are not useful for the querying, this is especially relevant with
high dimensionality data , demensionality reduction is an unsupervised
technique to reduce dimensionality , resulting in better performance and
faster execution time, this is particularly useful for indexing .
10
Figure 3: reducing 3d points to 2d with pca
- Neural Networks have shown exceptional accuracy and performance when dealing with
extremely large dimensions of data, they are useful for many tasks regarding searching similar
images,texts,voice recordings.
11
indexing structures and storage mechanisms used in NoSQL databases
might struggle to handle the increased complexity and computational demands
associated with high-dimensional data. Vector databases, specifically designed
for efficient management of vector data, are built to handle the challenges
of high-dimensional scaling.
• Limited Optimization for Vector Search: NoSQL databases often
prioritize general-purpose features and scalability over specific optimizations
for vector search operations. This can result in sub optimal query performance
and longer response times when working with vector data. Vector databases,
on the other hand, are purpose-built for vector-related tasks, offering specialized
optimizations that significantly improve the efficiency of vector search
operations.
this is why having just a library wasn’t enough to handle vectors, the need for
an entire database management system regarding vector manupulation became
a must
12
1.4 Vector Search
Traditionally is we store our unstructured data in a relational or document
database, we often have to assign A string representing a title to the images,
often this is done manually and searching this image would be through searching
it’s title, this is unbelievably tedious and useless when trying to insert and search
large volumes of unstructured data.vector databases tackle this directly by using
a new type of queries called vector search , also known as a similarity search
or nearest neighbor query, is an operation performed on a vector database to
retrieve vectors that are most similar or nearest to a given query vector. The
goal of a vector query is to find vectors in the database that have the closest
resemblance or proximity to the query vector based on a specified distance
metric as recommendation systems often look for approximate answers and not
well bounded ones. This has proven to be so useful when dealing with textual
input describing products and retrieving them , getting relevant answers for
queries ,proposing similar videos syntactical or semantically for users. this
approach includes several steps
1. Query Vector this steps revolves around transforming the given query
into a query vector which is the target , the goal is to construct the vector
that provides the best semantic representation of the searched features .for
this there are various techniques taken from NLP,DeepLearning and ML
such as TF-IDF(Term Frequency-Inverse Document Frequency),Bag-of-Words
(BoW),Word Embeddings the latter being the most common technique for
vector databases.[3]
13
2. Distance Calculation this is often the heaviest step in terms of execution
time[4] , as it calculates all the distances between target vector which is
our query vector and the database items which are vectors using KNN.one
of the optimizations proposed to accelerate this calculation is to calculate
distance between only the near neighbor in a specified range instead of all
elements which is approximate nearest neighbor(ANN). there are several
distances used for this like Euclidean distance i=1 (ui − vi ) , inner product
qP
n 2
14
Figure 6: Result search
4. Filtering in queries with filters , the selected neighbors get filtered out ,
for example if we apply some type of filtering that only accepts p6 and p3
15
1.5 Indexes
While most vector databases support normal indexes used in traditional databases,
the nature of the vector database specific queries calls need to new types of
indexes not used in traditional databases. some databases have their own
variations of the following mentioned indexes but generally many vector databases
support these in a way or other.
16
Figure 9: clustering illustration
into 256 clusters, we end up with 8*256 clusters each have the central
element which we call centroid.the clusters centroids are grouped into a
table called Codebook which is the table of centroids, each centroid has a
scalar valued id , this scalar valued id gets inserted into a PQ code table. so
what we did essentially is that we reduced a 1000 vector of 128 elements into
256 array of 8 centroids a very huge memory and performance upgrade.this
new table containing centroids id is called PQ codes and the one containing
centroids is called codebook and it represents our index collection.now the
question is how do we use them ? so given a query vector ⃗q ,normally we
would perform knn and return k nearest neighbors, this is a bit tedious
when the db is very large what we do is we devide our query vector into 8
just like we did to our db vectors, for each subspace of the query vector we
17
calculate the distance between it and the corresponding segment from the
PQ table , and we save the results in whats called a distance table shown
bellow ,this table gets then sorted out ascendenly , and we then get the
top result, this is especially fast because instead of reaching all vectors, we
only look at our codebook and pq table and extract the top elements of
the distance table .
18
Figure 12: query with PQ index
start off by visiting item 5 then end item, both of these gets marked,
19
we move down the layers, we skip 5 , and visit 14 then skip the end
item, we move to layer1 we skip 5 and we find item 11.HNSW inherits
the same layered format with longer edges in the highest layers (for fast
search) and shorter edges in the lower layers (for accurate search).the
second algorithm is Navigable Small World Graphs (NSW) developed
around 2011 to 2014, this was done to speed up the proximity search for
proximity graphs,A proximity graph, also known as a similarity graph or
neighborhood graph, is a graph-based data structure that represents the
pairwise similarity or proximity relationships between a set of objects. In
20
Figure 15: HNSW example
1.6.3 Elasticsearch
21
2 Milvus overview
2.1 Introduction
Milvus, is a purpose-built open source database
management system to efficiently store and search large scale vector data for
data science and AI applications. It is a specialized system for high-dimensional
vectors following the design practice of one-sizenot-fits-all [8] in contrast to
generalizing relational databases to support vectors. Milvus provides many
application interfaces (including SDKs in Python/Java/Go/C++ and RESTful
APIs) that can be easily used by applications. Milvus is highly tuned for the
heterogeneous computing architecture with modern CPUs and GPUs (multiple
GPU devices) for the best efficiency. It supports versatile query types such
as vector similarity search with various similarity functions, attribute filtering,
and multi-vector query processing. It provides different types of indexes (e.g.,
quantization-based indexes and graph-based indexes and develops an extensible
interface to easily incorporate new indexes into the system. Milvus manages
dynamic vector data (e.g., insertions and deletions) via an LSM-based structure
while providing consistent real-time searches with snapshot isolation. Milvus is
also a distributed data management system deployed across multiple nodes to
achieve scalability and availability.Milvus has become one of the most widely
used vector Database management systems with its advanced features .
22
2.2 Why Milvus ?
The majority of the work regqrding software that dealt with vector data was
mainly algorithms in the form of libraries or subsystems belonging to relational
databases such as facebooks faiss and microsoft SPTAG. Other works revolved
around creating real vector databases but some of had problems when the
amount of data was massive ,systems like Alibaba AnalyticDB-V and Alibaba
PASE (PostgreSQL) had specialized vector column to support vector data,however
these systems being not fully dedicated to vector data renderd many of the
needed operations on vectors very limited. so in here comes milvus , it is fully
dedicated to vector data with a many advanced and new techniques to handle
them,it architecture takes advantage of GPU and CPU parallelization to provide
the best performance , the focus on vectors renders GPU optimization a wonder
for these types of operations , Milvus is also known to be the only database along
side weaviate to support multivector queries.A table taken from a research paper
shows how Milvus surpasses many vector system.
23
3 Milvus Architecture
Milvus was designed for similarity search on dense vector datasets containing
millions, billions, or even trillions of vectors. Before proceeding, familiarize
yourself with the basic principles of embedding retrieval. Milvus also supports
data sharding, data persistence, streaming data ingestion, hybrid search between
vector and scalar data, time travel, and many other advanced functions. The
platform offers performance on demand and can be optimized to suit any embedding
retrieval scenario. We recommend deploying Milvus using Kubernetes for optimal
availability and elasticity. Milvus adopts a shared-storage architecture featuring
storage and computing disaggregation and horizontal scalability for its computing
nodes. Following the principle of data plane and control plane disaggregation,
Milvus comprises four layers: access layer, coordinator service, worker node,
and storage. These layers are mutually independent when it comes to scaling
or disaster recovery
24
Figure 18: Overview of the architecture of milvus
25
Figure 19: Milvus Distributed Architecture
the query nodes which we will talk about later , are responsible for executing
and optimizing incoming queries, the load balancer maes sure that the work
load remains balanced for maximum performance and scalability.
26
into several buckets using hash algorithm. Then the proxy requests data coord
to assign segments, the smallest unit in Milvus for data storage.Afterwards, the
proxy inserts information of the requested segments into message store so that
these information will not be lost.
27
• Root coordinator and node: handles data definition language (DDL)
and data control language (DCL) requests, such as create or delete collections,
partitions, or indexes, as well as manage TSO (timestamp Oracle) and time
ticker issuing.
• query coordinator and node Query node retrieves incremental log data
and turn them into growing segments by subscribing to the log broker, loads
historical data from the object storage, and runs hybrid search between
vector and scalar data.
• Index coordinator and node : manages topology of the index nodes,
builds index, and maintains index metadata.its coordinator builds indexes.
Index nodes do not need to be memory resident, and can be implemented
with the serverless framework.
• Data coordinator and node Data coordinator manages topology of the
data nodes, maintains metadata, and triggers flush, compact, and other
background data operations. Data node retrieves incremental log data by
subscribing to the log broker, processes mutation requests, and packs log
data into log snapshots and stores them in the object storage.
Milvus uses etcd for storing metadata. This topic introduces how to configure
meta storage dependency when you install Milvus with Milvus Operator.the
ETCD is configurable using
It’s responsible for storing and managing messages or events within the milvus
db . It involves persisting messages generated by applications or systems for
various purposes such as communication, logging, auditing, event sourcing, or
real-time data processing.Milvus uses RocksMQ, Pulsar or Kafka for managing
logs of recent changes, outputting stream logs, and providing log subscriptions.
28
notification, and return of query results. It also ensures integrity of the incremental
data when the worker nodes recover from system breakdown. Milvus cluster uses
Pulsar as log broker; Milvus standalone uses RocksDB as log broker. Besides,
the log broker can be readily replaced with streaming data storage platforms
such as Kafka and Pravega.
Milvus is built around log broker and follows the "log as data" principle, so
Milvus does not maintain a physical table but guarantees data reliability through
logging persistence and snapshot logs.
this represents a single entity of the distributed node systems , the backbone
of the system responsible for charging the data from the data storage to the
main memory .the index nodes store manage the created indexes, the data
node handles the loaded data and the query node takes care of calculations .
Milvus uses an Object storage which is a type of data storage architecture that
manages and organizes data as discrete units called objects. Unlike traditional
file systems or block storage, which organize data into a hierarchical structure
or fixed-size blocks, object storage stores data as self-contained entities with
unique identifiers and metadata. Milvus uses MinIO or S3 as object storage to
persist large-scale files, such as index files and binary logs.
29
4 Knowhere system Indexes
Having seen the classic indexes used in the regular vector databases and libraries,
Milvus takes a bit of a enovative approach with the nowhere concept. Knowhere
is an operation interface for accessing services in the upper layers of the system
and vector similarity search libraries like Faiss, Hnswlib, Annoy in the lower
layers of the system. In addition, Knowhere is also in charge of heterogeneous
computing. More specifically, Knowhere controls on which hardware (eg. CPU
or GPU) to execute index building and search requests. This is how Knowhere
gets its name - knowing where to execute the operations. More types of hardware
including DPU and TPU will be supported in future releases.In a broader
sense, Knowhere also incorporates other third-party index libraries like Faiss.
Therefore, as a whole, Knowhere is recognized as the core vector computation
engine in the Milvus vector database.Computation in Milvus mainly involves
vector and scalar operations. Knowhere only handles the operations on vectors
in Milvus. The figure above illustrates the Knowhere architecture in Milvus.The
bottom-most layer is the system hardware. The third-party index libraries are
on top of the hardware. Then Knowhere interacts with the index node and
query node on the top via CGO.
30
Knowhere provides index like :
• The Faiss index has two sub classes: FaissBaseIndex for all indexes on float
point vectors, and FaissBaseBinaryIndex for all indexes on binary vectors.
• GPUIndex is the base class for all Faiss GPU indexes.to optimize the search
performance for specific use cases.
• OffsetBaseIndex is the base class for all self-developed indexes. Only
vector ID is stored in the index file. As a result, an index file size for
128-dimensional vectors can be reduced by 2 orders of magnitude. We
recommend taking the original vectors into consideration as well when
using this type of index for vector similarity search.
• IDMAP is not exactly an index, but rather used for brute-force search.
When vectors are inserted to the vector database, no data training and
index building is required. Searches will be conducted directly on the
inserted vector data.
• The IVF (inverted file) indexes are the most frequently used. The IVF
class is derived from VecIndex and FaissBaseIndex, and further extends to
IVFSQ and IVFPQ. GPUIVF is derived from GPUIndex and IVF. Then
GPUIVF further extends to GPUIVFSQ and GPUIVFPQ.
• IVFSQHybrid is a class for self-developed hybrid index that is executed by
coarse quantize on GPU. And search in the bucket is executed on CPU.
This type of index can reduce the occurrence of memory copy between CPU
and GPU by leveraging the computing power of GPU. IVFSQHybrid has
the same recall rate as GPUIVFSQ but comes with a better performance.
31
5 Data Model and Organization
Milvus stores data in 4 diffrent levels of nesting which segments,partitions,shards
and collections.
5.1 Collection
A collection in Milvus ,can be seen as the equivalent to a table in a relational
storage system. Collection is the biggest data unit in Milvus.it composed of
multiple shards.
5.2 shards
Sharding also known as channeling is a technique used in distributed database
systems to horizontally partition data across multiple nodes or servers called
shards. It is employed to improve the scalability, performance, and availability
of a database by distributing the data and workload across multiple machines.
In a sharded database, the dataset is divided into smaller subsets called shards,
and each shard is hosted on a separate server or node. Each shard is responsible
for storing a specific portion of the data. This division can be based on different
criteria, such as ranges of values, hash functions, or predefined rules.To take
full advantage of the parallel computing power of clusters when writing data,
collections in Milvus must spread data writing operations to different nodes.
By default, a single collection contains two shards. Depending on your dataset
32
volume, we can have more shards in a collection. Milvus uses a master key
hashing method for sharding.
a shard is allocated When data node starts or shuts down or When segment
space allocated is requested by proxy. Then there are several strategies of
sharding allocation. Milvus supports 2 of the strategies:
• Consistent hashing: The default strategy in Milvus, This strategy leverages
the hashing technique to assign each channel a position on the ring, then
searches in a clock-wise direction to find the nearest data node to a channel.
Thus, in the illustration above, channel 1 is allocated to data node 2, while
channel 2 is allocated to data node 3. However, one problem with this
strategy is that the increase or decrease in the number of data nodes
33
Figure 25: consistent hashing
5.3 Partitions
Milvus divides the data in the collection into multiple parts on physical storage
based on certain rules. Such operation is called partitioning. Each partition can
contain multiple segments. A partition is identified by a tag. When inserting
vector data, you can use the tag to specify which partition to insert the data
into. When querying vector data, you can use the tag to specify the partition
where the query should be executed. Milvus supports both the exact matching
and regular expression matching for partition tags.
5.4 Segments
The smallest storage entity in milvus ,There are three types of segments with
different status in Milvus: growing, sealed, and flushed segment:
A growing segment is a newly created segment that can be allocated to the proxy
for data insertion. The internal space of a segment can be used, allocated,
or free.it is consumed by the data node meaning its data is moved from the
34
storage to the main memory and requested by the proxy and allocated by data
coordinator meaning theres space allocated for it in the memory. Allocated
space will expire after a certain period time.
35
Figure 27: Sealed segment
A flushed segment is a segment that has already been written into disk. Flush
refers to storing segment data to object storage for the sake of data persistence.
A segment can only be flushed when the allocated space in a sealed segment
expires. When flushed, the sealed segment turns into a flushed segment.this
could be seen as the equivelent of the commit operation in relational databases.
36
Figure 28: Flushed Segment
37
6 Milvus Installation and connection
6.1 Docker Installation
the installation process in docker is fairly straightforward and easy, we go
through the following commands
//installing milvus yaml file
wget https://github.com/milvusio/milvus/releases/download/v2.2.10/milvu
//start milvus docker
sudo docker-compose ps
//connect to the milvus server through port 19530
sudo docker port milvus-standalone 19530/tcp
Milvus can also be installed inside clusters through the Milvus Operator which
is a solution that helps you deploy and manage a full Milvus service stack to
target Kubernetes (K8s) clusters. The stack includes all Milvus components
and relevant dependencies like etcd, Pulsar and MinIO. This topic introduces
how to deploy a Milvus cluster with Milvus Operator on K8s.
38
6.2 Cloud Zilliz
Cloud zilliz is a platform that allows for the creation and hosting milvus vector
databases.it allows for the creation of a single cluster and 2 collections at max.
the registration processes go as follows,after login in we choose our usage plan
39
then we create our cluster. the metric type refers to the distance measurement
seen back in the vector databases overview
40
6.3 SDK Connection
Milvus supports a wide range of SDK (python,NodeJs,Java...) the method of
connection differs depending on whether we are using the cloud or using it
localy,we will be using python for the manupulation of the collection.we start
by installing pymilvus a library that supports and links between milvus and
python sdk.using the command
pip3 install -b pymilvus
once this is ready we pass on to the connection.
the connection to the cloud is done through API key and the Endpoint, of these
are found at the first page in the cluster. the connection would be of the form
the token is the api key and the uri is our endpointthen inside the code we
connect through the command
connections.connect("default",
uri=milvus_uri,
token=token)
print(f"Connecting to DB: {milvus_uri}")
41
42
6.3.3 API connection
Milvus provides endpoint for direct http requests for the cloud platform , this
is done through the window
when the milvus is run locally for example in docker then the connection to the
server would be through the endpoint and instead of an api key we store the
username and password.since we don’t have a password and user yet we basicly
connect using
connections.connect(
alias="default",
user=’’,
password=’’,
host=’localhost’,
port=’19530’
)
43
7 Milvus Database Manupulation
to create a database we have to already have a user and password or an api key
from pymilvus import connections, db
conn = connections.connect(host="127.0.0.1", port=19530)
database = db.create_database("books")
to use ,drop or list the databases we write
db.use_database("book)
db.list_database()
db.drop_database("book")
44
8.1.1 renaming,dropping,searching
using the utility tool of pymilvus we could easily do the rest of the operations
regarding deletion,modification of name and searching
utility.rename_collection("old_collection", "new_collection") # Output:
utility.drop_collection("new_collection")
utility.has_collection("new_collection") # Output: False
collection.load(replica_number=2)
• INT64: numpy.int64
• VARCHAR: VARCHAR
45
10.1.2 Scalar Valued types
• DYNAMIC: JSON
46
10.3 Data compaction
this operation includes data dimensionality or volume reduction ,
collection.compact()
11 Index manipulation
index_params = {
"metric_type":"L2",
"index_type":"IVF\_FLAT",
"params":{"nlist":1024}
}
utility.index_building_progress("book")
11.2 metric_type
is Type of metrics used to measure the similarity of vectors ,For floating
point vectors L2 (Euclidean distance) and IP (Inner product) but For binary
vectors we have :JACCARD (Jaccard distance),TANIMOTO (Tanimoto distance),HAMMING
(Hamming distance),SUPERSTRUCTURE (Superstructure),SUBSTRUCTURE
(Substructure)
11.3 index_type
For floating point vectors:
• FLAT (FLAT)
• IVF_FLAT (IVF_FLAT)
• IVF_SQ8 (IVF_SQ8)
• IVF_PQ (IVF_PQ)
• HNSW (HNSW)
47
For binary_vectors we have BIN_FLAT (BIN_FLAT) and BIN_IVF_FLAT
(BIN_IVF_FLAT)
collection.create_index(
field_name="book_name",
index_name="scalar_index",
)
collection.drop(index_name)
• data=[[0.1, 0.2]]: Specifies the query vector(s) for the search operation.
In this example, a single query vector [0.1, 0.2] is provided.
48
• param=search_params: Specifies additional search parameters. It refers
to the search_params object that holds parameters like the distance
metric, number of probes, and offset (as explained in the previous response).
• limit=10: Defines the maximum number of search results to be returned.
In this case, the limit is set to 10, meaning the search operation will
retrieve at most 10 results. expr=None: Allows specifying an expression
for more complex queries, but in this example, it is not used, so it is
set to None.if the case where we have a hybrid search , the condition
would be added here .
• output_fields=[’title’]: Specifies the names of the fields that should be
retrieved from the search results. In this case, only the "title" field will
be returned for each matching result.
• consistency_level="Strong": Determines the consistency level of the
search operation. Setting it to "Strong" ensures that the search results
are up-to-date and consistent with the latest changes made to the collection.
13 Security
_HOST = ’127.0.0.1’
_PORT = ’19530’
_ROOT = "root"
_ROOT_PASSWORD = "Milvus"
_ROLE_NAME = "test_role"
_PRIVILEGE_INSERT = "Insert"
def connect_to_milvus(db_name="default"):
print(f"connect to milvus\n")
connections.connect(host=_HOST,
port=_PORT, db_name=db_name)
collection.drop(index_name)
49
Figure 30: Ip permission setting in cloud
• TimeTravel ,is a feature that allows you to access historical data at any
point within a specified time period, making it possible to query, restore,
and back up data in the past. With Time Travel, you can Search or
query data that has been deleted,Restore data that has been deleted or
updated. ,Back up data before a specific point of time. Unlike traditional
databases that use snapshots or retain data to support the Time Travel
feature, the Milvus vector database maintains a timeline for all data
insert or delete operations and adopts a timestamp mechanism. This
means you can specify the timestamp in a search or query to retrieve
data at a specific point of time in the past to significantly reduce maintenance
costs.
• Timestamp ,In the Milvus vector database, each entity has its own
timestamp attribute. All data manipulation language (DML) operations
including data insertion and deletion, mark entities with a timestamp.
For instance, if you inserted several entities all at one go, this batch
of data will be marked with timestamps and share the same timestamp
value.
• DML operations ,When the proxy receives a data insert or delete request,
50
it also gets a timestamp from the root coord. Then, the proxy adds
the timestamp as an additional field to the inserted or deleted data.
Timestamp is a data field just like primary key (pk). The timestamp
field is stored together with other data fields of a collection. When you
load a collection to memory, all data in the collection, including their
corresponding timestamps, are loaded into memory.
51
Conclusion
with the emergence of AI,vector databases have become essential tools for
managing and querying high-dimensional vector data efficiently. They address
the challenges associated with working with large-scale vector datasets and
enable various applications that require similarity search or nearest neighbor
operations.
Vector databases, such as Milvus, have emerged as powerful tools for managing
and searching high-dimensional vector data efficiently. They are specifically
designed to handle the challenges associated with working with large-scale
vector datasets and enabling similarity search operations. Milvus, in particular,
provides a feature-rich vector database solution with several notable capabilities
52
Ressources
- https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf
- https://www.pinecone.io/learn/hnsw/
References
[1] Timothy King. 2019. 80 Percent of Your Data Will Be Unstructured
in Five Years. https://solutionsreview.com/data-management/
80-percent-of-your-data-will-be-unstructured-in-five-years/
[6] https://towardsdatascience.com/product-quantization-for-similarity-search-2f1f67c5fddd.
[8] Michael Stonebraker and Ugur Çetintemel. 2005. "One Size Fits All":
An Idea Whose Time Has Come and Gone (Abstract). In International
Conference on Data Engineering (ICDE). 2–11.
[9] https://milvus.io/docs/architecture_overview.md
53