Log System As Backbone – How We Built the World’s Most Advanced Vector Database on Pulsar - Pulsar Summit Asia 2021

0 likes1,454 views

The document outlines the architecture and features of Milvus, an open-source vector database designed for efficient similarity searches on dense vectors. It emphasizes the importance of unstructured data processing, the use of Apache Pulsar for log storage, and the database's scalability and ease of use. Real-world use cases include applications in customer service chatbots and face recognition, showcasing Milvus's capacity for real-time data ingestion and system extensibility.

Internet

Log System as Backbone
Xiaofan Luan 2022.1
How We Build Cloud native Vector Database on Pulsar

About Zilliz and me
•Open-source company behind Milvus
•Mission: Reinvent Data Science
• Graduate from Cornell University
• Partner, Director of engineering at ZILLIZ
• member of TAC of LF AI & data foundation
• Architect of Milvus Community
Career History:

CONTENTS
What is Vector Database
01
Architecture Overview of Milvus 2.0
02
Real-world cases
03

Agenda
What is Vector Database
01
Design Philosophy behind Milvus 2.0
02
Architecture Overview
03
Real-world cases
04

Unstructured data process pipeline
Embedding
vectors
Deep learning
models
Unstructured
data
Knowledge, insight,
$

Why vector database
Arithmetic operation 
 
 
Number comparison 
 
1–10
1–5 6–10
1 2 3 4 5 6 7 8 9 10
Numbers
Similarity (eg. Euclidean distance) 
 
 
Similarity comparison 
Vectors
Operation
Organization

Data structure inside vector db
• Hashing-based
• Tree partitioning based
• Inverted index based
• Graph Based

What is Milvus
From the user’s perspective of, we need a more easy-
to-use and powerful database, not just a faster library.

➢ It’s a database with CRUD support
➢ Designed for efficient similarity search on dense
vectors
➢ Highly scalable and robust, performance on
demand
➢ Open source, world’s most popular vector
database

Journey of Milvus
The Idea
Milvus
0.1

Release
Open

Source
Joined

LF&AI
Milvus
1.0

Release
Milvus
2.0 RC1
Release

Milvus 2.0
GA Release

2022.1
2021.10
2021.3
2020.3
2019.10
2019.04
2018.10

Design space in vector database
Tradeoffs

- Consistency, Availability, Partition

- Data freshness, Query performance, Resources

- Another CAP, Cost, Accuracy, Performance

No silver bullets fits all!!

Design choices Milvus take
- Availability over Consistency

- Scalability over single node performance

- Ease of use over knob tuning

- But everything is tunable thanks to log
backbone and micro service design

Tunable Consistency
Strong consistency: wait for all data arrived before search

Bounded Staleness: search unless data is delayed

Session: read until sync time reached write ts

Consistent Prefix: consume log in order

Eventual: persistent log ensure data is eventual consistent

Unified Streaming/Batching
Data: Growing + Historical

Storage: Stream based log storage + Batch based blob storage

Time tick mechanism
Logs are assigned to a window based on timestamp

Data can be written out of order by proxy

Time tick trigger message pack consumption

Why Apache Pulsar as log storage?
➢ Tiered Storage
➢ Unlimited topic numbers
➢ Geo Replication
➢ Multi tenancy
➢ Pulsar functions
➢ Integrated with K8s and other cloud infrastructure
➢ Timely and Kindly community support

Users
1000+ Enterprise users around the global

Related product search and recommendations 
ElasticsearchforkeywordsearchwithASCIIcodes.

Thecodeweknowaboutthesetwoarraysofnumbersisthat
breadnotequaltotoast.
Weassumethatsimilarcontextsrepresentsimilarthings,and
trytocomparethemusingmathematicalmethods.Wecould
evenfindawaytoencodewholesentencesbytheirmeaning.

1. Milvus can be stateless and better compatible with K8s, thanks to the
usage of streaming storage.

2. System can be easily extended with other features such as key word
search and analytic queries

3. Vector search can be used under more real time use cases because of
supporting of streaming ingestion

4. And More: Cross datacenter/cloud replication, No more need to worry
about complicated consensus protocol, UDF implementation…
Conclusion

More Related Content

What's hot (20)

PDF

Intro to HBasealexbaranau

PDF

Vector Similarity Search & Indexing MethodsKate Shao

PDF

Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfAWS Chicago

PDF

Kafka Streams: What it is, and how to use it?confluent

PPTX

Building a modern data warehouseJames Serra

PDF

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent

PDF

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

PDF

The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...Neo4j

PPTX

Apache HBase™Prashant Gupta

PDF

인공지능추천시스템 airs개발기_모델링과시스템NAVER D2

PDF

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

PDF

DevOps for DatabricksDatabricks

PDF

How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit

PPTX

Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community

PDF

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

PDF

Large Language Models BootcampData Science Dojo

PDF

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Kai Wähner

PDF

An overview of Neo4j InternalsTobias Lindaaker

PDF

Introduction to elasticsearchpmanvi

PPTX

Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal

Intro to HBasealexbaranau

Vector Similarity Search & Indexing MethodsKate Shao

Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfAWS Chicago

Kafka Streams: What it is, and how to use it?confluent

Building a modern data warehouseJames Serra

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...Neo4j

Apache HBase™Prashant Gupta

인공지능추천시스템 airs개발기_모델링과시스템NAVER D2

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

DevOps for DatabricksDatabricks

How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit

Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Large Language Models BootcampData Science Dojo

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Kai Wähner

An overview of Neo4j InternalsTobias Lindaaker

Introduction to elasticsearchpmanvi

Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal

Similar to Log System As Backbone – How We Built the World’s Most Advanced Vector Database on Pulsar - Pulsar Summit Asia 2021 (20)

PDF

Keeping Data Fresh: Mastering Updates in Vector DatabasesZilliz

PDF

06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann

PDF

Scaling Vector Search: How Milvus Handles Billions+Zilliz

PDF

09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...Timothy Spann

PDF

Vector Search at Scale - Pro Tips - Stephen BatifolZilliz

PDF

Introducing Milvus and new features in 2.4 releaseZilliz

PDF

09-12-2024 - Milvus, Vector database used for Sensor Data RAGTimothy Spann

PDF

NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeTimothy Spann

PDF

09-19-2024 AI Camp Hybrid Seach - Milvus for Vector DatabaseTimothy Spann

PDF

09-25-2024 NJX Venture Summit Introduction to Unstructured DataTimothy Spann

PDF

Milvus: Scaling Vector Data Solutions for Gen AIZilliz

PDF

06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann

PPT

Big DataNGDATA

PDF

Дмитрий Попович "How to build a data warehouse?"Fwdays

PPT

SQL or NoSQL, that is the question!Andraz Tori

PDF

01-Oct-2024_PES-VectorDatabasesAndAI.pdfTimothy Spann

PDF

Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph

PDF

Cloud arch patternsCorey Huinker

PDF

10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and HowTimothy Spann

PPTX

Strata NY 2018: The deconstructed databaseJulien Le Dem

Keeping Data Fresh: Mastering Updates in Vector DatabasesZilliz

06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann

Scaling Vector Search: How Milvus Handles Billions+Zilliz

09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...Timothy Spann

Vector Search at Scale - Pro Tips - Stephen BatifolZilliz

Introducing Milvus and new features in 2.4 releaseZilliz

09-12-2024 - Milvus, Vector database used for Sensor Data RAGTimothy Spann

NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to EdgeTimothy Spann

09-19-2024 AI Camp Hybrid Seach - Milvus for Vector DatabaseTimothy Spann

09-25-2024 NJX Venture Summit Introduction to Unstructured DataTimothy Spann

Milvus: Scaling Vector Data Solutions for Gen AIZilliz

06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann

Big DataNGDATA

Дмитрий Попович "How to build a data warehouse?"Fwdays

SQL or NoSQL, that is the question!Andraz Tori

01-Oct-2024_PES-VectorDatabasesAndAI.pdfTimothy Spann

Nebula Graph nMeetup in Shanghai - Meet with Graph Technology EnthusiastsNebula Graph

Cloud arch patternsCorey Huinker

10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and HowTimothy Spann

Strata NY 2018: The deconstructed databaseJulien Le Dem

More from StreamNative (20)

PDF

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022StreamNative

PDF

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative

PDF

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative

PDF

Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative

PDF

Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022StreamNative

PDF

Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative

PDF

Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...StreamNative

PDF

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative

PDF

Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022StreamNative

PDF

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative

PDF

Understanding Broker Load Balancing - Pulsar Summit SF 2022StreamNative

PDF

Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative

PDF

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022StreamNative

PDF

Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative

PDF

Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022StreamNative

PDF

Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative

PDF

Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative

PDF

Welcome and Opening Remarks - Pulsar Summit SF 2022StreamNative

PDF

MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative

PDF

Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021StreamNative