SlideShare a Scribd company logo
Log System as Backbone
Xiaofan Luan 2022.1
How We Build Cloud native Vector Database on Pulsar
About Zilliz and me
•Open-source company behind Milvus
•Mission: Reinvent Data Science
• Graduate from Cornell University
• Partner, Director of engineering at ZILLIZ
• member of TAC of LF AI & data foundation
• Architect of Milvus Community
Career History:
CONTENTS
What is Vector Database
01
Architecture Overview of Milvus 2.0
02
Real-world cases
03
Agenda
What is Vector Database
01
Design Philosophy behind Milvus 2.0
02
Architecture Overview
03
Real-world cases
04
What is Vector Database?
Unstructured data process pipeline
Embedding
vectors
Deep learning
models
Unstructured
data
Knowledge, insight,
$
Why vector database
Arithmetic operation





Number comparison



1–10
1–5 6–10
1 2 3 4 5 6 7 8 9 10
Numbers
Similarity (eg. Euclidean distance)





Similarity comparison

Vectors
Operation
Organization
Data structure inside vector db
• Hashing-based
• Tree partitioning based
• Inverted index based
• Graph Based
What is Milvus
From the user’s perspective of, we need a more easy-
to-use and powerful database, not just a faster library.


➢ It’s a database with CRUD support
➢ Designed for efficient similarity search on dense
vectors
➢ Highly scalable and robust, performance on
demand
➢ Open source, world’s most popular vector
database
Journey of Milvus
The Idea
Milvus
0.1


Release
Open


Source
Joined


LF&AI
Milvus
1.0


Release
Milvus
2.0 RC1
Release


Milvus 2.0
GA Release


2022.1
2021.10
2021.3
2020.3
2019.10
2019.04
2018.10
Architecture overview of Milvus 2.0
Design space in vector database
Tradeoffs


- Consistency, Availability, Partition


- Data freshness, Query performance, Resources


- Another CAP, Cost, Accuracy, Performance


No silver bullets fits all!!
Design choices Milvus take
- Availability over Consistency


- Scalability over single node performance


- Ease of use over knob tuning


- But everything is tunable thanks to log
backbone and micro service design
Log Sequence as back bone
Tunable Consistency
Strong consistency: wait for all data arrived before search


Bounded Staleness: search unless data is delayed


Session: read until sync time reached write ts


Consistent Prefix: consume log in order


Eventual: persistent log ensure data is eventual consistent
Unified Streaming/Batching
Data: Growing + Historical


Storage: Stream based log storage + Batch based blob storage
Elastic Component
Time tick mechanism
Logs are assigned to a window based on timestamp


Data can be written out of order by proxy


Time tick trigger message pack consumption
Why Apache Pulsar as log storage?
➢ Tiered Storage
➢ Unlimited topic numbers
➢ Geo Replication
➢ Multi tenancy
➢ Pulsar functions
➢ Integrated with K8s and other cloud infrastructure
➢ Timely and Kindly community support
Overall architecture of modern DB
Real world use case
Users
1000+ Enterprise users around the global
Related product search and recommendations

ElasticsearchforkeywordsearchwithASCIIcodes.


Thecodeweknowaboutthesetwoarraysofnumbersisthat
breadnotequaltotoast.
Weassumethatsimilarcontextsrepresentsimilarthings,and
trytocomparethemusingmathematicalmethods.Wecould
evenfindawaytoencodewholesentencesbytheirmeaning.
Chatbots in customer service
Face recognition
1. Milvus can be stateless and better compatible with K8s, thanks to the
usage of streaming storage.

2. System can be easily extended with other features such as key word
search and analytic queries

3. Vector search can be used under more real time use cases because of
supporting of streaming ingestion

4. And More: Cross datacenter/cloud replication, No more need to worry
about complicated consensus protocol, UDF implementation…
Conclusion
THANK YOU FOR WATCHING

More Related Content

What's hot (20)

PDF
Intro to HBase
alexbaranau
 
PDF
Vector Similarity Search & Indexing Methods
Kate Shao
 
PDF
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
AWS Chicago
 
PDF
Kafka Streams: What it is, and how to use it?
confluent
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
Neo4j
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
인공지능추천시스템 airs개발기_모델링과시스템
NAVER D2
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
DevOps for Databricks
Databricks
 
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
PPTX
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Microsoft Tech Community
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Large Language Models Bootcamp
Data Science Dojo
 
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner
 
PDF
An overview of Neo4j Internals
Tobias Lindaaker
 
PDF
Introduction to elasticsearch
pmanvi
 
PPTX
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Intro to HBase
alexbaranau
 
Vector Similarity Search & Indexing Methods
Kate Shao
 
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
AWS Chicago
 
Kafka Streams: What it is, and how to use it?
confluent
 
Building a modern data warehouse
James Serra
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
Neo4j
 
Apache HBase™
Prashant Gupta
 
인공지능추천시스템 airs개발기_모델링과시스템
NAVER D2
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
DevOps for Databricks
Databricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Microsoft Tech Community
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Large Language Models Bootcamp
Data Science Dojo
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner
 
An overview of Neo4j Internals
Tobias Lindaaker
 
Introduction to elasticsearch
pmanvi
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 

Similar to Log System As Backbone – How We Built the World’s Most Advanced Vector Database on Pulsar - Pulsar Summit Asia 2021 (20)

PDF
Keeping Data Fresh: Mastering Updates in Vector Databases
Zilliz
 
PDF
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
PDF
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
PDF
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Timothy Spann
 
PDF
Vector Search at Scale - Pro Tips - Stephen Batifol
Zilliz
 
PDF
Introducing Milvus and new features in 2.4 release
Zilliz
 
PDF
09-12-2024 - Milvus, Vector database used for Sensor Data RAG
Timothy Spann
 
PDF
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
Timothy Spann
 
PDF
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
Timothy Spann
 
PDF
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
Timothy Spann
 
PDF
Milvus: Scaling Vector Data Solutions for Gen AI
Zilliz
 
PDF
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
PPT
Big Data
NGDATA
 
PDF
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
PDF
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph
 
PDF
Cloud arch patterns
Corey Huinker
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Keeping Data Fresh: Mastering Updates in Vector Databases
Zilliz
 
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Timothy Spann
 
Vector Search at Scale - Pro Tips - Stephen Batifol
Zilliz
 
Introducing Milvus and new features in 2.4 release
Zilliz
 
09-12-2024 - Milvus, Vector database used for Sensor Data RAG
Timothy Spann
 
NYCMeetup07-25-2024-Unstructured Data Processing From Cloud to Edge
Timothy Spann
 
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
Timothy Spann
 
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
Timothy Spann
 
Milvus: Scaling Vector Data Solutions for Gen AI
Zilliz
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Big Data
NGDATA
 
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
SQL or NoSQL, that is the question!
Andraz Tori
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Nebula Graph nMeetup in Shanghai - Meet with Graph Technology Enthusiasts
Nebula Graph
 
Cloud arch patterns
Corey Huinker
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Ad

More from StreamNative (20)

PDF
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
StreamNative
 
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
StreamNative
 
PDF
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
StreamNative
 
PDF
Distributed Database Design Decisions to Support High Performance Event Strea...
StreamNative
 
PDF
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
StreamNative
 
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
StreamNative
 
PDF
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
StreamNative
 
PDF
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
StreamNative
 
PDF
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
StreamNative
 
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
PDF
Understanding Broker Load Balancing - Pulsar Summit SF 2022
StreamNative
 
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
StreamNative
 
PDF
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 
PDF
Event-Driven Applications Done Right - Pulsar Summit SF 2022
StreamNative
 
PDF
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
StreamNative
 
PDF
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
StreamNative
 
PDF
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
StreamNative
 
PDF
Welcome and Opening Remarks - Pulsar Summit SF 2022
StreamNative
 
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
StreamNative
 
PDF
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
StreamNative
 
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
StreamNative
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
StreamNative
 
Distributed Database Design Decisions to Support High Performance Event Strea...
StreamNative
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
StreamNative
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
StreamNative
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
StreamNative
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
StreamNative
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
StreamNative
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
StreamNative
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
StreamNative
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
StreamNative
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
StreamNative
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
StreamNative
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
StreamNative
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
StreamNative
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
StreamNative
 
Ad

Recently uploaded (20)

PPTX
Networking_Essentials_version_3.0_-_Module_5.pptx
ryan622010
 
PDF
BRKACI-1001 - Your First 7 Days of ACI.pdf
fcesargonca
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PDF
The Internet - By the numbers, presented at npNOG 11
APNIC
 
PDF
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
PDF
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PPTX
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
PDF
Digital burnout toolkit for youth workers and teachers
asociatiastart123
 
PDF
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
PPTX
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PDF
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
PPTX
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
Networking_Essentials_version_3.0_-_Module_5.pptx
ryan622010
 
BRKACI-1001 - Your First 7 Days of ACI.pdf
fcesargonca
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
The Internet - By the numbers, presented at npNOG 11
APNIC
 
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Orchestrating things in Angular application
Peter Abraham
 
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
Digital burnout toolkit for youth workers and teachers
asociatiastart123
 
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 

Log System As Backbone – How We Built the World’s Most Advanced Vector Database on Pulsar - Pulsar Summit Asia 2021

  • 1. Log System as Backbone Xiaofan Luan 2022.1 How We Build Cloud native Vector Database on Pulsar
  • 2. About Zilliz and me •Open-source company behind Milvus •Mission: Reinvent Data Science • Graduate from Cornell University • Partner, Director of engineering at ZILLIZ • member of TAC of LF AI & data foundation • Architect of Milvus Community Career History:
  • 3. CONTENTS What is Vector Database 01 Architecture Overview of Milvus 2.0 02 Real-world cases 03
  • 4. Agenda What is Vector Database 01 Design Philosophy behind Milvus 2.0 02 Architecture Overview 03 Real-world cases 04
  • 5. What is Vector Database?
  • 6. Unstructured data process pipeline Embedding vectors Deep learning models Unstructured data Knowledge, insight, $
  • 7. Why vector database Arithmetic operation
 
 
 Number comparison
 
 1–10 1–5 6–10 1 2 3 4 5 6 7 8 9 10 Numbers Similarity (eg. Euclidean distance)
 
 
 Similarity comparison
 Vectors Operation Organization
  • 8. Data structure inside vector db • Hashing-based • Tree partitioning based • Inverted index based • Graph Based
  • 9. What is Milvus From the user’s perspective of, we need a more easy- to-use and powerful database, not just a faster library. ➢ It’s a database with CRUD support ➢ Designed for efficient similarity search on dense vectors ➢ Highly scalable and robust, performance on demand ➢ Open source, world’s most popular vector database
  • 10. Journey of Milvus The Idea Milvus 0.1 Release Open Source Joined LF&AI Milvus 1.0 Release Milvus 2.0 RC1 Release Milvus 2.0 GA Release 2022.1 2021.10 2021.3 2020.3 2019.10 2019.04 2018.10
  • 12. Design space in vector database Tradeoffs - Consistency, Availability, Partition - Data freshness, Query performance, Resources - Another CAP, Cost, Accuracy, Performance No silver bullets fits all!!
  • 13. Design choices Milvus take - Availability over Consistency - Scalability over single node performance - Ease of use over knob tuning - But everything is tunable thanks to log backbone and micro service design
  • 14. Log Sequence as back bone
  • 15. Tunable Consistency Strong consistency: wait for all data arrived before search Bounded Staleness: search unless data is delayed Session: read until sync time reached write ts Consistent Prefix: consume log in order Eventual: persistent log ensure data is eventual consistent
  • 16. Unified Streaming/Batching Data: Growing + Historical Storage: Stream based log storage + Batch based blob storage
  • 18. Time tick mechanism Logs are assigned to a window based on timestamp Data can be written out of order by proxy Time tick trigger message pack consumption
  • 19. Why Apache Pulsar as log storage? ➢ Tiered Storage ➢ Unlimited topic numbers ➢ Geo Replication ➢ Multi tenancy ➢ Pulsar functions ➢ Integrated with K8s and other cloud infrastructure ➢ Timely and Kindly community support
  • 22. Users 1000+ Enterprise users around the global
  • 23. Related product search and recommendations
 ElasticsearchforkeywordsearchwithASCIIcodes. Thecodeweknowaboutthesetwoarraysofnumbersisthat breadnotequaltotoast. Weassumethatsimilarcontextsrepresentsimilarthings,and trytocomparethemusingmathematicalmethods.Wecould evenfindawaytoencodewholesentencesbytheirmeaning.
  • 26. 1. Milvus can be stateless and better compatible with K8s, thanks to the usage of streaming storage. 2. System can be easily extended with other features such as key word search and analytic queries 3. Vector search can be used under more real time use cases because of supporting of streaming ingestion 4. And More: Cross datacenter/cloud replication, No more need to worry about complicated consensus protocol, UDF implementation… Conclusion
  • 27. THANK YOU FOR WATCHING