SlideShare a Scribd company logo
Scale the Relational Database with
NewSQL
Shen Li @ PingCAP
About me and PingCAP
● Shen Li, VP of Engineering @ PingCAP
● A startup based in Beijing, China
● Round B with $15 million
● TiDB, 400+ PoC, 30+ adoptions
● We are setting up an office in the Bay Area. So we are hiring :)
Agenda
● Motivations
● The goals of TiDB
● The core components of TiDB
● The tools around TiDB
● Spark on TiKV
● Future plans
Why we build a new relational database
● RDBMS is becoming the performance bottleneck of your backend service
● The amount of data stored in the database is overwhelming
● You want to do some complex queries on a sharding cluster
○ e.g. simple JOIN or GROUP BY
● Your application needs ACID transaction on a sharding cluster
TiDB Project - Goal
● SQL is necessary
● Transparent sharding and data movement/balance
● 100% OLTP + 80% OLAP
○ Transaction + Complex query
● 24/7 availability, even in case of datacenter outages
○ Thanks to the Raft consensus algorithm
● Compatible with MySQL, in most cases
● Open source, of course.
Architecture
TiKV TiKV TiKV TiKV
Raft Raft Raft
TiDB TiDB TiDB
... ......
... ...
Placement
Driver (PD)
Control flow:
Balance / Failover
Metadata / Timestamp request
Stateless SQL Layer
Distributed Storage Layer
gRPC
gRPC
gRPCgRPC
Storage stack 1/3
● TiKV is the underlying storage layer
● Physically, data is stored in RocksDB
● We build a Raft layer on top of RocksDB
○ What is Raft?
● Written in Rust!
TiKV
API (gRPC)
Transaction
MVCC
Raft (gRPC)
RocksDB
Raw KV API
(https://github.com/pingc
ap/tidb/blob/master/cmd
/benchraw/main.go)
Transactional KV API
(https://github.com/pingcap
/tidb/blob/master/cmd/ben
chkv/main.go)
Storage Stack 2/3
Logical view of TiKV
● Key-Value storage
● Giant sorted (in byte-order) Key-Value map
● Split into regions
● Metadata: [start_key, end_key)
TiKV Key Space
[ start_key,
end_key)
(-∞, +∞)
Sorted Map
256MB
RocksDB
Instance
Region 1:[a-e]
Region 3:[k-o]
Region 5:[u-z]
...
Region 4:[p-t]
RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 4:[p-t]
...
Region 3:[k-o]
RocksDB
Instance
Region 2:[f-j]
Region 5:[u-z]
Region 3:[k-o]
... RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 5:[u-z]
...
Region 4:[p-t]
Raft group
Storage stack 3/3
● Data is organized by Regions
● Region: a set of continuous Key-Value pairs
RPC (gRPC)
Transaction
MVCC
Raft
RocksDB
···
Dynamic Multi-Raft
● What’s DynamicMulti-Raft?
○ Dynamic split / merge
● Safe split / merge
Region 1:[a-e]
split Region 1.1:[a-c]
Region 1.2:[d-e]split
Safe Split: 1/4
TiKV1
Region 1:[a-e]
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader Follower Follower
Raft group
Safe Split: 2/4
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader
Follower Follower
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Safe Split: 3/4
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Leader
Follower Follower
Split log (replicated by Raft)
Split log
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
Safe Split: 4/4
TiKV1
Region 1.1:[a-c]
Leader
Region 1.2:[d-e]
TiKV2
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
TiKV3
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
raft
raft
raft
raft
Region 1
Region 3
Region 1
Region 2
Scale-out (initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3
Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica to Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
ACID Transaction
● Based on Google Percolator
● ‘Almost’ decentralized 2-phase commit
○ Timestamp Allocator
● Optimistic transaction model
● Default isolation level: Snapshot Isolation
● We also support RC Isolation
Something we haven't mentioned
Now, we have a distributed, transactional, auto-scalable
key-value storage.
● Timestamp allocator
● Metadata storage
● Balance decision
Here comes the Placement Driver (PD for short)
Placement Driver
The brain of the TiKV cluster
●Timestamp allocator
●Metadata storage
●Replica scheduling PD PDPD
Raft Raft
etcd
Embedded
Scheduling Strategy
Region A
Region B
Node 1
Node 2
PD
Scheduling
Strategy
Cluster
Info
Admin
HeartBeat
Scheduling
Command
Region C
Config
Movement
The SQL Layer
● Mapping relational model to Key-Value model
● Full-featured SQL layer
● Cost-based optimizer (CBO)
● Distributed execution engine
SQL to Key-Value
● Row
Key: TableID + RowID
Value: Row Value
●Index
Key: TableID + IndexID + Index-Column-Values
Value: RowID
CREATE TABLE `t` (`id` int, `age` int, key
`age_idx` (`age`));
INSERT INTO `t` VALUES (100, 35);
K1
K2
100, 35
K1
TiKV
Encoded Keys:
K1: tid + rowid
K2: tid + idxid + 35
SQL Layer Overview
What happens behind a query
CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1));
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
Query Plan
Partial Aggregate
COUNT(c1)
Filter
c2 = “seattle”
Read Index
idx1: (10, +∞)
Physical Plan on TiKV (index scan)
Read Row Data
by RowID
RowID
Row
Row
Final Aggregate
SUM(COUNT(c1))
DistSQL Scan
Physical Plan on TiDB
COUNT(c1)
COUNT(c1)
TiKV
TiKV
TiKV
COUNT(c1)
COUNT(c1)
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
What happens behind a query
CREATE TABLE t1(id INT, email TEXT,KEY idx_id(id));
CREATE TABLE t2(id INT, email TEXT, KEY idx_id(id));
SELECT * FROM t1 join t2 WHERE t1.id = t2.id;
Hash Join Operator
Supported Join Operators
● Hash Join
● Sort merge Join
● Index-lookup Join
Cost-Based Optimizer
● Predicate Pushdown
● Column Pruning
● Eager Aggregate
● Convert Subquery to Join
● Statistics framework
● CBO Framework
○ Index Selection
○ Join Operator Selection
○ Stream Operators VS Hash Operators
Tools matter
● Syncer
● TiDB-Binlog
● Mydumper/MyLoader(loader)
Syncer
● Synchronize data from MySQL in real-time
● Hook up as a MySQL replica
MySQL
(master)
Syncer
Save Point
(disk)
Rule Filter
MySQL
TiDB Cluster
TiDB Cluster
TiDB Cluster
Syncer
Syncerbinlog
Fake slave
Syncer
or
TiDB-Binlog
TiDB Server
TiDB Server Sorter
Pumper
Pumper
TiDB Server
Pumper
Protobuf
MySQL Binlog
MySQL
3rd party applicationsCistern
● Subscribe the incremental data from TiDB
● Output Protobuf formatted data or MySQL Binlog format(WIP)
Another TiDB-Cluster
MyDumper / Loader
● Backup/restore in parallel
● Works for TiDB too
● Actually, we don’t have our own data migration tool for now
Spark on TiKV
● TiSpark = Spark SQL on TiKV
o Spark SQL directly on top of a distributed Database Storage engine
o Two extension points for Spark SQL Internal: Extra Optimizer Rules
and Extra Strategies
o Hijack Spark SQL logical plan and inject our own physical executor
● Hybrid Transactional/Analytical Processing(HTAP) rocks
o Provide strong OLAP capacity together with TiDB
Spark on TiKV
TiDB
TiDB
Worker
Spark
Driver
TiKV Cluster (Storage)
Metadata
TiKV TiKV
TiKV
Application
Syncer
Data location
Job
TiSpark
DistSQL API
TiKV
TiDB
TSO/Data location
Worker
Worker
Spark Cluster
TiDB Cluster
TiDB
... ...
...
DistSQL API
P
D
P
D
P
D
PD Cluster
TiKV TiKV
TiDB
Spark on TiKV
● The TiKV Connector is better than the JDBC connector
● Index support
● Complex Calculation Pushdown
● CBO
o Pick up right Access Path
o Join Reorder
● Priority & Isolation Level
Future plans
● Shift from Pre-GA to GA
● Better optimizer (Statistic && CBO)
● Smarter scheduling mechanism
● Document store for TiDB
○ MySQL 5.7.12+ X-Plugin
● Integrate TiDB with Kubernetes
Thanks
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Contact me:
shenli@pingcap.com
Ad

Recommended

How to build TiDB
How to build TiDB
PingCAP
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
Rust in TiKV
Rust in TiKV
PingCAP
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...
PingCAP
 
TiDB as an HTAP Database
TiDB as an HTAP Database
PingCAP
 
TiDB for Big Data
TiDB for Big Data
PingCAP
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
PingCAP
 
TiDB Introduction
TiDB Introduction
Morgan Tocker
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Kevin Xu
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
Morgan Tocker
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
Kevin Xu
 
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
ScyllaDB
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
Morgan Tocker
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Taro L. Saito
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
InfluxData
 
Airframe RPC
Airframe RPC
Taro L. Saito
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxData
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Kevin Xu
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
Morgan Tocker
 

More Related Content

What's hot (20)

TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
PingCAP
 
TiDB Introduction
TiDB Introduction
Morgan Tocker
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Kevin Xu
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
Morgan Tocker
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
Kevin Xu
 
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
ScyllaDB
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
Morgan Tocker
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Taro L. Saito
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
InfluxData
 
Airframe RPC
Airframe RPC
Taro L. Saito
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxData
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
PingCAP
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Kevin Xu
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
Morgan Tocker
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
Kevin Xu
 
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
Scylla Summit 2022: Overcoming the Performance Cost of Streaming Transactions
ScyllaDB
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
Morgan Tocker
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Taro L. Saito
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
InfluxData
 
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxDB 2.0: Dashboarding 101 by David G. Simmons
InfluxData
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 

Similar to Scale Relational Database with NewSQL (20)

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Kevin Xu
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
Morgan Tocker
 
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Kevin Xu
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdf
ssuser3fb50b
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
Morgan Tocker
 
Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]
Kevin Xu
 
Data-at-scale-with-TIDB Mydbops Co-Founder Kabilesh PR at LSPE Event
Data-at-scale-with-TIDB Mydbops Co-Founder Kabilesh PR at LSPE Event
Mydbops
 
TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
Kevin Xu
 
Keynote -- Percona Live Europe 2018
Keynote -- Percona Live Europe 2018
Kevin Xu
 
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
Mydbops
 
"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]
Kevin Xu
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Mydbops
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Data Con LA
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Md Kamaruzzaman
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency and consistency
ScyllaDB
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
Databricks
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Kevin Xu
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
Morgan Tocker
 
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Kevin Xu
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdf
ssuser3fb50b
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
Morgan Tocker
 
Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]
Kevin Xu
 
Data-at-scale-with-TIDB Mydbops Co-Founder Kabilesh PR at LSPE Event
Data-at-scale-with-TIDB Mydbops Co-Founder Kabilesh PR at LSPE Event
Mydbops
 
TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
Kevin Xu
 
Keynote -- Percona Live Europe 2018
Keynote -- Percona Live Europe 2018
Kevin Xu
 
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
TiDB in a Nutshell - Power of Open-Source Distributed SQL Database - Mydbops
Mydbops
 
"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]
Kevin Xu
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Mydbops
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Data Con LA
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Md Kamaruzzaman
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency and consistency
ScyllaDB
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
Databricks
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
Eric Evans
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Ad

More from PingCAP (20)

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PingCAP
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PingCAP
 
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
PingCAP
 
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
PingCAP
 
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
PingCAP
 
[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases
PingCAP
 
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
PingCAP
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PingCAP
 
[Paperreading] Paxos made easy (by sen han)
[Paperreading] Paxos made easy (by sen han)
PingCAP
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
PingCAP
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PingCAP
 
Finding Logic Bugs in Database Management Systems
Finding Logic Bugs in Database Management Systems
PingCAP
 
Chaos Practice in PingCAP
Chaos Practice in PingCAP
PingCAP
 
TiDB at PayPay
TiDB at PayPay
PingCAP
 
Paper Reading: FPTree
Paper Reading: FPTree
PingCAP
 
Paper Reading: Smooth Scan
Paper Reading: Smooth Scan
PingCAP
 
Paper Reading: Flexible Paxos
Paper Reading: Flexible Paxos
PingCAP
 
Paper reading: Cost-based Query Transformation in Oracle
Paper reading: Cost-based Query Transformation in Oracle
PingCAP
 
Paper reading: HashKV and beyond
Paper reading: HashKV and beyond
PingCAP
 
Paper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality Estimation
PingCAP
 
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PingCAP
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PingCAP
 
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
PingCAP
 
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
PingCAP
 
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
PingCAP
 
[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] QAGen: Generating query-aware test databases
PingCAP
 
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
PingCAP
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PingCAP
 
[Paperreading] Paxos made easy (by sen han)
[Paperreading] Paxos made easy (by sen han)
PingCAP
 
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
PingCAP
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PingCAP
 
Finding Logic Bugs in Database Management Systems
Finding Logic Bugs in Database Management Systems
PingCAP
 
Chaos Practice in PingCAP
Chaos Practice in PingCAP
PingCAP
 
TiDB at PayPay
TiDB at PayPay
PingCAP
 
Paper Reading: FPTree
Paper Reading: FPTree
PingCAP
 
Paper Reading: Smooth Scan
Paper Reading: Smooth Scan
PingCAP
 
Paper Reading: Flexible Paxos
Paper Reading: Flexible Paxos
PingCAP
 
Paper reading: Cost-based Query Transformation in Oracle
Paper reading: Cost-based Query Transformation in Oracle
PingCAP
 
Paper reading: HashKV and beyond
Paper reading: HashKV and beyond
PingCAP
 
Paper Reading: Pessimistic Cardinality Estimation
Paper Reading: Pessimistic Cardinality Estimation
PingCAP
 
Ad

Recently uploaded (20)

Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
 
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
Heat Treatment Process Automation in India
Heat Treatment Process Automation in India
Reckers Mechatronics
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
Microsoft-365-Administrator-s-Guide1.pdf
Microsoft-365-Administrator-s-Guide1.pdf
mazharatknl
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
How Automation in Claims Handling Streamlined Operations
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Download Adobe Illustrator Crack free for Windows 2025?
Download Adobe Illustrator Crack free for Windows 2025?
grete1122g
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
 
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
 
Heat Treatment Process Automation in India
Heat Treatment Process Automation in India
Reckers Mechatronics
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
Microsoft-365-Administrator-s-Guide1.pdf
Microsoft-365-Administrator-s-Guide1.pdf
mazharatknl
 
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
Why Every Growing Business Needs a Staff Augmentation Company IN USA.pdf
mary rojas
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
How Automation in Claims Handling Streamlined Operations
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Download Adobe Illustrator Crack free for Windows 2025?
Download Adobe Illustrator Crack free for Windows 2025?
grete1122g
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 

Scale Relational Database with NewSQL

  • 1. Scale the Relational Database with NewSQL Shen Li @ PingCAP
  • 2. About me and PingCAP ● Shen Li, VP of Engineering @ PingCAP ● A startup based in Beijing, China ● Round B with $15 million ● TiDB, 400+ PoC, 30+ adoptions ● We are setting up an office in the Bay Area. So we are hiring :)
  • 3. Agenda ● Motivations ● The goals of TiDB ● The core components of TiDB ● The tools around TiDB ● Spark on TiKV ● Future plans
  • 4. Why we build a new relational database ● RDBMS is becoming the performance bottleneck of your backend service ● The amount of data stored in the database is overwhelming ● You want to do some complex queries on a sharding cluster ○ e.g. simple JOIN or GROUP BY ● Your application needs ACID transaction on a sharding cluster
  • 5. TiDB Project - Goal ● SQL is necessary ● Transparent sharding and data movement/balance ● 100% OLTP + 80% OLAP ○ Transaction + Complex query ● 24/7 availability, even in case of datacenter outages ○ Thanks to the Raft consensus algorithm ● Compatible with MySQL, in most cases ● Open source, of course.
  • 6. Architecture TiKV TiKV TiKV TiKV Raft Raft Raft TiDB TiDB TiDB ... ...... ... ... Placement Driver (PD) Control flow: Balance / Failover Metadata / Timestamp request Stateless SQL Layer Distributed Storage Layer gRPC gRPC gRPCgRPC
  • 7. Storage stack 1/3 ● TiKV is the underlying storage layer ● Physically, data is stored in RocksDB ● We build a Raft layer on top of RocksDB ○ What is Raft? ● Written in Rust! TiKV API (gRPC) Transaction MVCC Raft (gRPC) RocksDB Raw KV API (https://github.com/pingc ap/tidb/blob/master/cmd /benchraw/main.go) Transactional KV API (https://github.com/pingcap /tidb/blob/master/cmd/ben chkv/main.go)
  • 8. Storage Stack 2/3 Logical view of TiKV ● Key-Value storage ● Giant sorted (in byte-order) Key-Value map ● Split into regions ● Metadata: [start_key, end_key) TiKV Key Space [ start_key, end_key) (-∞, +∞) Sorted Map 256MB
  • 9. RocksDB Instance Region 1:[a-e] Region 3:[k-o] Region 5:[u-z] ... Region 4:[p-t] RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 4:[p-t] ... Region 3:[k-o] RocksDB Instance Region 2:[f-j] Region 5:[u-z] Region 3:[k-o] ... RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 5:[u-z] ... Region 4:[p-t] Raft group Storage stack 3/3 ● Data is organized by Regions ● Region: a set of continuous Key-Value pairs RPC (gRPC) Transaction MVCC Raft RocksDB ···
  • 10. Dynamic Multi-Raft ● What’s DynamicMulti-Raft? ○ Dynamic split / merge ● Safe split / merge Region 1:[a-e] split Region 1.1:[a-c] Region 1.2:[d-e]split
  • 11. Safe Split: 1/4 TiKV1 Region 1:[a-e] TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower Raft group
  • 12. Safe Split: 2/4 TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower TiKV1 Region 1.1:[a-c] Region 1.2:[d-e]
  • 13. Safe Split: 3/4 TiKV1 Region 1.1:[a-c] Region 1.2:[d-e] Leader Follower Follower Split log (replicated by Raft) Split log TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e]
  • 14. Safe Split: 4/4 TiKV1 Region 1.1:[a-c] Leader Region 1.2:[d-e] TiKV2 Region 1.1:[a-c] Follower Region 1.2:[d-e] TiKV3 Region 1.1:[a-c] Follower Region 1.2:[d-e] raft raft raft raft
  • 15. Region 1 Region 3 Region 1 Region 2 Scale-out (initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 16. Region 1 Region 3 Region 1^ Region 2 Region 1* Region 2 Region 2 Region 3 Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 17. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica to Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 18. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 19. ACID Transaction ● Based on Google Percolator ● ‘Almost’ decentralized 2-phase commit ○ Timestamp Allocator ● Optimistic transaction model ● Default isolation level: Snapshot Isolation ● We also support RC Isolation
  • 20. Something we haven't mentioned Now, we have a distributed, transactional, auto-scalable key-value storage. ● Timestamp allocator ● Metadata storage ● Balance decision Here comes the Placement Driver (PD for short)
  • 21. Placement Driver The brain of the TiKV cluster ●Timestamp allocator ●Metadata storage ●Replica scheduling PD PDPD Raft Raft etcd Embedded
  • 22. Scheduling Strategy Region A Region B Node 1 Node 2 PD Scheduling Strategy Cluster Info Admin HeartBeat Scheduling Command Region C Config Movement
  • 23. The SQL Layer ● Mapping relational model to Key-Value model ● Full-featured SQL layer ● Cost-based optimizer (CBO) ● Distributed execution engine
  • 24. SQL to Key-Value ● Row Key: TableID + RowID Value: Row Value ●Index Key: TableID + IndexID + Index-Column-Values Value: RowID CREATE TABLE `t` (`id` int, `age` int, key `age_idx` (`age`)); INSERT INTO `t` VALUES (100, 35); K1 K2 100, 35 K1 TiKV Encoded Keys: K1: tid + rowid K2: tid + idxid + 35
  • 26. What happens behind a query CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1)); SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
  • 27. Query Plan Partial Aggregate COUNT(c1) Filter c2 = “seattle” Read Index idx1: (10, +∞) Physical Plan on TiKV (index scan) Read Row Data by RowID RowID Row Row Final Aggregate SUM(COUNT(c1)) DistSQL Scan Physical Plan on TiDB COUNT(c1) COUNT(c1) TiKV TiKV TiKV COUNT(c1) COUNT(c1) SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
  • 28. What happens behind a query CREATE TABLE t1(id INT, email TEXT,KEY idx_id(id)); CREATE TABLE t2(id INT, email TEXT, KEY idx_id(id)); SELECT * FROM t1 join t2 WHERE t1.id = t2.id;
  • 30. Supported Join Operators ● Hash Join ● Sort merge Join ● Index-lookup Join
  • 31. Cost-Based Optimizer ● Predicate Pushdown ● Column Pruning ● Eager Aggregate ● Convert Subquery to Join ● Statistics framework ● CBO Framework ○ Index Selection ○ Join Operator Selection ○ Stream Operators VS Hash Operators
  • 32. Tools matter ● Syncer ● TiDB-Binlog ● Mydumper/MyLoader(loader)
  • 33. Syncer ● Synchronize data from MySQL in real-time ● Hook up as a MySQL replica MySQL (master) Syncer Save Point (disk) Rule Filter MySQL TiDB Cluster TiDB Cluster TiDB Cluster Syncer Syncerbinlog Fake slave Syncer or
  • 34. TiDB-Binlog TiDB Server TiDB Server Sorter Pumper Pumper TiDB Server Pumper Protobuf MySQL Binlog MySQL 3rd party applicationsCistern ● Subscribe the incremental data from TiDB ● Output Protobuf formatted data or MySQL Binlog format(WIP) Another TiDB-Cluster
  • 35. MyDumper / Loader ● Backup/restore in parallel ● Works for TiDB too ● Actually, we don’t have our own data migration tool for now
  • 36. Spark on TiKV ● TiSpark = Spark SQL on TiKV o Spark SQL directly on top of a distributed Database Storage engine o Two extension points for Spark SQL Internal: Extra Optimizer Rules and Extra Strategies o Hijack Spark SQL logical plan and inject our own physical executor ● Hybrid Transactional/Analytical Processing(HTAP) rocks o Provide strong OLAP capacity together with TiDB
  • 37. Spark on TiKV TiDB TiDB Worker Spark Driver TiKV Cluster (Storage) Metadata TiKV TiKV TiKV Application Syncer Data location Job TiSpark DistSQL API TiKV TiDB TSO/Data location Worker Worker Spark Cluster TiDB Cluster TiDB ... ... ... DistSQL API P D P D P D PD Cluster TiKV TiKV TiDB
  • 38. Spark on TiKV ● The TiKV Connector is better than the JDBC connector ● Index support ● Complex Calculation Pushdown ● CBO o Pick up right Access Path o Join Reorder ● Priority & Isolation Level
  • 39. Future plans ● Shift from Pre-GA to GA ● Better optimizer (Statistic && CBO) ● Smarter scheduling mechanism ● Document store for TiDB ○ MySQL 5.7.12+ X-Plugin ● Integrate TiDB with Kubernetes