SlideShare a Scribd company logo
Impacts of Sharding, Partitioning, Encoding, and
Sorting on Distributed Query Performance
Nga Tran
Staff Engineer, InfluxData
July 14, 2021
● InfluxData - Staff Engineer
● Tableau/Salesforce (2 years)
○ Sr. Manager of Automatic Statistics
● Vertica RDBMS (over a decade)
○ Engineer of Query Optimizer
○ Director of Engineering (R&D)
● ELCA (4 years)
Outline
● Non-distributed vs Distributed Databases
● Splitting Data to Gain Query Performance
○ Sharding, Partitioning, Encoding, and Sorting
● Impacts of different data setups on Query Performance
Distributed Database
Non-Distributed DB: 1-node cluster
● 1 machine
● Data is loaded & then queried on that node
Distributed DB: Cluster of many nodes
● Several machines shares the work
● Data is horizontally split between nodes
● Data is queried from all nodes
Node
Non-Distributed DB
Node 1 Node 2 Node n
N nodes, each plays the same role and talks to each other
Distributed DB
Row 1
Row 2
……..
Row a
Row a+1
Row a+2
………..
Row b
Row x+1
Row x+2
………..
Row n
Distributed Database
Non-Distributed DB: 1-node cluster
● 1 machine
● Data is loaded & then queried on that node
Distributed DB: Cluster of many nodes
● Several machines shares the work
● Data is horizontally split between nodes
● Data is queried from all nodes
→ How to split data to gain query performance?
Node
Non-Distributed DB
Node 1 Node 2 Node n
N nodes, each plays the same role and talks to each other
Distributed DB
Row 1
Row 2
……..
Row a
Row a+1
Row a+2
………..
Row b
Row x+1
Row x+2
………..
Row n
Splitting Data to Gain Query Performance
● Sharding
○ Horizontally split a table into N non-overlapping shards
■ → each node will (equally) share 1/n of the workload:
● Load 1/n data to each node
● Query: join & group-by on each node share 1/n workload
● Partitioning
○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning
out, local parallelism
● Encoding
○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by
Splitting Data to Gain Query Performance
● Sharding
○ Horizontally split a table into N non-overlapping shards
■ → each node will (equally) share 1/n of the workload:
● Load 1/n data to each node
● Query: join & group-by on each node share 1/n workload
● Partitioning
○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning
out, local parallelism
● Encoding
○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by
→ Let us dig into examples
Line_Item
o_okey o_date o_pri
1 2021.05.01 2
2 2021.05.01 1
3 2021.05.02 1
4 2021.05.02 3
5 2021.05.02 1
Examples: Two tables Order & Line_Item
Order
l_okey l_name l_price l_shipdate
1 desk 100 2021.05.07
1 chair 50 2021.05.03
1 monitor 130 2021.05.03
1 mouse 10 2021.05.07
2 pot 20 2021.05.01
2 pan 25 2021.05.04
3 shirt 30 2021.05.10
4 bike 120 2021.05.04
4 helmet 30 2021.05.10
5 kayak 200 2021.05.05
5 lifevest 20 2021.05.02
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
1 desk 100 2021.05.07
1 chair 50 2021.05.03
1 monitor 130 2021.05.03
1 mouse 10 2021.05.07
3 shirt 30 2021.05.2
5 kayak 200 2021.05.07
5 lifevest 20 2021.05.02
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
2 pot 20 2021.05.01
2 pan 25 2021.05.04
4 bike 120 2021.05.04
4 helmet 30 2021.05.10
Examples: 2-node cluster
Node 1 Node 2
Order Line_Item Line_Item
Order
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
Partitioned : Order: (o_date) & Line_Item: (l_shipdate)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
3 shirt 30 2021.05.2
5 lifevest 20 2021.05.02
1 chair 50 2021.05.03
1 monitor 130 2021.05.03
1 desk 100 2021.05.07
1 mouse 10 2021.05.07
5 kayak 200 2021.05.07
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
2 pot 20 2021.05.01
2 pan 25 2021.05.04
4 bike 120 2021.05.04
4 helmet 30 2021.05.10
Examples: 2-node cluster
Node 1 Node 2
Order Line_Item Line_Item
Order
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
Partitioned : Order: (o_date) & Line_Item: (l_shipdate)
Encoded & Sorted : Order: (o_okey) & Line_Item: RLE(l_okey)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
(3,1) shirt 30 2021.05.2
(5,1) lifevest 20 2021.05.02
(1, 2) chair 50 2021.05.03
monitor 130 2021.05.03
(1,2) desk 100 2021.05.07
mouse 10 2021.05.07
(5,1) kayak 200 2021.05.07
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
(2,1) pot 20 2021.05.01
(2,1) pan 25 2021.05.04
(4,1) bike 120 2021.05.04
(4,1) helmet 30 2021.05.10
Examples: 2-node cluster
Node 1 Node 2
Order Line_Item Line_Item
Order
Impacts of the setups on query performance
Examples: Query
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and
l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Examples: Query - Do the shards help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and
l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
1 desk 100 2021.05.07
1 chair 50 2021.05.03
1 monitor 130 2021.05.03
1 mouse 10 2021.05.07
3 shirt 30 2021.05.2
5 kayak 200 2021.05.07
5 lifevest 20 2021.05.02
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
2 pot 20 2021.05.01
2 pan 25 2021.05.04
4 bike 120 2021.05.04
4 helmet 30 2021.05.10
Back to Shard setup
Node 1 Node 2
Order Line_Item Line_Item
Order
Examples: Query - Do the shards help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date, o_pri
ORDER BY
revenue desc, o_date;
YES
● Join: l_okey = o_key
○ → all odd keys in node 1 and even keys in node 2
○ → Node 1 and node 2 join data on their local node. No need to shuffle data between nodes before
joining.
● Group By: l_okey, o_date, o_pri
○ → Similarly, same group-by keys are in the same nodes. Each node can aggregate data without the
need to reshuffle data
Examples: Query - Do the shards help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_key, o_date, o_pri
ORDER BY
revenue desc, o_date;
What if Order not sharded on o_okey & Line_item not sharded on l_okey?
Examples: Query - Do the shards help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_key, o_date, o_pri
ORDER BY
revenue desc, o_date;
What if Order not sharded on o_okey & Line_item not sharded on l_okey?
● Join: l_okey = o_key
○ → Need to reshuffle data so same join keys land on the same nodes before joining. Many ways:
■ Reshard on the fly both Order on o_okey and Line_Item on l_okey
■ Broadcast small table (o_okey) to other nodes
● Group By: l_okey, o_date, o_pri
○ → If after the join the data is shared on l_okey, nothing is needed. Otherwise, either:
■ Reshard data on l_okey to 2 nodes
■ Send everything to one node to do the final group-by
Examples: Query - Do the shards help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_key, o_date, o_pri
ORDER BY
revenue desc, o_date;
What if Order not sharded on o_okey & Line_item not sharded on l_okey?
● → Not sharded on join keys will lead to extra on-the-fly reshard or broadcast cost
● → Not already (re-)sharded on group-by keys before the group-by operator will lead to either
○ Reshard or
○ The final node has to do all the group-by work
Examples: Query - Do the partitions help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
Partitioned : Order: (o_date) & Line_Item: (l_shipdate)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
3 shirt 30 2021.05.2
5 lifevest 20 2021.05.02
1 chair 50 2021.05.03
1 monitor 130 2021.05.03
1 desk 100 2021.05.07
1 mouse 10 2021.05.07
5 kayak 200 2021.05.07
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
2 pot 20 2021.05.01
2 pan 25 2021.05.04
4 bike 120 2021.05.04
4 helmet 30 2021.05.10
Back to Partition Setup
Node 1 Node 2
Order Line_Item Line_Item
Order
Examples: Query - Do the partitions help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Yes
● Filter: o_date < 2021.05.02 and l_shipdate > 2021.05.03
○ → Prune partitions not in the filter ranges
Examples: Query - Do the partitions help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
What if Order is not partitioned on o_date and Line_Item not partitioned on l_shipdate?
Examples: Query - Do the partitions help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
What if Order is not partitioned on o_date and Line_Item not partitioned on l_shipdate?
● → nothing to prune early, we have to scan all column data and apply the filter ranges
Examples: Query - Do the encoding & sorting help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2)
Partitioned : Order: (o_date) & Line_Item: (l_shipdate)
Encoded & Sorted : Order: (o_okey) & Line_Item: RLE(l_okey)
o_okey o_date o_pri
1 2021.05.01 2
3 2021.05.01 1
5 2021.05.02 1
l_okey l_name l_price l_shipdate
(3,1) shirt 30 2021.05.2
(5,1) lifevest 20 2021.05.02
(1, 2) chair 50 2021.05.03
monitor 130 2021.05.03
(1,2) desk 100 2021.05.07
mouse 10 2021.05.07
(5,1) kayak 200 2021.05.07
o_okey o_date o_pri
2 2021.05.01 1
4 2021.05.02 3
l_okey l_name l_price l_shipdate
(2,1) pot 20 2021.05.01
(2,1) pan 25 2021.05.04
(4,1) bike 120 2021.05.04
(4,1) helmet 30 2021.05.10
Back to Encoding and Sorting Setup
Node 1 Node 2
Order Line_Item Line_Item
Order
Examples: Query - Do the encoding & sorting help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
Yes
● Join: l_okey = o_key
○ → use fast & more memory efficient merge join because data already sorted on the join keys
○ → l_okey can be kept in RLE during join
● Group By: l_okey, o_date,o_pri
○ → Group-by key is sorted and no need doing hash groupby, simply group data as we get new batches until we reach
higher value
Examples: Query - Do the encoding & sorting help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
What if Order is not sorted on o_okey and Line_Item is not RLE on l_okey?
Examples: Query - Do the encoding & sorting help?
SELECT
l_okey, sum(l_price) as revenue, o_date, o_pri
FROM
customer, orders, lineitem
WHERE
l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03
GROUP BY
l_okey, o_date,o_pri
ORDER BY
revenue desc, o_date;
What if Order is not sorted on o_okey and Line_Item is not RLE on l_okey?
● → use hash join instead (usually slower and requires more memory than merge join)
● → use hash-group-by method (similarly, usually slower and requires more memory than pipe-lined group-by)
● → If there are only a few line items per order, the RLE won’t save much space
Database Designer:
● Topic for another talk
● Startup: Ottertune https://ottertune.com
○ Database Optimization on Autopilot
How to design sharding, partitioning, encoding, and sorting
for a combination of queries?
So what we have demonstrated today?
● Sharding
○ Horizontally split a table into N non-overlapping shards
■ → each node will (equally) share 1/n of the workload:
● Load 1/n data to each node
● Query: join & group-by on each node share 1/n workload
● Partitioning
○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning
out, local parallelism
● Encoding
○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by
→ Can you think of examples for the cases we have not covered?
Thank you
Ad

More Related Content

What's hot (20)

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
InfluxData
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Diego Pacheco
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
InfluxData
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 

Similar to Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query Performance (20)

SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
Sandun Perera
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Sql interview prep
Sql interview prepSql interview prep
Sql interview prep
ssusere339c6
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Nunes database
Nunes databaseNunes database
Nunes database
Rohini17
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Massimo Cenci
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Data Augmentation and Disaggregation by Neal Fultz
Data Augmentation and Disaggregation by Neal FultzData Augmentation and Disaggregation by Neal Fultz
Data Augmentation and Disaggregation by Neal Fultz
Data Con LA
 
Introduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraIntroduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for Cassandra
DataStax Academy
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101
Federico Razzoli
 
Normalization.ppt
Normalization.pptNormalization.ppt
Normalization.ppt
ssuser7e9b941
 
Final Project SQL - Elyada Wigati Pramaresti.pptx
Final Project SQL - Elyada Wigati Pramaresti.pptxFinal Project SQL - Elyada Wigati Pramaresti.pptx
Final Project SQL - Elyada Wigati Pramaresti.pptx
Elyada Wigati Pramaresti
 
Basic Deep Learning.pptx
Basic Deep Learning.pptxBasic Deep Learning.pptx
Basic Deep Learning.pptx
mabog44
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022
Dave Stokes
 
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Dave Stokes
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Sql interview prep
Sql interview prepSql interview prep
Sql interview prep
ssusere339c6
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Nunes database
Nunes databaseNunes database
Nunes database
Rohini17
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Massimo Cenci
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Data Augmentation and Disaggregation by Neal Fultz
Data Augmentation and Disaggregation by Neal FultzData Augmentation and Disaggregation by Neal Fultz
Data Augmentation and Disaggregation by Neal Fultz
Data Con LA
 
Introduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraIntroduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for Cassandra
DataStax Academy
 
MySQL Query Optimisation 101
MySQL Query Optimisation 101MySQL Query Optimisation 101
MySQL Query Optimisation 101
Federico Razzoli
 
Final Project SQL - Elyada Wigati Pramaresti.pptx
Final Project SQL - Elyada Wigati Pramaresti.pptxFinal Project SQL - Elyada Wigati Pramaresti.pptx
Final Project SQL - Elyada Wigati Pramaresti.pptx
Elyada Wigati Pramaresti
 
Basic Deep Learning.pptx
Basic Deep Learning.pptxBasic Deep Learning.pptx
Basic Deep Learning.pptx
mabog44
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022MySQL Indexes and Histograms - RMOUG Training Days 2022
MySQL Indexes and Histograms - RMOUG Training Days 2022
Dave Stokes
 
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
Dave Stokes
 
Ad

More from InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
InfluxData
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
InfluxData
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
InfluxData
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
InfluxData
 
Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
InfluxData
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
InfluxData
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
InfluxData
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
InfluxData
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
InfluxData
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
InfluxData
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
InfluxData
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
InfluxData
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
InfluxData
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
InfluxData
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
InfluxData
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
InfluxData
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
InfluxData
 
Ad

Recently uploaded (20)

DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query Performance

  • 1. Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query Performance Nga Tran Staff Engineer, InfluxData July 14, 2021
  • 2. ● InfluxData - Staff Engineer ● Tableau/Salesforce (2 years) ○ Sr. Manager of Automatic Statistics ● Vertica RDBMS (over a decade) ○ Engineer of Query Optimizer ○ Director of Engineering (R&D) ● ELCA (4 years)
  • 3. Outline ● Non-distributed vs Distributed Databases ● Splitting Data to Gain Query Performance ○ Sharding, Partitioning, Encoding, and Sorting ● Impacts of different data setups on Query Performance
  • 4. Distributed Database Non-Distributed DB: 1-node cluster ● 1 machine ● Data is loaded & then queried on that node Distributed DB: Cluster of many nodes ● Several machines shares the work ● Data is horizontally split between nodes ● Data is queried from all nodes Node Non-Distributed DB Node 1 Node 2 Node n N nodes, each plays the same role and talks to each other Distributed DB Row 1 Row 2 …….. Row a Row a+1 Row a+2 ……….. Row b Row x+1 Row x+2 ……….. Row n
  • 5. Distributed Database Non-Distributed DB: 1-node cluster ● 1 machine ● Data is loaded & then queried on that node Distributed DB: Cluster of many nodes ● Several machines shares the work ● Data is horizontally split between nodes ● Data is queried from all nodes → How to split data to gain query performance? Node Non-Distributed DB Node 1 Node 2 Node n N nodes, each plays the same role and talks to each other Distributed DB Row 1 Row 2 …….. Row a Row a+1 Row a+2 ……….. Row b Row x+1 Row x+2 ……….. Row n
  • 6. Splitting Data to Gain Query Performance ● Sharding ○ Horizontally split a table into N non-overlapping shards ■ → each node will (equally) share 1/n of the workload: ● Load 1/n data to each node ● Query: join & group-by on each node share 1/n workload ● Partitioning ○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning out, local parallelism ● Encoding ○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by
  • 7. Splitting Data to Gain Query Performance ● Sharding ○ Horizontally split a table into N non-overlapping shards ■ → each node will (equally) share 1/n of the workload: ● Load 1/n data to each node ● Query: join & group-by on each node share 1/n workload ● Partitioning ○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning out, local parallelism ● Encoding ○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by → Let us dig into examples
  • 8. Line_Item o_okey o_date o_pri 1 2021.05.01 2 2 2021.05.01 1 3 2021.05.02 1 4 2021.05.02 3 5 2021.05.02 1 Examples: Two tables Order & Line_Item Order l_okey l_name l_price l_shipdate 1 desk 100 2021.05.07 1 chair 50 2021.05.03 1 monitor 130 2021.05.03 1 mouse 10 2021.05.07 2 pot 20 2021.05.01 2 pan 25 2021.05.04 3 shirt 30 2021.05.10 4 bike 120 2021.05.04 4 helmet 30 2021.05.10 5 kayak 200 2021.05.05 5 lifevest 20 2021.05.02
  • 9. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate 1 desk 100 2021.05.07 1 chair 50 2021.05.03 1 monitor 130 2021.05.03 1 mouse 10 2021.05.07 3 shirt 30 2021.05.2 5 kayak 200 2021.05.07 5 lifevest 20 2021.05.02 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate 2 pot 20 2021.05.01 2 pan 25 2021.05.04 4 bike 120 2021.05.04 4 helmet 30 2021.05.10 Examples: 2-node cluster Node 1 Node 2 Order Line_Item Line_Item Order
  • 10. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) Partitioned : Order: (o_date) & Line_Item: (l_shipdate) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate 3 shirt 30 2021.05.2 5 lifevest 20 2021.05.02 1 chair 50 2021.05.03 1 monitor 130 2021.05.03 1 desk 100 2021.05.07 1 mouse 10 2021.05.07 5 kayak 200 2021.05.07 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate 2 pot 20 2021.05.01 2 pan 25 2021.05.04 4 bike 120 2021.05.04 4 helmet 30 2021.05.10 Examples: 2-node cluster Node 1 Node 2 Order Line_Item Line_Item Order
  • 11. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) Partitioned : Order: (o_date) & Line_Item: (l_shipdate) Encoded & Sorted : Order: (o_okey) & Line_Item: RLE(l_okey) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate (3,1) shirt 30 2021.05.2 (5,1) lifevest 20 2021.05.02 (1, 2) chair 50 2021.05.03 monitor 130 2021.05.03 (1,2) desk 100 2021.05.07 mouse 10 2021.05.07 (5,1) kayak 200 2021.05.07 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate (2,1) pot 20 2021.05.01 (2,1) pan 25 2021.05.04 (4,1) bike 120 2021.05.04 (4,1) helmet 30 2021.05.10 Examples: 2-node cluster Node 1 Node 2 Order Line_Item Line_Item Order
  • 12. Impacts of the setups on query performance
  • 13. Examples: Query SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date;
  • 14. Examples: Query - Do the shards help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date;
  • 15. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate 1 desk 100 2021.05.07 1 chair 50 2021.05.03 1 monitor 130 2021.05.03 1 mouse 10 2021.05.07 3 shirt 30 2021.05.2 5 kayak 200 2021.05.07 5 lifevest 20 2021.05.02 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate 2 pot 20 2021.05.01 2 pan 25 2021.05.04 4 bike 120 2021.05.04 4 helmet 30 2021.05.10 Back to Shard setup Node 1 Node 2 Order Line_Item Line_Item Order
  • 16. Examples: Query - Do the shards help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date, o_pri ORDER BY revenue desc, o_date; YES ● Join: l_okey = o_key ○ → all odd keys in node 1 and even keys in node 2 ○ → Node 1 and node 2 join data on their local node. No need to shuffle data between nodes before joining. ● Group By: l_okey, o_date, o_pri ○ → Similarly, same group-by keys are in the same nodes. Each node can aggregate data without the need to reshuffle data
  • 17. Examples: Query - Do the shards help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_key, o_date, o_pri ORDER BY revenue desc, o_date; What if Order not sharded on o_okey & Line_item not sharded on l_okey?
  • 18. Examples: Query - Do the shards help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_key, o_date, o_pri ORDER BY revenue desc, o_date; What if Order not sharded on o_okey & Line_item not sharded on l_okey? ● Join: l_okey = o_key ○ → Need to reshuffle data so same join keys land on the same nodes before joining. Many ways: ■ Reshard on the fly both Order on o_okey and Line_Item on l_okey ■ Broadcast small table (o_okey) to other nodes ● Group By: l_okey, o_date, o_pri ○ → If after the join the data is shared on l_okey, nothing is needed. Otherwise, either: ■ Reshard data on l_okey to 2 nodes ■ Send everything to one node to do the final group-by
  • 19. Examples: Query - Do the shards help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_key, o_date, o_pri ORDER BY revenue desc, o_date; What if Order not sharded on o_okey & Line_item not sharded on l_okey? ● → Not sharded on join keys will lead to extra on-the-fly reshard or broadcast cost ● → Not already (re-)sharded on group-by keys before the group-by operator will lead to either ○ Reshard or ○ The final node has to do all the group-by work
  • 20. Examples: Query - Do the partitions help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date;
  • 21. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) Partitioned : Order: (o_date) & Line_Item: (l_shipdate) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate 3 shirt 30 2021.05.2 5 lifevest 20 2021.05.02 1 chair 50 2021.05.03 1 monitor 130 2021.05.03 1 desk 100 2021.05.07 1 mouse 10 2021.05.07 5 kayak 200 2021.05.07 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate 2 pot 20 2021.05.01 2 pan 25 2021.05.04 4 bike 120 2021.05.04 4 helmet 30 2021.05.10 Back to Partition Setup Node 1 Node 2 Order Line_Item Line_Item Order
  • 22. Examples: Query - Do the partitions help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; Yes ● Filter: o_date < 2021.05.02 and l_shipdate > 2021.05.03 ○ → Prune partitions not in the filter ranges
  • 23. Examples: Query - Do the partitions help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; What if Order is not partitioned on o_date and Line_Item not partitioned on l_shipdate?
  • 24. Examples: Query - Do the partitions help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; What if Order is not partitioned on o_date and Line_Item not partitioned on l_shipdate? ● → nothing to prune early, we have to scan all column data and apply the filter ranges
  • 25. Examples: Query - Do the encoding & sorting help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date;
  • 26. Sharded : Order: (o_okey % 2) & Line_Item: (l_okey % 2) Partitioned : Order: (o_date) & Line_Item: (l_shipdate) Encoded & Sorted : Order: (o_okey) & Line_Item: RLE(l_okey) o_okey o_date o_pri 1 2021.05.01 2 3 2021.05.01 1 5 2021.05.02 1 l_okey l_name l_price l_shipdate (3,1) shirt 30 2021.05.2 (5,1) lifevest 20 2021.05.02 (1, 2) chair 50 2021.05.03 monitor 130 2021.05.03 (1,2) desk 100 2021.05.07 mouse 10 2021.05.07 (5,1) kayak 200 2021.05.07 o_okey o_date o_pri 2 2021.05.01 1 4 2021.05.02 3 l_okey l_name l_price l_shipdate (2,1) pot 20 2021.05.01 (2,1) pan 25 2021.05.04 (4,1) bike 120 2021.05.04 (4,1) helmet 30 2021.05.10 Back to Encoding and Sorting Setup Node 1 Node 2 Order Line_Item Line_Item Order
  • 27. Examples: Query - Do the encoding & sorting help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; Yes ● Join: l_okey = o_key ○ → use fast & more memory efficient merge join because data already sorted on the join keys ○ → l_okey can be kept in RLE during join ● Group By: l_okey, o_date,o_pri ○ → Group-by key is sorted and no need doing hash groupby, simply group data as we get new batches until we reach higher value
  • 28. Examples: Query - Do the encoding & sorting help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; What if Order is not sorted on o_okey and Line_Item is not RLE on l_okey?
  • 29. Examples: Query - Do the encoding & sorting help? SELECT l_okey, sum(l_price) as revenue, o_date, o_pri FROM customer, orders, lineitem WHERE l_okey = o_key and o_date < 2021.05.02 and l_shipdate > 2021.05.03 GROUP BY l_okey, o_date,o_pri ORDER BY revenue desc, o_date; What if Order is not sorted on o_okey and Line_Item is not RLE on l_okey? ● → use hash join instead (usually slower and requires more memory than merge join) ● → use hash-group-by method (similarly, usually slower and requires more memory than pipe-lined group-by) ● → If there are only a few line items per order, the RLE won’t save much space
  • 30. Database Designer: ● Topic for another talk ● Startup: Ottertune https://ottertune.com ○ Database Optimization on Autopilot How to design sharding, partitioning, encoding, and sorting for a combination of queries?
  • 31. So what we have demonstrated today? ● Sharding ○ Horizontally split a table into N non-overlapping shards ■ → each node will (equally) share 1/n of the workload: ● Load 1/n data to each node ● Query: join & group-by on each node share 1/n workload ● Partitioning ○ Each shard is further split into smaller partitions for better data filtering, deleting, fanning out, local parallelism ● Encoding ○ Each column is encoded (sorted & compressed) to further help on join, filtering, group-by, order-by → Can you think of examples for the cases we have not covered?