SlideShare a Scribd company logo
Joining Billions of Rows in Seconds
with One Database Instead of Two:
Replacing MongoDB and Hive with Scylla
Alexys Jacob
CTO, Numberly
1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all data sources and
destinations.
➔ For this we use ID matching tables.
ID matching tables
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable by
partner
Queried AND updated all the time!
➔ High read AND write workload
JOIN
Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH
ACTIVATE
Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Scylla?
Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID over the last N
months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
Let’s use Scylla!
Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes
OK for Scala, what about Python?
No joinWithCassandraTable
when using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
Dask + Hive + Scylla time break down
Hive
Scylla
Partitions
count50 seconds
10 seconds
60 seconds
Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes
Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Scylla!
Production environment
▪ 6x DELL R640
• dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
Thank You
Questions welcomed!
Stay in touch
alexys@numberly.com
@ultrabug https://ultrabug.fr

More Related Content

What's hot (20)

PPTX
Airflow at lyft
Tao Feng
 
PDF
How the Postgres Query Optimizer Works
EDB
 
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
Solving PostgreSQL wicked problems
Alexander Korotkov
 
PDF
ksqlDB로 시작하는 스트림 프로세싱
confluent
 
PPTX
Centralized Logging System Using ELK Stack
Rohit Sharma
 
PDF
Chasing the optimizer
Mauro Pagano
 
PPT
High Performance Mysql
liufabin 66688
 
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PDF
InfluxDB & Grafana
Pedro Salgado
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
SQL Server Tuning to Improve Database Performance
Mark Ginnebaugh
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
ODP
The PostgreSQL Query Planner
Command Prompt., Inc
 
PPTX
Query Compilation in Impala
Cloudera, Inc.
 
PDF
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
Airflow at lyft
Tao Feng
 
How the Postgres Query Optimizer Works
EDB
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Solving PostgreSQL wicked problems
Alexander Korotkov
 
ksqlDB로 시작하는 스트림 프로세싱
confluent
 
Centralized Logging System Using ELK Stack
Rohit Sharma
 
Chasing the optimizer
Mauro Pagano
 
High Performance Mysql
liufabin 66688
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
InfluxDB & Grafana
Pedro Salgado
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
SQL Server Tuning to Improve Database Performance
Mark Ginnebaugh
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
The PostgreSQL Query Planner
Command Prompt., Inc
 
Query Compilation in Impala
Cloudera, Inc.
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 

Similar to Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla (20)

PDF
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
ScyllaDB
 
PDF
Managing your black friday logs - Code Europe
David Pilato
 
PDF
Managing your Black Friday Logs NDC Oslo
David Pilato
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PPTX
Thing you didn't know you could do in Spark
SnappyData
 
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPTX
MongoSF 2011 - Using MongoDB for IGN's Social Platform
Manish Pandit
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PPTX
Spark to DocumentDB connector
Denny Lee
 
PPTX
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PDF
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Stream Collections - Scala Days
Greg Silin
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
ScyllaDB
 
Managing your black friday logs - Code Europe
David Pilato
 
Managing your Black Friday Logs NDC Oslo
David Pilato
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Thing you didn't know you could do in Spark
SnappyData
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Unified Big Data Processing with Apache Spark
C4Media
 
MongoSF 2011 - Using MongoDB for IGN's Social Platform
Manish Pandit
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Spark to DocumentDB connector
Denny Lee
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Stream Collections - Scala Days
Greg Silin
 
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
PDF
Leading a High-Stakes Database Migration
ScyllaDB
 
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
PDF
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
Ad

Recently uploaded (20)

PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Human Resources Information System (HRIS)
Amity University, Patna
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Executive Business Intelligence Dashboards
vandeslie24
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

  • 1. Joining Billions of Rows in Seconds with One Database Instead of Two: Replacing MongoDB and Hive with Scylla Alexys Jacob CTO, Numberly
  • 2. 1 Eiffel Tower 2 Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug
  • 3. Business context of Numberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys)
  • 4. Business context of Numberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables.
  • 5. ID matching tables 1. SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload JOIN
  • 6. Real life example: retargeting From a database (email) to a web banner (cookie) Previous donors [email protected] [email protected] [email protected] [email protected] https://kitty.eu AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE
  • 8. Drawbacks & pitfalls Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline
  • 10. Future implementation using Scylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline
  • 11. Proof Of Concept hardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in!
  • 12. Finding the right schema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice!
  • 13. Schema tip! > What is the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency!
  • 14. scylla-grafana-monitoring Set it up and test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default)
  • 15. Reference data and metrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark!
  • 16. Results: ▪ idle cluster: 2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count +
  • 18. Testing with Scylla Distinguish between hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits!
  • 19. Spark 2 + Hive + Scylla Hive (population) Scylla (ID matching) Partitions count +
  • 20. Spark 2 / Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7
  • 21. Spark 2 tuning (1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666
  • 22. Spark 2 tuning (2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ spark.cassandra.read.timeout_ms=150000 ScyllaDB blog posts & webinar ▪ https://www.scylladb.com/2018/07/31/spark-scylla/ ▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/ ▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/ ▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
  • 23. Spark 2 + Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  • 24. OK for Scala, what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches
  • 25. Dask + Hive + Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count
  • 26. Dask + Hive + Scylla time break down Hive Scylla Partitions count50 seconds 10 seconds 60 seconds
  • 27. Dask + Parquet + Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds!
  • 28. Dask + Scylla results Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  • 29. Python+Scylla with Parquet tips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  • 31. Production environment ▪ 6x DELL R640 • dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager
  • 32. Thank You Questions welcomed! Stay in touch [email protected] @ultrabug https://ultrabug.fr

Editor's Notes

  • #8: Lambda architecture
  • #9: Keeping (in sync) two copies of the same data Batch data freshness Operational burden => none can sustain both read and write workload
  • #10: Can Scylla sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?
  • #11: Simpler data consistency Operational simplicity and efficiency Reduced costs
  • #12: Always try a technology under the best omens :) Running Gentoo Linux
  • #13: ID translations must be done both ways => denormalization I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.
  • #14: We ended up with three denormalized tables History like table (just like a log) Optimize for latest value? This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!
  • #15: Docker based, Easy to install, Multi environment support Understand the performance of your cluster Tune your workload for optimal performances
  • #16: Ref dataset : data cardinality, representative volumes MongoDB cluster, make sure to shard and index the dataset just like you do on the production collections. Hive, respect the storage file format of your current implementations as well as their partitioning. Combien de machines en prod ?! le dire
  • #19: It’s time to break Scylla, your goal here is to saturate the Scylla cluster, get it to ~90% load
  • #20: read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #21: I experienced pretty crappy performances at first. Grafana monitoring showed that Scylla was not being the bottleneck Repartition is used to leverage the driver’s knowledge on how data is sharded to optimize how it is going to be split between spark workers
  • #22: Take your clusters’ utilization into account
  • #23: Take your clusters’ utilization into account
  • #24: With spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances Those three refurbished machines can compete with our current production machines and implementations They can even match an idle Hive cluster of a medium size DIGRESSION!
  • #25: I went into the crazy quest of beating Spark 2 performances using a pure Python implementation. The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not. So you can’t possibly imagine outperforming Spark 2 with your single machine Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time. This is a quite expensive process that we have a chance to avoid using Python! joinWithCassandraTable JOINs 10M with 400M...
  • #26: read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #28: libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!
  • #29: The Hive loading + partitioning got down from 50s to 10s
  • #31: The conclusion of the evaluation has not been driven by the good figures we got out of our test workloads Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla Instead we decided on the following points of interest (in no particular order): data consistency production reliability datacenter awareness ease of operation infrastructure rationalisation Developer friendliness (mais c’est pas mongo) Costs (train engineers)