SlideShare a Scribd company logo
Joining Billions of Rows in Seconds
with One Database Instead of Two:
Replacing MongoDB and Hive with Scylla
Alexys Jacob
CTO, Numberly
1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug
Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)
Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all data sources and
destinations.
➔ For this we use ID matching tables.
ID matching tables
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable by
partner
Queried AND updated all the time!
➔ High read AND write workload
JOIN
Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH
ACTIVATE
Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline
Scylla?
Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline
Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!
Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID over the last N
months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!
Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!
scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)
Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!
Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+
Let’s use Scylla!
Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!
Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+
Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7
Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666
Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes
OK for Scala, what about Python?
No joinWithCassandraTable
when using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches
Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count
Dask + Hive + Scylla time break down
Hive
Scylla
Partitions
count50 seconds
10 seconds
60 seconds
Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!
Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes
Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:
Scylla!
Production environment
▪ 6x DELL R640
• dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager
Thank You
Questions welcomed!
Stay in touch
alexys@numberly.com
@ultrabug https://ultrabug.fr

More Related Content

What's hot (20)

PDF
Using all of the high availability options in MariaDB
MariaDB plc
 
PPTX
ConfD で Linux にNetconfを喋らせてみた
Akira Iwamoto
 
PDF
H2O - the optimized HTTP server
Kazuho Oku
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PPTX
RocksDB detail
MIJIN AN
 
PDF
Spark shuffle introduction
colorant
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PDF
Linux Profiling at Netflix
Brendan Gregg
 
PPTX
Introduction to apache zoo keeper
Omid Vahdaty
 
PDF
Untangling Cluster Management with Helix
Amy W. Tang
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Kernel Recipes 2019 - Faster IO through io_uring
Anne Nicolas
 
PDF
Intro to HBase
alexbaranau
 
PDF
暗認本読書会10
MITSUNARI Shigeo
 
PDF
Nginx Internals
Joshua Zhu
 
PPTX
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
Using all of the high availability options in MariaDB
MariaDB plc
 
ConfD で Linux にNetconfを喋らせてみた
Akira Iwamoto
 
H2O - the optimized HTTP server
Kazuho Oku
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
RocksDB detail
MIJIN AN
 
Spark shuffle introduction
colorant
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Linux Profiling at Netflix
Brendan Gregg
 
Introduction to apache zoo keeper
Omid Vahdaty
 
Untangling Cluster Management with Helix
Amy W. Tang
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Kernel Recipes 2019 - Faster IO through io_uring
Anne Nicolas
 
Intro to HBase
alexbaranau
 
暗認本読書会10
MITSUNARI Shigeo
 
Nginx Internals
Joshua Zhu
 
Scylla on Kubernetes: Introducing the Scylla Operator
ScyllaDB
 
Introduction to Apache ZooKeeper
Saurav Haloi
 

Similar to Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla (20)

PDF
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
ScyllaDB
 
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
PDF
The Path to ScyllaDB 5.2
ScyllaDB
 
PDF
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
PPTX
Writing Applications for Scylla
ScyllaDB
 
PDF
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
 
PDF
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
ScyllaDB
 
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
PDF
ScyllaDB Virtual Workshop
ScyllaDB
 
PDF
Exploring ScyllaDB’s DynamoDB-Compatible API by Guilherme Nogueira & Nadav Ha...
ScyllaDB
 
PDF
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
PDF
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
PDF
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
ScyllaDB
 
PPTX
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
ScyllaDB
 
PDF
Using ScyllaDB for Real-Time Read-Heavy Workloads.pdf
ScyllaDB
 
PDF
Alternator webinar september 2019
Nadav Har'El
 
PDF
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
ScyllaDB
 
PPTX
Meeting the challenges of OLTP Big Data with Scylla
ScyllaDB
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
ScyllaDB
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
The Path to ScyllaDB 5.2
ScyllaDB
 
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Writing Applications for Scylla
ScyllaDB
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
 
Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More ...
ScyllaDB
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB
 
Exploring ScyllaDB’s DynamoDB-Compatible API by Guilherme Nogueira & Nadav Ha...
ScyllaDB
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
ScyllaDB
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
ScyllaDB
 
Using ScyllaDB for Real-Time Read-Heavy Workloads.pdf
ScyllaDB
 
Alternator webinar september 2019
Nadav Har'El
 
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
ScyllaDB
 
Meeting the challenges of OLTP Big Data with Scylla
ScyllaDB
 
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
PDF
Leading a High-Stakes Database Migration
ScyllaDB
 
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
PDF
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
PDF
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
Ad

Recently uploaded (20)

PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

  • 1. Joining Billions of Rows in Seconds with One Database Instead of Two: Replacing MongoDB and Hive with Scylla Alexys Jacob CTO, Numberly
  • 2. 1 Eiffel Tower 2 Soccer World Cups 15 Years in the Data industry Pythonista OSS enthusiast & contributor Gentoo Linux developer CTO at Numberly - living in Paris, France whoami @ultrabug
  • 3. Business context of Numberly Digital Marketing Technologist (MarTech) Handling the relationship between brands and people (People based) Dealing with multiple sources and a wide range of data types (Events) Mixing and correlating a massive amount of different types of events... ...which all have their own identifiers (think primary keys)
  • 4. Business context of Numberly Web navigation tracking (browser ID: cookie) CRM databases (email address, customer ID) Partners’ digital platforms (cookie ID, hash(email address)) Mobile phone apps (device ID: IDFA, GAID) Ability to synchronize and translate identifiers between all data sources and destinations. ➔ For this we use ID matching tables.
  • 5. ID matching tables 1. SELECT reference population 2. JOIN with the ID matching table 3. MATCHED population is usable by partner Queried AND updated all the time! ➔ High read AND write workload JOIN
  • 6. Real life example: retargeting From a database (email) to a web banner (cookie) Previous donors [email protected] [email protected] [email protected] [email protected] https://kitty.eu AppNexus ... Google ID matching table Cookie id = 123 Cookie id = 297 ? Cookie id = 896 Ad Exchange User cookie id 123 SELECT MATCH ACTIVATE
  • 8. Drawbacks & pitfalls Events Message queues HDFS Real time Programs Batch Calculation MongoDB Hive Batch pipeline Real time pipeline
  • 10. Future implementation using Scylla? Events Message queues Real time Programs Batch Calculation Scylla Batch pipeline Real time pipeline
  • 11. Proof Of Concept hardware Recycled hardware… ▪ 2x DELL R510 • 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC ▪ 1x DELL R710 • 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC ➔ Compete with our production? Scylla is in!
  • 12. Finding the right schema model Query based AND test-driven data modeling 1. What are all the cookie IDs associated to the given partner ID over the last N months? 2. What is the last cookie ID/date for the given partner ID? Gotcha: the reverse questions are also to be answered! ➔ Denormalization ➔ Prototype with your language of choice!
  • 13. Schema tip! > What is the last cookie ID for the given partner ID? TIP: CLUSTERING ORDER ▪ Defaults to ASC ➔ Latest value at the end of the sstable! ▪ Change “date” ordering to DESC ➔ Latest value at the top of the sstable ➔ Reduced read latency!
  • 14. scylla-grafana-monitoring Set it up and test it! ▪ Use cassandra-stress Key graphs: ▪ number of open connections ▪ cache hits / misses ▪ per shard/node distribution ▪ sstable reads TIP: reduce default scrape interval ▪ scrape_interval: 2s (4s default) ▪ scrape_timeout: 1s (5s default)
  • 15. Reference data and metrics Reference dataset ▪ 10M population ▪ 400M ID matching table ➔ Representative volumes Measured on our production stack, with real load NOT a benchmark!
  • 16. Results: ▪ idle cluster: 2 minutes, 15 seconds ▪ normal cluster: 4 minutes ▪ overloaded cluster: 15 minutes Spark 2 + Hive: reference metrics Hive (population) Hive (ID matching) Partitions count +
  • 18. Testing with Scylla Distinguish between hot and cold cache scenarios ▪ Cold cache: mostly disk I/O bound ▪ Hot cache: mostly memory bound Push your Scylla cluster to its limits!
  • 19. Spark 2 + Hive + Scylla Hive (population) Scylla (ID matching) Partitions count +
  • 20. Spark 2 / Scala test workload DataStax’s spark-cassandra-connector joinWithCassandraTable ▪ spark-cassandra-connector-2.0.1-s_2.11.jar ▪ Java 7
  • 21. Spark 2 tuning (1/2) Use a fixed number of executors ▪ spark.dynamicAllocation.enabled=false ▪ spark.executor.instances=30 Change Spark split size to match Scylla for read performance ▪ spark.cassandra.input.split.size_in_mb=1 Adjust reads per seconds ▪ spark.cassandra.input.reads_per_sec=6666
  • 22. Spark 2 tuning (2/2) Tune the number of connections opened by each executor ▪ spark.cassandra.connection.connections_per_executor_max=100 Align driver timeouts with server timeouts (check scylla.yaml) ▪ spark.cassandra.connection.timeout_ms=150000 ▪ spark.cassandra.read.timeout_ms=150000 ScyllaDB blog posts & webinar ▪ https://www.scylladb.com/2018/07/31/spark-scylla/ ▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/ ▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/ ▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/
  • 23. Spark 2 + Scylla results Cold cache: 12 minutes Hot cache: 2 minutes Reference results: idle cluster: 2 minutes, 15 seconds normal cluster: 4 minutes overloaded cluster: 15 minutes
  • 24. OK for Scala, what about Python? No joinWithCassandraTable when using pyspark... Maybe we don’t need Spark 2 at all! 1. Load the 10M rows from Hive 2. For every row lookup the ID matching table from Scylla 3. Count the resulting number of matches
  • 25. Dask + Hive + Scylla Results: ▪ Cold cache: 6min ▪ Hot cache: 2min Hive (population) Scylla (ID matching) Partitions count
  • 26. Dask + Hive + Scylla time break down Hive Scylla Partitions count50 seconds 10 seconds 60 seconds
  • 27. Dask + Parquet + Scylla Parquet files (HDFS) Scylla Partitions count 10 seconds!
  • 28. Dask + Scylla results Cold cache: 5 minutes Hot cache: 1 minute 5 seconds Spark 2 results: cold cache: 6 minutes hot cache: 2 minutes
  • 29. Python+Scylla with Parquet tips! ▪ Use execute_concurrent() ▪ Increase concurrency parameter (defaults to 100) ▪ Use libev as connection_class instead of asyncore ▪ Use hdfs3 + pyarrow to read and load Parquet files:
  • 31. Production environment ▪ 6x DELL R640 • dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB Gentoo Linux Multi-DC setup Ansible based provisioning and backups Monitored by scylla-grafana-monitoring Housekeeping handled by scylla-manager
  • 32. Thank You Questions welcomed! Stay in touch [email protected] @ultrabug https://ultrabug.fr

Editor's Notes

  • #8: Lambda architecture
  • #9: Keeping (in sync) two copies of the same data Batch data freshness Operational burden => none can sustain both read and write workload
  • #10: Can Scylla sustain our ID matching tables workloads while maintaining consistently low upsert/write and lookup/read latencies?
  • #11: Simpler data consistency Operational simplicity and efficiency Reduced costs
  • #12: Always try a technology under the best omens :) Running Gentoo Linux
  • #13: ID translations must be done both ways => denormalization I wrote tests on my dataset so I could concentrate on the model while making sure that all my questions were being answered correctly and consistently.
  • #14: We ended up with three denormalized tables History like table (just like a log) Optimize for latest value? This will ensure that the latest values (rows) are stored at the beginning of the sstable file effectively reducing the read latency when the row is not in cache!
  • #15: Docker based, Easy to install, Multi environment support Understand the performance of your cluster Tune your workload for optimal performances
  • #16: Ref dataset : data cardinality, representative volumes MongoDB cluster, make sure to shard and index the dataset just like you do on the production collections. Hive, respect the storage file format of your current implementations as well as their partitioning. Combien de machines en prod ?! le dire
  • #19: It’s time to break Scylla, your goal here is to saturate the Scylla cluster, get it to ~90% load
  • #20: read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #21: I experienced pretty crappy performances at first. Grafana monitoring showed that Scylla was not being the bottleneck Repartition is used to leverage the driver’s knowledge on how data is sharded to optimize how it is going to be split between spark workers
  • #22: Take your clusters’ utilization into account
  • #23: Take your clusters’ utilization into account
  • #24: With spinning disks, the cold start result can compete with the results of a heavily loaded Hadoop cluster where pending containers and parallelism are knocking down its performances Those three refurbished machines can compete with our current production machines and implementations They can even match an idle Hive cluster of a medium size DIGRESSION!
  • #25: I went into the crazy quest of beating Spark 2 performances using a pure Python implementation. The main problem to compete with Spark 2 is that it is a distributed framework and Python by itself is not. So you can’t possibly imagine outperforming Spark 2 with your single machine Spark 2 is shipped and ran on executors using YARN so we are firing up JVMs and dispatching containers all the time. This is a quite expensive process that we have a chance to avoid using Python! joinWithCassandraTable JOINs 10M with 400M...
  • #26: read the 10M population rows from Hive in a partitioned manner for each partition (slice of 10M), query Scylla to lookup the possibly matching partnerid create a dataframe from the resulting matches gather back all the dataframes and merge them count the number of matches Spark 2 cold is 12min Spark 2 hot is 2min
  • #28: libhdfs3 + pyarrow combo. It is faster to load everything on a single machine than loading from Hive on multiple ones!
  • #29: The Hive loading + partitioning got down from 50s to 10s
  • #31: The conclusion of the evaluation has not been driven by the good figures we got out of our test workloads Those are no benchmarks and never pretended to be but we could still prove that performances were solid enough to not be a blocker in the adoption of Scylla Instead we decided on the following points of interest (in no particular order): data consistency production reliability datacenter awareness ease of operation infrastructure rationalisation Developer friendliness (mais c’est pas mongo) Costs (train engineers)