Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

Joining Billions of Rows in Seconds
with One Database Instead of Two:
Replacing MongoDB and Hive with Scylla
Alexys Jacob
CTO, Numberly

1 Eiffel Tower
2 Soccer World Cups
15 Years in the Data industry
Pythonista
OSS enthusiast & contributor
Gentoo Linux developer
CTO at Numberly - living in Paris, France
whoami
@ultrabug

Business context of Numberly
Digital Marketing Technologist (MarTech)
Handling the relationship between brands and people (People based)
Dealing with multiple sources and a wide range of data types (Events)
Mixing and correlating a massive amount of different types of events...
...which all have their own identifiers (think primary keys)

Business context of Numberly
Web navigation tracking (browser ID: cookie)
CRM databases (email address, customer ID)
Partners’ digital platforms (cookie ID, hash(email address))
Mobile phone apps (device ID: IDFA, GAID)
Ability to synchronize and translate identifiers between all data sources and
destinations.
➔ For this we use ID matching tables.

ID matching tables
1. SELECT reference population
2. JOIN with the ID matching table
3. MATCHED population is usable by
partner
Queried AND updated all the time!
➔ High read AND write workload
JOIN

Real life example: retargeting
From a database (email) to a web banner (cookie)
Previous
donors
generous@coconut.fr
isupportu@lab.com
wiki4ever@wp.eu
openinternet@free.fr
https://kitty.eu
AppNexus
...
Google
ID
matching
table
Cookie id = 123
Cookie id = 297
?
Cookie id = 896
Ad Exchange User cookie id 123
SELECT MATCH
ACTIVATE

Current implementation(s)
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline

Drawbacks & pitfalls
Events
Message
queues
HDFS
Real time
Programs
Batch
Calculation
MongoDB
Hive
Batch pipeline
Real time pipeline

Future implementation using Scylla?
Events
Message
queues
Real time
Programs
Batch
Calculation
Scylla
Batch pipeline
Real time pipeline

Proof Of Concept hardware
Recycled hardware…
▪ 2x DELL R510
• 19GB RAM, 16 cores, RAID0 SAS spinning disks, 1Gbps NIC
▪ 1x DELL R710
• 19GB RAM, 8 cores, RAID0 SAS spinning disks, 1Gbps NIC
➔ Compete with our production? Scylla is in!

Finding the right schema model
Query based AND test-driven data modeling
1. What are all the cookie IDs associated to the given partner ID over the last N
months?
2. What is the last cookie ID/date for the given partner ID?
Gotcha: the reverse questions are also to be answered!
➔ Denormalization
➔ Prototype with your language of choice!

Schema tip!
> What is the last cookie ID for the given partner ID?
TIP: CLUSTERING ORDER
▪ Defaults to ASC
➔ Latest value at the end of the
sstable!
▪ Change “date” ordering to DESC
➔ Latest value at the top of the
sstable
➔ Reduced read latency!

scylla-grafana-monitoring
Set it up and test it!
▪ Use cassandra-stress
Key graphs:
▪ number of open connections
▪ cache hits / misses
▪ per shard/node distribution
▪ sstable reads
TIP: reduce default scrape interval
▪ scrape_interval: 2s (4s default)
▪ scrape_timeout: 1s (5s default)

Reference data and metrics
Reference dataset
▪ 10M population
▪ 400M ID matching table
➔ Representative volumes
Measured on our production stack, with real load
NOT a benchmark!

Results:
▪ idle cluster: 2 minutes, 15 seconds
▪ normal cluster: 4 minutes
▪ overloaded cluster: 15 minutes
Spark 2 + Hive: reference metrics
Hive
(population)
Hive
(ID matching)
Partitions
count
+

Testing with Scylla
Distinguish between hot and cold cache scenarios
▪ Cold cache: mostly disk I/O bound
▪ Hot cache: mostly memory bound
Push your Scylla cluster to its limits!

Spark 2 + Hive + Scylla
Hive
(population)
Scylla
(ID matching)
Partitions
count
+

Spark 2 / Scala test workload
DataStax’s spark-cassandra-connector joinWithCassandraTable
▪ spark-cassandra-connector-2.0.1-s_2.11.jar
▪ Java 7

Spark 2 tuning (1/2)
Use a fixed number of executors
▪ spark.dynamicAllocation.enabled=false
▪ spark.executor.instances=30
Change Spark split size to match Scylla for read performance
▪ spark.cassandra.input.split.size_in_mb=1
Adjust reads per seconds
▪ spark.cassandra.input.reads_per_sec=6666

Spark 2 tuning (2/2)
Tune the number of connections opened by each executor
▪ spark.cassandra.connection.connections_per_executor_max=100
Align driver timeouts with server timeouts (check scylla.yaml)
▪ spark.cassandra.connection.timeout_ms=150000
▪ spark.cassandra.read.timeout_ms=150000
ScyllaDB blog posts & webinar
▪ https://www.scylladb.com/2018/07/31/spark-scylla/
▪ https://www.scylladb.com/2018/08/21/spark-scylla-2/
▪ https://www.scylladb.com/2018/10/08/hooking-up-spark-and-scylla-part-3/
▪ https://www.scylladb.com/2018/07/17/spark-webinar-questions-answered/

Spark 2 + Scylla results
Cold cache: 12 minutes
Hot cache: 2 minutes
Reference results:
idle cluster: 2 minutes, 15 seconds
normal cluster: 4 minutes
overloaded cluster: 15 minutes

OK for Scala, what about Python?
No joinWithCassandraTable
when using pyspark...
Maybe we don’t need Spark 2 at all!
1. Load the 10M rows from Hive
2. For every row lookup the ID matching table from Scylla
3. Count the resulting number of matches

Dask + Hive + Scylla
Results:
▪ Cold cache: 6min
▪ Hot cache: 2min
Hive
(population)
Scylla
(ID matching)
Partitions
count

Dask + Hive + Scylla time break down
Hive
Scylla
Partitions
count50 seconds
10 seconds
60 seconds

Dask + Parquet + Scylla
Parquet files
(HDFS)
Scylla
Partitions
count
10 seconds!

Dask + Scylla results
Cold cache: 5 minutes
Hot cache: 1 minute 5 seconds
Spark 2 results:
cold cache: 6 minutes
hot cache: 2 minutes

Python+Scylla with Parquet tips!
▪ Use execute_concurrent()
▪ Increase concurrency parameter (defaults to 100)
▪ Use libev as connection_class instead of asyncore
▪ Use hdfs3 + pyarrow to read and load Parquet files:

Production environment
▪ 6x DELL R640
• dual socket 2,6GHz 14C, 512GB RAM, Samsung 17xxx NVMe 3,2 TB
Gentoo Linux
Multi-DC setup
Ansible based provisioning and backups
Monitored by scylla-grafana-monitoring
Housekeeping handled by scylla-manager

Thank You
Questions welcomed!
Stay in touch
alexys@numberly.com
@ultrabug https://ultrabug.fr

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

More Related Content

What's hot (20)

Similar to Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla (20)

More from ScyllaDB (20)

Recently uploaded (20)

Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Instead of Two - Replacing MongoDB and Hive with Scylla

Editor's Notes