SlideShare a Scribd company logo
ClickHouse for
Experimentation
Gleb Kanterov
@kanterov
2018-07-03
170M Monthly Active Users
75M Subscribers
35M Tracks
65 Markets
[1] https://investors.spotify.com
Quick Facts
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
AUTONOMY
Hadoop@Spotify
● On-Premise
● 2,500 nodes
● 100 PB Disk
● 100 TB RAM
● 100B+ events per day
● 20K+ jobs per day
Hadoop@Spotify
● Migration from On-Premise to GCP
● Moved 100 PB of data
● Our Hadoop cluster is dead
Hadoop@Spotify
What are
experiments,
and why
ClickHouse?
Randomized
Controlled
Experiment
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
An experiment where all subjects
involved in the experiment are treated
the same except for one deviation.
One variable is changed in order to
isolate the results.
All Khan Academy content is available for free at www.khanacademy.org
A/B Testing
A/B Testing is a randomized controlled experiment where one variable is tested.
E.g., hypothesis Our new recommendation algorithm increases content consumption.
How to verify?
1. Formulate hypothesis
2. Run A/B test
3. See if there is a statistically significant increase in consumption.
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
1. Event Delivery
Developers instrument
applications and services
using SDK.
Events are collected and
published to Pub/Sub.
Batch jobs read data from
Pub/Sub, deduplicate and
anonymize, and then store in
hourly partitions on GCS.
Exposing users to
experiments, and configuring
A/B variations on clients is
done by dedicates services.
Product
Owners
Data
Scientists
Granular Data
BigQuery
1
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
2. Data Pipelines and
Storage
Data gets transformed and
aggregated using Dataflow
batch jobs, and stored in
Bigtable, GCS and BigQuery.
Bigtable contains
pre-computed aggregated
experiment results.
BigQuery has granular data
used in ad-hoc analysis.
Product
Owners
Data
Scientists
Granular Data
BigQuery
2
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
3. Presentation
Users of Experimentation
platform see their experiment
results in web application.
Statistical tests and health
checks are performed
automatically.
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
3
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
4. Ad-hoc Analytics
Data scientists do ad-hoc
exploration in Jupyter
notebooks using BigQuery.
Here they answer experiment
specific-questions, not
automatically supported by
experimentation system.
Product
Owners
Data
Scientists
Granular Data
BigQuery
4
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
What works well
Centralized team owning
100-s of core metrics.
Automatic experiment
analysis and planning.
Allows to conclude
experiments without manual
analysis. Autonomous feature
teams can move fast and
iterate on their product.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Problems
Not every metric worths
centralization.
Centralized team became a
bottleneck for Feature
features.
As a result, too much
repetitive work goes into
notebooks and ad-hoc
queries.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Granular Data
OLAP Database
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Reasons
1. Experimentation isn’t only
about hypothesis testing, but
learning from experiments.
Aggregated data in Bigtable
wasn’t granular enough, and
didn't have enough
dimensions.
2. Can’t add a new metric
without involving a central
team.
What we want
Provide teams more granular
data out of the box, and give
a way to define a new metric.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Requirements
● Serve 100-s of QPS with sub-second latency
● We know in advance what are queries and data
● Maintain 10x metrics with the same cost
● Thousands of metrics
● Billions of rows per day in each of 100-s of tables
● Ready to be used out of the box
● Leverage existing infrastructure as much as feasible
● Hide unnecessary complexity from internal users
What about BigQuery?
● Supports Standard SQL
● Don’t have to optimize datasets in advance
● Works great for heavy queries with joins among multiple datasets
● Doesn’t need operations and machines running
● Good for interactive ad-hoc queries (~ minutes)
● Isn’t best for a high amount of low-latency queries you are aware in advance
Why ClickHouse?
● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...)
● ClickHouse has the most simple architecture
● Powerful SQL dialect close to Standard SQL
● A comprehensive set of built-in functions and aggregators
● Was ready to be used out of the box
● Superset integration is great
● Easy to query using clickhouse-jdbc and jooq
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
5. ClickHouse
Interactive queries on
granular data.
Reduce demand in notebooks
and BigQuery with
dashboards and exploration
in Superset.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
5
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
6. Metrics Catalog
Centralized place for teams to
define their own metrics.
20 minutes to define a metric. Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
Metrics Catalog
Metrics API
Metric
definitions
6
What we have built
● Own DSL to define metrics, and centralized metrics catalog
● Expressive and simple model that we can efficiently scale to 1000-s of metrics
● Generalize existing components to work with Metrics DSL
○ data preparation and ingestion into ClickHouse
○ denormalization with conformed dimensions
○ create dashboards, tables and charts in Superset
○ do statistical tests, and expose results through API
○ define ownership, tiering, and other attributes
○ integrates with the rest of infrastructure for alerting, monitoring,
data quality, anomaly detection, access control & etc
● Users don’t work with ClickHouse SQL, or need to know how it works
● API to query metrics and metadata
Ingestion to ClickHouse
● Move data from GCS to ClickHouse
● Use clickhouse-jdbc, custom code and RowBinary format
● Use daily partitioning, and ingest once a day
● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0
● Don’t use materialized views in ClickHouse
● Offload most of computations to batch data pipelines due to scalability, experience and
tooling
● TODO try ClickHouse-Native-JDBC
● TODO pre-sort in data pipelines before ingesting
What is next
● Do lambda-style ingestion for subset of metrics with low-latency requirements
● Add more aggregations to DSL (e.g. 5 statistical moments)
● Add custom chart types to Superset
● Try ClickHouse for similar use cases within Spotify
Using ClickHouse for Experimentation
Ad

More Related Content

What's hot (20)

All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
Altinity Ltd
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
Altinity Ltd
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, CloudflareClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
Altinity Ltd
 

Similar to Using ClickHouse for Experimentation (20)

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google Cloud Platform - Japan
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
Tatvic Analytics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
Harissh16
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
Romit Mehta
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
Rasel Rana
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
Michael Young
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
Márton Kodok
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
Biswajit Das
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
Naresh Chintalcheru
 
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google Cloud Platform - Japan
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
Tatvic Analytics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
Harissh16
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
Romit Mehta
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Jaroslav Gergic
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
Rasel Rana
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
Michael Young
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
Márton Kodok
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Ad

Recently uploaded (20)

Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Top 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docxTop 10 Client Portal Software Solutions for 2025.docx
Top 10 Client Portal Software Solutions for 2025.docx
Portli
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...
Andre Hora
 
Ad

Using ClickHouse for Experimentation

  • 2. 170M Monthly Active Users 75M Subscribers 35M Tracks 65 Markets [1] https://investors.spotify.com Quick Facts
  • 3. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads
  • 4. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things
  • 5. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission
  • 6. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission AUTONOMY
  • 8. ● On-Premise ● 2,500 nodes ● 100 PB Disk ● 100 TB RAM ● 100B+ events per day ● 20K+ jobs per day Hadoop@Spotify
  • 9. ● Migration from On-Premise to GCP ● Moved 100 PB of data ● Our Hadoop cluster is dead Hadoop@Spotify
  • 12. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 13. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 14. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 15. Randomized Controlled Experiment An experiment where all subjects involved in the experiment are treated the same except for one deviation. One variable is changed in order to isolate the results. All Khan Academy content is available for free at www.khanacademy.org
  • 16. A/B Testing A/B Testing is a randomized controlled experiment where one variable is tested. E.g., hypothesis Our new recommendation algorithm increases content consumption. How to verify? 1. Formulate hypothesis 2. Run A/B test 3. See if there is a statistically significant increase in consumption.
  • 17. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery
  • 18. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 1. Event Delivery Developers instrument applications and services using SDK. Events are collected and published to Pub/Sub. Batch jobs read data from Pub/Sub, deduplicate and anonymize, and then store in hourly partitions on GCS. Exposing users to experiments, and configuring A/B variations on clients is done by dedicates services. Product Owners Data Scientists Granular Data BigQuery 1
  • 19. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 2. Data Pipelines and Storage Data gets transformed and aggregated using Dataflow batch jobs, and stored in Bigtable, GCS and BigQuery. Bigtable contains pre-computed aggregated experiment results. BigQuery has granular data used in ad-hoc analysis. Product Owners Data Scientists Granular Data BigQuery 2
  • 20. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine 3. Presentation Users of Experimentation platform see their experiment results in web application. Statistical tests and health checks are performed automatically. Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery 3
  • 21. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 4. Ad-hoc Analytics Data scientists do ad-hoc exploration in Jupyter notebooks using BigQuery. Here they answer experiment specific-questions, not automatically supported by experimentation system. Product Owners Data Scientists Granular Data BigQuery 4
  • 22. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 What works well Centralized team owning 100-s of core metrics. Automatic experiment analysis and planning. Allows to conclude experiments without manual analysis. Autonomous feature teams can move fast and iterate on their product. Product Owners Data Scientists Granular Data BigQuery
  • 23. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Problems Not every metric worths centralization. Centralized team became a bottleneck for Feature features. As a result, too much repetitive work goes into notebooks and ad-hoc queries. Product Owners Data Scientists Granular Data BigQuery
  • 24. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Granular Data OLAP Database Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Reasons 1. Experimentation isn’t only about hypothesis testing, but learning from experiments. Aggregated data in Bigtable wasn’t granular enough, and didn't have enough dimensions. 2. Can’t add a new metric without involving a central team. What we want Provide teams more granular data out of the box, and give a way to define a new metric. Product Owners Data Scientists Granular Data BigQuery
  • 25. Requirements ● Serve 100-s of QPS with sub-second latency ● We know in advance what are queries and data ● Maintain 10x metrics with the same cost ● Thousands of metrics ● Billions of rows per day in each of 100-s of tables ● Ready to be used out of the box ● Leverage existing infrastructure as much as feasible ● Hide unnecessary complexity from internal users
  • 26. What about BigQuery? ● Supports Standard SQL ● Don’t have to optimize datasets in advance ● Works great for heavy queries with joins among multiple datasets ● Doesn’t need operations and machines running ● Good for interactive ad-hoc queries (~ minutes) ● Isn’t best for a high amount of low-latency queries you are aware in advance
  • 27. Why ClickHouse? ● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...) ● ClickHouse has the most simple architecture ● Powerful SQL dialect close to Standard SQL ● A comprehensive set of built-in functions and aggregators ● Was ready to be used out of the box ● Superset integration is great ● Easy to query using clickhouse-jdbc and jooq
  • 28. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 5. ClickHouse Interactive queries on granular data. Reduce demand in notebooks and BigQuery with dashboards and exploration in Superset. Product Owners Data Scientists Granular Data BigQuery Superset 5
  • 29. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 6. Metrics Catalog Centralized place for teams to define their own metrics. 20 minutes to define a metric. Product Owners Data Scientists Granular Data BigQuery Superset Metrics Catalog Metrics API Metric definitions 6
  • 30. What we have built ● Own DSL to define metrics, and centralized metrics catalog ● Expressive and simple model that we can efficiently scale to 1000-s of metrics ● Generalize existing components to work with Metrics DSL ○ data preparation and ingestion into ClickHouse ○ denormalization with conformed dimensions ○ create dashboards, tables and charts in Superset ○ do statistical tests, and expose results through API ○ define ownership, tiering, and other attributes ○ integrates with the rest of infrastructure for alerting, monitoring, data quality, anomaly detection, access control & etc ● Users don’t work with ClickHouse SQL, or need to know how it works ● API to query metrics and metadata
  • 31. Ingestion to ClickHouse ● Move data from GCS to ClickHouse ● Use clickhouse-jdbc, custom code and RowBinary format ● Use daily partitioning, and ingest once a day ● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0 ● Don’t use materialized views in ClickHouse ● Offload most of computations to batch data pipelines due to scalability, experience and tooling ● TODO try ClickHouse-Native-JDBC ● TODO pre-sort in data pipelines before ingesting
  • 32. What is next ● Do lambda-style ingestion for subset of metrics with low-latency requirements ● Add more aggregations to DSL (e.g. 5 statistical moments) ● Add custom chart types to Superset ● Try ClickHouse for similar use cases within Spotify