SlideShare a Scribd company logo
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve Analytics Journey at Celtra:
Snowflake, Spark and Databricks
Grega Kespret
Director of Engineering, Analytics @
Matthew J. Glickman
Vice President of Product @
• Where we started (aggregate fact tables)
• Why we needed data warehouse
• Requirements and evaluations
• Snowflake adoption
• How Celtra handles schema evolution and data rewrites
• Snowflake architecture
• Next steps
Story (Agenda)
Creative
platform
for
brand
advertising
Celtra AdCreator
188,000+
Ads Built
15,000+
Campaigns
5000+
Brands
2bn+
Analytics Events / Day
1TB+
New Data / Day
Key Pain Points
✗ Difficult to analyze the data collected
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
Pre-aggregations—Unable to Respond to
Speed and Complexity of Business
OLAP
cubes
Operational
data
+
Event dataTrackers
Adhoc queries
Applications
Client-facing
dashboardsMySQLAmazon S3 MySQL
ETL
• Point in time facts about what happened
• Bread and butter of our analytics data
• JSON records
• Very sparse
• Complex relationships between events
• Sessionization: Combine discrete events
into sessions
Event Data
• Patterns more interesting than sums and counts of events
• Easier to troubleshoot/debug with context
• Able to check for/enforce causality
(if X happened, Y must also have happened)
• De-duplication possible (no skewed rates because of outliers)
• Later events reveal information about earlier arriving events
(e.g. session duration, attribution, identity, etc.)
Sessionization: Why We Do It
Sessionization
Events Sessions
Spark for Complex ETL on Events
• Complex ETL: de-duplicate, sessionize, clean, validate, emit facts
• Production hourly runs
• Get full expressive power of Scala for ETL
• Shuffle needed for sessionization
• Seamless integration with S3
• Needed flexibility, not provided by precomputed aggregates
(unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Wanted short development cycles and faster experimentation
• Visualizations
✗ Difficult to Analyze the Data Collected
For example:
• Analyzing effects of placement position on
engagement rates
• Troubleshooting 95th percentile of ad loading
time performance
Key Pain Points
 Now able to analyze the data collected
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
Better with Databricks + SparkSQL
OLAP
cubes
Trackers
Adhoc queries
Applications
Client-facing
dashboardsMySQL
ETL
Operational dataEvent data
MySQLAmazon S3
+
SQL
Key Pain Points
 Now able to analyze the data collected
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Complex ETL repeated in adhoc queries (slow, error-prone)
But a New Problem Emerged
OLAP
cubes
Trackers
Adhoc queries
Applications
Client-facing
dashboardsMySQL
ETL
Operational dataEvent data
MySQLAmazon S3
+
SQL
Idea: Split ETL, Materialize Sessions
OLAP
cubes
Event data
+
Operational dataTrackers
Adhoc queries
Applications
Client-facing
dashboards
Amazon S3 MySQL MySQL
ETL ETL
Sessions
???
deduplication, sessionization, cleaning, validation, external dependencies
aggregating across different dimensions
Part 1: Complex Part 2: Simple
SQL
Requirements
• Fully managed service
• Columnar storage format
• Support for complex nested structures
• Schema evolution possible
• Data rewrites possible
• Scale compute resources separately
from storage
Needed Data Warehouse to Store
Intermediate Results
Nice-to-Haves
• Transactions
• Partitioning
• Skipping
• Access control
• Appropriate for OLAP use case
Operational tasks for self-service installation:
• Replace failed node
• Refresh projection
• Restart database with one node down
• Remove dead node from DNS
• Ensure enough (at least 2x) disk space
available for rewrites
• Backup data
• Archive data
Why We Wanted a Managed Service
We did not want to
deal with these tasks
1. Denormalize everything
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize everything
✗ Speed (joins)
✓ Cheap storage
3. Nested objects: pre-group the data on each grain
✓ Speed (a "join" between parent and child is essentially free)
✓ Cheap storage
3 Choices for How to Model Sessions
Session Unit views Page views
0 N 1 N
Creative
1
N
Campaign
1
N
Interactions
0 N
Complex Nested Structures
Session Unit views Page views
0 N 1 N
Creative
1
N
Campaign
1
N
Interactions
0 N
Unit views
Session
Page views
Interactions
Creative
Campaign
1
N
1
N
Flat Data in Relational Tables Nested Data
Flat + Normalized Nested
Find top 10 pages on creative units with most interactions on average
Flat vs. Nested Queries
SELECT
creativeId,
uv.name unitName,
pv.name pageName,
AVG(COUNT(*)) avgInteractions
FROM
sessions s
JOIN unitViews uv ON uv.id = s.id
JOIN pageViews pv ON pv.uvid = uv.id
JOIN interactions i ON i.pvid = pv.id
GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10
Distributed join turned into local joinRequires unique ID at every grain
Joins
SELECT
creativeId,
unitViews.value:name unitName,
pageViews.value:name pageName,
AVG(ARRAY_SIZE(pageViews.value:interactions)) avgInteractions
FROM
sessions,
LATERAL FLATTEN(json:unitViews) unitViews,
LATERAL FLATTEN(unitViews.value:pageViews) pageViews
GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10
Final
Contenders
for New
Data
Warehouse
• Evaluated Spark + HCatalog + Parquet + S3 solution
• Too many small files problem => file stitching
• No consistency guarantees over set of files on
S3 => secondary index | convention
• Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog),
Storage format layer (Parquet), Data layer (S3)
Work with Data, Not Files
We really wanted a database-like abstraction with
transactions, not a file format!
We Chose Snowflake as Our Managed Data
Warehouse (DWaaS)
Pain Points
 Now able to analyze the data collected
 Data processed once & consumed many times  ETL'd data acts as a single source of truth
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
OLAP
cubes
Trackers
Adhoc queries
Applications
Client-facing
dashboards
MySQL
ETL ETL
SessionsEvent data
+
Operational data
Amazon S3 MySQL
SQL
Snowflake Adoption
• Backfilling / recomputing sessions of the last 2 years (from January 2014)
28TB of data (compressed)
• "Soft deploy", period of mirrored writes
Soon switch completely in production
• Each developer/analyst has its own database
• Separate roles and data warehouses for: production, developers, analysts
• Analysts & data scientists already using Snowflake through Databricks daily
• Session schema
Known, well defined (by Session Scala model) and enforced
• Latest Session model
Authoritative source for sessions schema
• Historical sessions conform to the latest Session model
Can de-serialize any historical session
• Readers should ignore fields not in Session model
We do not guarantee to preserve this data
• Computing facts (metrics, dimensions) from Session model is time-invariant
Computed 2 months ago or today, numbers must be the same
How Celtra Handles Data Structure Evolution
Schema Evolution
Change in Session model Top level / scalar column Nested / VARIANT column
Rename field
ALTER TABLE tbl RENAME COLUMN
col1 TO col2; data rewrite (!)
Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite
Add field, no historical values
ALTER TABLE tbl ADD COLUMN col
type; no change necessary
Also considered views for VARIANT schema evolution
For complex scenarios have to use Javascript UDF => lose benefits of columnar access
Not good for practical use
• They are sometimes necessary
• We have the ability to do data rewrites
• Rewrites of ~35TB (compressed) are not fun
• Complex and time consuming, so we fully automate them
• Costly, so we batch multiple changes together
• Rewrite must maintain sort order fast access (note: UPDATE breaks it!)
• Javascript UDFs are our default approach for rewrites of data in VARIANT
Data Rewrites
• Expressive power of Javascript (vs. SQL)
• Run on the whole VARIANT record
• (Almost) constant performance
• More readable and understandable
• For changing a single field,
OBJECT_INSERT/OBJECT_DELETE
are preferred
Inline Rewrites with Javascript UDFs
CREATE OR REPLACE FUNCTION transform("json" variant)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
AS '
// modify json
return json;
';
SELECT transform(json) FROM sessions;
Snowflake Spark Connector
• Implements Spark Data Sources API
• Access data in Snowflake through Spark SQL (via Databricks)
• Currently available in Beta, soon to be open-source
Operational
data
+
Event data
Adhoc queries
MySQL Amazon S3
ETL
Sessions
SQL
Snowflake Data Warehouse as a Service
Centralized storage
Instant, automatic scalability & elasticity
Single service
Scalable, resilient cloud services layer
coordinates access & management
Elastically scalable
compute
Multiple “virtual warehouse” compute
clusters scale horsepower & concurrency
Database Storage
Python
• Data Warehouse as a Service:
No infrastructure, knobs or tuning
• Infinite & Independent Scalability:
Scale storage and compute layers
independently
• One Place for All Data:
Native support for structured & semi-
structured data
• Instant Cloning: Isolate prod/dev
• Highly Available:
11 9’s durability, 4 9’s availability
Snowflake’s Multi-cluster, Shared Data Service
Logical
Databases
Virtual
Warehouse
Virtual
Warehouse
ETL & Data
Loading
Virtual
Warehouse
Finance
Virtual
Warehouse
Dev, Test,
QA
Dashboards
Virtual
Warehouse
Marketing
Clone
Data Science
Apple 101.12 250 FIH-2316
Pear 56.22 202 IHO-6912
Orange 98.21 600 WHQ-6090
Native Support for Structured
+ Semi-structured Data
Any hierarchical, nested
data type (e.g. JSON, Avro)
Optimized VARIANT data
type, no fixed schema or
transformation required
Full benefit of database
optimizations – pruning,
filtering, etc.
Structured
data
{ "firstName": "John",
"lastName": "Smith",
"height_cm": 167.64,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY”,
…
..
..
.
{
"firstName": "John",
"lastName": "Smith",
"height_cm": 167.64,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
Semi-Structured Data Stored Natively Queried using SQL
Semi-structured
data
Next Stage: Snowflake Also for Aggregates
OLAP
cubes
Trackers
Adhoc queries
Applications
Client-facing
dashboards
MySQL
ETL ETL
SessionsEvent data
+
Operational data
Amazon S3 MySQL
SQL
End Goal
Pain Points
 Now able to analyze the data collected
 Data processed once & consumed many times  ETL'd data acts as a single source of truth
 Fast schema changes in cubes (e.g. adding / removing metric)
+
Trackers
ETL
Sessions &
OLAP cubes
Adhoc queries
Applications
Client-facing
dashboards
Event data
+
Amazon S3 MySQL
Operational data
Thank You.
Grega Kespret
Director of Engineering,
Analytics @
@gregakespret
github.com/gregakespret
slideshare.net/gregak
linkedin.com/in/gregakespret
Matthew J. Glickman
Vice President of Product,
@matthewglickman
linkedin.com/in/matthewglickman
Appendix
Snowflake Query Performance
• There are no indexes or projections
• Sort the data on ingest to maintain query performance
• 3 "tiers"
• Query cache
• File cache
• S3 storage
• Save sessions to S3 in parallel
(Spark cluster)
Getting Data INTO Snowflake
Operational
data
+
Event data
Adhoc queries
MySQL Amazon S3
ETL
SQL
Sessions
sessions.map(serializeJson).saveAsTextFile("s3a://...")
• Save sessions to S3 in parallel
(Spark cluster)
• Copy from S3 to temporary table
(Snowflake cluster)
Getting Data INTO Snowflake
Operational
data
+
Event data
Adhoc queries
MySQL Amazon S3
ETL
Sessions
CREATE TEMPORARY TABLE sessions-import (json VARIANT NOT NULL);
COPY INTO sessions-import FROM s3://...
FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json')
CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...')
REGION = 'external-1';
SQL
• Save sessions to S3 in parallel
(Spark cluster)
• Copy from S3 to temporary table
(Snowflake cluster)
• Sort and insert into main table
(Snowflake cluster)
Getting Data INTO Snowflake
INSERT INTO sessions
SELECT
TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)::date,
json:accountId AS accountId,
json:campaignId AS campaignId,
HOUR(TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)),
json:creativeId AS creativeId,
json:placementId AS placementId,
TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3),
json
FROM sessions-import
ORDER BY
utcDate ASC,
accountId ASC,
campaignId ASC,
utcHour ASC,
creativeId ASC,
placementId ASC;
Operational
data
+
Event data
Adhoc queries
MySQL Amazon S3
ETL
Sessions
SQL
Getting Data OUT of Snowflake
COPY INTO s3://...
FROM (SELECT json FROM sessions WHERE ...)
FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json')
REGION = 'external-1'
CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...');
• Copy to S3 (Snowflake cluster)
OLAP
cubes
Applications
Client-facing
dashboards
MySQL
ETL
Sessions
Getting Data OUT of Snowflake
val sessions: RDD[Session] = sc.textFile(s"s3a://...").map(deserialize)
• Copy to S3 (Snowflake cluster)
• Read from S3 and apply schema (Spark cluster)
OLAP
cubes
Applications
Client-facing
dashboards
MySQL
ETL
Sessions
Combining Spark and Snowflake
Parallel
unload
InputFormat
RDD[Array[String]]
DataFrame
Parallel
consumption
AWS S3

More Related Content

What's hot (18)

Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
Grega Kespret
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
Snowflake Computing
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
Knoldus Inc.
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
Kent Graziano
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
Snowflake Computing
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Kent Graziano
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake
KETL Limited
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Kent Graziano
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
Grega Kespret
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
Snowflake Computing
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No LimitsAWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summit Singapore 2019 | Snowflake: Your Data. No Limits
AWS Summits
 
KSnow: Getting started with Snowflake
KSnow: Getting started with SnowflakeKSnow: Getting started with Snowflake
KSnow: Getting started with Snowflake
Knoldus Inc.
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
Kent Graziano
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
Snowflake Computing
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Kent Graziano
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Launching a Data Platform on Snowflake
Launching a Data Platform on SnowflakeLaunching a Data Platform on Snowflake
Launching a Data Platform on Snowflake
KETL Limited
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
Kent Graziano
 

Viewers also liked (10)

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Faysal Shaarani (MBA)
 
NT sf-bjq
NT sf-bjqNT sf-bjq
NT sf-bjq
sofia Ferrando
 
The Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriersThe Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriers
Erik Duval
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
 
eCloud newspapers
eCloud newspaperseCloud newspapers
eCloud newspapers
Erik Duval
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Neural networks
Neural networksNeural networks
Neural networks
Slideshare
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
REHMAT ULLAH
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
Ahmed_hashmi
 
Amazon.com Business Model
Amazon.com Business ModelAmazon.com Business Model
Amazon.com Business Model
Raveena Balani
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Faysal Shaarani (MBA)
 
The Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriersThe Snowflake Effect: open learning without barriers
The Snowflake Effect: open learning without barriers
Erik Duval
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
 
eCloud newspapers
eCloud newspaperseCloud newspapers
eCloud newspapers
Erik Duval
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Neural networks
Neural networksNeural networks
Neural networks
Slideshare
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
REHMAT ULLAH
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
Ahmed_hashmi
 
Amazon.com Business Model
Amazon.com Business ModelAmazon.com Business Model
Amazon.com Business Model
Raveena Balani
 

Similar to Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
Łukasz Grala
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdfUltimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
chanti29
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
PARIKSHIT SAVJANI
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
webuploader
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Microsoft Fabric trough the Power BI lenses
Microsoft Fabric trough the Power BI lensesMicrosoft Fabric trough the Power BI lenses
Microsoft Fabric trough the Power BI lenses
Data & Analytics Magazin
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
Kellyn Pot'Vin-Gorman
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
llangit
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
abercius24
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Session 2: SQL Server 2012 with Christian Malbeuf
Session 2: SQL Server 2012 with Christian MalbeufSession 2: SQL Server 2012 with Christian Malbeuf
Session 2: SQL Server 2012 with Christian Malbeuf
CTE Solutions Inc.
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
Łukasz Grala
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdfUltimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
Ultimate+SnowPro+Core+Certification+Course+Slides+by+Tom+Bailey (1).pdf
chanti29
 
Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016Columnstore improvements in SQL Server 2016
Columnstore improvements in SQL Server 2016
Niko Neugebauer
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
PARIKSHIT SAVJANI
 
AnalysisServices
AnalysisServicesAnalysisServices
AnalysisServices
webuploader
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Microsoft Fabric trough the Power BI lenses
Microsoft Fabric trough the Power BI lensesMicrosoft Fabric trough the Power BI lenses
Microsoft Fabric trough the Power BI lenses
Data & Analytics Magazin
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
Kellyn Pot'Vin-Gorman
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
llangit
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
abercius24
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
Chester Chen
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Session 2: SQL Server 2012 with Christian Malbeuf
Session 2: SQL Server 2012 with Christian MalbeufSession 2: SQL Server 2012 with Christian Malbeuf
Session 2: SQL Server 2012 with Christian Malbeuf
CTE Solutions Inc.
 

Recently uploaded (20)

How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

  • 2. Self-serve Analytics Journey at Celtra: Snowflake, Spark and Databricks Grega Kespret Director of Engineering, Analytics @ Matthew J. Glickman Vice President of Product @
  • 3. • Where we started (aggregate fact tables) • Why we needed data warehouse • Requirements and evaluations • Snowflake adoption • How Celtra handles schema evolution and data rewrites • Snowflake architecture • Next steps Story (Agenda)
  • 7. Key Pain Points ✗ Difficult to analyze the data collected ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) Pre-aggregations—Unable to Respond to Speed and Complexity of Business OLAP cubes Operational data + Event dataTrackers Adhoc queries Applications Client-facing dashboardsMySQLAmazon S3 MySQL ETL
  • 8. • Point in time facts about what happened • Bread and butter of our analytics data • JSON records • Very sparse • Complex relationships between events • Sessionization: Combine discrete events into sessions Event Data
  • 9. • Patterns more interesting than sums and counts of events • Easier to troubleshoot/debug with context • Able to check for/enforce causality (if X happened, Y must also have happened) • De-duplication possible (no skewed rates because of outliers) • Later events reveal information about earlier arriving events (e.g. session duration, attribution, identity, etc.) Sessionization: Why We Do It Sessionization Events Sessions
  • 10. Spark for Complex ETL on Events • Complex ETL: de-duplicate, sessionize, clean, validate, emit facts • Production hourly runs • Get full expressive power of Scala for ETL • Shuffle needed for sessionization • Seamless integration with S3
  • 11. • Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.) • Needed answers to questions that existing data model did not support • Wanted short development cycles and faster experimentation • Visualizations ✗ Difficult to Analyze the Data Collected For example: • Analyzing effects of placement position on engagement rates • Troubleshooting 95th percentile of ad loading time performance
  • 12. Key Pain Points  Now able to analyze the data collected ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) Better with Databricks + SparkSQL OLAP cubes Trackers Adhoc queries Applications Client-facing dashboardsMySQL ETL Operational dataEvent data MySQLAmazon S3 + SQL
  • 13. Key Pain Points  Now able to analyze the data collected ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) ✗ Complex ETL repeated in adhoc queries (slow, error-prone) But a New Problem Emerged OLAP cubes Trackers Adhoc queries Applications Client-facing dashboardsMySQL ETL Operational dataEvent data MySQLAmazon S3 + SQL
  • 14. Idea: Split ETL, Materialize Sessions OLAP cubes Event data + Operational dataTrackers Adhoc queries Applications Client-facing dashboards Amazon S3 MySQL MySQL ETL ETL Sessions ??? deduplication, sessionization, cleaning, validation, external dependencies aggregating across different dimensions Part 1: Complex Part 2: Simple SQL
  • 15. Requirements • Fully managed service • Columnar storage format • Support for complex nested structures • Schema evolution possible • Data rewrites possible • Scale compute resources separately from storage Needed Data Warehouse to Store Intermediate Results Nice-to-Haves • Transactions • Partitioning • Skipping • Access control • Appropriate for OLAP use case
  • 16. Operational tasks for self-service installation: • Replace failed node • Refresh projection • Restart database with one node down • Remove dead node from DNS • Ensure enough (at least 2x) disk space available for rewrites • Backup data • Archive data Why We Wanted a Managed Service We did not want to deal with these tasks
  • 17. 1. Denormalize everything ✓ Speed (aggregations without joins) ✗ Expensive storage 2. Normalize everything ✗ Speed (joins) ✓ Cheap storage 3. Nested objects: pre-group the data on each grain ✓ Speed (a "join" between parent and child is essentially free) ✓ Cheap storage 3 Choices for How to Model Sessions Session Unit views Page views 0 N 1 N Creative 1 N Campaign 1 N Interactions 0 N
  • 18. Complex Nested Structures Session Unit views Page views 0 N 1 N Creative 1 N Campaign 1 N Interactions 0 N Unit views Session Page views Interactions Creative Campaign 1 N 1 N Flat Data in Relational Tables Nested Data
  • 19. Flat + Normalized Nested Find top 10 pages on creative units with most interactions on average Flat vs. Nested Queries SELECT creativeId, uv.name unitName, pv.name pageName, AVG(COUNT(*)) avgInteractions FROM sessions s JOIN unitViews uv ON uv.id = s.id JOIN pageViews pv ON pv.uvid = uv.id JOIN interactions i ON i.pvid = pv.id GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10 Distributed join turned into local joinRequires unique ID at every grain Joins SELECT creativeId, unitViews.value:name unitName, pageViews.value:name pageName, AVG(ARRAY_SIZE(pageViews.value:interactions)) avgInteractions FROM sessions, LATERAL FLATTEN(json:unitViews) unitViews, LATERAL FLATTEN(unitViews.value:pageViews) pageViews GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10
  • 21. • Evaluated Spark + HCatalog + Parquet + S3 solution • Too many small files problem => file stitching • No consistency guarantees over set of files on S3 => secondary index | convention • Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog), Storage format layer (Parquet), Data layer (S3) Work with Data, Not Files We really wanted a database-like abstraction with transactions, not a file format!
  • 22. We Chose Snowflake as Our Managed Data Warehouse (DWaaS) Pain Points  Now able to analyze the data collected  Data processed once & consumed many times  ETL'd data acts as a single source of truth ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) OLAP cubes Trackers Adhoc queries Applications Client-facing dashboards MySQL ETL ETL SessionsEvent data + Operational data Amazon S3 MySQL SQL
  • 23. Snowflake Adoption • Backfilling / recomputing sessions of the last 2 years (from January 2014) 28TB of data (compressed) • "Soft deploy", period of mirrored writes Soon switch completely in production • Each developer/analyst has its own database • Separate roles and data warehouses for: production, developers, analysts • Analysts & data scientists already using Snowflake through Databricks daily
  • 24. • Session schema Known, well defined (by Session Scala model) and enforced • Latest Session model Authoritative source for sessions schema • Historical sessions conform to the latest Session model Can de-serialize any historical session • Readers should ignore fields not in Session model We do not guarantee to preserve this data • Computing facts (metrics, dimensions) from Session model is time-invariant Computed 2 months ago or today, numbers must be the same How Celtra Handles Data Structure Evolution
  • 25. Schema Evolution Change in Session model Top level / scalar column Nested / VARIANT column Rename field ALTER TABLE tbl RENAME COLUMN col1 TO col2; data rewrite (!) Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite Add field, no historical values ALTER TABLE tbl ADD COLUMN col type; no change necessary Also considered views for VARIANT schema evolution For complex scenarios have to use Javascript UDF => lose benefits of columnar access Not good for practical use
  • 26. • They are sometimes necessary • We have the ability to do data rewrites • Rewrites of ~35TB (compressed) are not fun • Complex and time consuming, so we fully automate them • Costly, so we batch multiple changes together • Rewrite must maintain sort order fast access (note: UPDATE breaks it!) • Javascript UDFs are our default approach for rewrites of data in VARIANT Data Rewrites
  • 27. • Expressive power of Javascript (vs. SQL) • Run on the whole VARIANT record • (Almost) constant performance • More readable and understandable • For changing a single field, OBJECT_INSERT/OBJECT_DELETE are preferred Inline Rewrites with Javascript UDFs CREATE OR REPLACE FUNCTION transform("json" variant) RETURNS VARIANT LANGUAGE JAVASCRIPT AS ' // modify json return json; '; SELECT transform(json) FROM sessions;
  • 28. Snowflake Spark Connector • Implements Spark Data Sources API • Access data in Snowflake through Spark SQL (via Databricks) • Currently available in Beta, soon to be open-source Operational data + Event data Adhoc queries MySQL Amazon S3 ETL Sessions SQL
  • 29. Snowflake Data Warehouse as a Service Centralized storage Instant, automatic scalability & elasticity Single service Scalable, resilient cloud services layer coordinates access & management Elastically scalable compute Multiple “virtual warehouse” compute clusters scale horsepower & concurrency Database Storage Python
  • 30. • Data Warehouse as a Service: No infrastructure, knobs or tuning • Infinite & Independent Scalability: Scale storage and compute layers independently • One Place for All Data: Native support for structured & semi- structured data • Instant Cloning: Isolate prod/dev • Highly Available: 11 9’s durability, 4 9’s availability Snowflake’s Multi-cluster, Shared Data Service Logical Databases Virtual Warehouse Virtual Warehouse ETL & Data Loading Virtual Warehouse Finance Virtual Warehouse Dev, Test, QA Dashboards Virtual Warehouse Marketing Clone Data Science
  • 31. Apple 101.12 250 FIH-2316 Pear 56.22 202 IHO-6912 Orange 98.21 600 WHQ-6090 Native Support for Structured + Semi-structured Data Any hierarchical, nested data type (e.g. JSON, Avro) Optimized VARIANT data type, no fixed schema or transformation required Full benefit of database optimizations – pruning, filtering, etc. Structured data { "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY”, … .. .. . { "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, Semi-Structured Data Stored Natively Queried using SQL Semi-structured data
  • 32. Next Stage: Snowflake Also for Aggregates OLAP cubes Trackers Adhoc queries Applications Client-facing dashboards MySQL ETL ETL SessionsEvent data + Operational data Amazon S3 MySQL SQL
  • 33. End Goal Pain Points  Now able to analyze the data collected  Data processed once & consumed many times  ETL'd data acts as a single source of truth  Fast schema changes in cubes (e.g. adding / removing metric) + Trackers ETL Sessions & OLAP cubes Adhoc queries Applications Client-facing dashboards Event data + Amazon S3 MySQL Operational data
  • 34. Thank You. Grega Kespret Director of Engineering, Analytics @ @gregakespret github.com/gregakespret slideshare.net/gregak linkedin.com/in/gregakespret Matthew J. Glickman Vice President of Product, @matthewglickman linkedin.com/in/matthewglickman
  • 36. Snowflake Query Performance • There are no indexes or projections • Sort the data on ingest to maintain query performance • 3 "tiers" • Query cache • File cache • S3 storage
  • 37. • Save sessions to S3 in parallel (Spark cluster) Getting Data INTO Snowflake Operational data + Event data Adhoc queries MySQL Amazon S3 ETL SQL Sessions sessions.map(serializeJson).saveAsTextFile("s3a://...")
  • 38. • Save sessions to S3 in parallel (Spark cluster) • Copy from S3 to temporary table (Snowflake cluster) Getting Data INTO Snowflake Operational data + Event data Adhoc queries MySQL Amazon S3 ETL Sessions CREATE TEMPORARY TABLE sessions-import (json VARIANT NOT NULL); COPY INTO sessions-import FROM s3://... FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json') CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...') REGION = 'external-1'; SQL
  • 39. • Save sessions to S3 in parallel (Spark cluster) • Copy from S3 to temporary table (Snowflake cluster) • Sort and insert into main table (Snowflake cluster) Getting Data INTO Snowflake INSERT INTO sessions SELECT TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)::date, json:accountId AS accountId, json:campaignId AS campaignId, HOUR(TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3)), json:creativeId AS creativeId, json:placementId AS placementId, TO_TIMESTAMP_NTZ(json:adRequestServerTimestamp::int, 3), json FROM sessions-import ORDER BY utcDate ASC, accountId ASC, campaignId ASC, utcHour ASC, creativeId ASC, placementId ASC; Operational data + Event data Adhoc queries MySQL Amazon S3 ETL Sessions SQL
  • 40. Getting Data OUT of Snowflake COPY INTO s3://... FROM (SELECT json FROM sessions WHERE ...) FILE_FORMAT = (FORMAT_NAME = 'session_gzip_json') REGION = 'external-1' CREDENTIALS = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...'); • Copy to S3 (Snowflake cluster) OLAP cubes Applications Client-facing dashboards MySQL ETL Sessions
  • 41. Getting Data OUT of Snowflake val sessions: RDD[Session] = sc.textFile(s"s3a://...").map(deserialize) • Copy to S3 (Snowflake cluster) • Read from S3 and apply schema (Spark cluster) OLAP cubes Applications Client-facing dashboards MySQL ETL Sessions
  • 42. Combining Spark and Snowflake Parallel unload InputFormat RDD[Array[String]] DataFrame Parallel consumption AWS S3