SlideShare a Scribd company logo
Levelling up your
data infrastructure
simon.belak@gmail.com

@sbelak
We all start here …
The Problem
… but eventually
• Want granularity smaller than GA exposes

• Want analysis GA doesn’t support

• Want to combine and analyse data from different sources
Goal: answer 80% of
questions stemming from
data in 20min or less
The analytics chasm
2 min 20 min project
Ideal. Almost real-time. Can be
done during brainstorming
without disrupting the flow.
fail
Added to roadmapSqueeze in
somewhere
in the day
Levelling up
1.Acquire data (directly, or from 3rd party APIs)

2.Store it in a data warehouse

3.Transform it to a usable and unified shape

4.Perform analytics on it
Intermezzo: My perspective
• Core developer at Metabase, an open source BI/analytics
tool. 3rd largest BI tool in the world. 20k+ companies use
us daily, including N26, Revolut, Swisscom

• Built analytics department at GoOpti from the ground up

• Helped 20+ companies become data-driven
Levelling up
1.Acquire data (directly, or from 3rd party APIs)

2.Store it in a data warehouse

3.Transform it to a usable and unified shape

4.Perform analytics on it
Collecting requirements
1.Make a list of all the data sources you currently have, how much data is in
them (number of entities), and at what rate the data grows

2.Collect user stories from all potential users:

As a ______ I’d like to _________, because _________
3.Match each user story with needed data sources

4.Rank user stories using PIE (probability, impact, effort)

5.Rank data sources by summing the PIE score of all user stories that require it.

6.Build data infrastructure to enable the high-value cluster 

7.Continue doing steps 1-6 as you iterate
A minimal data-collection
plan
• Event stream

• Goal: be able to reconstruct any given session from data

• Timestamp, session, action, payload, context/result
Invest into workflow
management from the
start
Extract-Load-Transform
• Dump data somewhere as soon as possible so you don’t
loose it.

• Databases are fast and powerful enough to do most
transforms there. In return you get:

• Observability

• Analysts become more self-sufficient (if they know SQL)

• For small-medium data size (< 1M data points/day)
more performant and requires much less infrastructure
Good ELT is:
• Repeatable

• Observable

• Extensible

• Scalable

• Recoverable (don’t loose data, ever!)
Designing your data
warehouse
Identify principle axis of
your data
• User, account, transaction, instance, product, event (log)…

• There will (and should) be some overlap

• Different axis will have different granularity

• Some should be ordered in time
Data warehouse topology
• Big fat denormalised tables, one for each principle axis

• Use views to tailor the representation to your tools and
analysis needs
Which DB?
• Optimize for ease of ad-hoc querying

• Should be decently performant (waiting kills productivity)
but is unlikely to be the bottleneck

• Simple to deploy, connect to, and use

• Strong data validation/schemas, but should also handle
non-structured data (validation on load = data loss)

• Sane handling of timezones, date time arithmetics, &
numbers
My go-to stack
• Snowplow for event-like data

• Apache Airflow to manage the workflow

• (managed) Postgres for data warehouse (or Druid if only event data and a
lot of it) 

• dbt for data transforms

• Metabase for analytics

• Fully open-source

• Extensible, performant
SaaS alternatives
• Segment, Stitch Data

• Redshift, BigQuery, Snowflake

• Dataform 

• PowerBI, Looker
What to look for when
choosing your stack
• Iteration velocity

• Toil

• Observability

• Vendor lock-in

• Extensibility and repurposability (avoid the multiple tool anti-pattern)

• Don’t loose data

• Self-service

• Friction, friction, friction

• Cost (both setup & running)
Common pitfalls
• You need it before you can afford it

• (no) Ownership of data, processes, dashboards

• Overestimating scale

• Not iterating
Making dashboards
people use
Good dashboards are:
• Actionable

• Clear & simple

• Sharable (and a good teaching tool <3)
Add descriptions and
reasoning
Anticipate followup
questions & flow
Interactivity turns reports
into tools (and begets a
sense of ownership)
It should be easy to slip
into exploration mode
Design your dashboards
with user journey and
process in mind
Metric definitions are
rarely unambiguous,
nor self-explanatory.
Document them!
Why your dashboards fail to
cross the chasm
• Discoverability 

• Legibility

• Trust (in data, in creator, in correctness)
Exploratory analysis
101
Segmentation,
segmentation,
segmentation
Minimal segmentation
checklist
• New vs. Returning

• Time cohorts

• Milestone events

• Usage

• Value

• Customer attributes (company size, industry, …)

• Geography
Think in distributions
Seasonality
Different segments, different behaviour, different
volumes
You can often encode
dynamic processes as
binary outcomes
Signal or noise?
• Trend & relative change often tell more than absolute
values Percentiles

• Intra- vs. inter-segment variance

• Significance tests

• Sample representativeness (is not just for A/B tests)

• Distribution similarity 

• Have a reference point (and reference it often)
Case: MESI
MESI
• Medical decices

• North-star metric: number of measurements/device

• Current data sources: GA, product database, countly,
sentry, hubspot, Odoo
MESI data acquisition
• Collect event stream from devices capturing all the interactions [Snowplow]

• Mirror product database into data warehouse [Airflow]

• Collect event stream from the website [Snowplow]

• Integrate Hubspot and Odoo via API [Airflow]

• Integrate sentry via API [Airflow]

• (Retire Countly) 

• (Add support data — Jira, Zendesk, …)

• (Add accounting/billing)
MESI data warehouse
• (managed) Postgres

• Principle axis: account, user, device event, user journey
event, device
MESI analytics
• Metabase

• User journey before conversion

• Device usage patterns

• UX friction points

• Onboarding

• Errors & support issues

• Segmentation
Case: SalesGenomics
SalesGenomics
• eCommerce marketing agency focused on scale-up

• Typical customer marketing budget 10k-100k/month

• Current data sources: GA, FB, Shopify 

• 2-sided reporting: for clients, internal
SalesGenomics data
acquisition
• Custom event collector on websites (replacing GA
snippet) [Snowplow]

• Integrate Shoppify, AdWords, FB ads [Airflow]

— OR — 

• Use Segment/Stitch Data
SalesGenomics data
warehouse
• (managed) Postgres

• Principle axis: order, order item, user, account, user
journey event, ad, ad campaign
SalesGenomics analytics
• Metabase

• Cross-client learning & benchmarking

• User journey

• Segmentation

• Order patterns and periodicity 

• Gross margin!
• Cost analysis (shipping, marketing, returns …)
Case: starting from 0
Starting from 0
• Setup GA (remember the minimal data-collection plan)

• Connect Metabase to your product DB

• Collect data user stories from day 1

• Focus analytics on user journey, segmentation, costs, & UX
Questions
simon.belak@gmail.com

@sbelak
Resources
metabase.com

airflow.apache.org

github.com/fishtown-analytics

postgresql.org

github.com/snowplow/snowplow

druid.apache.org

segment.com

stitchdata.com

dataliftoff.com/elt-with-amazon-redshift-an-overview

More Related Content

What's hot (20)

PDF
Software Analytics for Pragmatists [DevOps Camp 2017]
Markus Harrer
 
PDF
Applied Data Science Course Part 1: Concepts & your first ML model
Dataiku
 
PDF
Dataiku productive application to production - pap is may 2015
Dataiku
 
PPTX
Frank Bien Opening Keynote - Join 2016
Looker
 
PDF
PASS Summit Data Storytelling with R Power BI and AzureML
Jen Stirrup
 
PDF
What are actionable insights? (Introduction to Operational Analytics Software)
Newton Day Uploads
 
PPTX
Improving Data Modeling Workflow
Looker
 
PPTX
Sage Intelligence Reporting for your Sage ERP Software
BrainSell Technologies
 
PDF
H2O World - Data Science in Action @ 6sense - Viral Bajaria
Sri Ambati
 
PPTX
Microsoft Dynamics: The Truth About Analytics
zaptechnology
 
PPTX
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Cloudera, Inc.
 
PDF
Big data expo - machine learning in the elastic stack
BigDataExpo
 
PDF
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Formulatedby
 
PDF
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
PPTX
Synapse NanoApps
Synapse Information Ltd
 
PPTX
Creating an Enterprise AI Strategy
AtScale
 
PDF
H2O World - What you need before doing predictive analysis - Keen.io
Sri Ambati
 
PPTX
When and Where to Embed Business Intelligence
Looker
 
PPTX
Stop refreshing vanity metrics & start focusing on the metrics that inform de...
Looker
 
PPTX
Operationalizing analytics to scale
Looker
 
Software Analytics for Pragmatists [DevOps Camp 2017]
Markus Harrer
 
Applied Data Science Course Part 1: Concepts & your first ML model
Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku
 
Frank Bien Opening Keynote - Join 2016
Looker
 
PASS Summit Data Storytelling with R Power BI and AzureML
Jen Stirrup
 
What are actionable insights? (Introduction to Operational Analytics Software)
Newton Day Uploads
 
Improving Data Modeling Workflow
Looker
 
Sage Intelligence Reporting for your Sage ERP Software
BrainSell Technologies
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
Sri Ambati
 
Microsoft Dynamics: The Truth About Analytics
zaptechnology
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Cloudera, Inc.
 
Big data expo - machine learning in the elastic stack
BigDataExpo
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Formulatedby
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
Synapse NanoApps
Synapse Information Ltd
 
Creating an Enterprise AI Strategy
AtScale
 
H2O World - What you need before doing predictive analysis - Keen.io
Sri Ambati
 
When and Where to Embed Business Intelligence
Looker
 
Stop refreshing vanity metrics & start focusing on the metrics that inform de...
Looker
 
Operationalizing analytics to scale
Looker
 

Similar to Levelling up your data infrastructure (20)

PPTX
IT webinar 2016
PR Cell, IIM Rohtak
 
PDF
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
PPTX
Power BI - 2016 - Public
Julian Payne
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PPTX
Skilwise Big data
Skillwise Group
 
PPTX
Skillwise Big Data part 2
Skillwise Group
 
PPTX
Big data and machine learning / Gil Chamiel
geektimecoil
 
PPTX
Big data unit 2
RojaT4
 
PDF
Data-Driven Development Era and Its Technologies
SATOSHI TAGOMORI
 
PDF
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 
PDF
Store, Extract, Transform, Load, Visualize. Untagged Conference
Ani Lopez
 
PPTX
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
DATAVERSITY
 
PPTX
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
PDF
Digital_IOT_(Microsoft_Solution).pdf
ssuserd23711
 
PDF
Big Data at a Gaming Company: Spil Games
Rob Winters
 
PDF
Lecture 1-big data engineering (Introduction).pdf
ahmedibrahimghnnam01
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PPTX
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
PPT
Kushal Data Warehousing PPT
Kushal Singh
 
IT webinar 2016
PR Cell, IIM Rohtak
 
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
Power BI - 2016 - Public
Julian Payne
 
An overview of modern scalable web development
Tung Nguyen
 
Skilwise Big data
Skillwise Group
 
Skillwise Big Data part 2
Skillwise Group
 
Big data and machine learning / Gil Chamiel
geektimecoil
 
Big data unit 2
RojaT4
 
Data-Driven Development Era and Its Technologies
SATOSHI TAGOMORI
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Ani Lopez
 
Analyzing Billions of Data Rows with Alteryx, Amazon Redshift, and Tableau
DATAVERSITY
 
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
Digital_IOT_(Microsoft_Solution).pdf
ssuserd23711
 
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Lecture 1-big data engineering (Introduction).pdf
ahmedibrahimghnnam01
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
Kushal Data Warehousing PPT
Kushal Singh
 
Ad

More from Simon Belak (20)

PDF
The subtle art of recommendation
Simon Belak
 
PDF
Metabase Ljubljana Meetup #2
Simon Belak
 
PDF
Metabase lj meetup
Simon Belak
 
PDF
Sketch algorithms
Simon Belak
 
PDF
Transducing for fun and profit
Simon Belak
 
PDF
Your metrics are wrong
Simon Belak
 
PDF
Writing smart contracts the sane way
Simon Belak
 
PDF
Online statistical analysis using transducers and sketch algorithms
Simon Belak
 
PDF
Save the princess
Simon Belak
 
PDF
Data driven going to market strategy
Simon Belak
 
PDF
Spec: a lisp-flavoured type system
Simon Belak
 
PDF
A data layer in clojure
Simon Belak
 
PDF
Odkrivanje segmentov iz podatkov
Simon Belak
 
PDF
Using Onyx in anger
Simon Belak
 
PDF
Spec + onyx
Simon Belak
 
PDF
Dao of lisp
Simon Belak
 
PDF
Predicting the future with goopti
Simon Belak
 
PDF
Living with-spec
Simon Belak
 
PDF
Living with-spec
Simon Belak
 
PDF
Doing data science with Clojure
Simon Belak
 
The subtle art of recommendation
Simon Belak
 
Metabase Ljubljana Meetup #2
Simon Belak
 
Metabase lj meetup
Simon Belak
 
Sketch algorithms
Simon Belak
 
Transducing for fun and profit
Simon Belak
 
Your metrics are wrong
Simon Belak
 
Writing smart contracts the sane way
Simon Belak
 
Online statistical analysis using transducers and sketch algorithms
Simon Belak
 
Save the princess
Simon Belak
 
Data driven going to market strategy
Simon Belak
 
Spec: a lisp-flavoured type system
Simon Belak
 
A data layer in clojure
Simon Belak
 
Odkrivanje segmentov iz podatkov
Simon Belak
 
Using Onyx in anger
Simon Belak
 
Spec + onyx
Simon Belak
 
Dao of lisp
Simon Belak
 
Predicting the future with goopti
Simon Belak
 
Living with-spec
Simon Belak
 
Living with-spec
Simon Belak
 
Doing data science with Clojure
Simon Belak
 
Ad

Recently uploaded (20)

PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction to Data Analytics and Data Science
KavithaCIT
 

Levelling up your data infrastructure