SlideShare a Scribd company logo
Building and Scaling
Data Lineage at Netflix
Strata SF 2019
Di Lin, Girish Lingappa, Jitender
Aswani (@jaswani)
CTA: Intentionally Purposeful to
Improve Diversity in Tech
People, Security,
& Infra Data Eng
(DEI)
Data Lineage Asset Inventory Anomaly Detection
Leverage data to
improve cloud infra
Show of hands
in thinking / planning stages?
currently
building the
system?
done
building?
Imagination Trip
(User Stories)
#1 Data Scientist
validate
source of the
metrics?
#2 Software Engineer
who
downstream to
you service is
impacted?
#3 Reliability Engineer
identify and
alert on jobs /
table at risk of
missing SLA?
Detection and
Data Cleansing
Use Cases Inspired Lineage
Retention and
Data Efficiency
Cost Attribution Platform Reliability
Data Integrity
intentionality
behind complex
data landscape
Freedom & Responsibility
distributed decision
making =>
● move fast,
● scale complexity,
● Judicious about
what’s best for Netflix
Data Landscape
Top Challenges
Challenge # 1
Diverse Data Landscape
First time expanding beyond traditional data
warehouse realm
Challenge # 2
Platform Evolution
New Platforms Upgrades
Challenge # 3
Data Conformance
Variance in granularity of the meta data
Design Principles
Design Principle # 1
Ensure data coverage
Design Principle # 2
Enable seamless integration
Design Principle # 3
Envision a flexible data model
Data Coverage
Data Coverage - data gets pushed
● Platform tools publish events (preferred approach)
○ Compute engines
○ Data movement tools
○ Reporting tools
Data Coverage - we pull & parse data
● Ingestion scripts parse logs and metadata
○ Enumerate jobs and scripts
○ Parse plan info
○ Parse queries
Lineage Data Flow
Data lineage
Use Cases
#1: Platform Reliability
● SLA Recommendation
● Predict SLA miss
● Instructive notifications
● Anomaly detection
#2 : Data visibility
● Reinstate key tables
● Deprecate tables
● Schema changes
● Contract no longer held
#3: Efficiency
● Cost evaluation for entities
○ Storage
○ Compute
○ Usage
● Retention
● Disaster recovery
Please visit us at
jobs.netflix.com
Continue your journey...
Lineage blog
Infra DE team blog
Netflix @ Strata
● Wednesday
○ Self-service data platform
● Thursday
○ App performance measurement
○ Security detection platform
○ Personalization enhancement 4:40 PM @ 2002
11:50 AM @ 2002
11:50 AM @ 2006
2:40 PM @ 2024
???
Thank you.
“Provide a complete and accurate
data lineage system enabling
decision-makers to win moments
of truth”
Lineage Mission

More Related Content

PDF
Straight Talk to Demystify Data Lineage
DATAVERSITY
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 
PDF
Introduction to Knowledge Graphs and Semantic AI
Semantic Web Company
 
PDF
Data Strategy
sabnees
 
PPTX
Data Lake Overview
James Serra
 
PDF
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
PDF
Active Governance Across the Delta Lake with Alation
Databricks
 
Straight Talk to Demystify Data Lineage
DATAVERSITY
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 
Introduction to Knowledge Graphs and Semantic AI
Semantic Web Company
 
Data Strategy
sabnees
 
Data Lake Overview
James Serra
 
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
Active Governance Across the Delta Lake with Alation
Databricks
 

What's hot (20)

PDF
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PPTX
Data analytics introduction
amiyadash
 
PPTX
Introduction to DCAM, the Data Management Capability Assessment Model
Element22
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
State of Data Governance in 2021
DATAVERSITY
 
PDF
Big Data Engineer Roles & Responsibilities | Edureka
Edureka!
 
PPT
Data Governance
Rob Lux
 
PDF
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
PDF
DAS Slides: Data Quality Best Practices
DATAVERSITY
 
PPTX
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
PDF
Introduction of Knowledge Graphs
Jeff Z. Pan
 
PPTX
Modern Data Architecture
Alexey Grishchenko
 
PDF
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
PPTX
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
PPTX
Capability Model_Data Governance
Steve Novak
 
PDF
Data strategy demistifying data
Hans Verstraeten
 
PPTX
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Data analytics introduction
amiyadash
 
Introduction to DCAM, the Data Management Capability Assessment Model
Element22
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
State of Data Governance in 2021
DATAVERSITY
 
Big Data Engineer Roles & Responsibilities | Edureka
Edureka!
 
Data Governance
Rob Lux
 
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
DAS Slides: Data Quality Best Practices
DATAVERSITY
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
Introduction of Knowledge Graphs
Jeff Z. Pan
 
Modern Data Architecture
Alexey Grishchenko
 
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
Capability Model_Data Governance
Steve Novak
 
Data strategy demistifying data
Hans Verstraeten
 
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Modern Data architecture Design
Kujambu Murugesan
 
Data Modeling, Data Governance, & Data Quality
DATAVERSITY
 
Ad

Similar to Data lineage (20)

PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
DOCX
Resume (1)
naveenreddytamma
 
DOCX
Resume (1)
naveenreddytamma
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PDF
DATA @ NFLX (Tableau Conference 2014 Presentation)
Blake Irvine
 
PDF
Innovative and Agile Data Delivery, using 'A Logical Data Fabric'
Denodo
 
PDF
Data Virtualization: Introduction and Business Value (UK)
Denodo
 
PDF
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
PDF
High Availability HPC ~ Microservice Architectures for Supercomputing
inside-BigData.com
 
PDF
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
 
PDF
Data in Motion - tech-intro-for-paris-hackathon
Cisco DevNet
 
PDF
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann
 
PDF
Simply Business' Data Platform
Dani Solà Lagares
 
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
PPTX
From SQL to NoSQL - StampedeCon 2015
StampedeCon
 
PDF
Tejas bichave m tech python
tejas bichave
 
PPTX
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Streamsets Inc.
 
PDF
7 Emerging Data & Enterprise Integration Trends in 2022
Safe Software
 
PDF
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
Resume (1)
naveenreddytamma
 
Resume (1)
naveenreddytamma
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
DATA @ NFLX (Tableau Conference 2014 Presentation)
Blake Irvine
 
Innovative and Agile Data Delivery, using 'A Logical Data Fabric'
Denodo
 
Data Virtualization: Introduction and Business Value (UK)
Denodo
 
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
High Availability HPC ~ Microservice Architectures for Supercomputing
inside-BigData.com
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
 
Data in Motion - tech-intro-for-paris-hackathon
Cisco DevNet
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Timothy Spann
 
Simply Business' Data Platform
Dani Solà Lagares
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
 
From SQL to NoSQL - StampedeCon 2015
StampedeCon
 
Tejas bichave m tech python
tejas bichave
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Streamsets Inc.
 
7 Emerging Data & Enterprise Integration Trends in 2022
Safe Software
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Inside Analysis
 
Ad

Recently uploaded (20)

PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 

Data lineage