SlideShare a Scribd company logo
​ Mars Lan
​ The WhereHows Team
​ Apr 26, 2017
​ Big Data Meetup @ LinkedIn
WhereHows: Taming Metadata for
150K Datasets Over 9 Data Platforms
​ Github: github.com/linkedin/WhereHows
​ Gitter: gitter.im/wherehows
​ Google Groups: WhereHows
​
● LinkedIn’s Data Ecosystem
● The Metadata problem
● WhereHows: Architecture and Details
● Future Evolution
Agenda
Mission
Connect the world’s professionals to make
them more productive and successful
What is LinkedIn?
LinkedIn Data Ecosystem
LinkedIn.com: Desktop, Mobile apps
Services (Prod + Corp)
Logs,
Events, Messages
Hadoop
Streaming
CDC
Kafka
Databases
(Espresso, MySQL, Oracle)
Samza
Teradata
Data
standardization,
Reporting, ML
Data
standardization,
Reporting
Derived Data Stores, Indexes
(Pinot, Search, Voldemort,
Venice, Graph, MySQL)
Snapshots,
incremental dumps
ReadsReads, Writes
Streaming Ingest
Batch loads
LinkedIn.corp: Internal applications (e.g. dashboards)
Employees
Members,
Customers
LinkedIn’s Data Ecosystem
Oracle
MySQL
Espresso
Teradata
Pinot
Kafka
Hadoop
Couchbase
Voldemort
Venice
SQL
Pig
Map-Reduce
Hive
Cascading
Scalding
Spark
Samza
Java
Custom
Data Platforms Transformation Systems
● Cross Platform
○ Silo-ed and non-interoperable metadata
○ Missing linkage between platforms
● Challenges within Platforms
○ Big data platforms (e.g. Hadoop) encourage sprawl
○ Schema-free systems => inferring structure is hard
○ Multiple processing frameworks => lineage tough
Challenges Introduced by Diversity
Some Early Questions
WhereHows
Open source @ github.com/linkedin/wherehows
WhereHows @ 10,000 ft
WhereHows @ LinkedIn
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Lineage
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows Concepts
● Dataset: A logical collection of data, e.g. Oracle table, Hive View, HDFS
directory, Kafka topic
● Process / Flow: A processing workflow that contains one or more jobs
● Lineage: A relationship between datasets deduced from operation data
● Metric: A business metric with additional info on source, formula, dimensions,
dashboard, wiki etc.
● Ownership: dev owner, producer, consumer, delegate, stakeholder
WhereHows Architecture
WH
MySQL
WH App (Play + Ember)
Metadata
Store
Rest.li API
Catalog (Schema)
HDFS, Teradata, Oracle,
Kafka, Voldemort, Hive, ...
Lineage
Azkaban, Gobblin
Ownership
Git, ownership repository, ...
Elastic
Search
Index Builder
Catalog - Challenges
● Standardization : Single metadata model that works with all platforms
○ Least-common-denominator vs leaky abstractions
○ What is a dataset? A Table? A Database? A Metric?
● Extraction : Each data platform stores metadata differently
○ HDFS - files/directories plus schema files
○ TD/Oracle - DBC.Table, ALL_TABLES etc
○ Kafka - Topic, Schema registry
● Freshness : Trust erodes with staleness
Trust
Freshness
Catalog - Our Approach
● URN-based naming for datasets in all platforms
○ Generalized + specialized metadata models under evolution
● Quick authoring of platform-specific ETL jobs using Jython
● Pull model (extract + transform) and push model (Kafka, REST) both exist
Lineage - Challenges
● Diversity in processing frameworks on Hadoop
● Inferring from code is not trivial - think UDF, external parameters etc
● Cross data platform lineage requires mapping all data copies
● Visualization is non-trivial with huge fan-out
Pretty
Understandable
Lineage - Our Approach
● Azkaban’s execution logs for intra-Hadoop lineage
○ Hadoop job ID => Job conf from job history node => source + destination pair
● AppWorx execution log
● Gobblin events for into-Hadoop and cross-Hadoop cluster lineage
● Heuristics based on known patterns
● Lineage API, Tabular representation for downstream impact
We also have pretty, unreadable lineage graphs :)
Anatomy of Metadata ETL
● Extract
○ Gather metadata from source (direct query, crawling file system, log parsing etc)
○ Build JSON representation of metadata
○ Dump JSON to file
● Transform
○ Convert JSON objects into CSV conforming destination table structure
● Load
○ Load CSV files into table, performing diff if necessary
Metadata
DB
Extract Transform LoadData
Platform
JSON CSV
Metadata Kafka Event (In Development)
● MetadataChangeEvent - Both delta & current snapshot of a dataset
● MetadataInventoryEvent - Periodic lightweight event for re-synchronization
● MetadataLineageEvent - For operation lineage
Data platform
WhereHowsKafkaMetadata Events
Data processor
Active Work @ LinkedIn
● Product Experience
○ Improve search relevance
● Compliance: GDPR requirements
○ Fine-grained metadata acquisition across all data platforms
○ Purge specifications for datasets (actual deletion driven through Gobblin)
● Better Metadata
○ Column-level lineage using static analysis of Pig, Hive, Spark, Samza SQL, TD scripts
● Big Metadata
○ Support a wide range of storage backends for scale-out, specialized access patterns
■ Ground, Neo4j, LinkedIn’s GraphDB, NoSQL, REST services etc.
● Tech Improvement Items
○ Easier authoring/sharding/monitoring of metadata ETL using Gobblin
Feature Roadmap
● Product Experience
○ Better lineage visualization
○ Richer social collaboration
● Developer Happiness
○ Simplify build system & deployment
○ Admin API for ETL job management
○ Replace VM with Docker image
The Team
Abhishek Agrawal
Eng Mgr
Tushar Shanbhag
Product
Nicole Li
Project Mgr
Wen Cui
Design
Eric Sun
Mars Lan
Na Zhang
Yi Wang Seyi Adebajo
Engineering
Thank You!
Ad

More Related Content

What's hot (20)

The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
Nitin Kumar
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
DataWorks Summit
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
Pat Patterson
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
Nitin Kumar
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...Enterprise large scale graph analytics and computing base on distribute graph...
Enterprise large scale graph analytics and computing base on distribute graph...
DataWorks Summit
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
DataWorks Summit
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
Pat Patterson
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 

Similar to WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms (20)

Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
Shawn Zhu
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho Kettle
Dan Moore
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
Martin Bém
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Introduction To Pentaho Kettle
Introduction To Pentaho KettleIntroduction To Pentaho Kettle
Introduction To Pentaho Kettle
Boulder Java User's Group
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
Shawn Zhu
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
An Introduction to Pentaho Kettle
An Introduction to Pentaho KettleAn Introduction to Pentaho Kettle
An Introduction to Pentaho Kettle
Dan Moore
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
Martin Bém
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Ad

Recently uploaded (20)

Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Ad

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

  • 1. ​ Mars Lan ​ The WhereHows Team ​ Apr 26, 2017 ​ Big Data Meetup @ LinkedIn WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms ​ Github: github.com/linkedin/WhereHows ​ Gitter: gitter.im/wherehows ​ Google Groups: WhereHows ​
  • 2. ● LinkedIn’s Data Ecosystem ● The Metadata problem ● WhereHows: Architecture and Details ● Future Evolution Agenda
  • 3. Mission Connect the world’s professionals to make them more productive and successful What is LinkedIn?
  • 4. LinkedIn Data Ecosystem LinkedIn.com: Desktop, Mobile apps Services (Prod + Corp) Logs, Events, Messages Hadoop Streaming CDC Kafka Databases (Espresso, MySQL, Oracle) Samza Teradata Data standardization, Reporting, ML Data standardization, Reporting Derived Data Stores, Indexes (Pinot, Search, Voldemort, Venice, Graph, MySQL) Snapshots, incremental dumps ReadsReads, Writes Streaming Ingest Batch loads LinkedIn.corp: Internal applications (e.g. dashboards) Employees Members, Customers
  • 6. ● Cross Platform ○ Silo-ed and non-interoperable metadata ○ Missing linkage between platforms ● Challenges within Platforms ○ Big data platforms (e.g. Hadoop) encourage sprawl ○ Schema-free systems => inferring structure is hard ○ Multiple processing frameworks => lineage tough Challenges Introduced by Diversity
  • 8. WhereHows Open source @ github.com/linkedin/wherehows
  • 15. WhereHows Concepts ● Dataset: A logical collection of data, e.g. Oracle table, Hive View, HDFS directory, Kafka topic ● Process / Flow: A processing workflow that contains one or more jobs ● Lineage: A relationship between datasets deduced from operation data ● Metric: A business metric with additional info on source, formula, dimensions, dashboard, wiki etc. ● Ownership: dev owner, producer, consumer, delegate, stakeholder
  • 16. WhereHows Architecture WH MySQL WH App (Play + Ember) Metadata Store Rest.li API Catalog (Schema) HDFS, Teradata, Oracle, Kafka, Voldemort, Hive, ... Lineage Azkaban, Gobblin Ownership Git, ownership repository, ... Elastic Search Index Builder
  • 17. Catalog - Challenges ● Standardization : Single metadata model that works with all platforms ○ Least-common-denominator vs leaky abstractions ○ What is a dataset? A Table? A Database? A Metric? ● Extraction : Each data platform stores metadata differently ○ HDFS - files/directories plus schema files ○ TD/Oracle - DBC.Table, ALL_TABLES etc ○ Kafka - Topic, Schema registry ● Freshness : Trust erodes with staleness Trust Freshness
  • 18. Catalog - Our Approach ● URN-based naming for datasets in all platforms ○ Generalized + specialized metadata models under evolution ● Quick authoring of platform-specific ETL jobs using Jython ● Pull model (extract + transform) and push model (Kafka, REST) both exist
  • 19. Lineage - Challenges ● Diversity in processing frameworks on Hadoop ● Inferring from code is not trivial - think UDF, external parameters etc ● Cross data platform lineage requires mapping all data copies ● Visualization is non-trivial with huge fan-out Pretty Understandable
  • 20. Lineage - Our Approach ● Azkaban’s execution logs for intra-Hadoop lineage ○ Hadoop job ID => Job conf from job history node => source + destination pair ● AppWorx execution log ● Gobblin events for into-Hadoop and cross-Hadoop cluster lineage ● Heuristics based on known patterns ● Lineage API, Tabular representation for downstream impact We also have pretty, unreadable lineage graphs :)
  • 21. Anatomy of Metadata ETL ● Extract ○ Gather metadata from source (direct query, crawling file system, log parsing etc) ○ Build JSON representation of metadata ○ Dump JSON to file ● Transform ○ Convert JSON objects into CSV conforming destination table structure ● Load ○ Load CSV files into table, performing diff if necessary Metadata DB Extract Transform LoadData Platform JSON CSV
  • 22. Metadata Kafka Event (In Development) ● MetadataChangeEvent - Both delta & current snapshot of a dataset ● MetadataInventoryEvent - Periodic lightweight event for re-synchronization ● MetadataLineageEvent - For operation lineage Data platform WhereHowsKafkaMetadata Events Data processor
  • 23. Active Work @ LinkedIn ● Product Experience ○ Improve search relevance ● Compliance: GDPR requirements ○ Fine-grained metadata acquisition across all data platforms ○ Purge specifications for datasets (actual deletion driven through Gobblin) ● Better Metadata ○ Column-level lineage using static analysis of Pig, Hive, Spark, Samza SQL, TD scripts ● Big Metadata ○ Support a wide range of storage backends for scale-out, specialized access patterns ■ Ground, Neo4j, LinkedIn’s GraphDB, NoSQL, REST services etc. ● Tech Improvement Items ○ Easier authoring/sharding/monitoring of metadata ETL using Gobblin
  • 24. Feature Roadmap ● Product Experience ○ Better lineage visualization ○ Richer social collaboration ● Developer Happiness ○ Simplify build system & deployment ○ Admin API for ETL job management ○ Replace VM with Docker image
  • 25. The Team Abhishek Agrawal Eng Mgr Tushar Shanbhag Product Nicole Li Project Mgr Wen Cui Design Eric Sun Mars Lan Na Zhang Yi Wang Seyi Adebajo Engineering