SlideShare a Scribd company logo
April 17th 2019
Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft
Tao Feng | @feng-tao | Engineer, Lyft
Amundsen: A Data Discovery Platform from Lyft
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Demo
• Architecture
• Summary
2
Data platform users
3
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
4
Core Infra high level architecture
Custom apps
Data Discovery
5
• My first project is to analyze and predict Data council Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
6
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
7
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
8
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
10
Data Scientists spend upto 1/3rd time in Data Discovery...
11
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Audience for data
discovery
12
Data Discovery - User personas
13
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Meet Amundsen
16
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
19
Relevance - search for “apple” on Google
20
Low relevance High relevance
Popularity - search for “apple” on Google
21
Low popularity High popularity
Striking the balance
22
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Back to mocks...
23
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Demo
29
Amundsen’s
architecture
30
31
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Frontend Service
32
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Amundsen table detail page
2. Metadata Service
35
36
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
37
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Trade Off #1
Why choose Graph
database
38
Why Graph database?
Why Graph database?
Trade Off #2
Why not propagate the
metadata back to source
41
Why not propagate the metadata back to source
42
Why not propagate the metadata back to source
43
?
?
Why not propagate the metadata back to source
44
3. Search Service
45
46
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
47
Challenge #1
How to make the search
result more relevant?
48
How to make the search result more relevant?
49
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
4. Data Builder
50
51
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #1
Various forms of
metadata
52
53
Metadata Sources @ Lyft
Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
54
Challenge #2
Pull model vs Push model
55
Pull model vs. Push model
56
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
What’s next?
63
Amundsen seems to be more useful than what we
thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
64
Impact - Amundsen at Lyft
65
Beta release
(internal)
Generally Available
(GA) release
Alpha release
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
Summary
67
Summary
• Data Discovery adds 30+% more productivity to Data Scientists
• Metadata is key to the next wave of big data applications
• Amundsen - Lyft’s metadata and data discovery platform
• Blog post with more details: go.lyft.com/datadiscoveryblog
68
Jin Hyuk Chang | @jinhyukchang
Tao Feng | @feng-tao
Slides at go.lyft.com/amundsen_datacouncil_2019
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 69
Backup
72

More Related Content

PPTX
Strata sf - Amundsen presentation
PDF
Meetup SF - Amundsen
PDF
Airflow at lyft for Airflow summit 2020 conference
PDF
Disrupting Data Discovery
PPTX
Deep dive into LangChain integration with Neo4j.pptx
PPTX
Free Training: How to Build a Lakehouse
PDF
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Strata sf - Amundsen presentation
Meetup SF - Amundsen
Airflow at lyft for Airflow summit 2020 conference
Disrupting Data Discovery
Deep dive into LangChain integration with Neo4j.pptx
Free Training: How to Build a Lakehouse
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

What's hot (20)

PDF
The Knowledge Graph Explosion
PDF
Actionable Insights with AI - Snowflake for Data Science
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
PPTX
Zero to Snowflake Presentation
PPTX
Big data architectures and the data lake
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PPTX
Large Scale Graph Analytics with JanusGraph
PDF
Introducing Databricks Delta
PPTX
Tableau
PDF
Modern Data architecture Design
PPTX
Introduction to Data Engineering
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PPTX
Comparing three data ingestion approaches where Apache Kafka integrates with ...
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
3D: DBT using Databricks and Delta
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
PDF
Accelerating Data Ingestion with Databricks Autoloader
PDF
Using Databricks as an Analysis Platform
The Knowledge Graph Explosion
Actionable Insights with AI - Snowflake for Data Science
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
Zero to Snowflake Presentation
Big data architectures and the data lake
Tame the small files problem and optimize data layout for streaming ingestion...
Large Scale Graph Analytics with JanusGraph
Introducing Databricks Delta
Tableau
Modern Data architecture Design
Introduction to Data Engineering
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Running Airflow Workflows as ETL Processes on Hadoop
3D: DBT using Databricks and Delta
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Accelerating Data Ingestion with Databricks Autoloader
Using Databricks as an Analysis Platform
Ad

Similar to Data council sf amundsen presentation (20)

PPTX
How Lyft Drives Data Discovery
PDF
Data Discovery and Metadata
PPTX
How Lyft Drives Data Discovery
PDF
Democratizing Data within your organization - Data Discovery
PDF
Amundsen: From discovering to security data
PPT
Large scale computing
PDF
SDSC18 and DSATL Meetup March 2018
PDF
Entity-Centric Data Management
PPTX
Relevancy and Search Quality Analysis - Search Technologies
PDF
Recommender Systems @ Scale - PyData 2019
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PPTX
Ordering the chaos: Creating websites with imperfect data
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
PDF
Recommender Systems @ Scale, Big Data Europe Conference 2019
PDF
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
PPTX
Clickstream data with spark
PDF
Hadoop meets Agile! - An Agile Big Data Model
PPTX
unit 1 big data.pptx
PPTX
Market Research Meets Big Data Analytics for Business Transformation
How Lyft Drives Data Discovery
Data Discovery and Metadata
How Lyft Drives Data Discovery
Democratizing Data within your organization - Data Discovery
Amundsen: From discovering to security data
Large scale computing
SDSC18 and DSATL Meetup March 2018
Entity-Centric Data Management
Relevancy and Search Quality Analysis - Search Technologies
Recommender Systems @ Scale - PyData 2019
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Ordering the chaos: Creating websites with imperfect data
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Recommender Systems @ Scale, Big Data Europe Conference 2019
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Clickstream data with spark
Hadoop meets Agile! - An Agile Big Data Model
unit 1 big data.pptx
Market Research Meets Big Data Analytics for Business Transformation
Ad

More from Tao Feng (6)

PPTX
Airflow at lyft
PDF
Odp - On demand profiler (ICPE 2018)
PDF
Effective Multi-stream Joining in Apache Samza Framework
PDF
A memory capacity model for high performing data-filtering applications in Sa...
PDF
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
PDF
Benchmarking Apache Samza: 1.2 million messages per sec per node
Airflow at lyft
Odp - On demand profiler (ICPE 2018)
Effective Multi-stream Joining in Apache Samza Framework
A memory capacity model for high performing data-filtering applications in Sa...
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Benchmarking Apache Samza: 1.2 million messages per sec per node

Recently uploaded (20)

PPTX
436813905-LNG-Process-Overview-Short.pptx
PDF
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
PDF
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
PPTX
Road Safety tips for School Kids by a k maurya.pptx
PPT
Drone Technology Electronics components_1
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PDF
Monitoring Global Terrestrial Surface Water Height using Remote Sensing - ARS...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
SCOPE_~1- technology of green house and poyhouse
PDF
classification of cubic lattice structure
PPTX
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
PPTX
anatomy of limbus and anterior chamber .pptx
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PDF
Introduction to Data Science: data science process
PDF
flutter Launcher Icons, Splash Screens & Fonts
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
AgentX UiPath Community Webinar series - Delhi
PPTX
metal cuttingmechancial metalcutting.pptx
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
436813905-LNG-Process-Overview-Short.pptx
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
Road Safety tips for School Kids by a k maurya.pptx
Drone Technology Electronics components_1
Lesson 3_Tessellation.pptx finite Mathematics
Monitoring Global Terrestrial Surface Water Height using Remote Sensing - ARS...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
SCOPE_~1- technology of green house and poyhouse
classification of cubic lattice structure
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
anatomy of limbus and anterior chamber .pptx
Simulation of electric circuit laws using tinkercad.pptx
Introduction to Data Science: data science process
flutter Launcher Icons, Splash Screens & Fonts
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
AgentX UiPath Community Webinar series - Delhi
metal cuttingmechancial metalcutting.pptx
July 2025: Top 10 Read Articles Advanced Information Technology

Data council sf amundsen presentation

  • 1. April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Amundsen: A Data Discovery Platform from Lyft
  • 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Demo • Architecture • Summary 2
  • 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 4. 4 Core Infra high level architecture Custom apps
  • 6. • My first project is to analyze and predict Data council Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 6
  • 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 7
  • 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 8
  • 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 10
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery... 11 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 13. Data Discovery - User personas 13 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 16. Meet Amundsen 16 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 18. Search results ranked on relevance and query activity
  • 19. How does search work? 19
  • 20. Relevance - search for “apple” on Google 20 Low relevance High relevance
  • 21. Popularity - search for “apple” on Google 21 Low popularity High popularity
  • 22. Striking the balance 22 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 24. Search results ranked on relevance and query activity
  • 25. Detailed description and metadata about data resources
  • 27. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 31. 31 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 33. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 36. 36 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 37. 37 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 38. Trade Off #1 Why choose Graph database 38
  • 41. Trade Off #2 Why not propagate the metadata back to source 41
  • 42. Why not propagate the metadata back to source 42
  • 43. Why not propagate the metadata back to source 43 ? ?
  • 44. Why not propagate the metadata back to source 44
  • 46. 46 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 47. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 47
  • 48. Challenge #1 How to make the search result more relevant? 48
  • 49. How to make the search result more relevant? 49 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride)
  • 51. 51 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 52. Challenge #1 Various forms of metadata 52
  • 54. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 54
  • 55. Challenge #2 Pull model vs Push model 55
  • 56. Pull model vs. Push model 56 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 59. How are we building data? Databuilder
  • 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 64
  • 63. Impact - Amundsen at Lyft 65 Beta release (internal) Generally Available (GA) release Alpha release
  • 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  • 66. Summary • Data Discovery adds 30+% more productivity to Data Scientists • Metadata is key to the next wave of big data applications • Amundsen - Lyft’s metadata and data discovery platform • Blog post with more details: go.lyft.com/datadiscoveryblog 68
  • 67. Jin Hyuk Chang | @jinhyukchang Tao Feng | @feng-tao Slides at go.lyft.com/amundsen_datacouncil_2019 Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 69

Editor's Notes

  • #3: Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • #4: Who is our audience: everyone who works at Lyft… Power users: Data Scientists, Research Scientists, Product Managers… Next: Engineers, GMs, Ops, etc.
  • #5: What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  • #7: Mark
  • #8: Mark
  • #9: Mark
  • #11: Mark
  • #12: Data Discovery: How much of a challenge is it? Significant challenge… Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis… We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis… But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth… We can significantly increase productivity and impact if we can reduce this time...
  • #14: Mark
  • #21: Mark
  • #23: Popularity is not click through rate but through query access patterns.
  • #32: Amundsen architecture at Lyft: 3 micro-services(FE, metadata, search) and one generic data ingestion framework Will discuss each of the component in details High level walkthrough….how CCPA compilance works
  • #38: ML features (one sentence on what is feature service) Add a logo of neo4j ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #40: Why graph db is the best option? Why not rdbms, nosql etc Why choose neo4j vs other graph db? (most popular graph db) Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data. Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  • #41: Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all. We model the graph as bi-direction relationships.
  • #43: ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #44: ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #45: ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  • #47: ? challenge How to improve search relevance?
  • #48: Think of search algorithm… Event_ride_table -> event ride table
  • #50: Talk about how we measure it: instrument empty results from user, % of click through rate What we did? Boost the rank of exact table name match
  • #57: Walk through pros and cons
  • #60: Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table. Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that. Loader writes data into either sink (destination) or into staging area. Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  • #65: Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  • #66: A slide on Amundsen @ Lyft? ? how long it has been in prod How many datasets Users WAU Usage
  • #67: Kafka topic Schema registry ML workflow and Airflow DAGs
  • #69: Mark
  • #71: ***feel free to edit based on season; No need to divide by location, can also divide by department