SlideShare a Scribd company logo
Building Satori: Web Data
Extraction On Hadoop
Nikolai Avteniev
Sr. Staff Software Engineer
LinkedIn
Building Opportunity from the Empire State Building
2
LinkedIn NYC
3
The Team
Nikita Lytkin
Staff Software Engineer
Pi-Chuan Chang
Sr. Software Engineer
David Astle
Sr. Software Engineer
Nikolai Avteniev
Sr. Staff Software Engineer
Eran Leshem
Sr. Staff Software Engineer
THE ECONOMIC GRAPH
Connecting talent with opportunity
at massive scale
What we thought we needed
6
The BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy.
"The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
Questions we wanted to answer
7
Focused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
Identity Team
Virtually All Member Value Relies On Identity Data
Susan Kaplan
Sr. Marketing Manager at Weblo
SEARCH
Research & Contact
AD TARGETING
Market Products
& Services
PMYK
Build Your Network
RECRUITER
Recruit & Hire
FEED
Get Daily News
NETWORK
Keep in Touch
RECOMMENDATIONS
Get a Job/Gig
WVMP
Establish Yourself
as Expert
Identity Use Case
A smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps
& get credit for certificates, patents, publications…
Kafka/Samza Team
• Avg. HTML Document is 6K
37% < 10K
• Samza can handle 1.2M
messages per node [2]
• There is a limit of how much
data is retained between 7
and 30 days.
• Most of the data is filtered out
• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node”
https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-
node
Help 400M members fully realize
their professional identity on
LinkedIn.
Find sources of professional
content on the public internet.
Fetch the content, extract
structured data and match it to
member profiles
13
The Project: Satori
Web Data Extraction HOW TO:
• Enterprise VS Social Web
use cases
• Web Sources
• Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70
(2014): 301-323.
16
What is a Wrapper?
Induce wrappers based on data [4]
Build wrappers that are robust. [5]
Cluster similar pages by URL [6]
The web is huge and there are
interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB
Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of
the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites."
Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment
5.7 (2012): 680-691.
Picking a Crawler
HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010
9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004
10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014
11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
20
And the winner is …
Satori
• Built on Nutch 1.9
• Runs on Hadoop 2.3
• Scheduled to run every 5
hours
• Respects robots.txt
• Default crawl delay of 5
seconds
22
Crawl Flow
• Output into target schema
• Apply XPATH wrappers
• Wrappers are hierarchical
mapping of Schema field to
XPath expression
• Indexed by data domain and
data source
23
Extract Flow
Crawl rate is bound by the
number of sites and the site
crawl delay
Common Crawl Great Source
https://commoncrawl.org/
Gobblin Great Ingestion
Framework
https://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources
XPath extractors can be
challenging on sites with rich
data
It is easy to exceed the Hadoop
quota
Match[in]
Matching authors and publications to members
to power profile edit experiences
30
Overview
Match using global identifiers,
email or full name.
The data might not be clean
after extraction
Start with a small set of data and
get it to the users quickly
31
Start Simple
Narrow the candidates with
LSH[1]
Use the simple model to
generate the ground truth
Train using a simple algorithm
and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
5.3
2.3
3.9
0.6
Publications Companies
Extractor Objects
Total Processed
33
Current Status
56
2
5.6
2.5
1.2 0.1
Publication Company
Crawler Objects
Unfetched Fetched Gone
Target a data source which has
data that will be easy to fetch,
extract and match.
Add tracking to the entire flow
Do it all offline if you can
Get the product to the
customers early to validate the
process and value proposition
Most important of all write it all
down and share it with everyone

©2014 LinkedIn Corporation. All Rights Reserved.
Ad

More Related Content

What's hot (20)

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
Richard Wallis
 
SIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media SitesSIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media Sites
Uldis Bojars
 
Oas schwartz OA Summit
Oas schwartz OA SummitOas schwartz OA Summit
Oas schwartz OA Summit
Open Analytics
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
The open semantic enterprise enterprise data meets web data
The open semantic enterprise   enterprise data meets web dataThe open semantic enterprise   enterprise data meets web data
The open semantic enterprise enterprise data meets web data
Georg Guentner
 
Life after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the FutureLife after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the Future
Emily Nimsakont
 
Linked Data Book: DC Semantic Web Meetup 20130129
Linked Data Book: DC Semantic Web Meetup 20130129Linked Data Book: DC Semantic Web Meetup 20130129
Linked Data Book: DC Semantic Web Meetup 20130129
3 Round Stones
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup
Faizan Javed
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
sopekmir
 
Conclusions - Linked Data
Conclusions - Linked DataConclusions - Linked Data
Conclusions - Linked Data
Juan Sequeda
 
Toogdag 2017
Toogdag 2017Toogdag 2017
Toogdag 2017
Richard Zijdeman
 
IRMS 2018 - Looking to the future to preserver the past
IRMS 2018 - Looking to the future to preserver the pastIRMS 2018 - Looking to the future to preserver the past
IRMS 2018 - Looking to the future to preserver the past
Randy Perkins-Smart
 
Presentation at Google Day on Big Data
Presentation at Google Day on Big DataPresentation at Google Day on Big Data
Presentation at Google Day on Big Data
Rezaur Rahman
 
FIBO & Schema.org
FIBO & Schema.orgFIBO & Schema.org
FIBO & Schema.org
Richard Wallis
 
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Christopher Regan
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
Richard Wallis
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
sopekmir
 
Knowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeKnowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your Knowledge
Neo4j
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
Richard Wallis
 
SIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media SitesSIOC: Semantic Web for Social Media Sites
SIOC: Semantic Web for Social Media Sites
Uldis Bojars
 
Oas schwartz OA Summit
Oas schwartz OA SummitOas schwartz OA Summit
Oas schwartz OA Summit
Open Analytics
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
The open semantic enterprise enterprise data meets web data
The open semantic enterprise   enterprise data meets web dataThe open semantic enterprise   enterprise data meets web data
The open semantic enterprise enterprise data meets web data
Georg Guentner
 
Life after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the FutureLife after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the Future
Emily Nimsakont
 
Linked Data Book: DC Semantic Web Meetup 20130129
Linked Data Book: DC Semantic Web Meetup 20130129Linked Data Book: DC Semantic Web Meetup 20130129
Linked Data Book: DC Semantic Web Meetup 20130129
3 Round Stones
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup
Faizan Javed
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
sopekmir
 
Conclusions - Linked Data
Conclusions - Linked DataConclusions - Linked Data
Conclusions - Linked Data
Juan Sequeda
 
IRMS 2018 - Looking to the future to preserver the past
IRMS 2018 - Looking to the future to preserver the pastIRMS 2018 - Looking to the future to preserver the past
IRMS 2018 - Looking to the future to preserver the past
Randy Perkins-Smart
 
Presentation at Google Day on Big Data
Presentation at Google Day on Big DataPresentation at Google Day on Big Data
Presentation at Google Day on Big Data
Rezaur Rahman
 
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Enterprise Data World 2016 | FIBO extension to Schema.org | FIBO SEO | Christ...
Christopher Regan
 
Schema.org where did that come from?
Schema.org where did that come from?Schema.org where did that come from?
Schema.org where did that come from?
Richard Wallis
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
sopekmir
 
Knowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeKnowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your Knowledge
Neo4j
 

Viewers also liked (15)

Kemiskinan dan kesenjangan pendapatan
Kemiskinan dan kesenjangan pendapatanKemiskinan dan kesenjangan pendapatan
Kemiskinan dan kesenjangan pendapatan
EnengNs
 
Industrialisasi dan pertembangan
Industrialisasi dan pertembanganIndustrialisasi dan pertembangan
Industrialisasi dan pertembangan
EnengNs
 
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
Seoul Energy Self-sufficient Villages
 
البحث في مصادر المعلومات الالكترونية
البحث في مصادر المعلومات الالكترونيةالبحث في مصادر المعلومات الالكترونية
البحث في مصادر المعلومات الالكترونية
Beni-Suef University
 
i-Go Lite Travel Trailer Features and Benefits
i-Go Lite Travel Trailer Features and Benefitsi-Go Lite Travel Trailer Features and Benefits
i-Go Lite Travel Trailer Features and Benefits
Cean Burgeson
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Karya ilmiah PKN
Karya ilmiah PKNKarya ilmiah PKN
Karya ilmiah PKN
Rahma Kusuma
 
Gambaran umum perekonomian indonesia
Gambaran umum perekonomian indonesiaGambaran umum perekonomian indonesia
Gambaran umum perekonomian indonesia
MUHAMAD ZAKY MUJAHID
 
Dernière évolution du projet de démateialisation des procédures du commerce e...
Dernière évolution du projet de démateialisation des procédures du commerce e...Dernière évolution du projet de démateialisation des procédures du commerce e...
Dernière évolution du projet de démateialisation des procédures du commerce e...
AAEC_AFRICAN
 
Usaha kecil dan menengah
Usaha kecil dan menengahUsaha kecil dan menengah
Usaha kecil dan menengah
EnengNs
 
Nishant_Patnaik
Nishant_PatnaikNishant_Patnaik
Nishant_Patnaik
Nishant Patnaik
 
Fruhling, Sommer
Fruhling, SommerFruhling, Sommer
Fruhling, Sommer
vierah
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
Jadual berkala unsur
Jadual berkala unsurJadual berkala unsur
Jadual berkala unsur
Cikgu Marzuqi
 
Weihnachtstraditionen in der slowakei
Weihnachtstraditionen in der slowakeiWeihnachtstraditionen in der slowakei
Weihnachtstraditionen in der slowakei
16monika
 
Kemiskinan dan kesenjangan pendapatan
Kemiskinan dan kesenjangan pendapatanKemiskinan dan kesenjangan pendapatan
Kemiskinan dan kesenjangan pendapatan
EnengNs
 
Industrialisasi dan pertembangan
Industrialisasi dan pertembanganIndustrialisasi dan pertembangan
Industrialisasi dan pertembangan
EnengNs
 
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
에너지자립마을 이야기11. 행복마을 전농 래미안아름숲
Seoul Energy Self-sufficient Villages
 
البحث في مصادر المعلومات الالكترونية
البحث في مصادر المعلومات الالكترونيةالبحث في مصادر المعلومات الالكترونية
البحث في مصادر المعلومات الالكترونية
Beni-Suef University
 
i-Go Lite Travel Trailer Features and Benefits
i-Go Lite Travel Trailer Features and Benefitsi-Go Lite Travel Trailer Features and Benefits
i-Go Lite Travel Trailer Features and Benefits
Cean Burgeson
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Gambaran umum perekonomian indonesia
Gambaran umum perekonomian indonesiaGambaran umum perekonomian indonesia
Gambaran umum perekonomian indonesia
MUHAMAD ZAKY MUJAHID
 
Dernière évolution du projet de démateialisation des procédures du commerce e...
Dernière évolution du projet de démateialisation des procédures du commerce e...Dernière évolution du projet de démateialisation des procédures du commerce e...
Dernière évolution du projet de démateialisation des procédures du commerce e...
AAEC_AFRICAN
 
Usaha kecil dan menengah
Usaha kecil dan menengahUsaha kecil dan menengah
Usaha kecil dan menengah
EnengNs
 
Fruhling, Sommer
Fruhling, SommerFruhling, Sommer
Fruhling, Sommer
vierah
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
Jadual berkala unsur
Jadual berkala unsurJadual berkala unsur
Jadual berkala unsur
Cikgu Marzuqi
 
Weihnachtstraditionen in der slowakei
Weihnachtstraditionen in der slowakeiWeihnachtstraditionen in der slowakei
Weihnachtstraditionen in der slowakei
16monika
 
Ad

Similar to DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn (20)

SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
yatakonakiran2
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
animove
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Peter Haase
 
Introduction to APIs and Linked Data
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked Data
Adrian Stevenson
 
Web Mining
Web MiningWeb Mining
Web Mining
Shobha Rani
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
IJERA Editor
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
Filip Radulovic
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
APIs in Enterprise
APIs in EnterpriseAPIs in Enterprise
APIs in Enterprise
Victor Olex
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET Journal
 
Linked Services for the Web of Data
Linked Services for the Web of DataLinked Services for the Web of Data
Linked Services for the Web of Data
Carlos Pedrinaci
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
Kristi Holmes
 
Semantic Web For Dummies
Semantic Web For DummiesSemantic Web For Dummies
Semantic Web For Dummies
Jeffrey T. Pollock
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
Anna Fensel
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
IMC Technologies
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Integrate All The Things WS02Con
Integrate All The Things WS02ConIntegrate All The Things WS02Con
Integrate All The Things WS02Con
James Governor
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
Vladimir Alexiev, PhD, PMP
 
SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SKB-Web2.0.ppt SK...
yatakonakiran2
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
animove
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Peter Haase
 
Introduction to APIs and Linked Data
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked Data
Adrian Stevenson
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
Filip Radulovic
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
APIs in Enterprise
APIs in EnterpriseAPIs in Enterprise
APIs in Enterprise
Victor Olex
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET Journal
 
Linked Services for the Web of Data
Linked Services for the Web of DataLinked Services for the Web of Data
Linked Services for the Web of Data
Carlos Pedrinaci
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
Anna Fensel
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
IMC Technologies
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bhaskar Ghosh
 
Integrate All The Things WS02Con
Integrate All The Things WS02ConIntegrate All The Things WS02Con
Integrate All The Things WS02Con
James Governor
 
Ad

More from Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 

Recently uploaded (20)

Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 

DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

  • 1. Building Satori: Web Data Extraction On Hadoop Nikolai Avteniev Sr. Staff Software Engineer LinkedIn
  • 2. Building Opportunity from the Empire State Building 2 LinkedIn NYC
  • 3. 3 The Team Nikita Lytkin Staff Software Engineer Pi-Chuan Chang Sr. Software Engineer David Astle Sr. Software Engineer Nikolai Avteniev Sr. Staff Software Engineer Eran Leshem Sr. Staff Software Engineer
  • 5. Connecting talent with opportunity at massive scale
  • 6. What we thought we needed 6 The BIG Idea Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
  • 7. Questions we wanted to answer 7 Focused our Vision Who would use this tool? Do we need to crawl the entire web? Do we need to process the pages near line? Where would we store this data? How would we correct mistakes in the flow?
  • 9. Virtually All Member Value Relies On Identity Data Susan Kaplan Sr. Marketing Manager at Weblo SEARCH Research & Contact AD TARGETING Market Products & Services PMYK Build Your Network RECRUITER Recruit & Hire FEED Get Daily News NETWORK Keep in Touch RECOMMENDATIONS Get a Job/Gig WVMP Establish Yourself as Expert
  • 10. Identity Use Case A smarter way to build your profile • Suggest 1-click profile updates to members • Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…
  • 12. • Avg. HTML Document is 6K 37% < 10K • Samza can handle 1.2M messages per node [2] • There is a limit of how much data is retained between 7 and 30 days. • Most of the data is filtered out • Need to bootstrap Samza stores 12 Not a perfect fit 1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc 2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single- node
  • 13. Help 400M members fully realize their professional identity on LinkedIn. Find sources of professional content on the public internet. Fetch the content, extract structured data and match it to member profiles 13 The Project: Satori
  • 15. • Enterprise VS Social Web use cases • Web Sources • Wrappers 15 Web Data Extraction System 3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.
  • 16. 16 What is a Wrapper?
  • 17. Induce wrappers based on data [4] Build wrappers that are robust. [5] Cluster similar pages by URL [6] The web is huge and there are interesting things in the long tale[7] 17 Industrial Web Data Extraction 4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230. 5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009. 6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011. 7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.
  • 19. HERITRIX powers archive.org NUTCH powers common crawl BUbinNG part of LAW Scrapy used with in LinkedIn 19 The Contestants 8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010 9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004 10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014 11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
  • 22. • Built on Nutch 1.9 • Runs on Hadoop 2.3 • Scheduled to run every 5 hours • Respects robots.txt • Default crawl delay of 5 seconds 22 Crawl Flow
  • 23. • Output into target schema • Apply XPATH wrappers • Wrappers are hierarchical mapping of Schema field to XPath expression • Indexed by data domain and data source 23 Extract Flow
  • 24. Crawl rate is bound by the number of sites and the site crawl delay
  • 25. Common Crawl Great Source https://commoncrawl.org/ Gobblin Great Ingestion Framework https://github.com/linkedin/gobblinn 25 Bootstrap From Bulk Sources
  • 26. XPath extractors can be challenging on sites with rich data
  • 27. It is easy to exceed the Hadoop quota
  • 29. Matching authors and publications to members to power profile edit experiences
  • 31. Match using global identifiers, email or full name. The data might not be clean after extraction Start with a small set of data and get it to the users quickly 31 Start Simple
  • 32. Narrow the candidates with LSH[1] Use the simple model to generate the ground truth Train using a simple algorithm and a few hundred features 32 Keep It Simple 1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
  • 33. 5.3 2.3 3.9 0.6 Publications Companies Extractor Objects Total Processed 33 Current Status 56 2 5.6 2.5 1.2 0.1 Publication Company Crawler Objects Unfetched Fetched Gone
  • 34. Target a data source which has data that will be easy to fetch, extract and match.
  • 35. Add tracking to the entire flow
  • 36. Do it all offline if you can
  • 37. Get the product to the customers early to validate the process and value proposition
  • 38. Most important of all write it all down and share it with everyone 
  • 39. ©2014 LinkedIn Corporation. All Rights Reserved.