SlideShare a Scribd company logo
Data Applications and Infrastructure at LinkedIn Jay Kreps LinkedIn
Plan `whoami` Data products Data infrastructure
Data-centric engineering at LinkedIn LinkedIn’s Search Network & Analytics team Domain: Derived data Products Search People you may know Social graph services Job matching Collaborative filtering Infrastructure
People You May Know
Other products
People You May Know 120 billion relationships scored...every day 82 hadoop jobs (not counting ETL) Around 16TB of intermediate data Machine learning model to predict probability of connection Bloom filter's for approximate filtering joins (10x perf improvement) About ~5 test algorithms per week 2 engineers
Relevance Products You must fly entirely by the instruments Scale and relevance very closely linked More is often better Iteration time is essential UI matters, really We threw out custom non-hadoop code that was faster Opportunity to work directly on the business
Infrastructure as an Ecosystem Isolated infrastructure team is usually a bad solution Too isolated from the problems Data product team has crushing problems This area is extremely immature People should want to use it Treat it like a product Either make money off it or give it away Open source is a great solution Custom software should be the best
Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
Azkaban workflow = cron + make
Azkaban workflow:hadoop :: web framework:webapp
Azkaban
Azkaban Examples Example job source: Example workflow UI
Workflow
Azkaban  82 jobs running every day just for PYMK ...need to run in the right order … need to restart from failure … need to enforce dependencies GUI is important for operations alerting, resource locking, config management, etc deployable zip files of code represent a job flow everyone works independently, releases/deploys independently simple text files for config (but can use GUI in a pinch) aggregate logs, run times restart from point of failure
Data Deployment How do you get your  multi-billion edge probabilistic  relationship graph to the live website to serve queries?
Voldemort LinkedIn had many prior passes at this problem, all bad MySQL Oracle Etc. Fully distributed, partitioned, decentralized key-value storage Supports pluggable storage engines Online/offline cycle Is this a good fit?
Voldemort Data Deployment
Voldemort Data Deployment Building a multi TB lookup structure is really, really hard work...it is a batch operation Solution: build this structure in hadoop Tradeoff: build time vs lookup time Minimal perfect hashing requires only 2.5 bits per key, but is slow to build Sorted indexes are a fast, simple alternative Build is a no-op map/reduce (just sorting) Data load will saturate the network even for small cluster Voldemort gives failover load balancing monitoring remote access partitioning
Voldemort Data Deployment If data takes 24 hours to generate, it may take 24 hours to fix Need a faster rollback strategy Cold disk space is cheap Store the live copy Store the copy currently being updated Store N backup copies “ Atomic” swap Cache needs to start warm I/O network throttling to limit impact of deployment Our prod latency is < 3 ms from the client side 900GB store takes ~1:30 to build on 45 node dev cluster
Questions?

More Related Content

What's hot (20)

PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PDF
What is an Open Data Lake? - Data Sheets | Whitepaper
Vasu S
 
PDF
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
PDF
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
PDF
The Past, Present and Future of Big Data @LinkedIn
Suja Viswesan
 
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
PPTX
Querying Druid in SQL with Superset
DataWorks Summit
 
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
PPTX
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
What is an Open Data Lake? - Data Sheets | Whitepaper
Vasu S
 
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
The Past, Present and Future of Big Data @LinkedIn
Suja Viswesan
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Querying Druid in SQL with Superset
DataWorks Summit
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
Schema-on-Read vs Schema-on-Write
Amr Awadallah
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 

Viewers also liked (20)

PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
PPT
Graph db
Gagan Agrawal
 
PDF
GraphDB Connectors – Powering Complex SPARQL Queries
Marin Dimitrov
 
PDF
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PDF
NoSQL x SQL: Bancos de Dados em Nuvens Computacionais
Carlo Pires
 
PPTX
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
PPTX
Apache Kafka
Maher TEBOURBI
 
PDF
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Shirshanka Das
 
PDF
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
PPTX
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
PDF
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
PPTX
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
rajappaiyer
 
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
PPT
Comparação de desempenho entre SQL e NoSQL
pichiliani
 
PPTX
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
PDF
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Graph db
Gagan Agrawal
 
GraphDB Connectors – Powering Complex SPARQL Queries
Marin Dimitrov
 
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
NoSQL x SQL: Bancos de Dados em Nuvens Computacionais
Carlo Pires
 
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
Apache Kafka
Maher TEBOURBI
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Shirshanka Das
 
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
rajappaiyer
 
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Comparação de desempenho entre SQL e NoSQL
pichiliani
 
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Ad

Similar to Data Applications and Infrastructure at LinkedIn__HadoopSummit2010 (20)

PPT
Bhupeshbansal bigdata
Bhupesh Bansal
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Voldemort Nosql
elliando dias
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
20081022cca
Jeff Hammerbacher
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PPTX
Bigdata
Shankar R
 
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
PPTX
BigData
Shankar R
 
PPTX
Bw tech hadoop
Mindgrub Technologies
 
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
PDF
Data Infrastructure for a World of Music
Lars Albertsson
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPTX
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
PPTX
The Meta of Hadoop - COMAD 2012
Joydeep Sen Sarma
 
PPT
Hadoop applicationarchitectures
Doug Chang
 
KEY
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
PPTX
Cloud Control Access: From Hack to Reality
Alan Quayle
 
Bhupeshbansal bigdata
Bhupesh Bansal
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Voldemort Nosql
elliando dias
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
20081022cca
Jeff Hammerbacher
 
SQL or NoSQL, that is the question!
Andraz Tori
 
Bigdata
Shankar R
 
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
BigData
Shankar R
 
Bw tech hadoop
Mindgrub Technologies
 
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
Data Infrastructure for a World of Music
Lars Albertsson
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
The Meta of Hadoop - COMAD 2012
Joydeep Sen Sarma
 
Hadoop applicationarchitectures
Doug Chang
 
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
Cloud Control Access: From Hack to Reality
Alan Quayle
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Digital Circuits, important subject in CS
contactparinay1
 

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

  • 1. Data Applications and Infrastructure at LinkedIn Jay Kreps LinkedIn
  • 2. Plan `whoami` Data products Data infrastructure
  • 3. Data-centric engineering at LinkedIn LinkedIn’s Search Network & Analytics team Domain: Derived data Products Search People you may know Social graph services Job matching Collaborative filtering Infrastructure
  • 6. People You May Know 120 billion relationships scored...every day 82 hadoop jobs (not counting ETL) Around 16TB of intermediate data Machine learning model to predict probability of connection Bloom filter's for approximate filtering joins (10x perf improvement) About ~5 test algorithms per week 2 engineers
  • 7. Relevance Products You must fly entirely by the instruments Scale and relevance very closely linked More is often better Iteration time is essential UI matters, really We threw out custom non-hadoop code that was faster Opportunity to work directly on the business
  • 8. Infrastructure as an Ecosystem Isolated infrastructure team is usually a bad solution Too isolated from the problems Data product team has crushing problems This area is extremely immature People should want to use it Treat it like a product Either make money off it or give it away Open source is a great solution Custom software should be the best
  • 9. Open Source Zoie – Faceted Search Bobo – Real-time search indexing Decomposer – Very large matrix decomposition routines (now in Mahout) Norbert – Partition aware cluster management & RPC Voldemort – Key/Value storage Kamikaze – Compression package Sensei – Distributed search Azkaban – Hadoop workflow
  • 10. Azkaban workflow = cron + make
  • 11. Azkaban workflow:hadoop :: web framework:webapp
  • 13. Azkaban Examples Example job source: Example workflow UI
  • 15. Azkaban 82 jobs running every day just for PYMK ...need to run in the right order … need to restart from failure … need to enforce dependencies GUI is important for operations alerting, resource locking, config management, etc deployable zip files of code represent a job flow everyone works independently, releases/deploys independently simple text files for config (but can use GUI in a pinch) aggregate logs, run times restart from point of failure
  • 16. Data Deployment How do you get your multi-billion edge probabilistic relationship graph to the live website to serve queries?
  • 17. Voldemort LinkedIn had many prior passes at this problem, all bad MySQL Oracle Etc. Fully distributed, partitioned, decentralized key-value storage Supports pluggable storage engines Online/offline cycle Is this a good fit?
  • 19. Voldemort Data Deployment Building a multi TB lookup structure is really, really hard work...it is a batch operation Solution: build this structure in hadoop Tradeoff: build time vs lookup time Minimal perfect hashing requires only 2.5 bits per key, but is slow to build Sorted indexes are a fast, simple alternative Build is a no-op map/reduce (just sorting) Data load will saturate the network even for small cluster Voldemort gives failover load balancing monitoring remote access partitioning
  • 20. Voldemort Data Deployment If data takes 24 hours to generate, it may take 24 hours to fix Need a faster rollback strategy Cold disk space is cheap Store the live copy Store the copy currently being updated Store N backup copies “ Atomic” swap Cache needs to start warm I/O network throttling to limit impact of deployment Our prod latency is < 3 ms from the client side 900GB store takes ~1:30 to build on 45 node dev cluster

Editor's Notes

  • #2: This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • #3: This is the agenda slide. There is only one of these in the deck.
  • #4: Why linkedin cares about derived data Why it is hard
  • #5: Talk about what you can do
  • #7: if you get bad results, I claim you are in an unsuccessful test! Still a small percentage of the quadrillion possible relationships (pairwise is hard)
  • #8: What we learned
  • #11: Azk is a workflow scheduler? What is workflow?
  • #14: Samurai rule Logic is in jobs, not job descriptor Jobs are independent Work – viz, polish
  • #22: This is the final slide; generally for questions at the end of the talk. Please post your contact information here.