SlideShare a Scribd company logo
Hadoop  Voldemort  @ LinkedIn Bhupesh Bansal 20 January , 2010 01/21/10
The plan What is Project Voldemort ? Motivation New features In production Hadoop & LinkedIn LinkedIn Hadoop ecosystem Hadoop & Voldemort References Q&A
Introduction Project Voldemort is a distributed, scalable, highly available key/value storage system. Inspired by Amazon Dynamo Paper and memcached Online storage solution which scales horizontally High throughput, low latency What does it do for you ? Provides a simple key/value APIs for client. Data partitioning and replication Provides consistency guarantees even in presence of failures Scales well (amount of data, number of clients)
Motivation I : Big Data  Proprietary & Confidential 01/21/10 Reference :  algo2.iti.kit.edu/.../fopraext/index.html
Motivation II: Data Driven Features
Motivation III  Proprietary & Confidential 01/21/10
Motivation IV Proprietary & Confidential 01/21/10
Why Is This Hard? Failures in a distributed system are much more complicated A can talk to B does not imply B can talk to A Nodes will fail and come back to life with stale data I/O has high request latency variance Is the node down or the node slow ? Intermittent failures are common Should I try this node now ? There are fundamental trade-offs between availability and consistency CAP theorem User must be isolated from these problems
Some Problems we worked on lately ? Performance improvements Push computation to data (Server side views) Data compression Failure detection Reliable Testing Testing is hard, Distributed systems make it harder Voldemort Rebalancing Dynamically add nodes to a running cluster Administration Often ignored But very important
Server side views Motivation: push computation to data Create custom views on server side for specific transformations. Avoids network transfer of big blobs Avoids serialization CPU and I/O cost  Examples Range query over denormalized list stored as value. Append() operation for list values. Filters/aggregates on specific fields etc.
Failure Detection Need to maintain up-to-date status of each server availability. Detect failures earlier Avoid waiting for failed nodes while serving. Reduce false positives  Maintains proper load balancing Allowance for intermittent or temporary failures. Re-establish node availability asynchronously.  Contributed by Kirk True Proprietary & Confidential 01/21/10
EC2 based testing  Testing “in the cloud” Distributed systems have to be tested on multi-node clusters Distributed systems have complex failure scenarios A storage system, above all, must be stable Automated testing allows rapid iteration while maintaining confidence in systems’ correctness and stability EC2-based testing framework Tests are invoked programmatically Adaptable to other cloud hosting providers Will run on a regular basis Contributed by Kirk True
Coming this Jan (finally): Rebalancing Voldemort Rebalancing  capability to add/delete nodes, move data around in an online voldemort cluster.  Features No downtime Transparent to the client Maintain data consistency guarantees push button user interface
Administration Monitoring View statistics (how many queries are made? How long are they taking?) Perform diagnostic operations Administrative functionalities  Functionality which is needed, but shouldn’t be performed by regular store clients, example Ability to update and retrieve cluster/store metadata Efficient streaming of keys and values. Delete entries in bulk Truncate entire store  Restore a node data from replicas
Present day In Production use  At LinkedIn Multiple clusters Variety of customers Outside of LinkedIn Gilt Group, KaChing, others Active developer community, inside and outside LinkedIn Monthly release cycle Continuous testing environment.  Daily performance tests.
Performance LinkedIn cluster: web event tracking logging and online lookups 6 nodes, 400 GB of data, 12 clients mixed load (67 % Get , 33 % Put) Throughput  1433 QPS (node) 4299 QPS (cluster) Latency GET  50 % percentile  0.05 ms 95 % percentile  36.07 ms 99 % percentile  60.65 ms PUT 50 % percentile  0.09 ms 95 % percentile  0.41 ms 99 % percentile  1.22 ms
Hadoop @ Linkedin
Batch Computing at Linkedin  Some questions we want to answer What do we use Hadoop for ? How do we store data ? How do we manage workflows ? How do we do ETL ? How do we prototype ideas
What do we use Hadoop for ? Proprietary & Confidential 01/21/10
How do we store Data ? Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, Date, etc. Example member definition: {   'member_id': 'int32', ‘ first_name': 'string', ’ last_name': ’string’, ‘ age’  : ‘int32’ … } Data is stored in Hadoop as sequence files, serialized with this format The schema of data is saved in sequence files as metadata The schema is read dynamically by Java/Pig jobs on the fly.
How do we manage workflows ?  We wrote a workflow management tool  (code name:  Azkaban ) Dependency management  (Hadoop, ETL, java, Unix jobs) Maintains a dependency directed acyclic graph All dependencies must complete before the job itself can run If a dependency fails the job itself fails Scheduling  workflows can be scheduled to run on  repeating schedule. Configuration system  (simple properties files) GUI  for visualizing and controlling job Historical logging and job success data retained for each run Alerting of failures Will be open sourced soon ( APL ) !!
Introducing Azkaban
How do we do ETL ? : Getting data in  Two kind of data From Databases (user data, news, jobs etc.) Need a way to get data reliably periodically Need test to verify data Support for incremental replication Our solution  Transmogrify,  A   driver program which accepts an inputReader and outputWriter InputReader: JDBCReader, CSV Reader Output Writer: JDBCWriter, HDFS writers From web logs (page views, search, clicks etc) Weblogs files are rsynced and loaded up in HDFS Hadoop jobs for date cleaning and transformation.
ETL II: Getting data out Batch jobs generate output in 100GBs How do we finally serve this data to user ? Some constraints Wants to show data to users ASAP Should not impact online serving Should have quick rollback capabilities. Should be horizontally scalable Should be fault tolerant (replication) High throughput, low latency
ETL II : Getting Data Out : Existing Solutions JDBC upload to Oracle  Very slow (3 days to push 20GB of data) Long running transactions cause issues. “ Load Data” statement in MySQL Tried many many tuning settings Was still very slow  Almost unresponsive for online serving Memcached Need to have everything in memory No support for batch inserts Need to repush if server dies. Oracle SQL*Loader The whole process took 2-3 days 20-24 hour actual data loading time Needed a guy to baby sit the process What if we want to do this daily ? Hbase (didn’t try) Proprietary & Confidential 01/21/10
ETL II : Getting Data Out : Our solution Index build runs 100% in Hadoop  MapReduce job outputs Voldemort Stores to HDFS Job control initiates a fetch request to Voldemort nodes. Voldemort  nodes copies data from Hadoop in parallel  Atomic swap to make the data live Heavily optimized storage engine for read-only data I/O Throttling on the transfer to protect the live servers Our Solution  Wrote a special  Read-only storage engine for Voldemort Data is built on Hadoop and copied to Voldemort
Voldemort Read only store: version I Simple File based storage engine Two files: key file and value file Key file have sorted MD5 key hash and file offset of the value file for corresponding value. Value file have value saved as size,data Advantages Index is built on hadoop, no load on production servers Files are copied in parallel to voldemort cluster Supports rollback by keeping multiple copies Proprietary & Confidential 01/21/10
Voldemort Read only store: version II The Version I read only stores have few issues Only one reducer per node Binary search can potentially take 32 steps. Version II format Make multiple key-file, value-file pairs (multiple reducer per node) Mmap all keys file. Use interpolation binary search Keys are MD5 hash and very well uniformed distributed While searching do predicted binary search  Much faster performance Future work Group values by frequency, so that frequent values are in operating system cache Transfer only the delta Proprietary & Confidential 01/21/10
Performance There are three performance numbers important now Hadoop Time to build read-only store indexes 10 mins  to build 20 GB of data File transfer time Limited only to network and disk throughputs Online serving performance Read-Only store is fast ( 1ms – 5ms )  Operation system page cache helps a lot Very predictable performance based on  Data size, disk speeds, RAM, cache hit ratios
Batch Computing at LinkedIn
Infrastructure At LinkedIn Last year: QA lab: 20 machines, cheap dell linux servers Production: 20 machines, Heavy machines “ QA” cluster is for dev, analysis, and reporting uses Pig, Hadoop streaming for prototyping ideas Ad hoc jobs compete with scheduled jobs  Tried different Hadoop schedulers Production is for jobs that produce user-facing data Hired Allen Wittenauer as our Hadoop Architect in Sep 2009 100 hadoop machines
References Amazon dynamo paper Project-voldemort.com NoSQL presentations at Last.fm (2009) Voldemort presentation by Jay Kreps Proprietary & Confidential 01/21/10
The End
Core Concepts
Core Concepts - I ACID  Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency  to  Eventual consistency Consistency Models Strict consistency 2 Phase Commits PAXOS : distributed algorithm to ensure quorum for consistency Eventual consistency Different nodes can have different views of value In a steady state system will return last written value. BUT Can have much strong guarantees.  Proprietary & Confidential 01/21/10
Core Concept - II Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change  Replication  Each partition is stored by ‘N’ nodes Node Failures Transient (short term) Long term  Needs faster bootstrapping Proprietary & Confidential 01/21/10
Core Concept - III N - The replication factor  R - The number of blocking reads W - The number of blocking writes If  R+W > N  then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes  R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 01/21/10
Core Concepts - IV Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 01/21/10
Implementation
Voldemort Design
Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V)  GET (Key K) MULTI-GET (Iterator<Key> K),  DELETE (Key K) / (Key K , Version ver) No Range Scans
Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
Serialization Really important Few Considerations Schema free? Backward/Forward compatible Real life data structures Bytes <=> objects <=> strings? Size (No XML) Many ways to do it -- we allow anything Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serializtion
Routing Routing layer hides lot of complexity Hashing schema Replication (N, R , W)  Failures Read-Repair (online repair mechanism) Hinted Handoff (Long term recovery mechanism) Easy to add domain specific strategies E.g. only do synchronous operations on nodes in the local data center Client Side / Server Side / Hybrid
Voldemort Physical Deployment
Routing With Failures Failure Detection Requirements Need to be very very fast View of server state may be inconsistent A can talk to B but C cannot A can talk to C , B can talk to A but not to C Currently done by routing layer (request timeouts) Periodically retries failed nodes. All requests must have hard SLAs Other possible solutions Central server  Gossip protocol Need to look more into this.
Repair Mechanism Read Repair Online repair mechanism  Routing client receives values from multiple node Notify a node if you see an old value Only works for keys which are read after failures Hinted Handoff If a write fails write it to any random node Just mark the write as a special write Each node periodically tries to get rid of all special entries Bootstrapping mechanism (We don’t have it yet) If a node was down for long time Hinted handoff can generate ton of traffic Need a better way to bootstrap and clear hinted handoff tables Proprietary & Confidential 01/21/10
Network Layer Network is the major bottleneck in many uses Client performance turns out to be harder than server (client must wait!) Lots of issue with socket buffer size/socket pool Server is also a Client Two implementations HTTP + servlet container Simple socket protocol + custom server HTTP server is great, but http client is 5-10X slower Socket protocol is what we use in production Recently added a non-blocking version of the server
Persistence Single machine key-value storage is a commodity Plugins are better than tying yourself to a single strategy Different use cases optimize reads optimize writes Large vs Small values SSDs may completely change this layer Couple of different options  BDB, MySQL and mmap’d file implementations Berkeley DBs most popular In memory plugin for testing  Btrees are still the best all-purpose structure No flush on write is a huge, huge win
Ad

More Related Content

What's hot (20)

Mumak
MumakMumak
Mumak
Hadoop User Group
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
David Groozman
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
Bill Graham
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
Bouquet
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
Databricks
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
grepalex
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
Dan Harvey
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
Sandish Kumar H N
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
Bouquet
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
Databricks
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
grepalex
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
Dan Harvey
 

Viewers also liked (20)

Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
Hadoop User Group
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
Hadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
Hadoop User Group
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
Hadoop User Group
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
Hadoop User Group
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
Hadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
Tom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 
NodeB Application Part
NodeB Application PartNodeB Application Part
NodeB Application Part
Tusharadri Sarkar
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
Yahoo Developer Network
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
Hadoop User Group
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
Hadoop User Group
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
Hadoop User Group
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
Hadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
Tom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Hadoop User Group
 
Ad

Similar to Hadoop and Voldemort @ LinkedIn (20)

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
Directi Group
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Nati Shalom
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
rramesh
 
Module 1- Introduction to Big Data and Hadoop
Module 1- Introduction to Big Data and HadoopModule 1- Introduction to Big Data and Hadoop
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
Ashok Modi
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
Directi Group
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Nati Shalom
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Building Low Cost Scalable Web Applications Tools & Techniques
Building Low Cost Scalable Web Applications   Tools & TechniquesBuilding Low Cost Scalable Web Applications   Tools & Techniques
Building Low Cost Scalable Web Applications Tools & Techniques
rramesh
 
Module 1- Introduction to Big Data and Hadoop
Module 1- Introduction to Big Data and HadoopModule 1- Introduction to Big Data and Hadoop
Module 1- Introduction to Big Data and Hadoop
SiddheshMhatre27
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
Ashok Modi
 
Ad

More from Hadoop User Group (15)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
Hadoop User Group
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
Hadoop User Group
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
Hadoop User Group
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
Hadoop User Group
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
Hadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
Hadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
Hadoop User Group
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
Hadoop User Group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
Hadoop User Group
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
Hadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
Hadoop User Group
 

Recently uploaded (20)

Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 

Hadoop and Voldemort @ LinkedIn

  • 1. Hadoop Voldemort @ LinkedIn Bhupesh Bansal 20 January , 2010 01/21/10
  • 2. The plan What is Project Voldemort ? Motivation New features In production Hadoop & LinkedIn LinkedIn Hadoop ecosystem Hadoop & Voldemort References Q&A
  • 3. Introduction Project Voldemort is a distributed, scalable, highly available key/value storage system. Inspired by Amazon Dynamo Paper and memcached Online storage solution which scales horizontally High throughput, low latency What does it do for you ? Provides a simple key/value APIs for client. Data partitioning and replication Provides consistency guarantees even in presence of failures Scales well (amount of data, number of clients)
  • 4. Motivation I : Big Data Proprietary & Confidential 01/21/10 Reference : algo2.iti.kit.edu/.../fopraext/index.html
  • 5. Motivation II: Data Driven Features
  • 6. Motivation III Proprietary & Confidential 01/21/10
  • 7. Motivation IV Proprietary & Confidential 01/21/10
  • 8. Why Is This Hard? Failures in a distributed system are much more complicated A can talk to B does not imply B can talk to A Nodes will fail and come back to life with stale data I/O has high request latency variance Is the node down or the node slow ? Intermittent failures are common Should I try this node now ? There are fundamental trade-offs between availability and consistency CAP theorem User must be isolated from these problems
  • 9. Some Problems we worked on lately ? Performance improvements Push computation to data (Server side views) Data compression Failure detection Reliable Testing Testing is hard, Distributed systems make it harder Voldemort Rebalancing Dynamically add nodes to a running cluster Administration Often ignored But very important
  • 10. Server side views Motivation: push computation to data Create custom views on server side for specific transformations. Avoids network transfer of big blobs Avoids serialization CPU and I/O cost Examples Range query over denormalized list stored as value. Append() operation for list values. Filters/aggregates on specific fields etc.
  • 11. Failure Detection Need to maintain up-to-date status of each server availability. Detect failures earlier Avoid waiting for failed nodes while serving. Reduce false positives Maintains proper load balancing Allowance for intermittent or temporary failures. Re-establish node availability asynchronously. Contributed by Kirk True Proprietary & Confidential 01/21/10
  • 12. EC2 based testing Testing “in the cloud” Distributed systems have to be tested on multi-node clusters Distributed systems have complex failure scenarios A storage system, above all, must be stable Automated testing allows rapid iteration while maintaining confidence in systems’ correctness and stability EC2-based testing framework Tests are invoked programmatically Adaptable to other cloud hosting providers Will run on a regular basis Contributed by Kirk True
  • 13. Coming this Jan (finally): Rebalancing Voldemort Rebalancing capability to add/delete nodes, move data around in an online voldemort cluster. Features No downtime Transparent to the client Maintain data consistency guarantees push button user interface
  • 14. Administration Monitoring View statistics (how many queries are made? How long are they taking?) Perform diagnostic operations Administrative functionalities Functionality which is needed, but shouldn’t be performed by regular store clients, example Ability to update and retrieve cluster/store metadata Efficient streaming of keys and values. Delete entries in bulk Truncate entire store Restore a node data from replicas
  • 15. Present day In Production use At LinkedIn Multiple clusters Variety of customers Outside of LinkedIn Gilt Group, KaChing, others Active developer community, inside and outside LinkedIn Monthly release cycle Continuous testing environment. Daily performance tests.
  • 16. Performance LinkedIn cluster: web event tracking logging and online lookups 6 nodes, 400 GB of data, 12 clients mixed load (67 % Get , 33 % Put) Throughput 1433 QPS (node) 4299 QPS (cluster) Latency GET 50 % percentile 0.05 ms 95 % percentile 36.07 ms 99 % percentile 60.65 ms PUT 50 % percentile 0.09 ms 95 % percentile 0.41 ms 99 % percentile 1.22 ms
  • 18. Batch Computing at Linkedin Some questions we want to answer What do we use Hadoop for ? How do we store data ? How do we manage workflows ? How do we do ETL ? How do we prototype ideas
  • 19. What do we use Hadoop for ? Proprietary & Confidential 01/21/10
  • 20. How do we store Data ? Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, Date, etc. Example member definition: { 'member_id': 'int32', ‘ first_name': 'string', ’ last_name': ’string’, ‘ age’ : ‘int32’ … } Data is stored in Hadoop as sequence files, serialized with this format The schema of data is saved in sequence files as metadata The schema is read dynamically by Java/Pig jobs on the fly.
  • 21. How do we manage workflows ? We wrote a workflow management tool (code name: Azkaban ) Dependency management (Hadoop, ETL, java, Unix jobs) Maintains a dependency directed acyclic graph All dependencies must complete before the job itself can run If a dependency fails the job itself fails Scheduling workflows can be scheduled to run on repeating schedule. Configuration system (simple properties files) GUI for visualizing and controlling job Historical logging and job success data retained for each run Alerting of failures Will be open sourced soon ( APL ) !!
  • 23. How do we do ETL ? : Getting data in Two kind of data From Databases (user data, news, jobs etc.) Need a way to get data reliably periodically Need test to verify data Support for incremental replication Our solution Transmogrify, A driver program which accepts an inputReader and outputWriter InputReader: JDBCReader, CSV Reader Output Writer: JDBCWriter, HDFS writers From web logs (page views, search, clicks etc) Weblogs files are rsynced and loaded up in HDFS Hadoop jobs for date cleaning and transformation.
  • 24. ETL II: Getting data out Batch jobs generate output in 100GBs How do we finally serve this data to user ? Some constraints Wants to show data to users ASAP Should not impact online serving Should have quick rollback capabilities. Should be horizontally scalable Should be fault tolerant (replication) High throughput, low latency
  • 25. ETL II : Getting Data Out : Existing Solutions JDBC upload to Oracle Very slow (3 days to push 20GB of data) Long running transactions cause issues. “ Load Data” statement in MySQL Tried many many tuning settings Was still very slow Almost unresponsive for online serving Memcached Need to have everything in memory No support for batch inserts Need to repush if server dies. Oracle SQL*Loader The whole process took 2-3 days 20-24 hour actual data loading time Needed a guy to baby sit the process What if we want to do this daily ? Hbase (didn’t try) Proprietary & Confidential 01/21/10
  • 26. ETL II : Getting Data Out : Our solution Index build runs 100% in Hadoop MapReduce job outputs Voldemort Stores to HDFS Job control initiates a fetch request to Voldemort nodes. Voldemort nodes copies data from Hadoop in parallel Atomic swap to make the data live Heavily optimized storage engine for read-only data I/O Throttling on the transfer to protect the live servers Our Solution Wrote a special Read-only storage engine for Voldemort Data is built on Hadoop and copied to Voldemort
  • 27. Voldemort Read only store: version I Simple File based storage engine Two files: key file and value file Key file have sorted MD5 key hash and file offset of the value file for corresponding value. Value file have value saved as size,data Advantages Index is built on hadoop, no load on production servers Files are copied in parallel to voldemort cluster Supports rollback by keeping multiple copies Proprietary & Confidential 01/21/10
  • 28. Voldemort Read only store: version II The Version I read only stores have few issues Only one reducer per node Binary search can potentially take 32 steps. Version II format Make multiple key-file, value-file pairs (multiple reducer per node) Mmap all keys file. Use interpolation binary search Keys are MD5 hash and very well uniformed distributed While searching do predicted binary search Much faster performance Future work Group values by frequency, so that frequent values are in operating system cache Transfer only the delta Proprietary & Confidential 01/21/10
  • 29. Performance There are three performance numbers important now Hadoop Time to build read-only store indexes 10 mins to build 20 GB of data File transfer time Limited only to network and disk throughputs Online serving performance Read-Only store is fast ( 1ms – 5ms ) Operation system page cache helps a lot Very predictable performance based on Data size, disk speeds, RAM, cache hit ratios
  • 30. Batch Computing at LinkedIn
  • 31. Infrastructure At LinkedIn Last year: QA lab: 20 machines, cheap dell linux servers Production: 20 machines, Heavy machines “ QA” cluster is for dev, analysis, and reporting uses Pig, Hadoop streaming for prototyping ideas Ad hoc jobs compete with scheduled jobs Tried different Hadoop schedulers Production is for jobs that produce user-facing data Hired Allen Wittenauer as our Hadoop Architect in Sep 2009 100 hadoop machines
  • 32. References Amazon dynamo paper Project-voldemort.com NoSQL presentations at Last.fm (2009) Voldemort presentation by Jay Kreps Proprietary & Confidential 01/21/10
  • 35. Core Concepts - I ACID Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Consistency Models Strict consistency 2 Phase Commits PAXOS : distributed algorithm to ensure quorum for consistency Eventual consistency Different nodes can have different views of value In a steady state system will return last written value. BUT Can have much strong guarantees. Proprietary & Confidential 01/21/10
  • 36. Core Concept - II Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change Replication Each partition is stored by ‘N’ nodes Node Failures Transient (short term) Long term Needs faster bootstrapping Proprietary & Confidential 01/21/10
  • 37. Core Concept - III N - The replication factor R - The number of blocking reads W - The number of blocking writes If R+W > N then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 01/21/10
  • 38. Core Concepts - IV Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 01/21/10
  • 41. Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V) GET (Key K) MULTI-GET (Iterator<Key> K), DELETE (Key K) / (Key K , Version ver) No Range Scans
  • 42. Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
  • 43. Serialization Really important Few Considerations Schema free? Backward/Forward compatible Real life data structures Bytes <=> objects <=> strings? Size (No XML) Many ways to do it -- we allow anything Compressed JSON, Protocol Buffers, Thrift, Voldemort custom serializtion
  • 44. Routing Routing layer hides lot of complexity Hashing schema Replication (N, R , W) Failures Read-Repair (online repair mechanism) Hinted Handoff (Long term recovery mechanism) Easy to add domain specific strategies E.g. only do synchronous operations on nodes in the local data center Client Side / Server Side / Hybrid
  • 46. Routing With Failures Failure Detection Requirements Need to be very very fast View of server state may be inconsistent A can talk to B but C cannot A can talk to C , B can talk to A but not to C Currently done by routing layer (request timeouts) Periodically retries failed nodes. All requests must have hard SLAs Other possible solutions Central server Gossip protocol Need to look more into this.
  • 47. Repair Mechanism Read Repair Online repair mechanism Routing client receives values from multiple node Notify a node if you see an old value Only works for keys which are read after failures Hinted Handoff If a write fails write it to any random node Just mark the write as a special write Each node periodically tries to get rid of all special entries Bootstrapping mechanism (We don’t have it yet) If a node was down for long time Hinted handoff can generate ton of traffic Need a better way to bootstrap and clear hinted handoff tables Proprietary & Confidential 01/21/10
  • 48. Network Layer Network is the major bottleneck in many uses Client performance turns out to be harder than server (client must wait!) Lots of issue with socket buffer size/socket pool Server is also a Client Two implementations HTTP + servlet container Simple socket protocol + custom server HTTP server is great, but http client is 5-10X slower Socket protocol is what we use in production Recently added a non-blocking version of the server
  • 49. Persistence Single machine key-value storage is a commodity Plugins are better than tying yourself to a single strategy Different use cases optimize reads optimize writes Large vs Small values SSDs may completely change this layer Couple of different options BDB, MySQL and mmap’d file implementations Berkeley DBs most popular In memory plugin for testing Btrees are still the best all-purpose structure No flush on write is a huge, huge win

Editor's Notes

  • #2: Thanks all, excited to talk to you
  • #3: Core concepts (optional) Implementation (optional)
  • #4: Key and value can be arbitrarily complex, they are serialized and can
  • #6: Statistical learning as the ultimate agile development tool (Peter Norvig), “business logic” through data rather than code
  • #10: Reference to previous presentation and amazon dynamo model
  • #11: Simple configuration to use a compressed store
  • #13: EC2 testing should be ready in next few weeks
  • #14: Main project for Bhupesh and I Minimal and tunable performance for the cluster.
  • #15: Will be in next release, happens November 15th
  • #17: Client bound
  • #19: Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
  • #21: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #22: Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
  • #23: Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
  • #24: Started out as a week project, was much more complex than we realized Period can be daily, hourly depending on need and database availability.
  • #27: Utilizes Hadoop power for computation intensive index build Provides Voldemort online serving advantages.
  • #28: Voldemort storage engine is very fast Key lookup using binary search Value lookup using single seek in value file. Operation system cache optimizes .
  • #30: Client bound
  • #34: Questions, comments, etc
  • #35: Switch to Bhup
  • #36: - Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience.
  • #37: Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions ---------------------------- Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key  partition mapping is fixed and only ownership of partitions can change
  • #39: Fancy way of doing Optimistic locking
  • #41: Will discuss each layer in more detail Layered design One interface for all layers: put/get/delete Each layer decorates the next Very flexible Easy to test Client API : very basic API just provides the raw interface to user Conflct reslution layer : handles all the versioning issues and provides hooks to supply custom conflict resolution strategy Serialization : Object &lt;=&gt; Data Network Layer : Depending on configuration can fit either here or below .. Main job is to handle the network interface, socket pooling other performance related optimizatons Routing layer : Handles and hide many details from the client/user hashing schema failed nodes replication required reads/required writes Storage engine Handle disk persistenct
  • #42: Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator
  • #43: Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.
  • #44: Avro is good, but new and unreleased Storing data is really different from an RPC IDL Data is forever (some data) Inverted index Threaded comment tree Don’t hardcode serialization Don’t just do byte[] -- checks are good, many features depend on knowing what the data is Xml profile
  • #46: Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer
  • #47: Client v. server - client is better - but harder
  • #50: You can write an implementation very easily We support plugins so you can run this backend without being in the main code base