SlideShare a Scribd company logo
Big Data and Me Bhupesh Bansal Feb 3, 2012
Relational Model Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2006 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Relational model The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Hides performance issues/details
Specialized Systems Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2007 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Specialized systems Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Motivation I : Big Data  02/06/12 Reference :  algo2.iti.kit.edu/.../fopraext/index.html
Motivation II: Data Driven Features
Motivation III: Makes Money  02/06/12 Proprietary & Confidential
Motivation IV: Big Data is cool 02/06/12
Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
Big Data Challenges Large scale data processing Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) Data Driven Applications Refine data push back to user for consumption Near real time feedback loop Keep continuously improving
Why is this hard ? Large scale data processing TB/PB of data Traditional storage systems cannot handle the scale Data Driven Applications Need to run complex machine learning algorithms on this data scale Near real time analysis improves application performance and usage.
Some good news !! Hadoop Biggest single driver for large scale data economy Scales, works, easy to use Memcached Works, scales and is fast Open source world Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. Sharing across companies Common practices/knowledge sharing across companies.
What works !! Simplicity Go with the simplest design possible. Near real time Async/Batch processing Put computation to background as much as possible Duplicate data everywhere Build customized solution for each problem Duplicate data as needed Data river  Publish events and let all systems consume at their own pace Monitoring/Alerting Keep a close eye on things and build a strong dev-ops team
What doesn’t works !! Magic systems Auto configure, Auto tuning Very hard to get it right instead have easy configuration and better monitoring Open source  If Not supported by strong engineering team internally Be ready to have folks spend 30-40% time on understanding, helping open source components Silver bullets One system to solve all scaling problems eg. Hbase Build separate systems for separate problems Central data source  Don’ t lock your data let it flow Use  (Kafka, Scribe or any publish/subscribe system)
Open source Very very important for any company today Do not reinvent the wheel Do not write a line of code if not needed 90/10 % rule Pick up open source solutions, fix what is broken Big plus for hiring Stand on shoulder of crowd
Open source: Storage Problem: You want to store TB of data for user consumption in real time Latency < 50 ms Scale 10,000 QPS + Solutions Big table design eg. Hbase Amazon Dynamo design eg. Voldemort Cache with persistence eg. Membase Document based storage eg. MongoDB
Open source: Publish/Subscribe Problem: Data River for all other systems to get their feed Solutions Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ Log feeds eg. Scribe, flume Kafka  A great mix of both the world
Open source: Real time analysis Problem: Analyze a stream of data and do simple analysis/reporting Solutions Splunk General purpose but high maintenance expansive analysis tool OpenTSDB Simple but scalable metrics reporting Yahoo S4/Twitter Storm Online map-reduce ish New systems will need lots of love and care
Open source: Search Problem: unstructured queries on data Solutions Lucene Most tested common search (but just a) library Solr Old system with lot of users but bad design Elastic Search Very well designed but new system Linkedin search open source systems sensieDB, zoie
Open source: Batch computation Problem: You want to process TB of data Solutions is simple: Use Hadoop Hadoop workflow manager Azkaban Oozie Query Native Java code Cascading Hive Pig
Open source: Other Serialization Avro, Thrift, protocol buffers Compression Snappy, LZO Monitoring Ganglia
My personal picks !! Storage: Pure key-value lookup : Voldemort Range queries, Hadoop job support: Hbase Batch generated Read only data serving: Voldemort Publish/Subscribe HornetQ OR Kafka Search ElasticSearch Hadoop Azkaban Hive and Native Java code
Jeff Dean’s Thoughts Very practical advice on building good reliable distributed systems. Highlights Back of the envelope calculations Understand your base numbers well Scale for 10X not 100X Embrace chaos/failure and design around it Monitor/status hooks at all levels Important not to try to be all things for everybody Reference  : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems
How Voldemort was born ? Reference : 1)  http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2)  http://www.slideshare.net/adorepump/voldemort-nosql
Why NoSQL ? TBs of data Sharding the only way to scale No joins possible (Data is split across machines) Specialized systems eg search, network feed breaks relational model No constraints, triggers, etc disappear Lots of denormalization Latency is key Relational DB depend on caching layer to achieve high throughput and low latency
Inspired By Amazon Dynamo & Memcached Amazon ’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Memcached Actually works Really fast Really simple
ACID Vs CAP ACID  Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency  to  Eventual consistency Proprietary & Confidential 02/06/12
Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change  Replication  Each partition is stored by  ‘N’ nodes Proprietary & Confidential 02/06/12
R+W > N  N - The replication factor  R - The number of blocking reads W - The number of blocking writes If  R+W > N  then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes  R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 02/06/12
Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
Vector Clock Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 02/06/12
Client API Data is organized into  “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V)  GET (Key K) MULTI-GET (Iterator<Key> K),  DELETE (Key K) / (Key K , Version ver) No Range Scans
Voldemort Physical Deployment
 
Read-only storage engine Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /
Questions
Ad

More Related Content

What's hot (20)

File Context
File ContextFile Context
File Context
Hadoop User Group
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
Cloudera, Inc.
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
Ilya Ganelin
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Daniel Abadi
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
Daniel Abadi
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Comparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBaseComparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBase
Accumulo Summit
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
Alex Scotti
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
DataWorks Summit
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
Richard Claassens CIPPE
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
Cloudera, Inc.
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
Ilya Ganelin
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Daniel Abadi
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
Daniel Abadi
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Comparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBaseComparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBase
Accumulo Summit
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
OpenDev
 
Speed it up and Spark it up at Intel
Speed it up and Spark it up at IntelSpeed it up and Spark it up at Intel
Speed it up and Spark it up at Intel
DataWorks Summit
 

Similar to Bhupeshbansal bigdata (20)

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
guest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
royans
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
mclee
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
No SQL Databases as modern database concepts
No SQL Databases as modern database conceptsNo SQL Databases as modern database concepts
No SQL Databases as modern database concepts
debasisdas225831
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
Douglas Bernardini
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
Sneha Challa
 
No sql
No sqlNo sql
No sql
Shruti_gtbit
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
Udi Bauman
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
guest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
royans
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
mclee
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
No SQL Databases as modern database concepts
No SQL Databases as modern database conceptsNo SQL Databases as modern database concepts
No SQL Databases as modern database concepts
debasisdas225831
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
Douglas Bernardini
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
Sneha Challa
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
Udi Bauman
 
Ad

Recently uploaded (20)

Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Ad

Bhupeshbansal bigdata

  • 1. Big Data and Me Bhupesh Bansal Feb 3, 2012
  • 2. Relational Model Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
  • 3. Linkedin 2006 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
  • 4. Relational model The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Hides performance issues/details
  • 5. Specialized Systems Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
  • 6. Linkedin 2007 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
  • 7. Specialized systems Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text
  • 8. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 9. Motivation I : Big Data 02/06/12 Reference : algo2.iti.kit.edu/.../fopraext/index.html
  • 10. Motivation II: Data Driven Features
  • 11. Motivation III: Makes Money 02/06/12 Proprietary & Confidential
  • 12. Motivation IV: Big Data is cool 02/06/12
  • 13. Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
  • 14. Big Data Challenges Large scale data processing Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) Data Driven Applications Refine data push back to user for consumption Near real time feedback loop Keep continuously improving
  • 15. Why is this hard ? Large scale data processing TB/PB of data Traditional storage systems cannot handle the scale Data Driven Applications Need to run complex machine learning algorithms on this data scale Near real time analysis improves application performance and usage.
  • 16. Some good news !! Hadoop Biggest single driver for large scale data economy Scales, works, easy to use Memcached Works, scales and is fast Open source world Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. Sharing across companies Common practices/knowledge sharing across companies.
  • 17. What works !! Simplicity Go with the simplest design possible. Near real time Async/Batch processing Put computation to background as much as possible Duplicate data everywhere Build customized solution for each problem Duplicate data as needed Data river Publish events and let all systems consume at their own pace Monitoring/Alerting Keep a close eye on things and build a strong dev-ops team
  • 18. What doesn’t works !! Magic systems Auto configure, Auto tuning Very hard to get it right instead have easy configuration and better monitoring Open source If Not supported by strong engineering team internally Be ready to have folks spend 30-40% time on understanding, helping open source components Silver bullets One system to solve all scaling problems eg. Hbase Build separate systems for separate problems Central data source Don’ t lock your data let it flow Use (Kafka, Scribe or any publish/subscribe system)
  • 19. Open source Very very important for any company today Do not reinvent the wheel Do not write a line of code if not needed 90/10 % rule Pick up open source solutions, fix what is broken Big plus for hiring Stand on shoulder of crowd
  • 20. Open source: Storage Problem: You want to store TB of data for user consumption in real time Latency < 50 ms Scale 10,000 QPS + Solutions Big table design eg. Hbase Amazon Dynamo design eg. Voldemort Cache with persistence eg. Membase Document based storage eg. MongoDB
  • 21. Open source: Publish/Subscribe Problem: Data River for all other systems to get their feed Solutions Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ Log feeds eg. Scribe, flume Kafka A great mix of both the world
  • 22. Open source: Real time analysis Problem: Analyze a stream of data and do simple analysis/reporting Solutions Splunk General purpose but high maintenance expansive analysis tool OpenTSDB Simple but scalable metrics reporting Yahoo S4/Twitter Storm Online map-reduce ish New systems will need lots of love and care
  • 23. Open source: Search Problem: unstructured queries on data Solutions Lucene Most tested common search (but just a) library Solr Old system with lot of users but bad design Elastic Search Very well designed but new system Linkedin search open source systems sensieDB, zoie
  • 24. Open source: Batch computation Problem: You want to process TB of data Solutions is simple: Use Hadoop Hadoop workflow manager Azkaban Oozie Query Native Java code Cascading Hive Pig
  • 25. Open source: Other Serialization Avro, Thrift, protocol buffers Compression Snappy, LZO Monitoring Ganglia
  • 26. My personal picks !! Storage: Pure key-value lookup : Voldemort Range queries, Hadoop job support: Hbase Batch generated Read only data serving: Voldemort Publish/Subscribe HornetQ OR Kafka Search ElasticSearch Hadoop Azkaban Hive and Native Java code
  • 27. Jeff Dean’s Thoughts Very practical advice on building good reliable distributed systems. Highlights Back of the envelope calculations Understand your base numbers well Scale for 10X not 100X Embrace chaos/failure and design around it Monitor/status hooks at all levels Important not to try to be all things for everybody Reference : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems
  • 28. How Voldemort was born ? Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql
  • 29. Why NoSQL ? TBs of data Sharding the only way to scale No joins possible (Data is split across machines) Specialized systems eg search, network feed breaks relational model No constraints, triggers, etc disappear Lots of denormalization Latency is key Relational DB depend on caching layer to achieve high throughput and low latency
  • 30. Inspired By Amazon Dynamo & Memcached Amazon ’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Memcached Actually works Really fast Really simple
  • 31. ACID Vs CAP ACID Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Proprietary & Confidential 02/06/12
  • 32. Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change Replication Each partition is stored by ‘N’ nodes Proprietary & Confidential 02/06/12
  • 33. R+W > N N - The replication factor R - The number of blocking reads W - The number of blocking writes If R+W > N then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 02/06/12
  • 34. Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
  • 35. Vector Clock Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 02/06/12
  • 36. Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V) GET (Key K) MULTI-GET (Iterator<Key> K), DELETE (Key K) / (Key K , Version ver) No Range Scans
  • 38.  
  • 39. Read-only storage engine Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 40. What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
  • 41. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 42. Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /

Editor's Notes

  • #4: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #7: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #11: Statistical learning as the ultimate agile development tool (Peter Norvig), “business logic” through data rather than code
  • #30: No Joins Across data domains due to APIs Within data domains due to performance Natural operation: getAll(id…) Latency: if you want to call 30 services on your main pages, they better be quick (30 * 20ms = 600ms)
  • #32: - Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience.
  • #33: Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions ---------------------------- Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key  partition mapping is fixed and only ownership of partitions can change
  • #35: Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.
  • #36: Fancy way of doing Optimistic locking
  • #37: Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator
  • #38: Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer
  • #40: Transfer time: 30 minutes Can max out a gb network, so be careful
  • #43: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #44: Questions, comments, etc