SlideShare a Scribd company logo
Introduction to
Big Data
AHMED SHOUMAN
Our agenda
 Demystify the term "Big Data"
 Find out what is Hadoop
 Explore the realms of batch and real-time big data processing
 Explore challenges of size, speed and scale in databases
 Skim the surface of big-data technologies
 Provide ways into the big-data world
Big Data
Demystified
What is big data?
 Big data is a collective term for a set technologies designed
for storage, querying and analysis of extremely large data sets,
sources and volumes.
 Big data technologies come in where traditional off-the-shelf
databases, data warehousing systems and analysis tools fall
short.
How did we end up with so much data?
 Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine
 Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud
 An Important Side Note
Big Data technologies are based on the concept of clustering - Many computers
working in sync to process chunks of our data.
Not just size
 Big data isn't just about data size, but also about data volume,
diversity and inter-connectedness.
Big data is
 Any attribute of our data that challenges either technological capabilities
or business needs, like:
 Scaling, moving, storage and retrieval of ever-growing generated data
 Processing many small data points in real-time
 Analysing diverse semi-structured data from multiple sources
 Querying multiple, diverse data sources in real-time
Breath... Let's recap
 Lot's of data due to technological capabilities and social paradigms
 Not just size! Diversity, volume and inter-connectedness also count
 Scale, speed, processing, querying and analysis
 Challenges technological capabilities or business needs
Hadoop The Elephant in the Room
Everyone talks about Hadoop
 Hadoop is a powerful platform for batch analysis of large
volumes of both structured and unstructured data.
From: Conquering Hadoop with Haskell
Hadoop explained
 Hadoop is a horizontally scalable, fault-tolerant, open-source file system
and batch-analysis platform capable of processing large amounts of data.
 HDFS - Hadoop File System
 M/R - Hadoop Map-Reduce platform
Hadoop explained
 HDFS is an ever-growing file system. We can store lots and
lots of data on it for later use.
 HDFS is used as the underlying platform for other
technologies likeHadoop M/R, Apache Mahout or HBase.
Hadoop explained
 Imagine we want to look at 30 days worth of access logs to identify site
usage patterns at a volume of 30M log entries per day.
 Hadoop M/R is a platform that allows us to query HDFS data in parallel for
the purpose of batch (offline) data processing and analysis.
Why is Hadoop so important?
 Scalable and fault-tolerant
 Handles massive amounts of data
 Truly parallel processing
 Data can be semi-structured or unstructured (schemaless)
 Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
Hadoop - Words of caution
 Complex
 Not for real-time
 Choose a distribution (Cloudera, HW, MapR) for better interoperability
 Requires trained DevOps for day-to-day operations
Breath....
 We demystified the term Big Data and glimpsed at Hadoop. Now What?
 How do I really get into the Big Data world?
The world of big data
 Batch & Data Science
 DBs
 Real-Time
Batch Processing
Hadoop M/R
Batch processing of large data sets
 We collect data for the purpose of providing end-users with better
experience in our business domain. This means we have to constantly
query our data and divine new insights and relevant information.
 The problem is doing that in very large scales is a painful, slow challenge.
How do we do this on Hadoop data?
Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
Batch processing of large data sets
 Hadoop gives us the basic tools for large data processing in
the form of M/R.
However, Hadoop M/R is pretty annoying to work with
directly as it lacks a lot of relevant tools for the job (statistical
analysis, machine learning etc.)
Source: http://xiaochongzhang.me/blog/?p=338
Hadoop querying and data science
tools
 Tool Purpose
 Hive Write SQL-like M/R queries on top of Hadoop
 Shark Hive-compatible, distributed SQL query engine for Hadoop
 Pig Write scripted M/R queries on top of Hadoop
 Impala Real-time SQL-like queries of Hadoop
 Mahout Scalable machine-learning on top of Hadoop M/R
The gentle way in
 Hive or Shark are a great place to start due to their SQL-like nature
 Shark is faster than Hive - less frustration
 You need some Hadoop data to work with (consider Avro)
 Remember - it's SQL-like, not SQL
 Start small, locally and grow to production later
 Check out Apache Sqoop for moving processed Hadoop data to your DB
Databases In the big data world
Databases in the big data world
 The Problem: Traditional RDBMS were not designed for storing, indexing
and querying growing amounts and volumes of data.
 The 3S Challenge:
 Size - How much data is written and read
 Speed - How fast can we write and read data
 Scale - How easily can our DB scale to accommodate more data
The 3S Challenge
 There's no single, simple solution to the 3S challenge. Instead,
solutions focus on making an informed sacrifice in one area in
order to gain in another area.
NoSQL and C.A.P.
 NoSQL is a term referring to a family of DBMS that attempt to resolve the
3S challenge by sacrificing one of three areas:
 Consistency - All clients have the same view of data
 Availability - Each client can always read and write
 Partition Tolerance - System works despite physical network failures
NoSQL and C.A.P.
 C.A.P. means you have to make an informed choice (and sacrifice)
 No single perfect solution
 Opt for mixed solutions per use-case
 Remember we're talking about read/write volume, not just size
Confused? Let's take a breath and focus
Big Data Concepts
OK, so where do I go from here?
 Identify your needs and limitations
 Choose a few candidates
 Research & Prototype
 Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB
(omitted due to time constraints).
Real-Time Big Data Now!
Real-Time big data processing
 Processing big data in real-time is about data volumes rather than just size.
For example, given a rate of 100K ops/sec, how do I do the following in
real-time?:
 Find anomalies in a data stream (spam)
 Group check-ins by geo
 Identify trending pages / topics
Hadoop isn't for real-time processing
 When it comes to data processing and analysis, Hadoop's M/R framework
is wonderful for batch (offline) processing.
 However, processing, analysing and querying Hadoop data in real-time is
quite difficult.
Apache Storm and Apache Spark
 Apache Storm and Apache Spark are two frameworks for large-scale,
distributed data processing in real-time.
 One could say that both Storm and Spark are for real-time data processing
what is Hadoop M/R for batch data processing.
Apache Storm - Highlights
 Runs on the JVM (Clojure / Java mix)
 Fully distributed and fault-tolerant
 Highly-scalable and extremely fast
 Interoperability with popular languages (Scala, Python etc.)
 Mature and production ready
 Hadoop interoperability via Storm-YARN
 Stateless / Non-Persistent (Data brought to processors)
Apache Spark - Highlights
 Fully distributed and extremely fast
 Write applications in Java Scala and Python
 Perfect for both batch and real-time
 Combine Hadoop SQL (Shark), Machine Learning and Data streaming
 Native Hadoop interoperability
 HDFS, HBase, Cassandra, Flume as data sources
 Stateful / Persistent (Processors brought to data)
Storm & Spark - Use Cases
 Continuous/Cyclic Computation
 Real-time analytics
 Machine Learning (eg. recommendations, personalisation)
 Graph Processing (eg. social networks) - Only Spark
 Data Warehouse ETL (Extract, Transform, Load)
Recap
Term Purpose
 Big Data Collective term for data-processing solutions at scale
 Hadoop Scalable file-system and batch processing platform
 Batch Processing Sifting and analysing data offline / in background
 M/R Parallel, batch data-processing algorithm
 3S Challenge Size, Speed, Scale of DBs
 C.A.P Consistency, Availability, Partition Tolerance
 NoSQL Family of DBMS that grew due to the 3S Challenge
 NewSQL Family of DBMS that provide ACID at scale
Questions?!
Feel free to drop my a line:
Email: ahmed.sayed.shouman@gmail.com

More Related Content

What's hot (20)

PPTX
Big Data & Hadoop Introduction
Jayant Mukherjee
 
PPTX
Big data concepts
Serkan Özal
 
PDF
Lecture1 introduction to big data
hktripathy
 
PPTX
A brief history of "big data"
Nicola Ferraro
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Data analytics
Dr.Bhuvaneswari Velumani
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Oltp vs olap
Mr. Fmhyudin
 
PPTX
DBMS Part1.pptx
Prof. Dr. K. Adisesha
 
PPTX
Distributed dbms architectures
Pooja Dixit
 
DOCX
Database Management Lab -SQL Queries
shamim hossain
 
PDF
Introduction to Data Science and Analytics
Srinath Perera
 
PPTX
Big data analytics
Dr.Bhuvaneswari Velumani
 
PPTX
Big Data and Security - Where are we now? (2015)
Peter Wood
 
PPTX
Data-base-system-and-big-data.pptx
MelchorCleve
 
PPTX
Non relational databases-no sql
Ram kumar
 
PPT
Data Warehouse Basic Guide
thomasmary607
 
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Big data concepts
Serkan Özal
 
Lecture1 introduction to big data
hktripathy
 
A brief history of "big data"
Nicola Ferraro
 
Hadoop File system (HDFS)
Prashant Gupta
 
Data analytics
Dr.Bhuvaneswari Velumani
 
Introduction to Data Engineering
Durga Gadiraju
 
Big Data Analytics with Hadoop
Philippe Julio
 
Oltp vs olap
Mr. Fmhyudin
 
DBMS Part1.pptx
Prof. Dr. K. Adisesha
 
Distributed dbms architectures
Pooja Dixit
 
Database Management Lab -SQL Queries
shamim hossain
 
Introduction to Data Science and Analytics
Srinath Perera
 
Big data analytics
Dr.Bhuvaneswari Velumani
 
Big Data and Security - Where are we now? (2015)
Peter Wood
 
Data-base-system-and-big-data.pptx
MelchorCleve
 
Non relational databases-no sql
Ram kumar
 
Data Warehouse Basic Guide
thomasmary607
 

Viewers also liked (20)

PPTX
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
SherinMariamReji05
 
PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big data ppt
Nasrin Hussain
 
PDF
Big Data Final Presentation
17aroumougamh
 
PPTX
What is big data?
David Wellman
 
PPTX
How to implement hadoop successfuly
Adir Sharabi
 
PDF
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Salesforce Developers
 
PDF
02 a holistic approach to big data
Raul Chong
 
PPTX
Big Data Course - BigData HUB
Ahmed Salman
 
PDF
High level languages for Big Data Analytics (Report)
Jose Luis Lopez Pino
 
PDF
Privacy in the Age of Big Data
Arab Federation for Digital Economy
 
PPTX
High-level languages for Big Data Analytics (Presentation)
Jose Luis Lopez Pino
 
PPTX
Big Data World
Hossein Zahed
 
PPTX
100 sql queries
Srinimf-Slides
 
PDF
Data minig with Big data analysis
Poonam Kshirsagar
 
PDF
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
PPTX
A Brief History of Big Data
Bernard Marr
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PDF
Big Data v Data Mining
University of Hertfordshire
 
PPTX
In-Memory Database Platform for Big Data
SAP Technology
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
SherinMariamReji05
 
What is Big Data?
Bernard Marr
 
Big data ppt
Nasrin Hussain
 
Big Data Final Presentation
17aroumougamh
 
What is big data?
David Wellman
 
How to implement hadoop successfuly
Adir Sharabi
 
Extreme Salesforce Data Volumes Webinar (with Speaker Notes)
Salesforce Developers
 
02 a holistic approach to big data
Raul Chong
 
Big Data Course - BigData HUB
Ahmed Salman
 
High level languages for Big Data Analytics (Report)
Jose Luis Lopez Pino
 
Privacy in the Age of Big Data
Arab Federation for Digital Economy
 
High-level languages for Big Data Analytics (Presentation)
Jose Luis Lopez Pino
 
Big Data World
Hossein Zahed
 
100 sql queries
Srinimf-Slides
 
Data minig with Big data analysis
Poonam Kshirsagar
 
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
A Brief History of Big Data
Bernard Marr
 
Big data ppt
Thirunavukkarasu Ps
 
Big Data v Data Mining
University of Hertfordshire
 
In-Memory Database Platform for Big Data
SAP Technology
 
Ad

Similar to Big Data Concepts (20)

PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
Pankajkumar496281
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PPTX
Big Data for QAs
Ahmed Misbah
 
PPTX
Inroduction to Big Data
Omnia Safaan
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PPTX
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
PDF
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PPTX
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPTX
Big Data Processing
Michael Ming Lei
 
PDF
Survey Paper on Big Data and Hadoop
IRJET Journal
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
PDF
Big Data Processing with Hadoop : A Review
IRJET Journal
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
ODP
Hadoop demo ppt
Phil Young
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Big data and hadoop overvew
Kunal Khanna
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Pankajkumar496281
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Big Data for QAs
Ahmed Misbah
 
Inroduction to Big Data
Omnia Safaan
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Big Data Processing
Michael Ming Lei
 
Survey Paper on Big Data and Hadoop
IRJET Journal
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Big Data Processing with Hadoop : A Review
IRJET Journal
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Hadoop demo ppt
Phil Young
 
Ad

More from Ahmed Salman (10)

PDF
IBM Netezza
Ahmed Salman
 
PPTX
DR_PRESENT 1
Ahmed Salman
 
PDF
Faas__Food_as_a_Service__project
Ahmed Salman
 
PDF
Project_Overview_-_final
Ahmed Salman
 
PPTX
Cloudera
Ahmed Salman
 
PPTX
TECRM 20 Presentation
Ahmed Salman
 
PPTX
TCRM10 Pesentation
Ahmed Salman
 
PDF
Introduction to Dig Data& Hadoop
Ahmed Salman
 
PDF
BigData HUB Workshop
Ahmed Salman
 
PDF
Hadoop Installation
Ahmed Salman
 
IBM Netezza
Ahmed Salman
 
DR_PRESENT 1
Ahmed Salman
 
Faas__Food_as_a_Service__project
Ahmed Salman
 
Project_Overview_-_final
Ahmed Salman
 
Cloudera
Ahmed Salman
 
TECRM 20 Presentation
Ahmed Salman
 
TCRM10 Pesentation
Ahmed Salman
 
Introduction to Dig Data& Hadoop
Ahmed Salman
 
BigData HUB Workshop
Ahmed Salman
 
Hadoop Installation
Ahmed Salman
 

Recently uploaded (20)

PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
short term internship project on Data visualization
JMJCollegeComputerde
 

Big Data Concepts

  • 2. Our agenda  Demystify the term "Big Data"  Find out what is Hadoop  Explore the realms of batch and real-time big data processing  Explore challenges of size, speed and scale in databases  Skim the surface of big-data technologies  Provide ways into the big-data world
  • 4. What is big data?  Big data is a collective term for a set technologies designed for storage, querying and analysis of extremely large data sets, sources and volumes.  Big data technologies come in where traditional off-the-shelf databases, data warehousing systems and analysis tools fall short.
  • 5. How did we end up with so much data?  Data Generation: Human (Internal) ↦ Human (Social) ↦ Machine  Data Processing: Single Core ↦ Multi-Core ↦ Cluster / Cloud  An Important Side Note Big Data technologies are based on the concept of clustering - Many computers working in sync to process chunks of our data.
  • 6. Not just size  Big data isn't just about data size, but also about data volume, diversity and inter-connectedness.
  • 7. Big data is  Any attribute of our data that challenges either technological capabilities or business needs, like:  Scaling, moving, storage and retrieval of ever-growing generated data  Processing many small data points in real-time  Analysing diverse semi-structured data from multiple sources  Querying multiple, diverse data sources in real-time
  • 8. Breath... Let's recap  Lot's of data due to technological capabilities and social paradigms  Not just size! Diversity, volume and inter-connectedness also count  Scale, speed, processing, querying and analysis  Challenges technological capabilities or business needs
  • 10. Everyone talks about Hadoop  Hadoop is a powerful platform for batch analysis of large volumes of both structured and unstructured data. From: Conquering Hadoop with Haskell
  • 11. Hadoop explained  Hadoop is a horizontally scalable, fault-tolerant, open-source file system and batch-analysis platform capable of processing large amounts of data.  HDFS - Hadoop File System  M/R - Hadoop Map-Reduce platform
  • 12. Hadoop explained  HDFS is an ever-growing file system. We can store lots and lots of data on it for later use.  HDFS is used as the underlying platform for other technologies likeHadoop M/R, Apache Mahout or HBase.
  • 13. Hadoop explained  Imagine we want to look at 30 days worth of access logs to identify site usage patterns at a volume of 30M log entries per day.  Hadoop M/R is a platform that allows us to query HDFS data in parallel for the purpose of batch (offline) data processing and analysis.
  • 14. Why is Hadoop so important?  Scalable and fault-tolerant  Handles massive amounts of data  Truly parallel processing  Data can be semi-structured or unstructured (schemaless)  Serves as basis for other technologies (Hbase, Mahout, Impala, Shark)
  • 15. Hadoop - Words of caution  Complex  Not for real-time  Choose a distribution (Cloudera, HW, MapR) for better interoperability  Requires trained DevOps for day-to-day operations
  • 16. Breath....  We demystified the term Big Data and glimpsed at Hadoop. Now What?  How do I really get into the Big Data world?
  • 17. The world of big data  Batch & Data Science  DBs  Real-Time
  • 19. Batch processing of large data sets  We collect data for the purpose of providing end-users with better experience in our business domain. This means we have to constantly query our data and divine new insights and relevant information.  The problem is doing that in very large scales is a painful, slow challenge.
  • 20. How do we do this on Hadoop data? Source: https://cwiki.apache.org/confluence/display/Hive/Tutorial
  • 21. Batch processing of large data sets  Hadoop gives us the basic tools for large data processing in the form of M/R. However, Hadoop M/R is pretty annoying to work with directly as it lacks a lot of relevant tools for the job (statistical analysis, machine learning etc.)
  • 23. Hadoop querying and data science tools  Tool Purpose  Hive Write SQL-like M/R queries on top of Hadoop  Shark Hive-compatible, distributed SQL query engine for Hadoop  Pig Write scripted M/R queries on top of Hadoop  Impala Real-time SQL-like queries of Hadoop  Mahout Scalable machine-learning on top of Hadoop M/R
  • 24. The gentle way in  Hive or Shark are a great place to start due to their SQL-like nature  Shark is faster than Hive - less frustration  You need some Hadoop data to work with (consider Avro)  Remember - it's SQL-like, not SQL  Start small, locally and grow to production later  Check out Apache Sqoop for moving processed Hadoop data to your DB
  • 25. Databases In the big data world
  • 26. Databases in the big data world  The Problem: Traditional RDBMS were not designed for storing, indexing and querying growing amounts and volumes of data.  The 3S Challenge:  Size - How much data is written and read  Speed - How fast can we write and read data  Scale - How easily can our DB scale to accommodate more data
  • 27. The 3S Challenge  There's no single, simple solution to the 3S challenge. Instead, solutions focus on making an informed sacrifice in one area in order to gain in another area.
  • 28. NoSQL and C.A.P.  NoSQL is a term referring to a family of DBMS that attempt to resolve the 3S challenge by sacrificing one of three areas:  Consistency - All clients have the same view of data  Availability - Each client can always read and write  Partition Tolerance - System works despite physical network failures
  • 29. NoSQL and C.A.P.  C.A.P. means you have to make an informed choice (and sacrifice)  No single perfect solution  Opt for mixed solutions per use-case  Remember we're talking about read/write volume, not just size
  • 30. Confused? Let's take a breath and focus
  • 32. OK, so where do I go from here?  Identify your needs and limitations  Choose a few candidates  Research & Prototype  Read about NewSQL - VoltDB, InfiniDB, MariaDB, HyperDex, FoundationDB (omitted due to time constraints).
  • 34. Real-Time big data processing  Processing big data in real-time is about data volumes rather than just size. For example, given a rate of 100K ops/sec, how do I do the following in real-time?:  Find anomalies in a data stream (spam)  Group check-ins by geo  Identify trending pages / topics
  • 35. Hadoop isn't for real-time processing  When it comes to data processing and analysis, Hadoop's M/R framework is wonderful for batch (offline) processing.  However, processing, analysing and querying Hadoop data in real-time is quite difficult.
  • 36. Apache Storm and Apache Spark  Apache Storm and Apache Spark are two frameworks for large-scale, distributed data processing in real-time.  One could say that both Storm and Spark are for real-time data processing what is Hadoop M/R for batch data processing.
  • 37. Apache Storm - Highlights  Runs on the JVM (Clojure / Java mix)  Fully distributed and fault-tolerant  Highly-scalable and extremely fast  Interoperability with popular languages (Scala, Python etc.)  Mature and production ready  Hadoop interoperability via Storm-YARN  Stateless / Non-Persistent (Data brought to processors)
  • 38. Apache Spark - Highlights  Fully distributed and extremely fast  Write applications in Java Scala and Python  Perfect for both batch and real-time  Combine Hadoop SQL (Shark), Machine Learning and Data streaming  Native Hadoop interoperability  HDFS, HBase, Cassandra, Flume as data sources  Stateful / Persistent (Processors brought to data)
  • 39. Storm & Spark - Use Cases  Continuous/Cyclic Computation  Real-time analytics  Machine Learning (eg. recommendations, personalisation)  Graph Processing (eg. social networks) - Only Spark  Data Warehouse ETL (Extract, Transform, Load)
  • 41. Term Purpose  Big Data Collective term for data-processing solutions at scale  Hadoop Scalable file-system and batch processing platform  Batch Processing Sifting and analysing data offline / in background  M/R Parallel, batch data-processing algorithm  3S Challenge Size, Speed, Scale of DBs  C.A.P Consistency, Availability, Partition Tolerance  NoSQL Family of DBMS that grew due to the 3S Challenge  NewSQL Family of DBMS that provide ACID at scale