Introduction To Big Data

Covers Big Data at a high level. Covers three main programming models- 1. In Memory Databases 2. MPP 3. MapReduce

Uploaded by

sameerwadkar

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views

Introduction To Big Data

Covers Big Data at a high level. Covers three main programming models- 1. In Memory Databases 2. MPP 3. MapReduce

Uploaded by

sameerwadkar

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction to Big Data

By - Sameer Wadkar
Co-Founder & Big Data Architect / Data Scientist at Axiomine

Agenda
What is Big Data
Big Data Characteristics
Big Data and Business Intelligence Applications
Big Data and Transactional Applications
Demo
What is Big Data?
Volume
Velocity
Big
Data
Variety
Big Data monitors 12 Terabytes
of Tweets each day to improve
product sentiment analysis
(source :IBM)
Amazon and PayPal
use Big-Data for real
time fraud detection
(source: McKinsey)
In 15 of the US economys 17
sectors, companies with upward of
1,000 employees store, on average,
more information than the Library
of Congress (source: McKinsey)
Big Data monitors 12
TB of Tweets each day
to improve product
sentiment analysis
(source :IBM)
Most Big Data applications are based around the Volume dimension
Visualizing Big Data
1 Petabyte is 54000 movies in digital format
Reading 1 Terabyte of data sequentially from a single disk drive
takes 3 hours
Typical speed to read from the hard-disk 80 MB/sec
Traversing 1 Terabyte of data randomly over 1 disk (a typical
database access scenario) requires orders of magnitude longer
Disk transfer rates are significantly higher than disk seek rate
Single node processing capacity will drown in the face of Big Data
Big Data vs. Traditional
Big Data Architecture

In Big Data architectures the application is moves to the data. Why?
User launches a batch job
1
Three Tier Architecture
App Request Data from Data Tier
2
Data Tier sends data to the App Tier
3
App Tier
processes data
4
App Tier sends the report
5
User requests a report
1
Master Distributes Application
2
Master launches App on nodes
3 5
User downloads results
4
All Nodes
process the
data on their
nodes
Master Node
Application & Data Tier
Data Tier
Application Tier
Why is Big Data hard?
Divide out and conquer in place is a Big Data Strategy
Goal is to divide the data on multiple nodes and conquer by
processing the data in-place of the node.
Real world processing cannot be always divided into smaller sub-
problems (Divide and Conquer is not always feasible)
Data has dependencies
Normalization v/s Denormalization
There are processing dependencies. Later phase of the process may
require results of an earlier phase
Single Pass v/s Multi-pass

Big Data Characteristics
Scale-out, Fault Tolerance & Graceful Recovery are essential features
Big Data Systems must scale out
Adding more nodes should lead to greater parallelization
Big Data Systems must be resilient to partial failure
If one part of the system fails other parts should continue to
function
Big Data Systems must be able to self-recover from partial failure
If any part of the system fails another part of the system will
attempt to recover from the failure.
Data must be replicated on separate nodes
Loss of any node does not lose data or processing.
Recovery should be transparent to the end-user.
Big Data Applications
Big Data design is dictated by the nature of the applications
Business Intelligence applications
Read-only systems
ETL Systems
Query massive data for purpose of generating reports or for large
scale transformations and import into destination data-source
Transactional Applications
One part of the system updates data while another part reads the
data
Example Systems Imagine running a online store of Amazon.com
scale.
BI - Sample Use-Case
A very simple query but size makes all the difference
Select SUM(SALES_AMT) from SALES where state=MD group by
YEAR order by YEAR
Find me total sales revenue by year for Maryland and order them
by year
What if SALES table has billions of rows over 20 years?
Sales Transactions
Table
Big Data
Reporting
Year Sales Revenue
1980 11 Million
1981 13 Million

2010 10 Billion
Input
Output
BI Big Data Flavors
We discuss three flavors in increasing order of scale-out capability
Big Data Flavor Products
In-Memory Databases Oracle Exalytics, SAP HANA
Massively Parallel Computing (MPP) Greenplum, Netezza
Map Reduce Hadoop
In Memory Databases
If State=VA is next query & cache is only big enough to hold one state
results at a time, cache miss occurs & no performance gains
Simplified version - Data is partitioned randomly across all nodes.
Selection Phase
1. Each data node contains fast Memory (SSD) and
mechanism to apply Where clause
2. Only the necessary data (MD records) are
passed over the expensive Network I/O to the
processing node
Processing Phase
1. The processing nodes will compute the
SUM(sales_amt) by year
2. Order the results
3. Place it In-Memory cache
First execution of the query is slow.
Subsequent executions are very fast (almost real-time) as the cache is
hot.
Cache has SQL-Interface. User experiences Real-Time!!
Data Node Data Node

Data Node
Processing Node
In-Memory TB Cache
with SQL Interface
User SQL
Interface
Fetch Phase
The user is served the results from the cache through
the familiar SQL Interface
In Memory Databases (cont.)
In-Memory DBs provide real-time querying on moderate sized data
Specialized hardware
Specialized I/O and Flash Memory for faster I/O
Massive in-memory cache (Multi-Terabyte TB) with SQL Interface
Characteristics
Familiar model (SQL Interface)
Can integrate with standard toolkits and BI Solutions
Unified software/hardware solution
Pros
Vendor lock-in
Expensive Hardware as well as licensing cost
Typically cannot scale beyond 1-2 TB of data
Works best when same data is read often (Cache remains hot).
Cons
MPP (Typical Architecture)
If query is Group by State ,no longer works as fast. Why?
Data is partitioned horizontally across all slave nodes. Assume Sale Year is the distribution
key. Secondary indexes by other keys can be added to each slave node.
Distributed Query Phase
1. Each salve node will compute the query
for the data contained in its own node.
2. Each year data is completely held in its
own node
3. This phase produces partial query results
which are complete for each year
Slave Node
(1980 & 1990 data)
Slave Node
(1981 & 1991 data)
..
Slave Node
(2000 & 2010 data)
Master Node
Accumulation Phase
1. All slave results are aggregated and sorted.
Scale Out More nodes means less years of data per node.
Redundancy & Failover Each node will have a backup node.
Data distribution strategy & access patterns compatibility
determine performance.
Enormous network overhead if access-patterns do not respect
distribution strategy
MPP (cont.)
MPP supports familiar RDBMS paradigm for medium scalability
Balances throughput with responsiveness.
Some implementations use specialized hardware (Ex. Netezza uses FPGA)
Familiar RDBMS (SQL) paradigm
Can scale to 10s of Terabytes in most cases
Characteristics
Familiar model (SQL Interface)
Can integrate with standard toolkits and BI Solutions
Pros
Vendor lock-in
Cannot scale for ad-hoc queries
Queries must respect data distribution strategy for acceptable performance.

Cons
MapReduce
If query is Group by State It still works!!
Data is partitioned randomly/redundantly across all data nodes. Every data node contain sales
data for every state and every year.
Map Phase
1. Each data node reads all of its
records sequentially.
2. It filters out all non- MD state
records
3. It computes a SUM(sales_amt) by
year for each year
Data Node Data Node Data Node
Reduce Node
Reduce Phase
1. Reduce node receives
SUM(sales_amt) for state MD by
each year from each node
2. Add all map results by year and
compute the final SUM(sales_amt)
by year for MD sales
3. Orders results by year
Data blocks (order of 128 MB) are stored and accessed contiguously
Scales out efficiently and degrades gracefully.
If a task fails the framework restarts automatically (on another node
if necessary) Redundancy and Graceful Recovery
Master Node
Map Nodes
MapReduce (cont.)
Map Reduce How it works
Year Sales
1990 $1M
1982 $2M
..
1999 $20M
Map Process 1
Year Sales
1998 $6M
1982 $5M
..
2010 $30M
Map Process 20

Reduce Node adds up all the map
results, sorts by year to give final
result
Year Sales
1980 $100M
1981 $102M
..
2010 $250M
MapReduce (cont.)
MapReduce is general purpose but requires complex skills.
Batch oriented - Maximizes throughput not responsiveness
Characteristics
Simple programming model
Scales out efficiently
Failure and redundancy built in
Adapts well to a wide variety of problems
Pros
Requires custom programming
Higher level languages (SQL-like) exist but programming skills are often
critical
Requires a complex array of skills to manage & maintain a MapReduce
System
Cons
Summary of BI Apps
Each option has tradeoffs. Choose based on requirements
Big Data Flavor How much data can it typically handle?
In Memory
Databases
Order of 1TB
Massively Parallel
Databases
Order of 10 TB
MapReduce Order of 100s of TB into the Petabyte
range
Transactional System - Use-Case
How many items in stock do users A and B on their second access?
Web Based Online
Store
Database
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
Context CAP Theorem
You can get any two but not all three features in any system
Characteristic
Consistency All nodes (and users) see the same data
at the same time.
Availability A guarantee that every request receives
a valid response. Site does not go down
or appear down under heavy load.
Partition Tolerance The system continues to function
regardless of loss of one of its
components
CA Single RDBMS
A single RDBMS instance is both consistent and available
Web Based Online
Store
RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
When setup in Read Committed every user sees the same
inventory count
System responds with last committed inventory count even during
updates
Consistent
Available
CP Distributed RDBMS
A Distributed RDBMS is consistent and resilient to failure of nodes
Web Based Online
Store
East Region
RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
Under Read Committed mode all user see consistent counts.
If one DB fails the other one will serve all users(Partition Tolerance)
During two phase commit system is unavailable.

Consistent
Partition Tolerant
West Region
RDBMS
2- Phase
Commit
AP Distributed RDBMS
Eventual Consistency is the key to Big Data Transactional Systems
Web Based Online Store
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
Amazon Dynamo and Apache Cassandra work on this principle
If one DB fails the other one will serve all users(Partition Tolerance)
Users will always be able to browse all products but occasionally
some users will see a stale count of inventory (Eventual Consistency)
Available
Partition Tolerant
Eventually Consistent
Hybrid Solution
Big Data Techniques Not an either or choice!
Large
Structured DB
Large
Unstructured
DB
Map Reduce
based ETL
MPP DB
In-Memory
DB
Business Users can
use familiar SQL
based tools in
real-time. In-
Memory DB
allows that
No-SQL DB
Programmers, System
Admins with no real-time
requirements can use all
three techniques. NoSQL
DBs allow technical users
to gain real-time benefits
in ways which suite their
complex needs.
Familiar BI
Solution
Programs &
Scripts
100 TB to
1 PB
5-10 TB
1 TB
Few 100
GB
Exploring over Millions US Patent Pages at the Speed of Thought

www.axiomine.com/patents/
Demo- US Patent Explorer
Patent Explorer Goals
Seamlessly navigate Structured and Unstructured data in real-time
Navigate 3 million US Patents Data (Text and Metadata) from 1963 to
1999 at the speed of thought.
Data Sources
Patent Metadata - National Bureau of Economic Research
Patent Text Bulk Download from Google Site
Each week granted patents are published to the Google Site as an
archive.
Size of uncompressed data
Structured Metadata Approximately 2 GB
Patent Text Data Approximately 300 GB
Patent Metadata
Cannot answer What is the title of Patent No 8086905?
Source National Bureau of Economic Research
http://data.nber.org/patents/
Patent Master
Pairwise
Citations
*
Inventors
*
Patent Master Other Master Data
Company
Master
Country
Master
Classification
Master
Contains only meta-data. No text data such as Patent Title available.
Ex. Pairwise citations contains millions of patent id pairs
Patent Text
Need to merge both metadata & text
Source Google
http://www.google.com/googlebooks/uspto.html
Sample File
High Level Architecture
Need to merge both metadata & text
Hadoop
Patent
Metadata
Patent Text
Navigation, Search
& Text Analytics
Apache Solr
Patent Details
MongoDB
Text Enhanced
Citation Data
Raw Data Tier ETL & Text Analytics Tier Search & Visualization
Navigate, Search
& Visualize
Drill down to
Patent Details
Big Data Flavors Summary
Choose a Big Data tool and product based on requirements
Flavor Characteristics
Map-Reduce Massive 100 TB to 1 PB Scale ELT
Complex Analytics on Massive Data
Large Scale Unstructured Data Analysis
Massively Parallel
Processing (MPP)
Batch oriented aggregations
Analytics on Moderately Large Structured Data with
predictable access patterns
In-Memory DB Similar to MPP but with real-time access patterns required.
Rich and Interactive Business Intelligence Apps
NoSQL databases Similar to In-Memory DB but simpler (Non SQL) access
patterns
Provide fast access to detail data where other techniques are
used to serve summary data
GPGPU Real time Value At Risk (Financial Risk Management)
Compute intensive analytics Ex. Simulation of a Hospital
Waiting Room over 1 years

Montana State Electrical Code Booklet
No ratings yet
Montana State Electrical Code Booklet
35 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Lecture 2 Scalable Data Systems
No ratings yet
Lecture 2 Scalable Data Systems
41 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
BDA_Unit-1
No ratings yet
BDA_Unit-1
32 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
23 Big Data and Data Wrangling
No ratings yet
23 Big Data and Data Wrangling
56 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Unit 5 - Principles of Big Data 2
No ratings yet
Unit 5 - Principles of Big Data 2
14 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
Big Data
No ratings yet
Big Data
30 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
BDA-1
No ratings yet
BDA-1
26 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Big Data
No ratings yet
Big Data
31 pages
Data Warehousing Data Minig Etc.....................
No ratings yet
Data Warehousing Data Minig Etc.....................
23 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
Cascade
No ratings yet
Cascade
20 pages
Big Data
No ratings yet
Big Data
19 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
3 Assignment
No ratings yet
3 Assignment
5 pages
Module 1.1 - Introduction To Big Data
No ratings yet
Module 1.1 - Introduction To Big Data
18 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
r05321203 Middleware Technologies
No ratings yet
r05321203 Middleware Technologies
4 pages
Daily Construction Report Rev2
No ratings yet
Daily Construction Report Rev2
7 pages
Submitted By: Kannan Motors Bajaj
No ratings yet
Submitted By: Kannan Motors Bajaj
43 pages
EMS Student CER Examples PDF
No ratings yet
EMS Student CER Examples PDF
7 pages
GHG Emission Inventory Report - Haldia Refinery
100% (1)
GHG Emission Inventory Report - Haldia Refinery
50 pages
Dogecoin: Transaction Receipt
No ratings yet
Dogecoin: Transaction Receipt
37 pages
Fencing Drawing-Proposal As On 15.02.2023
No ratings yet
Fencing Drawing-Proposal As On 15.02.2023
1 page
Semiconductor, Inc
No ratings yet
Semiconductor, Inc
21 pages
Warsak Report June 2023
No ratings yet
Warsak Report June 2023
44 pages
Electrical E Brochure
No ratings yet
Electrical E Brochure
12 pages
ROSEN Group - Advanced Pipeline Diagnostics 2016
100% (1)
ROSEN Group - Advanced Pipeline Diagnostics 2016
16 pages
Background of The Study There Is A System That Have Been Done Before Which Addressed The
No ratings yet
Background of The Study There Is A System That Have Been Done Before Which Addressed The
8 pages
Irspec Flash Butt Welding
No ratings yet
Irspec Flash Butt Welding
24 pages
Saravanakumar Nebosh El - 3
100% (1)
Saravanakumar Nebosh El - 3
3 pages
Work Study Including Method Study and Time Study
100% (1)
Work Study Including Method Study and Time Study
21 pages
Oracle 2
No ratings yet
Oracle 2
112 pages
Ergonomic Design of Hammer Handle To Reduce Musculoskeletal Disorders of Carpenters
100% (1)
Ergonomic Design of Hammer Handle To Reduce Musculoskeletal Disorders of Carpenters
7 pages
Toyota GR: (Global Registration For A Big Issue in The Market)
No ratings yet
Toyota GR: (Global Registration For A Big Issue in The Market)
4 pages
Designing Switching Voltage Regulators With The TL494
100% (1)
Designing Switching Voltage Regulators With The TL494
29 pages
Ikea Porters Five Forces
67% (3)
Ikea Porters Five Forces
17 pages
By John Nyere: The Design-Chain Operations Reference-Model
No ratings yet
By John Nyere: The Design-Chain Operations Reference-Model
15 pages
Casting Factors
No ratings yet
Casting Factors
13 pages
Economic Feasibility Studies
No ratings yet
Economic Feasibility Studies
7 pages
Past Paper Answers 2 CS
No ratings yet
Past Paper Answers 2 CS
15 pages
An Ecient Algorithm For Mining Frequent Closed Itemsets
No ratings yet
An Ecient Algorithm For Mining Frequent Closed Itemsets
10 pages
Sancharsoft
No ratings yet
Sancharsoft
20 pages
Internal MKTG
No ratings yet
Internal MKTG
12 pages
Ic-38 Gen Practice Paper-1
No ratings yet
Ic-38 Gen Practice Paper-1
20 pages
Motor - Cycle .Mechanic
100% (1)
Motor - Cycle .Mechanic
19 pages

Introduction To Big Data

Uploaded by

Introduction To Big Data

Uploaded by

Introduction to Big Data

You might also like