The practice of big data - making big data approachable

Big Data in Practice: A Pragmatic approach to Adoption and Value creation
Raj Nair
Data Management Practitioner and Consultant
Content NOT FOR DISTRIBUTION: Property of Raj Nair

1
Mainstream Big Data
2
Real World Use Cases and Applications
3
Practical Adoption : Opportunity Identification
4
Big Data 2.0 – What’s on the Horizon ?
5
Conclusion

Every Day Big Data
Reaching scale-up limits on your server
Represents tools, technologies, frameworks for storage and processing at scale
Represents Opportunity

Big Data 1.0 – The Hadoop Ecosystem
Software library
Framework for large scale distributed processing
Ability to scale to thousands of computers

Design Principles
-Large Data Sets
Classic Hadoop MapReduce – Batch Processing
-Moving computation is cheaper than moving data
-Hardware Failure, redundancy

What is this you call data?
Unlearn current notion of “Data”
Native Data Source

HDFS Storage and Archival
MapReduce Programming Library
Crunch
Data Pipeline processing
HBase
Real time access (low latency)
Pig
M/R Abstraction
Hive
Data Warehouse
Sqoop Data Transfer
Flume
Data Streaming
(High Latency)
Data Processing
Workload Management
Data Movement

Purpose
Use it for
HDFS
Distributed Storage
Raw data storage and archival
Flume
Data Movement
Continuous Streaming into HDFS
Sqoop
Data Movement
Data transfer from RDBMS to HDFS/HBase
HBase
Workload Mgmt
Near real-time read/write access to large data sets
Hive
Workload Mgmt
Analytical queries; data warehouse
Map Reduce
Data Processing
Low level custom code for data processing
Crunch
Data Processing (Java)
Coding M/R pipelines, aggregations
Pig
Data Processing
Scripting language; similar to Crunch

A Powerful Paradigm
Storage Layer
Query Engine
Processing Engine
Metadata
Hadoop – Separate Layers
Multiple Query Engines
Data in Native format
Oracle
SQL Server
Storage
Query
Storage
Query
Storage
Query
DB2
Tightly integrated Proprietary Stacks, cannot free your data

Opportunity…
Transform Data Processing
Exploration
Information Enrichment
Data Archival

Data Processing Pipeline
Several sources
Varying Frequencies
Varying Formats
Quality check
Validations, Scrubbing
Transformations/Rules
Prune app data sources
Discard/Archive

ETL Engine
Data Warehouse
Data Storage

From Source to Business Value
Shoe-horning
Relational fit
Loading
Archiving / Purging
Biz Rules
Validations
Scrubbing
Mapping
Transforms
Staging
Distribution
Prep Tuning
Data stores
Minutes/Hours
Subset of Data
Hours
Reliability
Sourcing
Missed SLAs = Biz Frustration

From Source to Business Value
Significantly more data sources
Highly scalable, significantly performant data processing
New business value,
Faster time to value

Data Exploration
Large reservoir of data
Descriptive Statistics
Central Tendencies
Dispersion
Visualization
Surprise Me!

Data Exploration
Courtesy: Data Science Central
http://www.datasciencecentral.com/profiles/blogs/r-hadoop-data-analytics-heaven

Information Enrichment

Data Archival
Recycle Policy

Data Archival
Storage in Native Format
Redundancy , Replication
Easily accessible, inexpensive

Practical Adoption
Big Data Technologies don’t solve all problems
Leveraging existing investments
Complexities of existing systems

Proof of Concept
Use your own data – realistic results
Focus on very specific pain points
Know what you are going to measure

Data Processing
Engine
Data Warehouse
Data Storage

Data Processing
Engine
Data Warehouse
Data Storage
Keep all your raw data
Cheaper Hardware
Low cost per byte $$
High value per byte
Offload from RDBMS
Improve scale, performance
Leverage existing tools

Hardware on a budget
Master:
- 12 cores
- 32 GB RAM
- 2 TB SATA Drives, 7.2K RPM
Workers:
- 4 Nodes
- 12 cores
- 16 GB RAM
- 4 TB SATA Drives each, 7.2 PRM
$5000
$5000 each
4-Port 10 Gig Switch - $1500
Grand Total < $30,000
Software costs ? - 0

Exploratory BI / Analysis
Data Storage
Makes Data exploration practically cheaper and faster
Use existing visualization tools (Tableau or other)
Check for integration with R

Data Architecture
•Single Important factor
•Don’t miss technology trends
But ….
It’s more about the battle plan

SQL on Hadoop
Impala
Tez
Phoenix
•Cloudera
•MPP Engine
•HortonWorks
•SQL on Hive
•Apache
•SQL on HBase

In memory and Real Time
Spark
Storm
Apache Drill
•100x faster than M/R
•Event processing
•Low latency ad hoc queries
•Interactive queries at scale

Where can I get Hadoop?
Distributors
Open Source Apache Project
And these guys…
Cloud

Conclusion
The Power & Paradigm of Distributed Computing
“Nativity” of Data – Unlearn old notions
Identify, understand your data processing pipeline
POC with a measurable, specific use case
Data Architecture – key to sustainable scalability
Stay informed

The practice of big data - making big data approachable

Recommended

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to The practice of big data - making big data approachable (20)

Recently uploaded (20)

The practice of big data - making big data approachable