Big dataarchitecturesandecosystem+nosql

Overview
of
Big Data Architecture ,
Hadoop Ecosystem &
NoSQL Databases
Khanderao Kand
CTO GloMantra Inc.
Entrepreneur and Technologist
Twitter @khanderao

Big Data Use Cases
Predictive Analytics, Recommendations, Brand/Product
Management
Social CRM: Brand Analytics, Consumer Sentiment
Analysis, Competition Analysis:
Risks and Fraud Reduction : Financial, Intrusion, Anti
Money Laundry
Text Analytics: Patent Search
Network and log analysis : Intrusion Analysis
Health Analysis: Epidemics, Communicable
deseases
Intelligence Analysis: CIA, Homeland Security
Societal: Social Movements Analysis, Political Campaign
Analysis

Big Data Characteristics
3 V (Volume, Velocity and Variety)
Variety: Text, Images, Videos, Social Web, Web
Logs, ERPs, CRM
Volume: Petabytes , Millions of people ,Billions / trillions of
records
Velocity: Speed of data coming in (likes, mobile, RFID, …)
Loosely structured and distributed data
Often involves time stamped events
Incomplete / non-perfect data
Velocity
Volume
Variety

Big Data is not just Hadoop
…Processing Algorithms
Log processing for frauds / intrusions / anomalies
Behavioral analysis of consumers: for Ads / targeting
Pattern recognition e.g. stock trades / weather
Machine learning / correlating events
Text Processing / Text mining / Sentiment analysis
Search
Predictive Analytics

Typical Tools
Statistical Processing (e.g. R)
Machine Learning ( Apache Mahout, UIMA)
Text Processing (WEKA, Mallet)
Complex Event Processing (S4, Esper)
Data Mining / Warehousing (JDM)

Big Data vs Traditional Architecture
Big Data Architecture
…
User launches
a batch job
1
Three Tier Architecture
App Request Data
from Data Tier
2
Data Tier sends
data to the App Tier
3
4
App Tier
sends the
report
5
User
requests a
report
1
Master Distributes
Application
2
Master launches
App on nodes
35
User
downloads
results
Application & Data Tier
Data Tier
Application
Tier
Application
Tier

Examples
Select top (10 [cor (SMP500)) from STOCKS from US
Select SUM(mentions) from twitter where hashtag = „coke‟
when ad =“coke” in period 1 day over 5 year
If buyer age=45, gender=male,
past: Nike, sports: 49ner, drinks: Budweiser
currently searching: Harley Davidson
what would he buy?

Ads Correlation
Goal: Cluster users based on past response to ads (and not on any known /
learned attributes) and use that knowledge to serve new ads for users in the
clusters
Approach:
AdClicked Events would be processed by CF engine
Userid, AdId, click -> Logged
CF engine would do batch processing to cluster users with
similar responses to past ads
CF Based Optimization algo to get users predicted score for given ads
Issues:
Users click data is very sparse
Ads may be short lived hence frequent CF batch (like indexing) needed
Mitigation:
Any way to correlate users demographic to click response (currently
correlation is low) . Can we infer users cluster with demographic based
cluster?

Collaborative Filtering
Basic Concept:
Leverage information provided by interactions of users to predict
items of interest for a user
Motivation:
What to recommend to user
Based on:
user past actions / feedback (clicks )
and
users who acted similar to „this‟ user
Advantage:
Very good results
Content / language agnostic

CF Recommendation
Ad1 Adn
U1
U j
CF Algorithm
Recommendation
Top
Ads
For
The
user

Serving Best Ad that user May click or view
Site Content MR1
Cassandra
/ MogoDB
MR3User Clicks
Cassandra
/ MogoDB
Cassandra
/ MogoDB
Ad Data
Site Content Analysis
and Classifier
DMZ
Freebase
OpenCalais
Content Analysis
User Behavior
User-Interest
MR2
User Cluster Based AdReco
ALGO: CF
Algo: Text Analytics +
Classifier (SVM/Bayes)
Classifiers + Statistical
mySQL / Apache Jena

Types of Big Data Platforms
Type Concepts Size Vendors
In-Memory
Databases
• Specialized I/O
and Flash
Memory for
faster I/O
• Specialized
HW
• Locked in
Order of TBs Oracle Exalytics,
SAP HANA,
Scaleout, Kognitio
Massively Parallel
Computing (MPP)
• Massive Nodes
• Organized data
• Distributed
Query
• Special HW
Order of 10s of
TB
Greenplum,
Netezza, Teradata
Aster, Sybase IQ
Map Reduce • Map and
Reduce
• Horizontally
scalable
• Commodity
hardware
100‟s of TB to
Petabytes
Hadoop

The image was taken from the Atacama desert in western South America by Yuri
Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11,
2012. Copyright Yuri Beletsky

Alignment…
Explosion of data from site logs, search engines, social
media…
Google published paper on Map Reduce and Google File
System, inspired Doug Cutting working on Apache
Lucene-Nutch, Hadoop born
Yahoo took further with 1000 nodes in 2007-2008
Possible to process very very large data on commodity
hardware
Apache Open source

Main Stars
Availability: Explosion of Data
Technology:
Hadoop
Cheaper storage and hardware
Scalability with Cloud
Requirement: Business requirement of intelligence out of
the data

Hadoop
Apache Java Open Source
Google Idea, Yahoo original implementation, open sourced
Two Components:
HDFS distributed File System and
Map-Reduce Engine
Commodity Hardware
Very High Scalability

HDFS
Large Data Set
Write Once – Read Many
Fault Tolerant
Distributed File System
Name Node – Data Node
Fixed Size Data Blocks
Checksum
Files – Sequence of blocks
Replicated over Balanced Cluster
Heartbeat Report from Nodes
NameNode
Client 1 Client2
Read
Write
Replication
Rack1 Rack N

Hadoop Jobs-Tasks
Job Tracker
Client 1 Client2
Task Tracker
Rack N
• Move the processing (Code) to Data instead of Data to Code
• JobTracker distributes and tracks tasks
• TaskTracker on processing nodes communicated task status to JobTrackers
• If Task does not respond, marked as failed, and relaunched on another Node
Task Tracker2

Map Reduce
• Two Step, Map and Reduce, approach of solving problem
• Move the code to the data
• Map step process data on nodes
• Reduce step aggregates results from all Map nodes with reduce algorithm
Map Reduce
OutputInput
Sort /
Shuffle

Big Data Process: MR Job
Train
Map
Reduce
Output
Map
Reduce
Output
Map
Reduce
Output
Map
Reduce
Output
Map
Reduce
Output

Big Data Stack
Speed
Scale
Speed
Hadoop
Esper, S4
kdb
Hbase
MongoDB
MySQL
Scale
Mahout
Matlab
R
SciPy
SAS SPSS
Patents
Infrastructure Technology
Layer
Processing Algorithms
Applications

Big Data Logical Architecture
Hadoop
Map Reduce
Unstructured
Data
Lucene
Nutch
Structured
Data
RDBMS
Datalogs
Streams
ETL
Data
Integration
Workflow
&
Scheduler
System
Admin
Monitoring
No-SQL
Hadoop
Based
RDBMS
No-SQL
SOLR
Apps
BI
Visualization
Analytics Products
BI Tools - Dev

Hadoop Ecosystem (Basic)
HDFS
Map Reduce
HCatalog
Network
HBase
SqoopHivePigAvro/
Thrift
Data
Access
Zookeeper,HCatalog
Knox
Chukwa /
Flume
Oozie
Processing
Storage
Workflow
Orchestration
Ambari,Nagios,Ganglia
BI Analytics Apps RDBMS

Big dataarchitecturesandecosystem+nosql

Apache AVRO
RPC and serialization framework
Programming language independent
JSON format
Primary use Hadoop
Communication between Nodes and In/Out Hadoop

Apache Thrift
Interface Definition Language for RPC
Language Independent
Binary Communication format
Layered Stack enabling debugging and monitoring
No config / No centralization
Developed by Facebook
IDL needs code generation for schema change
Code
Service Client
Read/Write
TProtocol
TTransport

Apache Hive
SQL-like HiveQL
Warehousing Apps
Compiles to MapReduce Tasks
Facebook, Netflix, etc.
hive> CREATE TABLE ADLOG (adtime timestamp, id int, action string)
hive> SHOW TABLES;
hive> DESCRIBE ADLOG;
hive> ALTER TABLE …
hive> FROM rawlog r INSERT OVERWRITE TABLE ADLOG
SELECT TRANSFORM(r.time, r.id, r.input)
AS (adtime, id, action) USING '/bin/log' WHERE a.adtime > '2008-08-09';

Apache Pig Latin
Higher Level scripting above Map Reduce
Procedural (unlike SQL) by easy like SQL
Constructs like FOREACH, GROUP
Supports User Defined Functions
From Yahoo
Good for Integrating and writing Hadoop Jobs
A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir'
LOAD 'outputDir'
AS (word:chararray, count: int) `my.outputDir`;

Sqoop
Data Bulk Load
Data Import Export
RDBMS and NoSQL
HDFS, Hbase
Data Sliced
Sliced Transferred via MAP only Jobs

Chukwa
Hadoop Subproject
Large scale log processing
In/Out HDFS
Collection and analysis
Batch Oriented
Components:
Agents
Collectors
MR Jobs for Parsing & Archiving
HICC : Hadoop Infra Care Center Web App

Flume
Apache project
Large scale log processing
Supported by Cloudera
Log Stream
Components:
Agents
Channel
Clients Log4JAppender, HTTP ..
Compared with Chukwa:
Near Real time (seconds) vs Minutes
No Central Config
Source
Agent
Sink1
Agent
Sink2
Agent
Sink
Client
Flume
Channel

Big „Fast‟ Data
Real time adhoc querry:
Google Percolater and Dremel inspired
Cloudera : Impala
SQL like querry on HDFS
Lower latency
By pass Map Reduce
Apache Drill

Apache Storm
High Volume Stream Processing
Twitter (acquired BackType)
Uses ZeroMQ
Concepts:
Spout
Bolt (like Map or Reduce)
Topology
Spout
Spout
Bolt
(Transor
m)
Bolt
Bolt
(Reduce)
Bolt

Storm + Fusion Convergence –
Twitter Model

NoSQL & Map Reduce
NoSQL databases provides:
Schema flexibility,
Aligned programming models
High Volume and scalability on commodity hardware
Eventual Consistency
Can Interact with real time Applications and high velocity of data
Hadoop / HDFS catered more for batch processing, its gap with
operational apps can be bridged by using NoSQL to avoid
duplication and latency of data
Such integration powers NoSQL with high performing Map
Reduce functionality
HBase natively Hadoop Based
Cassandra augmented to Hadoop
MongoDB had MapReduce functionality but not HDFS based.
MongoDB added HadoopBridge

Big dataarchitecturesandecosystem+nosql

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Big dataarchitecturesandecosystem+nosql (20)

Recently uploaded (20)

Big dataarchitecturesandecosystem+nosql