Architectures styles and deployment on the hadoop

Architectural Patterns and Best
Practices : #BigData #Hadoop
Srividhya Balasubramaniam @ Data and Information Management Consultant
Srividhya.logic@gmail.com

Agenda
• Why are enterprises re-thinking on their data strategy
• Modernizing Enterprise Data Warehouses
• Architectural Patterns and Design Consideration
• Best Practices
Analytics
Architecture
Application
Architecture
Platform
Architecture

“Because we have been doing
stuff this way for ages!…… ”
is not the norm
Re-Think!

Drivers of Change What Has not changed
DATA QUALITY AND GOVERNANCE
INFORMATION SECURITY
METADATA MANAGEMENT
DATA SOURCES
DATA STORE
DATA ACCESS
ORCHESTRATION AND SCHEDULING

Challenges?
Velocity , Variety and Volume

What is the Right Tool? How should
I use the tool
Reference
Architecture?
What Language and
tool should I learn
Why?Why? Why? Why?
What's like data
modelling in Hadoop
Buy or build?

Core Design Principles
 What Business Problem is being Solved?
 Define Tool Selection Criteria
 Decouple processing store and systems
 Hybrid Architecture Leverage Batch and Stream
 Scalable, Reliable, Fit for Purpose, Secure
 Available, Very low Admin Cost
 Supportable and Operations Monitoring
 Best Design is cheap

Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BIStorage of Messaging and Streaming
Criteria
1. How Distributed Services are managed
2. Guaranteed Ordering
3. Data Delivery
4. Data Retention Period
5. Availability
6. Scalability
7. Throughput
8. Parallel Clients
9. Object Size
10.Stream Map Reduce
11.Cost
Eg: Apache Kafka
• Guranteed Ordering,
Parallel Client and Stream
MR
• Configurable Data
Retention, Availability,
Object Size
• Low cost but more admin

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Databases What DB Export to choose
1. File Size
2. Network Bandwidth
3. Partitioning
4. Bulk Loading
5. CDC and Delta Data Transfers
6. Native connectors and specific
connectors for Distribution
Adaptors and
Golden Gate etc.

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Data Storage – Distributed Files Criteria
1. Average Latency
2. Typical Data Stored
3. Typical Item Size
4. Request Rate
5. Storage Cost PerGB / timeframe
6. Durability
7. Availability
8. Native support for toolsets
9. Active community and open source
Enterprise Distributions Selection
Clouders, Hortonworks, MapR

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Data Storage Selection Criteria
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured, Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Elastic
Cache

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Data Storage Selection Criteria
Cache  NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)
2. Typical Volume Stored (GB, TB, PB)
3. Typical Item Size (B, KB, TB, PB)
4. Query Request Rate (High to Very Low)
5. Storage and Maintenance Cost (High – Low)
6. Durability (Low – Very High)
7. Availability (High – Very High)
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured,
Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
BATCH INTERACTIVE STREAMING MESSAGING
Machine Learning
Spark ML
EMR etc
Criteria
1. Programming Language
Support
2. Availability
3. Speed
4. Scale
5. Latency Query
6. Data Volume
7. Storage Support
8. SQL?
Temperature of Data

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Buy Vs Build ETL Decision?

Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•QUEUE
•STREAM STORE
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Visualization
•Self Service BI
Create Analytical Application
Make Insights Available Via API
Analysis and Visualization
Zepplin, HUE etc
Publish to Queue

Data Modelling in Hadoop &
Architectural Patterns

Not only ER and Dimension Models (NoERDM)
Data Storage Format
Text
Sequence
Avro
Parquet
RC/ORC
Know strength and weakness of each format in terms of
Supporting Distributions
Processing requirements – Write, partial read, full read
Schema Evolution
Extract Requirements
Storage Requirements – How big are your files
How important is file splitability
Does block compression matter
Does the file format support indexing?
How easy it is to parse
Does it support column Stats?
Failure behavior for various file formats.

Not only ER and Dimension Models (NoERDM)
Compression Codecs
ZLIB
LZO
LZF
Snappy
Gzip
Bzip
Considerations
How much the size reduces
How fast it can compress decompress
How can I split my compressed files? File splitbility to make
use of parallelism
Compression types
Uncompressed
Record compressed.
Block Compressed.
`
We trade I/O Loads for CPU Loads

Other Practices
1. Structure and Organize your repository
a. Standard directory structure
b. Access quota controls
c. Stage area conventions
2. Location of HDFS files
a. Directory structure should simplify the assignment of permissions to be grated.
b. Eg /user, /etl , /tmp, /data, /app, /metadata,
3. Partitioning, Bucketing and denormalization.

Data Lake / Reservoir / Refinery
Exploratory Data Analysis
Application Level Analytics
Batch and Stream Analytics – Lambda Architecture
Enterprise Data Pipeline

Architectures styles and deployment on the hadoop

Recommended

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Architectures styles and deployment on the hadoop (20)

Recently uploaded (20)

Architectures styles and deployment on the hadoop

Editor's Notes