SlideShare a Scribd company logo
Architectural Patterns and Best
Practices : #BigData #Hadoop
Srividhya Balasubramaniam @ Data and Information Management Consultant
Srividhya.logic@gmail.com
Ice Breaker
120 Sec
Shhhhh!
Agenda
• Why are enterprises re-thinking on their data strategy
• Modernizing Enterprise Data Warehouses
• Architectural Patterns and Design Consideration
• Best Practices
Analytics
Architecture
Application
Architecture
Platform
Architecture
“Because we have been doing
stuff this way for ages!…… ”
is not the norm
Re-Think!
Drivers of Change What Has not changed
DATA QUALITY AND GOVERNANCE
INFORMATION SECURITY
METADATA MANAGEMENT
DATA SOURCES
DATA STORE
DATA ACCESS
ORCHESTRATION AND SCHEDULING
Challenges?
Velocity , Variety and Volume
What is the Right Tool? How should
I use the tool
Reference
Architecture?
What Language and
tool should I learn
Why?Why? Why? Why?
What's like data
modelling in Hadoop
Buy or build?
Core Design Principles
 What Business Problem is being Solved?
 Define Tool Selection Criteria
 Decouple processing store and systems
 Hybrid Architecture Leverage Batch and Stream
 Scalable, Reliable, Fit for Purpose, Secure
 Available, Very low Admin Cost
 Supportable and Operations Monitoring
 Best Design is cheap
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BIStorage of Messaging and Streaming
Criteria
1. How Distributed Services are managed
2. Guaranteed Ordering
3. Data Delivery
4. Data Retention Period
5. Availability
6. Scalability
7. Throughput
8. Parallel Clients
9. Object Size
10.Stream Map Reduce
11.Cost
Eg: Apache Kafka
• Guranteed Ordering,
Parallel Client and Stream
MR
• Configurable Data
Retention, Availability,
Object Size
• Low cost but more admin
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Databases What DB Export to choose
1. File Size
2. Network Bandwidth
3. Partitioning
4. Bulk Loading
5. CDC and Delta Data Transfers
6. Native connectors and specific
connectors for Distribution
Adaptors and
Golden Gate etc.
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage – Distributed Files Criteria
1. Average Latency
2. Typical Data Stored
3. Typical Item Size
4. Request Rate
5. Storage Cost PerGB / timeframe
6. Durability
7. Availability
8. Native support for toolsets
9. Active community and open source
Enterprise Distributions Selection
Clouders, Hortonworks, MapR
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured, Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Elastic
Cache
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Cache  NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)
2. Typical Volume Stored (GB, TB, PB)
3. Typical Item Size (B, KB, TB, PB)
4. Query Request Rate (High to Very Low)
5. Storage and Maintenance Cost (High – Low)
6. Durability (Low – Very High)
7. Availability (High – Very High)
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured,
Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
BATCH INTERACTIVE STREAMING MESSAGING
Machine Learning
Spark ML
EMR etc
Criteria
1. Programming Language
Support
2. Availability
3. Speed
4. Scale
5. Latency Query
6. Data Volume
7. Storage Support
8. SQL?
Temperature of Data
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Buy Vs Build ETL Decision?
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Create Analytical Application
Make Insights Available Via API
Analysis and Visualization
Zepplin, HUE etc
Publish to Queue
Data Modelling in Hadoop &
Architectural Patterns
Not only ER and Dimension Models (NoERDM)
Data Storage Format
Text
Sequence
Avro
Parquet
RC/ORC
Know strength and weakness of each format in terms of
Supporting Distributions
Processing requirements – Write, partial read, full read
Schema Evolution
Extract Requirements
Storage Requirements – How big are your files
How important is file splitability
Does block compression matter
Does the file format support indexing?
How easy it is to parse
Does it support column Stats?
Failure behavior for various file formats.
Not only ER and Dimension Models (NoERDM)
Compression Codecs
ZLIB
LZO
LZF
Snappy
Gzip
Bzip
Considerations
How much the size reduces
How fast it can compress decompress
How can I split my compressed files? File splitbility to make
use of parallelism
Compression types
Uncompressed
Record compressed.
Block Compressed.
`
We trade I/O Loads for CPU Loads
Other Practices
1. Structure and Organize your repository
a. Standard directory structure
b. Access quota controls
c. Stage area conventions
2. Location of HDFS files
a. Directory structure should simplify the assignment of permissions to be grated.
b. Eg /user, /etl , /tmp, /data, /app, /metadata,
3. Partitioning, Bucketing and denormalization.
Data Lake / Reservoir / Refinery
Exploratory Data Analysis
Application Level Analytics
Batch and Stream Analytics – Lambda Architecture
Enterprise Data Pipeline
Thank You!
Questions?
Ad

More Related Content

What's hot (20)

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
Bolke de Bruin
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
Ashish Thapliyal
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
Imply
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
Jianwei Li
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
Bolke de Bruin
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
Ashish Thapliyal
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
Imply
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
Jianwei Li
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 

Viewers also liked (6)

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Impetus Technologies
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Ad

Similar to Architectures styles and deployment on the hadoop (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
Martin Bém
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
Arcadia Data
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
AttaUrRahman78
 
kalyani.ppt
kalyani.pptkalyani.ppt
kalyani.ppt
ReyersonMax
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
AttaUrRahman78
 
kalyani.ppt
kalyani.pptkalyani.ppt
kalyani.ppt
GenrlUse1
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
Martin Bém
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
Arcadia Data
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
Ad

Recently uploaded (20)

Research Project csi1 - This presentation compares popular web browsers such ...
Research Project csi1 - This presentation compares popular web browsers such ...Research Project csi1 - This presentation compares popular web browsers such ...
Research Project csi1 - This presentation compares popular web browsers such ...
bomisung0207
 
NATIONALISM IN EUROPE class 10 best ppt.pdf
NATIONALISM IN EUROPE class 10 best ppt.pdfNATIONALISM IN EUROPE class 10 best ppt.pdf
NATIONALISM IN EUROPE class 10 best ppt.pdf
leenamakkar79
 
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdfLCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
rafaelsago2015
 
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
xule9cv6nd
 
Green Colorful House Simple Illustration Presentation.pdf.pdf
Green Colorful House Simple Illustration Presentation.pdf.pdfGreen Colorful House Simple Illustration Presentation.pdf.pdf
Green Colorful House Simple Illustration Presentation.pdf.pdf
RhyzCharmSolis
 
Huckel_Molecular orbital _Theory_8_Slides.pptx
Huckel_Molecular orbital _Theory_8_Slides.pptxHuckel_Molecular orbital _Theory_8_Slides.pptx
Huckel_Molecular orbital _Theory_8_Slides.pptx
study2022bsc
 
Latest Questions & Answers | Prepare for H3C GB0-961 Certification
Latest Questions & Answers | Prepare for H3C GB0-961 CertificationLatest Questions & Answers | Prepare for H3C GB0-961 Certification
Latest Questions & Answers | Prepare for H3C GB0-961 Certification
NWEXAM
 
When Is the Best Time to Use Job Finding Apps?
When Is the Best Time to Use Job Finding Apps?When Is the Best Time to Use Job Finding Apps?
When Is the Best Time to Use Job Finding Apps?
SnapJob
 
Stakeholders Management GT 11052021.cleaned.pptx
Stakeholders Management GT 11052021.cleaned.pptxStakeholders Management GT 11052021.cleaned.pptx
Stakeholders Management GT 11052021.cleaned.pptx
SaranshJeena
 
Huckel_MO_Theory_Colorful_Presentation (1).pptx
Huckel_MO_Theory_Colorful_Presentation (1).pptxHuckel_MO_Theory_Colorful_Presentation (1).pptx
Huckel_MO_Theory_Colorful_Presentation (1).pptx
study2022bsc
 
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDHSEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
123candemet2003
 
Bronchitis_Presentation_with_Images.pptx
Bronchitis_Presentation_with_Images.pptxBronchitis_Presentation_with_Images.pptx
Bronchitis_Presentation_with_Images.pptx
monmohanchowdhury8
 
remakingyourselfpresentation-250430095415-6476ade1.pptx
remakingyourselfpresentation-250430095415-6476ade1.pptxremakingyourselfpresentation-250430095415-6476ade1.pptx
remakingyourselfpresentation-250430095415-6476ade1.pptx
lakhmanpindariya9176
 
CHAPTER 7 - Foreign Direct Investment.pptx
CHAPTER 7 - Foreign Direct Investment.pptxCHAPTER 7 - Foreign Direct Investment.pptx
CHAPTER 7 - Foreign Direct Investment.pptx
72200337
 
Lecture 4.pptx which is need for microeconomic
Lecture 4.pptx which is need for microeconomicLecture 4.pptx which is need for microeconomic
Lecture 4.pptx which is need for microeconomic
mdrakibhasan1427
 
Top Business Schools in Delhi For Quality Education
Top Business Schools in Delhi For Quality EducationTop Business Schools in Delhi For Quality Education
Top Business Schools in Delhi For Quality Education
top10privatecolleges
 
sorcesofdrugs-160228074 56 4246643544 (3).ppt
sorcesofdrugs-160228074 56 4246643544 (3).pptsorcesofdrugs-160228074 56 4246643544 (3).ppt
sorcesofdrugs-160228074 56 4246643544 (3).ppt
IndalSatnami
 
History of Entomology and current updates of entomology.pptx
History of Entomology and current updates of entomology.pptxHistory of Entomology and current updates of entomology.pptx
History of Entomology and current updates of entomology.pptx
Neelesh Raipuria
 
Placement cell of college - why choose me
Placement cell of college - why choose mePlacement cell of college - why choose me
Placement cell of college - why choose me
mmanvi024
 
Best Fashion Designing Colleges in Delhi
Best Fashion Designing Colleges in DelhiBest Fashion Designing Colleges in Delhi
Best Fashion Designing Colleges in Delhi
top10privatecolleges
 
Research Project csi1 - This presentation compares popular web browsers such ...
Research Project csi1 - This presentation compares popular web browsers such ...Research Project csi1 - This presentation compares popular web browsers such ...
Research Project csi1 - This presentation compares popular web browsers such ...
bomisung0207
 
NATIONALISM IN EUROPE class 10 best ppt.pdf
NATIONALISM IN EUROPE class 10 best ppt.pdfNATIONALISM IN EUROPE class 10 best ppt.pdf
NATIONALISM IN EUROPE class 10 best ppt.pdf
leenamakkar79
 
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdfLCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
LCL216_2024-2_WEEKS 4 & 5_IF CLAUSES (1).pdf
rafaelsago2015
 
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
巴利亚多利德大学毕业证书学校原版文凭补办UVa成绩单办本科成绩单
xule9cv6nd
 
Green Colorful House Simple Illustration Presentation.pdf.pdf
Green Colorful House Simple Illustration Presentation.pdf.pdfGreen Colorful House Simple Illustration Presentation.pdf.pdf
Green Colorful House Simple Illustration Presentation.pdf.pdf
RhyzCharmSolis
 
Huckel_Molecular orbital _Theory_8_Slides.pptx
Huckel_Molecular orbital _Theory_8_Slides.pptxHuckel_Molecular orbital _Theory_8_Slides.pptx
Huckel_Molecular orbital _Theory_8_Slides.pptx
study2022bsc
 
Latest Questions & Answers | Prepare for H3C GB0-961 Certification
Latest Questions & Answers | Prepare for H3C GB0-961 CertificationLatest Questions & Answers | Prepare for H3C GB0-961 Certification
Latest Questions & Answers | Prepare for H3C GB0-961 Certification
NWEXAM
 
When Is the Best Time to Use Job Finding Apps?
When Is the Best Time to Use Job Finding Apps?When Is the Best Time to Use Job Finding Apps?
When Is the Best Time to Use Job Finding Apps?
SnapJob
 
Stakeholders Management GT 11052021.cleaned.pptx
Stakeholders Management GT 11052021.cleaned.pptxStakeholders Management GT 11052021.cleaned.pptx
Stakeholders Management GT 11052021.cleaned.pptx
SaranshJeena
 
Huckel_MO_Theory_Colorful_Presentation (1).pptx
Huckel_MO_Theory_Colorful_Presentation (1).pptxHuckel_MO_Theory_Colorful_Presentation (1).pptx
Huckel_MO_Theory_Colorful_Presentation (1).pptx
study2022bsc
 
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDHSEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
SEMINAR REPORT PPT.pptxSDJADADGGDYSADGSGJSFDH
123candemet2003
 
Bronchitis_Presentation_with_Images.pptx
Bronchitis_Presentation_with_Images.pptxBronchitis_Presentation_with_Images.pptx
Bronchitis_Presentation_with_Images.pptx
monmohanchowdhury8
 
remakingyourselfpresentation-250430095415-6476ade1.pptx
remakingyourselfpresentation-250430095415-6476ade1.pptxremakingyourselfpresentation-250430095415-6476ade1.pptx
remakingyourselfpresentation-250430095415-6476ade1.pptx
lakhmanpindariya9176
 
CHAPTER 7 - Foreign Direct Investment.pptx
CHAPTER 7 - Foreign Direct Investment.pptxCHAPTER 7 - Foreign Direct Investment.pptx
CHAPTER 7 - Foreign Direct Investment.pptx
72200337
 
Lecture 4.pptx which is need for microeconomic
Lecture 4.pptx which is need for microeconomicLecture 4.pptx which is need for microeconomic
Lecture 4.pptx which is need for microeconomic
mdrakibhasan1427
 
Top Business Schools in Delhi For Quality Education
Top Business Schools in Delhi For Quality EducationTop Business Schools in Delhi For Quality Education
Top Business Schools in Delhi For Quality Education
top10privatecolleges
 
sorcesofdrugs-160228074 56 4246643544 (3).ppt
sorcesofdrugs-160228074 56 4246643544 (3).pptsorcesofdrugs-160228074 56 4246643544 (3).ppt
sorcesofdrugs-160228074 56 4246643544 (3).ppt
IndalSatnami
 
History of Entomology and current updates of entomology.pptx
History of Entomology and current updates of entomology.pptxHistory of Entomology and current updates of entomology.pptx
History of Entomology and current updates of entomology.pptx
Neelesh Raipuria
 
Placement cell of college - why choose me
Placement cell of college - why choose mePlacement cell of college - why choose me
Placement cell of college - why choose me
mmanvi024
 
Best Fashion Designing Colleges in Delhi
Best Fashion Designing Colleges in DelhiBest Fashion Designing Colleges in Delhi
Best Fashion Designing Colleges in Delhi
top10privatecolleges
 

Architectures styles and deployment on the hadoop

  • 1. Architectural Patterns and Best Practices : #BigData #Hadoop Srividhya Balasubramaniam @ Data and Information Management Consultant [email protected]
  • 3. Agenda • Why are enterprises re-thinking on their data strategy • Modernizing Enterprise Data Warehouses • Architectural Patterns and Design Consideration • Best Practices Analytics Architecture Application Architecture Platform Architecture
  • 4. “Because we have been doing stuff this way for ages!…… ” is not the norm Re-Think!
  • 5. Drivers of Change What Has not changed DATA QUALITY AND GOVERNANCE INFORMATION SECURITY METADATA MANAGEMENT DATA SOURCES DATA STORE DATA ACCESS ORCHESTRATION AND SCHEDULING
  • 7. What is the Right Tool? How should I use the tool Reference Architecture? What Language and tool should I learn Why?Why? Why? Why? What's like data modelling in Hadoop Buy or build?
  • 8. Core Design Principles  What Business Problem is being Solved?  Define Tool Selection Criteria  Decouple processing store and systems  Hybrid Architecture Leverage Batch and Stream  Scalable, Reliable, Fit for Purpose, Secure  Available, Very low Admin Cost  Supportable and Operations Monitoring  Best Design is cheap
  • 9. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BIStorage of Messaging and Streaming Criteria 1. How Distributed Services are managed 2. Guaranteed Ordering 3. Data Delivery 4. Data Retention Period 5. Availability 6. Scalability 7. Throughput 8. Parallel Clients 9. Object Size 10.Stream Map Reduce 11.Cost Eg: Apache Kafka • Guranteed Ordering, Parallel Client and Stream MR • Configurable Data Retention, Availability, Object Size • Low cost but more admin
  • 10. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Databases What DB Export to choose 1. File Size 2. Network Bandwidth 3. Partitioning 4. Bulk Loading 5. CDC and Delta Data Transfers 6. Native connectors and specific connectors for Distribution Adaptors and Golden Gate etc.
  • 11. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage – Distributed Files Criteria 1. Average Latency 2. Typical Data Stored 3. Typical Item Size 4. Request Rate 5. Storage Cost PerGB / timeframe 6. Durability 7. Availability 8. Native support for toolsets 9. Active community and open source Enterprise Distributions Selection Clouders, Hortonworks, MapR
  • 12. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage Selection Criteria Data Structure : Fixed , Key Value, JSON Access Patterns : Hierarchical, Structured, Search, Publish etc Data Temperature : Hot, Warm Cold TCO : Low Elastic Cache
  • 13. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage Selection Criteria Cache  NoSQL SQL Search 1. Average Latency (ms, sec, min, hours) 2. Typical Volume Stored (GB, TB, PB) 3. Typical Item Size (B, KB, TB, PB) 4. Query Request Rate (High to Very Low) 5. Storage and Maintenance Cost (High – Low) 6. Durability (Low – Very High) 7. Availability (High – Very High) Data Structure : Fixed , Key Value, JSON Access Patterns : Hierarchical, Structured, Search, Publish etc Data Temperature : Hot, Warm Cold TCO : Low
  • 14. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI BATCH INTERACTIVE STREAMING MESSAGING Machine Learning Spark ML EMR etc Criteria 1. Programming Language Support 2. Availability 3. Speed 4. Scale 5. Latency Query 6. Data Volume 7. Storage Support 8. SQL? Temperature of Data
  • 15. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Buy Vs Build ETL Decision?
  • 16. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Create Analytical Application Make Insights Available Via API Analysis and Visualization Zepplin, HUE etc Publish to Queue
  • 17. Data Modelling in Hadoop & Architectural Patterns
  • 18. Not only ER and Dimension Models (NoERDM) Data Storage Format Text Sequence Avro Parquet RC/ORC Know strength and weakness of each format in terms of Supporting Distributions Processing requirements – Write, partial read, full read Schema Evolution Extract Requirements Storage Requirements – How big are your files How important is file splitability Does block compression matter Does the file format support indexing? How easy it is to parse Does it support column Stats? Failure behavior for various file formats.
  • 19. Not only ER and Dimension Models (NoERDM) Compression Codecs ZLIB LZO LZF Snappy Gzip Bzip Considerations How much the size reduces How fast it can compress decompress How can I split my compressed files? File splitbility to make use of parallelism Compression types Uncompressed Record compressed. Block Compressed. ` We trade I/O Loads for CPU Loads
  • 20. Other Practices 1. Structure and Organize your repository a. Standard directory structure b. Access quota controls c. Stage area conventions 2. Location of HDFS files a. Directory structure should simplify the assignment of permissions to be grated. b. Eg /user, /etl , /tmp, /data, /app, /metadata, 3. Partitioning, Bucketing and denormalization.
  • 21. Data Lake / Reservoir / Refinery Exploratory Data Analysis Application Level Analytics Batch and Stream Analytics – Lambda Architecture Enterprise Data Pipeline

Editor's Notes

  • #6: Broken Promise – No single version of truth At least we can persist raw data
  • #9: Obstacles with Big Data
  • #10: Obstacles with Big Data
  • #11: Obstacles with Big Data
  • #12: Obstacles with Big Data
  • #13: Obstacles with Big Data
  • #14: Obstacles with Big Data
  • #15: Obstacles with Big Data
  • #16: Obstacles with Big Data
  • #17: Obstacles with Big Data
  • #19: 1.1.1 Data Storage formats Considerations File Format Know strength and weaknesses of each format in terms of Supporting Distributions Processing requirements – Write, partial read, full read Schema Evolution Extract Requirements Storage Requirements – How big are your files How important is file splitability Does block compression matter Does the file format support indexing? How easy it is to parse Does it support column Stats? Failure behavior for various file formats.
  • #20: Compression consideration Any codec can be made splittable using a container format. Enable compression for mapreduce intermediate steps to enhance performance. Pay attention to how data is ordered. Compression happens in chunks so the entropy of these chunks are important. Ordering and storing data will enable better compression. Use compact fixle format with support for splitabbility eg: seq and avro.