SlideShare a Scribd company logo
REMINDER
Check in on the
COLLABORATE mobile app
Architectural Considerations for
Data Warehousing with Hadoop
Prepared by:
Mark Grover, Software Engineer
Jonathan Seidman, Solutions Architect
Cloudera, Inc.
github.com/hadooparchitecturebook/h
adoop-arch-book/tree/master/ch11-
data-warehousing
Session ID#: 10251
@mark_grover
@jseidman
About Us
■ Mark
▪ Software Engineer at
Cloudera
▪ Committer on Apache
Bigtop, PMC member on
Apache Sentry
(incubating)
▪ Contributor to Apache
Hadoop, Spark, Hive,
Sqoop, Pig and Flume
■ Jonathan
▪ Senior Solutions
Architect/Partner
Engineering at Cloudera
▪ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
▪ Co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
About the Book
■ @hadooparchbook
■ hadooparchitecturebook.com
■ github.com/hadooparchitectur
ebook
■ slideshare.com/hadooparchbo
ok
Agenda
■ Typical data warehouse architecture.
■ Challenges with the existing data warehouse architecture.
■ How Hadoop complements an existing data warehouse
architecture.
■ (Very) Brief intro to Hadoop.
■ Example use case.
■ Walkthrough of example use case implementation.
Typical Data Warehouse
Architecture
Example High Level Data Warehouse
Architecture
Extract
Data
Staging
Area
Operational
Source
Systems
Load
Data
Warehouse
Data
Analysis/Visu
alization Tools
Transformations
Challenges with the Data
Warehouse Architecture
Challenge – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
Challenges – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1 Slow Data Transformations = Missed ETL SLAs.
2 Slow Queries = Frustrated Business Users.
1
2
1
Challenges – Data Archiving
Data
Warehouse
Tape
Archive
■ Full-fidelity data only kept for a short duration
■ Expensive or sometimes impossible to look at historical raw
data
Challenge – Disparate Data Sources
Data
Warehouse
■ How do you join data from disparate sources with EDW?
Business
Intelligence
???
Challenge – Lack of Agility
■ Responding to changing requirements, mistakes, etc.
requires lengthy processes.
Challenge – Exploratory Analysis in the
EDW
■ Difficult for users to do exploratory analysis of data in the data
warehouse.
Business
Users
Developers Analysts
Data
Warehouse
Complementing the EDW with
Hadoop
Data Warehouse Architecture with Hadoop
Extract
Hadoop
Operational
Source
Systems
EDW
BI/Analytics Tools
Logs,
machine
data, etc.
Extract
Transformation/Analysis
Load
Hadoop
ETL/ELT Optimization with Hadoop
OLTP
Enterprise
Applications
ODS
Business
Intelligence
Transform
Query
Store
ETL
Data Warehouse
Query
(High $/Byte)
Active Archiving with Hadoop
Data
Warehouse
Hadoop
Joining Disparate Data Sources with
Hadoop
Data
Warehouse
Business
IntelligenceHadoop
Agile Data Access with Hadoop
Schema-on-Write (RDBMS):
• Prescriptive Data Modeling:
• Create static DB schema
• Transform data into RDBMS
• Query data in RDBMS format
• New columns must be added
explicitly before new data can
propagate into the system.
• Good for Known Unknowns
(Repetition)
Schema-on-Read (Hadoop):
• Descriptive Data Modeling:
• Copy data in its native format
• Create schema + parser
• Query Data in its native format
• New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
• Good for Unknown Unknowns
(Exploration)
Exploratory Analysis with Hadoop
Hadoop
Business
Users
Developers Analysts
Data
Warehouse
A Very Brief Intro to Hadoop
What is Apache Hadoop?
Has the Flexibility to Store
and Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides
workloads across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
Parallel
Processing
(MapReduce,
Spark, Impala,
etc.)
Distributed Computing
Frameworks
Apache Hadoop is an open source
platform for data storage and
processing that is…
 Scalable
 Fault tolerant
 Distributed
CORE HADOOP SYSTEM COMPONENTS
Oracle Big Data Appliance
■ All of the capabilities we’re talking about here are available as
part of the Oracle BDA.
Challenges of Hadoop Implementation
Challenges of Hadoop Implementation
Other Challenges – Architectural
Considerations
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data
Storage
(Formats,
Schema)
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
Metadata
Management
Hadoop Third Party Ecosystem
Data
Systems
Applications
Infrastructure
Operational
Tools
Walkthrough of Example Use
Case
Use-case
■ Movielens dataset
■ Users register by entering some demographic information
▪ Users can update demographic information later on
■ Rate movies
▪ Ratings can be updated later on
■ Auxillary information about movies available
▪ e.g. release date, IMDB URL, etc.
Movielens data set
u.user
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
Movielens data set
u.item
movie id | movie title | release date | video release date
| IMDb URL | unknown | Action | Adventure | Animation
|Children's | Comedy | Crime | Documentary | Drama |
Fantasy | Film-Noir | Horror | Musical | Mystery | Romance
| Sci-Fi | Thriller | War | Western |
1|Toy Story (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0
|0|0|0
2|GoldenEye (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1
|0|0
3|Four Rooms (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|1|0|0
Movielens data set
u.data
user id | item id | rating | timestamp
196|242|3|881250949
186|302|3|891717742
22|377|1|878887116
244 51|2|880606923
166|346|1|886397596
OLTP schema
Movielens data set - OLTP
Data Modeling
Data Modeling Considerations
■ We need to consider the following in our architecture:
▪ Storage layer – HDFS? HBase? Etc.
▪ File system schemas – how will we lay out the data?
▪ File formats – what storage formats to use for our data, both raw
and processed data?
▪ Data compression formats?
■ Hadoop is not a database, so these considerations will be
different from an RDBMS.
Denormalization
■ Why denormalize?
■ When to do denormalize?
■ How much to denormalize?
Why Denormalize?
■ Regular joins are expensive in Hadoop
■ When you have 2 data sets, no guarantees that
corresponding records will be present on the same
■ Such a guarantee exists when storing such data in a single
data set
When to Denormalize?
■ Well, it’s difficult to say
■ It depends
Movielens Data Set - Denormalization
Denormalize Denormalize
Data Set in Hadoop
Tracking Updates (CDC)
■ Can’t update data in-place in HDFS
■ HDFS is append-only filesystem
■ We have to track all updates
Tracking Updates in Hadoop
Hadoop File Types
■ Formats designed specifically to store and process data on
Hadoop:
▪ File based – SequenceFile
▪ Serialization formats – Thrift, Protocol Buffers, Avro
▪ Columnar formats – RCFile, ORC, Parquet
Final Schema in Hadoop
Our Storage Format Recommendation
■ Columnar format (Parquet) for merged/compacted data sets
▪ user, user_rating, movie
■ Row format (Avro) for history/append-only data sets
▪ user_history, user_rating_fact
Ingestion
Sources Interceptors Selectors Channels Sinks
Flume Agent
Ingestion – Apache Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
Ingestion – Apache Kafka
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Ingestion – Apache Sqoop
■ Apache project designed to ease import and export of data
between Hadoop and external data stores such as an
RDBMS.
■ Provides functionality to do bulk imports and exports of data.
■ Leverages MapReduce to transfer data in parallel.
Client Sqoop
MapReduce Map Map Map
Hadoop
Run import Collect metadata
Generate code,
Execute MR job
Pull data
Write to Hadoop
Sqoop Import Example – Movie
sqoop import --connect 
jdbc:mysql://mysql_server:3306/movielens 
--username myuser --password mypass --query 
'SELECT movie.*, group_concat(genre.name)
FROM movie
JOIN movie_genre ON (movie.id =
movie_genre.movie_id)
JOIN genre ON (movie_genre.genre_id = genre.id)
WHERE ${CONDITIONS}
GROUP BY movie.id' 
--split-by movie.id --as-avrodatafile 
--target-dir /data/movielens/movie
Data Processing
Popular Processing Engines
■ MapReduce
▪ Programming paradigm
■ Pig
▪ Workflow language based
■ Hive
▪ Batch SQL-engine
■ Impala
▪ Near real-time concurrent SQL engine
■ Spark
▪ DAG engine
Final Schema in Hadoop
Merge Updates
hive>INSERT OVERWRITE TABLE user_tmp
SELECT user.*
FROM user
LEFT OUTER JOIN user_upserts
ON (user.id = user_upserts.id)
WHERE
user_upserts.id IS NULL
UNION ALL
SELECT
id, age, occupation, zipcode,
TIMESTAMP(last_modified)
FROM user_upserts;
Aggregations
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
avg_movie_rating
latest_trending_
movies
Aggregations
hive>CREATE TABLE avg_movie_rating AS
SELECT
movie_id,
ROUND(AVG(rating), 1) AS rating
FROM
user_rating
GROUP BY
movie_id;
Export to Data Warehouse
user_rating_fact
user_rating
user_history
movie
user
Merge updates
One
record/user/movie
Merge updates
One record/user
Data
Warehouse
avg_movie_rating
latest_trending_
movies
Export
Sqoop Export
sqoop export --connect 
jdbc:mysql:/mysql_server:3306/movie_dwh 
--username myuser --password mypass 
--table avg_movie_rating --export-dir 
/user/hive/warehouse/avg_movie_rating 
-m 16 --update-key movie_id --update-mode 
allowinsert --input-fields-terminated-by 
'001’ --lines-terminated-by 'n'
Final Architecture
Final Architecture
Please complete the session
evaluation
Thank you!
@hadooparchbook
You may complete the session evaluation either
on paper or online via the mobile app
This is a slide title that can be up to two
lines of text without losing readability
■ This is the first bullet of text
■ This is the second bullet of text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet of text (and should be as far sub-
bullets indent)
– This tertiary sub-bullet will be seldom used, but available
▪ This is another sub-bullet of text
■ And this is the third bullet of text
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-
bullet of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet
of text
■ Senior Solutions
Architect/Partner
Enablement at Cloudera
■ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet
of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet of
text
■ This is the first bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
▪ This is a sub-bullet of text
— This secondary sub-bullet
▪ This is another sub-bullet of
text
■ And this is another bullet of
text
Subject number one Subject number two
This is a slide title for a slide with just the
title line (e.g., images/diagrams below)
Data warehousing with Hadoop
What is Hadoop?
Hadoop is an open-source system designed
To store and process petabyte scale data.
That’s pretty much what you need to know.
Well almost…
Compression Codecs
snappy
Well, maybe.
Not splittable.
X
Splittable.
Getting
better…
Very good
choice
Splittable,
but no...
Our Compression Codec Recommendation
■ Snappy for all data sets (columnar as well as row based)
File Format Choices
Data set Storage format Compression Codec
movie Parquet Snappy
user_history Avro Snappy
user Parquet Snappy
user_rating_fact Avro Snappy
user_rating Parquet Snappy
Ad

More Related Content

What's hot (20)

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kai Wähner
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
DataWorks Summit
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
DataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kai Wähner
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
Managing your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache AmbariManaging your Hadoop Clusters with Apache Ambari
Managing your Hadoop Clusters with Apache Ambari
DataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 

Similar to Data warehousing with Hadoop (20)

2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
Andy Pernsteiner
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
Rohit Jain
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
MapR Technologies
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
Rohit Jain
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
Adam Muise
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Blackvard
 
Ad

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 
Ad

Recently uploaded (20)

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 

Data warehousing with Hadoop

  • 1. REMINDER Check in on the COLLABORATE mobile app Architectural Considerations for Data Warehousing with Hadoop Prepared by: Mark Grover, Software Engineer Jonathan Seidman, Solutions Architect Cloudera, Inc. github.com/hadooparchitecturebook/h adoop-arch-book/tree/master/ch11- data-warehousing Session ID#: 10251 @mark_grover @jseidman
  • 2. About Us ■ Mark ▪ Software Engineer at Cloudera ▪ Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) ▪ Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume ■ Jonathan ▪ Senior Solutions Architect/Partner Engineering at Cloudera ▪ Previously, Technical Lead on the big data team at Orbitz Worldwide ▪ Co-founder of the Chicago Hadoop User Group and Chicago Big Data
  • 3. About the Book ■ @hadooparchbook ■ hadooparchitecturebook.com ■ github.com/hadooparchitectur ebook ■ slideshare.com/hadooparchbo ok
  • 4. Agenda ■ Typical data warehouse architecture. ■ Challenges with the existing data warehouse architecture. ■ How Hadoop complements an existing data warehouse architecture. ■ (Very) Brief intro to Hadoop. ■ Example use case. ■ Walkthrough of example use case implementation.
  • 6. Example High Level Data Warehouse Architecture Extract Data Staging Area Operational Source Systems Load Data Warehouse Data Analysis/Visu alization Tools Transformations
  • 7. Challenges with the Data Warehouse Architecture
  • 8. Challenge – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform
  • 9. Challenges – ETL/ELT Processing OLTP Enterprise Applications ODS Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 Slow Data Transformations = Missed ETL SLAs. 2 Slow Queries = Frustrated Business Users. 1 2 1
  • 10. Challenges – Data Archiving Data Warehouse Tape Archive ■ Full-fidelity data only kept for a short duration ■ Expensive or sometimes impossible to look at historical raw data
  • 11. Challenge – Disparate Data Sources Data Warehouse ■ How do you join data from disparate sources with EDW? Business Intelligence ???
  • 12. Challenge – Lack of Agility ■ Responding to changing requirements, mistakes, etc. requires lengthy processes.
  • 13. Challenge – Exploratory Analysis in the EDW ■ Difficult for users to do exploratory analysis of data in the data warehouse. Business Users Developers Analysts Data Warehouse
  • 14. Complementing the EDW with Hadoop
  • 15. Data Warehouse Architecture with Hadoop Extract Hadoop Operational Source Systems EDW BI/Analytics Tools Logs, machine data, etc. Extract Transformation/Analysis Load
  • 16. Hadoop ETL/ELT Optimization with Hadoop OLTP Enterprise Applications ODS Business Intelligence Transform Query Store ETL Data Warehouse Query (High $/Byte)
  • 17. Active Archiving with Hadoop Data Warehouse Hadoop
  • 18. Joining Disparate Data Sources with Hadoop Data Warehouse Business IntelligenceHadoop
  • 19. Agile Data Access with Hadoop Schema-on-Write (RDBMS): • Prescriptive Data Modeling: • Create static DB schema • Transform data into RDBMS • Query data in RDBMS format • New columns must be added explicitly before new data can propagate into the system. • Good for Known Unknowns (Repetition) Schema-on-Read (Hadoop): • Descriptive Data Modeling: • Copy data in its native format • Create schema + parser • Query Data in its native format • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Unknown Unknowns (Exploration)
  • 20. Exploratory Analysis with Hadoop Hadoop Business Users Developers Analysts Data Warehouse
  • 21. A Very Brief Intro to Hadoop
  • 22. What is Apache Hadoop? Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage Parallel Processing (MapReduce, Spark, Impala, etc.) Distributed Computing Frameworks Apache Hadoop is an open source platform for data storage and processing that is…  Scalable  Fault tolerant  Distributed CORE HADOOP SYSTEM COMPONENTS
  • 23. Oracle Big Data Appliance ■ All of the capabilities we’re talking about here are available as part of the Oracle BDA.
  • 24. Challenges of Hadoop Implementation
  • 25. Challenges of Hadoop Implementation
  • 26. Other Challenges – Architectural Considerations Data Sources Ingestion Raw Data Storage (Formats, Schema) Processed Data Storage (Formats, Schema) Processing Data Consumption Orchestration (Scheduling, Managing, Monitoring) Metadata Management
  • 27. Hadoop Third Party Ecosystem Data Systems Applications Infrastructure Operational Tools
  • 29. Use-case ■ Movielens dataset ■ Users register by entering some demographic information ▪ Users can update demographic information later on ■ Rate movies ▪ Ratings can be updated later on ■ Auxillary information about movies available ▪ e.g. release date, IMDB URL, etc.
  • 30. Movielens data set u.user user id | age | gender | occupation | zip code 1|24|M|technician|85711 2|53|F|other|94043 3|23|M|writer|32067 4|24|M|technician|43537 5|33|F|other|15213 6|42|M|executive|98101 7|57|M|administrator|91344
  • 31. Movielens data set u.item movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | 1|Toy Story (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0 |0|0|0 2|GoldenEye (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1 |0|0 3|Four Rooms (1995)|01-Jan- 1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 0|1|0|0
  • 32. Movielens data set u.data user id | item id | rating | timestamp 196|242|3|881250949 186|302|3|891717742 22|377|1|878887116 244 51|2|880606923 166|346|1|886397596
  • 36. Data Modeling Considerations ■ We need to consider the following in our architecture: ▪ Storage layer – HDFS? HBase? Etc. ▪ File system schemas – how will we lay out the data? ▪ File formats – what storage formats to use for our data, both raw and processed data? ▪ Data compression formats? ■ Hadoop is not a database, so these considerations will be different from an RDBMS.
  • 37. Denormalization ■ Why denormalize? ■ When to do denormalize? ■ How much to denormalize?
  • 38. Why Denormalize? ■ Regular joins are expensive in Hadoop ■ When you have 2 data sets, no guarantees that corresponding records will be present on the same ■ Such a guarantee exists when storing such data in a single data set
  • 39. When to Denormalize? ■ Well, it’s difficult to say ■ It depends
  • 40. Movielens Data Set - Denormalization Denormalize Denormalize
  • 41. Data Set in Hadoop
  • 42. Tracking Updates (CDC) ■ Can’t update data in-place in HDFS ■ HDFS is append-only filesystem ■ We have to track all updates
  • 44. Hadoop File Types ■ Formats designed specifically to store and process data on Hadoop: ▪ File based – SequenceFile ▪ Serialization formats – Thrift, Protocol Buffers, Avro ▪ Columnar formats – RCFile, ORC, Parquet
  • 45. Final Schema in Hadoop
  • 46. Our Storage Format Recommendation ■ Columnar format (Parquet) for merged/compacted data sets ▪ user, user_rating, movie ■ Row format (Avro) for history/append-only data sets ▪ user_history, user_rating_fact
  • 48. Sources Interceptors Selectors Channels Sinks Flume Agent Ingestion – Apache Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 49. Ingestion – Apache Kafka Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka
  • 50. Ingestion – Apache Sqoop ■ Apache project designed to ease import and export of data between Hadoop and external data stores such as an RDBMS. ■ Provides functionality to do bulk imports and exports of data. ■ Leverages MapReduce to transfer data in parallel. Client Sqoop MapReduce Map Map Map Hadoop Run import Collect metadata Generate code, Execute MR job Pull data Write to Hadoop
  • 51. Sqoop Import Example – Movie sqoop import --connect jdbc:mysql://mysql_server:3306/movielens --username myuser --password mypass --query 'SELECT movie.*, group_concat(genre.name) FROM movie JOIN movie_genre ON (movie.id = movie_genre.movie_id) JOIN genre ON (movie_genre.genre_id = genre.id) WHERE ${CONDITIONS} GROUP BY movie.id' --split-by movie.id --as-avrodatafile --target-dir /data/movielens/movie
  • 53. Popular Processing Engines ■ MapReduce ▪ Programming paradigm ■ Pig ▪ Workflow language based ■ Hive ▪ Batch SQL-engine ■ Impala ▪ Near real-time concurrent SQL engine ■ Spark ▪ DAG engine
  • 54. Final Schema in Hadoop
  • 55. Merge Updates hive>INSERT OVERWRITE TABLE user_tmp SELECT user.* FROM user LEFT OUTER JOIN user_upserts ON (user.id = user_upserts.id) WHERE user_upserts.id IS NULL UNION ALL SELECT id, age, occupation, zipcode, TIMESTAMP(last_modified) FROM user_upserts;
  • 57. Aggregations hive>CREATE TABLE avg_movie_rating AS SELECT movie_id, ROUND(AVG(rating), 1) AS rating FROM user_rating GROUP BY movie_id;
  • 58. Export to Data Warehouse
  • 59. user_rating_fact user_rating user_history movie user Merge updates One record/user/movie Merge updates One record/user Data Warehouse avg_movie_rating latest_trending_ movies Export
  • 60. Sqoop Export sqoop export --connect jdbc:mysql:/mysql_server:3306/movie_dwh --username myuser --password mypass --table avg_movie_rating --export-dir /user/hive/warehouse/avg_movie_rating -m 16 --update-key movie_id --update-mode allowinsert --input-fields-terminated-by '001’ --lines-terminated-by 'n'
  • 63. Please complete the session evaluation Thank you! @hadooparchbook You may complete the session evaluation either on paper or online via the mobile app
  • 64. This is a slide title that can be up to two lines of text without losing readability ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text (and should be as far sub- bullets indent) – This tertiary sub-bullet will be seldom used, but available ▪ This is another sub-bullet of text ■ And this is the third bullet of text
  • 65. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub- bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ Senior Solutions Architect/Partner Enablement at Cloudera ■ Previously, Technical Lead on the big data team at Orbitz Worldwide
  • 66. This is a slide title (one or two lines) ■ This is the first bullet of text ▪ This is a sub-bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text — This is a secondary sub-bullet of text – This tertiary sub-bullet that can be use ▪ This is another sub-bullet of text ■ This is the first bullet of text ■ This is the second bullet of text at the same level ▪ This is a sub-bullet of text ▪ This is a sub-bullet of text — This secondary sub-bullet ▪ This is another sub-bullet of text ■ And this is another bullet of text Subject number one Subject number two
  • 67. This is a slide title for a slide with just the title line (e.g., images/diagrams below)
  • 69. What is Hadoop? Hadoop is an open-source system designed To store and process petabyte scale data. That’s pretty much what you need to know. Well almost…
  • 70. Compression Codecs snappy Well, maybe. Not splittable. X Splittable. Getting better… Very good choice Splittable, but no...
  • 71. Our Compression Codec Recommendation ■ Snappy for all data sets (columnar as well as row based)
  • 72. File Format Choices Data set Storage format Compression Codec movie Parquet Snappy user_history Avro Snappy user Parquet Snappy user_rating_fact Avro Snappy user_rating Parquet Snappy