SlideShare a Scribd company logo
Application
Architectures with
Hadoop
Northern Colorado Big Data meetup
October 8, 2015
tiny.cloudera.com/app-arch-ft-collins
Mark Grover | @mark_grover
2
About the book
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.
3
About Me
•  Mark
–  Software Engineer
–  Engineer on Apache Spark
–  Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating).
–  Contributor to Hadoop, Hive, Spark, Sqoop, Flume
©2014 Cloudera, Inc. All Rights Reserved.
4
Case Study
Clickstream Analysis
5
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
6
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
7
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
8
Clickstream Analytics
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”
9
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
10
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
11
Hadoop Architectural Considerations
•  Storage managers?
–  HDFS? HBase?
•  Data storage and modeling:
–  File formats? Compression? Schema design?
•  Data movement
–  How do we actually get the data into Hadoop? How do we get it out?
•  Metadata
–  How do we manage data about the data?
•  Data access and processing
–  How will the data be accessed once in Hadoop? How can we transform it? How do
we query it?
•  Orchestration
–  How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved.
12
Architectural
Considerations
Data Storage and Modeling
13
Data Modeling Considerations
•  We need to consider the following in our architecture:
–  Storage layer – HDFS? HBase? Etc.
–  File system schemas – how will we lay out the data?
–  File formats – what storage formats to use for our data, both raw and
processed data?
–  Data compression formats?
©2014 Cloudera, Inc. All Rights Reserved.
14
Architectural
Considerations
Data Modeling – Storage Layer
15
Data Storage Layer Choices
•  Two likely choices for raw data:
©2014 Cloudera, Inc. All Rights Reserved.
16
Data Storage Layer Choices
•  Stores data directly as files
•  Fast scans
•  Poor random reads/writes
•  Stores data as Hfiles on
HDFS
•  Slow scans
•  Fast random reads/writes
©2014 Cloudera, Inc. All Rights Reserved.
17
Data Storage – Storage Manager Considerations
•  Incoming raw data:
–  Processing requirements call for batch transformations across multiple
records – for example sessionization.
•  Processed data:
–  Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
•  We choose HDFS
–  Processing needs in this case served better by fast scans.
©2014 Cloudera, Inc. All Rights Reserved.
18
Architectural
Considerations
Data Modeling – Data Storage Format
19
Our Format Choices…
•  Raw data
–  Avro with Snappy
•  Processed data
–  Parquet
©2014 Cloudera, Inc. All Rights Reserved.
20
Architectural
Considerations
Data Modeling – HDFS Schema Design
21
Recommended HDFS Schema Design
•  How to lay out data on HDFS?
©2014 Cloudera, Inc. All Rights Reserved.
22
Recommended HDFS Schema Design
/etl – Data in various stages of ETL workflow
/data – shared data for the entire organization
/tmp – temp data from tools or shared between users
/user/<username> - User specific data, jars, conf files
/app – Everything but data: UDF jars, HQL files, Oozie workflows
©2014 Cloudera, Inc. All Rights Reserved.
23
Architectural
Considerations
Data Modeling – Advanced HDFS Schema
Design
24
Partitioning
©2014 Cloudera, Inc. All Rights Reserved.
dataset
col=val1/file.txt
col=val2/file.txt
…
col=valn/file.txt
dataset
file1.txt
file2.txt
…
filen.txt
Un-partitioned HDFS
directory structure
Partitioned HDFS
directory structure
25
Partitioning considerations
•  What column to partition by?
–  Don’t have too many partitions (<10,000)
–  Don’t have too many small files in the partitions
–  Good to have partition sizes at least ~1 GB
•  We’ll partition by timestamp. This applies to both our raw and
processed data.
©2014 Cloudera, Inc. All Rights Reserved.
26
Architectural
Considerations
Data Ingestion
27
File Transfers
•  “hadoop fs –put <file>”
•  Reliable, but not
resilient to failure.
•  Other options are
mountable HDFS, for
example NFSv3.
©2014 Cloudera, Inc. All Rights Reserved.
28
Streaming Ingestion
•  Flume
–  Reliable, distributed, and available system for efficient collection, aggregation
and movement of streaming data, e.g. logs.
•  Kafka
–  Reliable and distributed publish-subscribe messaging system.
©2014 Cloudera, Inc. All Rights Reserved.
29
Flume vs. Kafka
•  Purpose built for
Hadoop data ingest.
•  Pre-built sinks for
HDFS, HBase, etc.
•  Supports
transformation of data
in-flight.
•  General pub-sub
messaging framework.
•  Just a message
transport.
•  Have to use third party
tool to ingest.
©2014 Cloudera, Inc. All Rights Reserved.
30
Flume vs. and Kafka
•  Kafka Source
•  Kafka Channel
©2014 Cloudera, Inc. All Rights Reserved.
31
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
32
A Brief Discussion of Flume Patterns – Fan-in
•  Flume agent runs on
each of our servers.
•  These agents send
data to multiple agents
to provide reliability.
•  Flume provides support
for load balancing.
©2014 Cloudera, Inc. All Rights Reserved.
33
Ingestion Decisions
•  Historical Data
–  File transfer
•  Incoming Data
–  Flume with the spooling directory source.
•  Relational Data Sources – ODS, CRM, etc.
–  Sqoop
©2014 Cloudera, Inc. All Rights Reserved.
34
Architectural
Considerations
Data Processing – Engines
35
Processing Engines
•  MapReduce
•  Abstractions – Pig, Hive, Cascading, Crunch
•  Spark
•  Impala
Confidentiality Information Goes Here
36
MapReduce
•  Oldie but goody
•  Restrictive Framework / Innovated Work Around
•  Extreme Batch
Confidentiality Information Goes Here
37
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
38
Abstractions
•  SQL
–  Hive
•  Script/Code
–  Pig: Pig Latin
–  Crunch: Java/Scala
–  Cascading: Java/Scala
Confidentiality Information Goes Here
39
Spark
•  The New Kid that isn’t that New Anymore
•  Easily 10x less code
•  Extremely Easy and Powerful API
•  Very good for machine learning
•  Scala, Java, and Python
•  RDDs
•  DAG Engine
Confidentiality Information Goes Here
40
Impala
• Real-time open source MPP style engine for Hadoop
• Doesn’t build on MapReduce
• Written in C++, uses LLVM for run-time code generation
• Can create tables over HDFS or HBase data
• Accesses Hive metastore for metadata
• Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
41
Architectural
Considerations
Data Processing – What processing needs to
happen?
42
What processing needs to happen?
Confidentiality Information Goes Here
•  Sessionization
•  Filtering
•  Deduplication
•  BI / Discovery
43
Sessionization
Confidentiality Information Goes Here
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes
44
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like:
•  What is my website’s bounce rate?
–  i.e. how many % of visitors don’t go past the landing page?
•  Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
–  Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
•  Do attribution analysis – which channels are responsible for most
conversions?
45
How to Sessionize?
Confidentiality Information Goes Here
1.  Given a list of clicks, determine which clicks
came from the same user
2.  Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session
46
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
47
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
48
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
49
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different
sessions
50©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation
•  We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the
organization.
•  We choose MR, since it’s fairly simple and maintainable code.
51
Filtering – filter out incomplete records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
52
Filtering – filter out records from bots/spiders
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google spider IP address
53©2014 Cloudera, Inc. All rights reserved.
Filtering recommendation
•  Bot/Spider filtering can be done easily in any of the engines
•  Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc.
•  Pretty close choice between MR, Hive and Spark
•  Can be done in Flume interceptors as well
•  We can simply embed this in our sessionization job
54
Deduplication – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
55©2014 Cloudera, Inc. All rights reserved.
Deduplication recommendation
•  Can be done in all engines.
•  We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication
•  We use Pig
56©2014 Cloudera, Inc. All rights reserved.
BI/Discovery engine recommendation
•  Main requirements for this are:
–  Low latency
–  SQL interface (e.g. JDBC/ODBC)
–  Users don’t know how to code
•  We chose Impala
–  It’s a SQL engine
–  Much faster than other engines
–  Provides standard JDBC/ODBC interfaces
57
Architectural
Considerations
Orchestration
58©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing…
Easier in Oozie
59©2014 Cloudera, Inc. All rights reserved.
•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban
60©2014 Cloudera, Inc. All rights reserved.
•  The best orchestration tool
is the one you are an expert on
– Oozie
– Spark Streaming, etc. don’t require orchestration
tool
Important Decision Consideration!
61
Putting It All
Together
Final Architecture
62©2014 Cloudera, Inc. All rights reserved.
Final architecture
Hadoop
Cluster
BI/Visualization
tool (e.g.
microstrategy)
BI
Analysts
Spark For machine learning
and graph processing
R/Python Statistical Analysis
Custom
Apps
3. Accessing
2. Processing
4. Orchestration
1. Ingestion
Operational
Data Store
CRM System
Via Sqoop
Web servers
Website
users
Web logsVia Flume
The image cannot be displayed. Your computer may not have enough memory to open the image, or the
image may have been corrupted. Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
Thank you
Ad

More Related Content

What's hot (20)

CI/CD with GitHub Actions
CI/CD with GitHub ActionsCI/CD with GitHub Actions
CI/CD with GitHub Actions
Swaminathan Vetri
 
Implementing Domain Events with Kafka
Implementing Domain Events with KafkaImplementing Domain Events with Kafka
Implementing Domain Events with Kafka
Andrei Rugina
 
Cloud Pub_Sub
Cloud Pub_SubCloud Pub_Sub
Cloud Pub_Sub
Knoldus Inc.
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
Jean-Paul Azar
 
Microservice architecture design principles
Microservice architecture design principlesMicroservice architecture design principles
Microservice architecture design principles
Sanjoy Kumar Roy
 
Git and GitHub workflows
Git and GitHub workflowsGit and GitHub workflows
Git and GitHub workflows
Arthur Shvetsov
 
Datapowercommonusecases 130509114200-phpapp02
Datapowercommonusecases 130509114200-phpapp02Datapowercommonusecases 130509114200-phpapp02
Datapowercommonusecases 130509114200-phpapp02
Cristina Garrido Lema
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Externalized Spring Boot App Configuration
Externalized  Spring Boot App ConfigurationExternalized  Spring Boot App Configuration
Externalized Spring Boot App Configuration
Haufe-Lexware GmbH & Co KG
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Running OpenShift Clusters in a Cloudstack Environment
Running OpenShift Clusters in a Cloudstack EnvironmentRunning OpenShift Clusters in a Cloudstack Environment
Running OpenShift Clusters in a Cloudstack Environment
ShapeBlue
 
Hazelcast Distributed Lock
Hazelcast Distributed LockHazelcast Distributed Lock
Hazelcast Distributed Lock
Jadson Santos
 
Cloud Design Patterns
Cloud Design PatternsCloud Design Patterns
Cloud Design Patterns
Karthikeyan VK
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Les bases de git
Les bases de gitLes bases de git
Les bases de git
Pierre Sudron
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHP
Corley S.r.l.
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 
Implementing Domain Events with Kafka
Implementing Domain Events with KafkaImplementing Domain Events with Kafka
Implementing Domain Events with Kafka
Andrei Rugina
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
Jean-Paul Azar
 
Microservice architecture design principles
Microservice architecture design principlesMicroservice architecture design principles
Microservice architecture design principles
Sanjoy Kumar Roy
 
Git and GitHub workflows
Git and GitHub workflowsGit and GitHub workflows
Git and GitHub workflows
Arthur Shvetsov
 
Datapowercommonusecases 130509114200-phpapp02
Datapowercommonusecases 130509114200-phpapp02Datapowercommonusecases 130509114200-phpapp02
Datapowercommonusecases 130509114200-phpapp02
Cristina Garrido Lema
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
Arnab Biswas
 
Running OpenShift Clusters in a Cloudstack Environment
Running OpenShift Clusters in a Cloudstack EnvironmentRunning OpenShift Clusters in a Cloudstack Environment
Running OpenShift Clusters in a Cloudstack Environment
ShapeBlue
 
Hazelcast Distributed Lock
Hazelcast Distributed LockHazelcast Distributed Lock
Hazelcast Distributed Lock
Jadson Santos
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHP
Corley S.r.l.
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 

Viewers also liked (19)

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
Josef Adersberger
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
Albert Hui
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
Cloudera, Inc.
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
Josef Adersberger
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
Albert Hui
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
Cloudera, Inc.
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Ad

Similar to Architecting application with Hadoop - using clickstream analytics as an example (20)

Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
Jianwei Li
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
Cloudera, Inc.
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
hadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
Jianwei Li
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
Cloudera, Inc.
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Ad

More from hadooparchbook (7)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 

Recently uploaded (20)

#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
#AdminHour presents: Hour of Code2018 slide deck from 12/6/2018
Lynda Kane
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Image processinglab image processing image processing
Image processinglab image processing  image processingImage processinglab image processing  image processing
Image processinglab image processing image processing
RaghadHany
 
Learn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step GuideLearn the Basics of Agile Development: Your Step-by-Step Guide
Learn the Basics of Agile Development: Your Step-by-Step Guide
Marcel David
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5..."Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
"Client Partnership — the Path to Exponential Growth for Companies Sized 50-5...
Fwdays
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from AnywhereAutomation Hour 1/28/2022: Capture User Feedback from Anywhere
Automation Hour 1/28/2022: Capture User Feedback from Anywhere
Lynda Kane
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 

Architecting application with Hadoop - using clickstream analytics as an example

  • 1. Application Architectures with Hadoop Northern Colorado Big Data meetup October 8, 2015 tiny.cloudera.com/app-arch-ft-collins Mark Grover | @mark_grover
  • 2. 2 About the book •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. 3 About Me •  Mark –  Software Engineer –  Engineer on Apache Spark –  Committer on Apache Bigtop, committer and PPMC member on Apache Sentry (incubating). –  Contributor to Hadoop, Hive, Spark, Sqoop, Flume ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Analytics ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. 7 Web Logs – Combined Log Format ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 8. 8 Clickstream Analytics ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 9. 9 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. 10 Challenges of Hadoop Implementation ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. 11 Hadoop Architectural Considerations •  Storage managers? –  HDFS? HBase? •  Data storage and modeling: –  File formats? Compression? Schema design? •  Data movement –  How do we actually get the data into Hadoop? How do we get it out? •  Metadata –  How do we manage data about the data? •  Data access and processing –  How will the data be accessed once in Hadoop? How can we transform it? How do we query it? •  Orchestration –  How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Data Modeling Considerations •  We need to consider the following in our architecture: –  Storage layer – HDFS? HBase? Etc. –  File system schemas – how will we lay out the data? –  File formats – what storage formats to use for our data, both raw and processed data? –  Data compression formats? ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15 Data Storage Layer Choices •  Two likely choices for raw data: ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. 16 Data Storage Layer Choices •  Stores data directly as files •  Fast scans •  Poor random reads/writes •  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 Data Storage – Storage Manager Considerations •  Incoming raw data: –  Processing requirements call for batch transformations across multiple records – for example sessionization. •  Processed data: –  Access to processed data will be via things like analytical queries – again requiring access to multiple records. •  We choose HDFS –  Processing needs in this case served better by fast scans. ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. 19 Our Format Choices… •  Raw data –  Avro with Snappy •  Processed data –  Parquet ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Recommended HDFS Schema Design •  How to lay out data on HDFS? ©2014 Cloudera, Inc. All Rights Reserved.
  • 22. 22 Recommended HDFS Schema Design /etl – Data in various stages of ETL workflow /data – shared data for the entire organization /tmp – temp data from tools or shared between users /user/<username> - User specific data, jars, conf files /app – Everything but data: UDF jars, HQL files, Oozie workflows ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 24 Partitioning ©2014 Cloudera, Inc. All Rights Reserved. dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt dataset file1.txt file2.txt … filen.txt Un-partitioned HDFS directory structure Partitioned HDFS directory structure
  • 25. 25 Partitioning considerations •  What column to partition by? –  Don’t have too many partitions (<10,000) –  Don’t have too many small files in the partitions –  Good to have partition sizes at least ~1 GB •  We’ll partition by timestamp. This applies to both our raw and processed data. ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. 27 File Transfers •  “hadoop fs –put <file>” •  Reliable, but not resilient to failure. •  Other options are mountable HDFS, for example NFSv3. ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. 28 Streaming Ingestion •  Flume –  Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. •  Kafka –  Reliable and distributed publish-subscribe messaging system. ©2014 Cloudera, Inc. All Rights Reserved.
  • 29. 29 Flume vs. Kafka •  Purpose built for Hadoop data ingest. •  Pre-built sinks for HDFS, HBase, etc. •  Supports transformation of data in-flight. •  General pub-sub messaging framework. •  Just a message transport. •  Have to use third party tool to ingest. ©2014 Cloudera, Inc. All Rights Reserved.
  • 30. 30 Flume vs. and Kafka •  Kafka Source •  Kafka Channel ©2014 Cloudera, Inc. All Rights Reserved.
  • 31. 31 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 32. 32 A Brief Discussion of Flume Patterns – Fan-in •  Flume agent runs on each of our servers. •  These agents send data to multiple agents to provide reliability. •  Flume provides support for load balancing. ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. 33 Ingestion Decisions •  Historical Data –  File transfer •  Incoming Data –  Flume with the spooling directory source. •  Relational Data Sources – ODS, CRM, etc. –  Sqoop ©2014 Cloudera, Inc. All Rights Reserved.
  • 35. 35 Processing Engines •  MapReduce •  Abstractions – Pig, Hive, Cascading, Crunch •  Spark •  Impala Confidentiality Information Goes Here
  • 36. 36 MapReduce •  Oldie but goody •  Restrictive Framework / Innovated Work Around •  Extreme Batch Confidentiality Information Goes Here
  • 37. 37 MapReduce Basic High Level Confidentiality Information Goes Here Mapper HDFS (Replicated) Native File System Block of Data Temp Spill Data Partitioned Sorted Data Reducer Reducer Local Copy Output File
  • 38. 38 Abstractions •  SQL –  Hive •  Script/Code –  Pig: Pig Latin –  Crunch: Java/Scala –  Cascading: Java/Scala Confidentiality Information Goes Here
  • 39. 39 Spark •  The New Kid that isn’t that New Anymore •  Easily 10x less code •  Extremely Easy and Powerful API •  Very good for machine learning •  Scala, Java, and Python •  RDDs •  DAG Engine Confidentiality Information Goes Here
  • 40. 40 Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. 41 Architectural Considerations Data Processing – What processing needs to happen?
  • 42. 42 What processing needs to happen? Confidentiality Information Goes Here •  Sessionization •  Filtering •  Deduplication •  BI / Discovery
  • 43. 43 Sessionization Confidentiality Information Goes Here Website visit Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 44. 44 Why sessionize? Confidentiality Information Goes Here Helps answers questions like: •  What is my website’s bounce rate? –  i.e. how many % of visitors don’t go past the landing page? •  Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? –  Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) •  Do attribution analysis – which channels are responsible for most conversions?
  • 45. 45 How to Sessionize? Confidentiality Information Goes Here 1.  Given a list of clicks, determine which clicks came from the same user 2.  Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session
  • 46. 46 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. 47 #1 – Which clicks are from same user? •  We can use: –  IP address (244.157.45.12) –  Cookies (A9A3BECE0563982D) –  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36") ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 #1 – Which clicks are from same user? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 49. 49 #2 – Which clicks part of the same session? ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” > 30 mins apart = different sessions
  • 50. 50©2014 Cloudera, Inc. All rights reserved. Sessionization engine recommendation •  We have sessionization code in MR, Spark on github. The complexity of the code varies, depends on the expertise in the organization. •  We choose MR, since it’s fairly simple and maintainable code.
  • 51. 51 Filtering – filter out incomplete records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 52. 52 Filtering – filter out records from bots/spiders ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” Google spider IP address
  • 53. 53©2014 Cloudera, Inc. All rights reserved. Filtering recommendation •  Bot/Spider filtering can be done easily in any of the engines •  Incomplete records are harder to filter in schema systems like Hive, Impala, Pig, etc. •  Pretty close choice between MR, Hive and Spark •  Can be done in Flume interceptors as well •  We can simply embed this in our sessionization job
  • 54. 54 Deduplication – remove duplicate records ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 55. 55©2014 Cloudera, Inc. All rights reserved. Deduplication recommendation •  Can be done in all engines. •  We already have a Hive table with all the columns, a simple DISTINCT query will perform deduplication •  We use Pig
  • 56. 56©2014 Cloudera, Inc. All rights reserved. BI/Discovery engine recommendation •  Main requirements for this are: –  Low latency –  SQL interface (e.g. JDBC/ODBC) –  Users don’t know how to code •  We chose Impala –  It’s a SQL engine –  Much faster than other engines –  Provides standard JDBC/ODBC interfaces
  • 58. 58©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing… Easier in Oozie
  • 59. 59©2014 Cloudera, Inc. All rights reserved. •  Workflow is fairly simple •  Need to trigger workflow based on data •  Be able to recover from errors •  Perhaps notify on the status •  And collect metrics for reporting Choosing the right Orchestration Tool Better in Azkaban
  • 60. 60©2014 Cloudera, Inc. All rights reserved. •  The best orchestration tool is the one you are an expert on – Oozie – Spark Streaming, etc. don’t require orchestration tool Important Decision Consideration!
  • 62. 62©2014 Cloudera, Inc. All rights reserved. Final architecture Hadoop Cluster BI/Visualization tool (e.g. microstrategy) BI Analysts Spark For machine learning and graph processing R/Python Statistical Analysis Custom Apps 3. Accessing 2. Processing 4. Orchestration 1. Ingestion Operational Data Store CRM System Via Sqoop Web servers Website users Web logsVia Flume
  • 63. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Thank you