Architecting application with Hadoop - using clickstream analytics as an example

Application
Architectures with
Hadoop
Northern Colorado Big Data meetup
October 8, 2015
tiny.cloudera.com/app-arch-ft-collins
Mark Grover | @mark_grover

2
About the book
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.

3
About Me
•  Mark
–  Software Engineer
–  Engineer on Apache Spark
–  Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating).
–  Contributor to Hadoop, Hive, Spark, Sqoop, Flume

4
Case Study
Clickstream Analysis

5
Analytics

6
Analytics

7
Web Logs – Combined Log Format
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”

8
Clickstream Analytics
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”

9
Challenges of Hadoop Implementation

10
Challenges of Hadoop Implementation

11
Hadoop Architectural Considerations
•  Storage managers?
–  HDFS? HBase?
•  Data storage and modeling:
–  File formats? Compression? Schema design?
•  Data movement
–  How do we actually get the data into Hadoop? How do we get it out?
•  Metadata
–  How do we manage data about the data?
•  Data access and processing
–  How will the data be accessed once in Hadoop? How can we transform it? How do
we query it?
•  Orchestration
–  How do we manage the workflow for all of this?

12
Architectural
Considerations
Data Storage and Modeling

13
Data Modeling Considerations
•  We need to consider the following in our architecture:
–  Storage layer – HDFS? HBase? Etc.
–  File system schemas – how will we lay out the data?
–  File formats – what storage formats to use for our data, both raw and
processed data?
–  Data compression formats?

14
Architectural
Considerations
Data Modeling – Storage Layer

15
Data Storage Layer Choices
•  Two likely choices for raw data:

16
Data Storage Layer Choices
•  Stores data directly as files
•  Fast scans
•  Poor random reads/writes
•  Stores data as Hfiles on
HDFS
•  Slow scans
•  Fast random reads/writes

17
Data Storage – Storage Manager Considerations
•  Incoming raw data:
–  Processing requirements call for batch transformations across multiple
records – for example sessionization.
•  Processed data:
–  Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
•  We choose HDFS
–  Processing needs in this case served better by fast scans.

18
Architectural
Considerations
Data Modeling – Data Storage Format

19
Our Format Choices…
•  Raw data
–  Avro with Snappy
•  Processed data
–  Parquet

20
Architectural
Considerations
Data Modeling – HDFS Schema Design

21
Recommended HDFS Schema Design
•  How to lay out data on HDFS?

22
Recommended HDFS Schema Design
/etl – Data in various stages of ETL workflow
/data – shared data for the entire organization
/tmp – temp data from tools or shared between users
/user/<username> - User specific data, jars, conf files
/app – Everything but data: UDF jars, HQL files, Oozie workflows

23
Architectural
Considerations
Data Modeling – Advanced HDFS Schema
Design

24
Partitioning
dataset
col=val1/file.txt
col=val2/file.txt
…
col=valn/file.txt
dataset
file1.txt
file2.txt
…
filen.txt
Un-partitioned HDFS
directory structure
Partitioned HDFS
directory structure

25
Partitioning considerations
•  What column to partition by?
–  Don’t have too many partitions (<10,000)
–  Don’t have too many small files in the partitions
–  Good to have partition sizes at least ~1 GB
•  We’ll partition by timestamp. This applies to both our raw and
processed data.

26
Architectural
Considerations
Data Ingestion

27
File Transfers
•  “hadoop fs –put <file>”
•  Reliable, but not
resilient to failure.
•  Other options are
mountable HDFS, for
example NFSv3.

28
Streaming Ingestion
•  Flume
–  Reliable, distributed, and available system for efficient collection, aggregation
and movement of streaming data, e.g. logs.
•  Kafka
–  Reliable and distributed publish-subscribe messaging system.

29
Flume vs. Kafka
•  Purpose built for
Hadoop data ingest.
•  Pre-built sinks for
HDFS, HBase, etc.
•  Supports
transformation of data
in-flight.
•  General pub-sub
messaging framework.
•  Just a message
transport.
•  Have to use third party
tool to ingest.

30
Flume vs. and Kafka
•  Kafka Source
•  Kafka Channel

31
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr

32
A Brief Discussion of Flume Patterns – Fan-in
•  Flume agent runs on
each of our servers.
•  These agents send
data to multiple agents
to provide reliability.
•  Flume provides support
for load balancing.

33
Ingestion Decisions
•  Historical Data
–  File transfer
•  Incoming Data
–  Flume with the spooling directory source.
•  Relational Data Sources – ODS, CRM, etc.
–  Sqoop

34
Architectural
Considerations
Data Processing – Engines

35
Processing Engines
•  MapReduce
•  Abstractions – Pig, Hive, Cascading, Crunch
•  Spark
•  Impala
Confidentiality Information Goes Here

36
MapReduce
•  Oldie but goody
•  Restrictive Framework / Innovated Work Around
•  Extreme Batch

37
MapReduce Basic High Level
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File

38
Abstractions
•  SQL
–  Hive
•  Script/Code
–  Pig: Pig Latin
–  Crunch: Java/Scala
–  Cascading: Java/Scala

39
Spark
•  The New Kid that isn’t that New Anymore
•  Easily 10x less code
•  Extremely Easy and Powerful API
•  Very good for machine learning
•  Scala, Java, and Python
•  RDDs
•  DAG Engine

40
Impala
• Real-time open source MPP style engine for Hadoop
• Doesn’t build on MapReduce
• Written in C++, uses LLVM for run-time code generation
• Can create tables over HDFS or HBase data
• Accesses Hive metastore for metadata
• Access available via JDBC/ODBC

41
Architectural
Considerations
Data Processing – What processing needs to
happen?

42
What processing needs to happen?
•  Sessionization
•  Filtering
•  Deduplication
•  BI / Discovery

43
Sessionization
Website visit
Visitor 1
Session 1
Visitor 1
Session 2
Visitor 2
Session 1
> 30 minutes

44
Why sessionize?
Helps answers questions like:
•  What is my website’s bounce rate?
–  i.e. how many % of visitors don’t go past the landing page?
•  Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
–  Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
•  Do attribution analysis – which channels are responsible for most
conversions?

45
How to Sessionize?
1.  Given a list of clicks, determine which clicks
came from the same user
2.  Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session

46
#1 – Which clicks are from same user?
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

47
•  We can use:
–  IP address (244.157.45.12)
–  Cookies (A9A3BECE0563982D)
–  IP address (244.157.45.12)and user agent string ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

48
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

49
#2 – Which clicks part of the same session?
> 30 mins apart = different
sessions

50©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation
•  We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the
organization.
•  We choose MR, since it’s fairly simple and maintainable code.

51
Filtering – filter out incomplete records
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

52
Filtering – filter out records from bots/spiders
Google spider IP address

Filtering recommendation
•  Bot/Spider filtering can be done easily in any of the engines
•  Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc.
•  Pretty close choice between MR, Hive and Spark
•  Can be done in Flume interceptors as well
•  We can simply embed this in our sessionization job

54
Deduplication – remove duplicate records

Deduplication recommendation
•  Can be done in all engines.
•  We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication
•  We use Pig

BI/Discovery engine recommendation
•  Main requirements for this are:
–  Low latency
–  SQL interface (e.g. JDBC/ODBC)
–  Users don’t know how to code
•  We chose Impala
–  It’s a SQL engine
–  Much faster than other engines
–  Provides standard JDBC/ODBC interfaces

57
Architectural
Considerations
Orchestration

•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing…
Easier in Oozie

•  Workflow is fairly simple
•  Need to trigger workflow based on data
•  Be able to recover from errors
•  Perhaps notify on the status
•  And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban

•  The best orchestration tool
is the one you are an expert on
– Oozie
– Spark Streaming, etc. don’t require orchestration
tool
Important Decision Consideration!

61
Putting It All
Together
Final Architecture

Final architecture
Hadoop
Cluster
BI/Visualization
tool (e.g.
microstrategy)
BI
Analysts
Spark For machine learning
and graph processing
R/Python Statistical Analysis
Custom
Apps
3. Accessing
2. Processing
4. Orchestration
1. Ingestion
Operational
Data Store
CRM System
Via Sqoop
Web servers
Website
users
Web logsVia Flume

The image cannot be displayed. Your computer may not have enough memory to open the image, or the
image may have been corrupted. Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
Thank you

Architecting application with Hadoop - using clickstream analytics as an example

Recommended

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Architecting application with Hadoop - using clickstream analytics as an example (20)

More from hadooparchbook (7)

Recently uploaded (20)

Architecting application with Hadoop - using clickstream analytics as an example