Building 1000 Node Spark Cluster On EMR
Building 1000 Node Spark Cluster On EMR
What
is
EMR?
Amazon
Elas+c
MapReduce
Hadoop-as-a-service
Map-Reduce engine
Massively parallel
What
is
EMR?
HDFS
Amazon EMR
HDFS
Amazon EMR
Amazon
DynamoDB
HDFS
Data
Sources
Amazon EMR
Amazon
Kinesis
Amazon
DynamoDB
Data management
HDFS
Data
Sources
Amazon EMR
Amazon
Kinesis
Amazon
DynamoDB
Data management
HDFS
Data
Sources
Amazon
RDS
Amazon EMR
Amazon
Kinesis
Amazon S3
Amazon
DynamoDB
Data management
HDFS
Data
Sources
Amazon
RDS
Amazon EMR
Amazon
Kinesis
Amazon S3
Amazon
DynamoDB
AWS Data
Pipeline
Core
Nodes
Amazon
EMR
cluster
DataNode
(HDFS)
HDFS
HDFS
Core
Nodes
Amazon
EMR
cluster
Can
Add
Core
Nodes:
More
CPU
More
Memory
More
HDFS
Space
HDFS
HDFS
HDFS
Core
Nodes
Amazon
EMR
cluster
Cant
remove
core
nodes:
HDFS
corrupBon
HDFS
HDFS
HDFS
Task
Nodes
Amazon
EMR
cluster
No
HDFS
Provides
compute
resources:
CPU
Memory
HDFS
HDFS
Task
Nodes
Amazon
EMR
cluster
Can
add
and
remove
task
nodes
HDFS
HDFS
Bootstrap
AcBons
Ability
to
run
or
install
addiBonal
packages/
soUware
on
EMR
nodes
Simple
bash
script
stored
on
S3
Script
gets
executed
during
node/instance
boot
Bme
Script
gets
executed
on
every
node
that
gets
added
to
the
cluster
Launch
iniBal
Spark
cluster
with
core
nodes
HDFS
to
store
and
checkpoint
RDDs
HDFS
HDFS
32GB Memory
HDFS
32GB Memory
256GB Memory
HDFS
HDFS
32GB Memory
256GB Memory
Amazon S3
Master
Node
HDFS
HDFS
32GB Memory
256GB
Memory
saveAsObjectFile
Amazon S3
Shutdown
TaskNodes
when
your
job
is
done
HDFS
HDFS
32GB Memory
Autoscaling
Spark
Amazon
EMR
cluster
Master
Node
HDFS
HDFS
32GB Memory
Autoscaling
Spark
Amazon
EMR
cluster
Master
Node
HDFS
HDFS
32GB Memory
256GB Memory
ElasBc
Spark
When
to
Scale?
Depends
on
your
job
Amazon EMR
Comes up in 15-20mins
Is cluster ready?
Lynx
Interface
lynx
hhp://localhost:9101
Web Interface
Spark UI
Dataset
Wikipedia
arBcle
trac
staBsBcs
4.5
TB
104
Billion
records
Stored
in
Amazon
S3
s3://bigdata-spark-demo/wikistats/
Dataset
File
structure
(pagecount-DATE-HOUR.gz)
Period:
Dec-2007
to
Feb-2014
Format
of
File
(tsv)
Feilds
projectcode,
pagename,
pageviews,
and
bytes
Sample
dataset
Projectcode
pagename
pageviews
bytes
Loading
data
HDFS
Amazon EMR
Amazon S3
Analyze opBons
Table
structure
create
external
table
wikistats
(
projectcode
string,
pagename
string,
pageviews
int,
pagesize
int
)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
'
'
LOCATION
's3n://bigdata-spark-demo/wikistats/';
ALTER
TABLE
wikistats
add
parBBon(dt=2007-12)
locaBon
's3n://bigdata-spark-
demo//wikistats/2007/2007-12';
.
Adding
parBBons
for
every
month
Bll
2014-04
Amazon
Kinesis
CreateStream
Creates
a
new
Data
Stream
within
the
Kinesis
Service
PutRecord
Adds
new
records
to
a
Kinesis
Stream
DescribeStream
Provides
metadata
about
the
Stream,
including
name,
status,
Shards,
etc.
GetNextRecord
Fetches
next
record
for
processing
by
user
business
logic
MergeShard
/
SplitShard
Scales
Stream
up/
down
DeleteStream
Deletes
the
Stream
Amazon
Kinesis
Kinesis
Amazon
Kinesis
Kinesis