0% found this document useful (0 votes)

206 views

Building 1000 Node Spark Cluster On EMR

The document discusses building a 1000 node cluster on Amazon EMR (Elastic MapReduce). It provides an overview of EMR concepts including master nodes, core nodes, and task nodes. It also covers using Spark on EMR including installing Spark via bootstrap actions, scaling the cluster by adding task nodes for more memory and removing them after jobs complete, and autoscaling based on memory usage. It provides an example command to easily launch a 1000 node Spark/Shark cluster on EMR.

Uploaded by

Anonymous iNxLvw

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views

Building 1000 Node Spark Cluster On EMR

Uploaded by

Anonymous iNxLvw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Building

1000 node cluster on

EMR
Manjeet Chayel

What is EMR?
Amazon Elas+c MapReduce

Hadoop-as-a-service

Map-Reduce engine

Massively parallel

What is EMR?

Integrated with tools

Integrated to AWS services

Cost eecBve AWS wrapper

HDFS

Amazon EMR

HDFS

Amazon EMR

AWS Data Pipeline

Amazon S3

Amazon
DynamoDB

HDFS

Data
Sources

Amazon EMR
Amazon
Kinesis

AWS Data Pipeline

Amazon S3

Amazon
DynamoDB

Data management

Hadoop Ecosystem analyBcal tools

HDFS

Data
Sources

Amazon EMR
Amazon
Kinesis

AWS Data Pipeline

Amazon S3

Amazon
DynamoDB

Data management

Hadoop Ecosystem analyBcal tools

HDFS

Data
Sources

Amazon
RDS

Amazon EMR
Amazon
Kinesis

AWS Data Pipeline

Amazon
RedShift

Amazon S3

Amazon
DynamoDB

Data management

Hadoop Ecosystem analyBcal tools

HDFS

Data
Sources

Amazon
RDS

Amazon EMR
Amazon
Kinesis

AWS Data Pipeline

Amazon
RedShift

Amazon S3

Amazon
DynamoDB

AWS Data
Pipeline

Amazon EMR Concepts

Master Node
Core Nodes
Task Nodes

Core Nodes
Amazon EMR cluster

DataNode (HDFS)

Master instance group

Master
Node

HDFS

Core instance group

Core Nodes
Amazon EMR cluster

Can Add Core Nodes:

More CPU

More Memory

More HDFS Space

Master instance group

Master
Node

HDFS

Core instance group

HDFS

Core Nodes
Amazon EMR cluster

Cant remove core
nodes:

HDFS corrupBon

Master instance group

Master
Node

HDFS

Core instance group

HDFS

Task Nodes
Amazon EMR cluster

No HDFS

Provides compute
resources:

CPU

Memory

Master instance group

Master
Node

HDFS

Core instance group

Task Nodes
Amazon EMR cluster

Can add and remove
task nodes

Master instance group

Master
Node

HDFS

Core instance group

Spark On Amazon EMR

Bootstrap AcBons
Ability to run or install addiBonal packages/
soUware on EMR nodes
Simple bash script stored on S3
Script gets executed during node/instance boot
Bme
Script gets executed on every node that gets
added to the cluster

Spark on Amazon EMR

Bootstrap acBon installs Spark on EMR nodes
Currently on Spark 0.8.1 & upgrading to 1.0
very soon

Why Spark on Amazon EMR?

Deploy small and large Spark clusters in
minutes
EMR manages your cluster, and handles node
recover in case of failures
IntegraBon with EC2 Spot Market, Amazon
RedshiU, Amazon Data pipeline, Amazon
Cloudwatch and etc
Tight integraBon with Amazon S3

Why Spark on Amazon EMR?

Shipping Spark logs to S3 for debugging
Dene S3 bucket at cluster deploy Bme

Scaling Spark on EMR

Amazon EMR cluster
Master instance group
Master
Node

Launch iniBal
Spark cluster with
core nodes
HDFS to store and
checkpoint RDDs

HDFS

32GB Memory

Scaling Spark on EMR

Amazon EMR cluster
Master instance group
Master
Node

Add Task nodes in

spot market to
increase memory
capacity
HDFS

HDFS

32GB Memory

256GB Memory

Scaling Spark on EMR

Create RDDs from
HDFS or Amazon S3
with:

sc.textFile
OR
sc.sequenceFile

Run ComputaBon on
RDDs

Amazon EMR cluster

Master instance group
Master
Node

HDFS

32GB Memory

256GB Memory

Amazon S3

Scaling Spark on EMR

Amazon EMR cluster
Master instance group

Save the resulBng

RDDs to HDFS or S3
with:

rdd.saveAsSequenceFile
OR
rdd.saveAsObjectFile

Master
Node

HDFS

32GB Memory

256GB Memory
saveAsObjectFile

Amazon S3

Scaling Spark on EMR

Amazon EMR cluster
Master instance group
Master
Node

Shutdown
TaskNodes when
your job is done
HDFS

HDFS

32GB Memory

ElasBc Spark With Amazon EMR

Autoscaling Spark
Amazon EMR cluster
Master Node

HDFS

32GB Memory

Autoscaling Spark
Amazon EMR cluster
Master Node

HDFS

32GB Memory

256GB Memory

ElasBc Spark
When to Scale?
Depends on your job

CPU bounded or Memory intensive?

Probably both for Spark jobs

Use CPU/Memory uBl. metrics to decide when to

scale

Spark Autoscaling Based Memory

Spark needs memory
Lots of it!!

How to scale based on the memory usage?

Launching a 1000 node Cluster

Amazon EMR

Launching a 1000 node Cluster

What do I need to launch a cluster?
AWS Account
Amazon EMR CLI

Launching a 1000 node Cluster

Easy to launch 1 command
./elas2c-mapreduce --create alive
--name "Spark/Shark Cluster" \
--bootstrap-ac2on s3://elasBcmapreduce/samples/spark/0.8.1/install-spark-shark.sh
--bootstrap-name "Spark/Shark"
--instance-type m1.xlarge
--instance-count 1000

Comes up in 15-20mins

Launching a 1000 node Cluster

Adding Task nodes

--add-instance-group TASK
--instance-count INSTANCE_COUNT
--instance-type INSTANCE_TYPE

Is cluster ready?

Cluster will be in WaiBng state

Lynx Interface
lynx hhp://localhost:9101

Web Interface

Spark UI

Dataset
Wikipedia arBcle trac staBsBcs
4.5 TB
104 Billion records

Stored in Amazon S3
s3://bigdata-spark-demo/wikistats/

Dataset
File structure (pagecount-DATE-HOUR.gz)
Period: Dec-2007 to Feb-2014
Format of File (tsv)
Feilds
projectcode, pagename, pageviews, and bytes

Sample dataset
Projectcode pagename pageviews bytes

en Barack_Obama 997 123091092

en Barack_Obama%27s_rst_100_days 8 850127
en Barack_Obama,_Jr 1 144103
en Barack_Obama,_Sr. 37 938821
en Barack_Obama_%22HOPE%22_poster 4 81005
en Barack_Obama_%22Hope%22_poster 5 102081

Loading data
HDFS

Amazon EMR

Amazon S3

Analyze opBons

Table structure
create external table wikistats
(
projectcode string,
pagename string,
pageviews int,
pagesize int
)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ' '
LOCATION 's3n://bigdata-spark-demo/wikistats/';

ALTER TABLE wikistats add parBBon(dt=2007-12) locaBon 's3n://bigdata-spark-
demo//wikistats/2007/2007-12';
.
Adding parBBons for every month Bll 2014-04

Analyze using Shark

Top 10 Page Views in Jan 2014

Exec Bme: 26 secs
Scanning 250GB of data

Analyze using Shark

Query 2:
Top 10 Page Views Overall

Exec Bme: 45 sec
Scanning 4.5TB of data

Analyze using Shark

Query 3:
No of pages in each projectcodes

Exec Bme: 48 secs
Scanning 4.5TB of data / 104 Billion records

Spark Streaming and Amazon

Kinesis

Amazon Kinesis
CreateStream
Creates a new Data Stream within the Kinesis Service
PutRecord
Adds new records to a Kinesis Stream
DescribeStream
Provides metadata about the Stream, including name, status, Shards,
etc.
GetNextRecord
Fetches next record for processing by user business logic
MergeShard / SplitShard
Scales Stream up/ down
DeleteStream
Deletes the Stream

Amazon Kinesis
Kinesis

What Do You Like To See?

Send Feedbacks To:

Manjeet Chayel
[email protected]
hhp://bit.ly/sparkemr

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
Edc15 Multimap - ECU Connections
67% (3)
Edc15 Multimap - ECU Connections
3 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Storage: The Node Types in Amazon EMR Are As Follows
No ratings yet
Storage: The Node Types in Amazon EMR Are As Follows
10 pages
AWS_EMR
No ratings yet
AWS_EMR
14 pages
Production Data Processing With Apache Spark
No ratings yet
Production Data Processing With Apache Spark
7 pages
9.elastic MapReduce-Redshift
No ratings yet
9.elastic MapReduce-Redshift
16 pages
LabManual5_ProcessingLogs_Using_EMR(1) (1)
No ratings yet
LabManual5_ProcessingLogs_Using_EMR(1) (1)
29 pages
Downloaded_oct24_Lab5_latestManual (1)
No ratings yet
Downloaded_oct24_Lab5_latestManual (1)
24 pages
ColorImages
No ratings yet
ColorImages
56 pages
AWS Plus Common Big Data Notes
No ratings yet
AWS Plus Common Big Data Notes
3 pages
DMWQ1D4S1T2 - Building Data Pipelines With Amazon EMR and MWAA - Updated
No ratings yet
DMWQ1D4S1T2 - Building Data Pipelines With Amazon EMR and MWAA - Updated
26 pages
How To Configure Big Data Management 10.1 For Amazon EMR 4.6
No ratings yet
How To Configure Big Data Management 10.1 For Amazon EMR 4.6
10 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Amazon EMR Serverless Architecture and Use Cases
No ratings yet
Amazon EMR Serverless Architecture and Use Cases
6 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Amazon EMR
No ratings yet
Amazon EMR
6 pages
Power Machine Learning at Scale: Mapping Parallelized Modeling-to-HPC Infrastructure On AWS
No ratings yet
Power Machine Learning at Scale: Mapping Parallelized Modeling-to-HPC Infrastructure On AWS
20 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Run Word Count - Hive Job On EMR - V1 - Reviewed - Sks - Lab Guides
No ratings yet
Run Word Count - Hive Job On EMR - V1 - Reviewed - Sks - Lab Guides
8 pages
AWS Cloud Automation: Harnessing Terraform For AWS Infrastructure As Code
From Everand
AWS Cloud Automation: Harnessing Terraform For AWS Infrastructure As Code
Rob Botwright
No ratings yet
Amazon Emr Migration Guide
No ratings yet
Amazon Emr Migration Guide
167 pages
Workload Aware Auto Scaling Qubole
No ratings yet
Workload Aware Auto Scaling Qubole
13 pages
Chotu Meat
No ratings yet
Chotu Meat
11 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
How Are Hadoop and Big Data Related?
No ratings yet
How Are Hadoop and Big Data Related?
18 pages
Module3 5
No ratings yet
Module3 5
11 pages
Amazon Emr Management Guide
No ratings yet
Amazon Emr Management Guide
314 pages
AWS Project by AnwarAkhtar
No ratings yet
AWS Project by AnwarAkhtar
7 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
30 pages
FrmGETHortanWorksFile Aspx
No ratings yet
FrmGETHortanWorksFile Aspx
44 pages
Amazon Emr Migration Guide
No ratings yet
Amazon Emr Migration Guide
141 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Spark Ops Final
No ratings yet
Spark Ops Final
45 pages
AWS in Action Part -2: Real-world Solutions for Cloud Professionals
From Everand
AWS in Action Part -2: Real-world Solutions for Cloud Professionals
Poonam Devi
No ratings yet
AWS Certified Database Study Guide: Specialty (DBS-C01) Exam
From Everand
AWS Certified Database Study Guide: Specialty (DBS-C01) Exam
Matheus Arrais
No ratings yet
Dinellie D_Assignment
No ratings yet
Dinellie D_Assignment
1 page
Cloud Computing Lab4 Kittu
No ratings yet
Cloud Computing Lab4 Kittu
15 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
7_SPARK
No ratings yet
7_SPARK
9 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Basics of Apache Spark Configuration Settings - by Halil Ertan - Towards Data Science
No ratings yet
Basics of Apache Spark Configuration Settings - by Halil Ertan - Towards Data Science
11 pages
AWS Amazon EMR
100% (1)
AWS Amazon EMR
38 pages
Spark
No ratings yet
Spark
96 pages
BD Notes
No ratings yet
BD Notes
11 pages
A-Deep-Dive-In-Hadoop-Spark-and-SQL-DW
No ratings yet
A-Deep-Dive-In-Hadoop-Spark-and-SQL-DW
41 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
EMR Workshop Lab 0: Create VPC
No ratings yet
EMR Workshop Lab 0: Create VPC
4 pages
AWS-BIGD Big Data On AWS
No ratings yet
AWS-BIGD Big Data On AWS
5 pages
Using AWS ParallelCluster To Simplify HPC Cluster Management CMP372-P
No ratings yet
Using AWS ParallelCluster To Simplify HPC Cluster Management CMP372-P
38 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Step by Step: Fault-tolerant, Scalable, Secure AWS Web Stack
From Everand
Step by Step: Fault-tolerant, Scalable, Secure AWS Web Stack
Savitra Sirohi
No ratings yet
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
AWS Cloud Confident Twitch Resources
No ratings yet
AWS Cloud Confident Twitch Resources
7 pages
Amazon Elastic MapReduce Best Practices
No ratings yet
Amazon Elastic MapReduce Best Practices
38 pages
Spark Guide Hortonworks Data Platform
No ratings yet
Spark Guide Hortonworks Data Platform
52 pages
AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam
From Everand
AWS Certified Data Analytics Study Guide: Specialty (DAS-C01) Exam
Asif Abbasi
No ratings yet
Spark Intro
No ratings yet
Spark Intro
24 pages
Brita Fill and Go Active Instruction Manual INT
No ratings yet
Brita Fill and Go Active Instruction Manual INT
2 pages
Conference Program. 11th International Conference. The Economies of The Balkan and The Eastern European Countries in The Changing World
No ratings yet
Conference Program. 11th International Conference. The Economies of The Balkan and The Eastern European Countries in The Changing World
24 pages
Veeam Agent Windows 3 0 User Guide
No ratings yet
Veeam Agent Windows 3 0 User Guide
351 pages
Advancing The State of UEFI Boot Kits - Alex Ionescu
No ratings yet
Advancing The State of UEFI Boot Kits - Alex Ionescu
52 pages
The Noble Gases of Windows Containers - Alex Ionescu
No ratings yet
The Noble Gases of Windows Containers - Alex Ionescu
48 pages
WLAN-PROS-Stress Test Report PDF
No ratings yet
WLAN-PROS-Stress Test Report PDF
68 pages
Camera Operations Photography and Lighti
No ratings yet
Camera Operations Photography and Lighti
77 pages
Bridging The Skills Gap
100% (3)
Bridging The Skills Gap
60 pages
A Practical Guide To 3D Printing
No ratings yet
A Practical Guide To 3D Printing
40 pages
Xen Server Avg
No ratings yet
Xen Server Avg
230 pages
HS 5 Digital Optimised
100% (4)
HS 5 Digital Optimised
132 pages
KLMAG2WEPD B031 Samsung
No ratings yet
KLMAG2WEPD B031 Samsung
24 pages
Lenovo IdeaPad Y330 Wistron Olympus LT32P 07242 RevSB
No ratings yet
Lenovo IdeaPad Y330 Wistron Olympus LT32P 07242 RevSB
54 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
7 Series DSP48E1 Slice: User Guide
No ratings yet
7 Series DSP48E1 Slice: User Guide
58 pages
Honor LLD-L21 8.0.0.125 (C185) Release Notes
No ratings yet
Honor LLD-L21 8.0.0.125 (C185) Release Notes
11 pages
Marantz+remote+app Ver+317 For+iphone Ipad
No ratings yet
Marantz+remote+app Ver+317 For+iphone Ipad
3 pages
Distributed system
No ratings yet
Distributed system
28 pages
CCNP (300-115) V2 0 2015 6 29 Lab PDF
No ratings yet
CCNP (300-115) V2 0 2015 6 29 Lab PDF
28 pages
What Is I2C?: To Know The I2C in Detail See This Article
100% (2)
What Is I2C?: To Know The I2C in Detail See This Article
6 pages
Check Point 600 1100 1200R Cli Guide PDF
No ratings yet
Check Point 600 1100 1200R Cli Guide PDF
263 pages
Os9 Bab60 1a Softw
No ratings yet
Os9 Bab60 1a Softw
80 pages
CP Admin in R12
No ratings yet
CP Admin in R12
5 pages
OS EXAM
No ratings yet
OS EXAM
4 pages
SARA-R5 ATCommands (UBX-19047455)
No ratings yet
SARA-R5 ATCommands (UBX-19047455)
489 pages
C Programm Files
No ratings yet
C Programm Files
34 pages
AN1010: Building A Customized NCP Application: Key Points
No ratings yet
AN1010: Building A Customized NCP Application: Key Points
7 pages
5.1 CCN2 Status
No ratings yet
5.1 CCN2 Status
7 pages
CS621 Week 1
No ratings yet
CS621 Week 1
30 pages

Building 1000 Node Spark Cluster On EMR

Uploaded by

Building 1000 Node Spark Cluster On EMR

Uploaded by

Building

1000 node cluster on

Integrated with tools

Integrated to AWS services

AWS Data Pipeline

AWS Data Pipeline

Hadoop Ecosystem analyBcal tools

AWS Data Pipeline

Hadoop Ecosystem analyBcal tools

AWS Data Pipeline

Hadoop Ecosystem analyBcal tools

AWS Data Pipeline

Amazon EMR Concepts

Master instance group

Core instance group

Master instance group

Core instance group

Master instance group

Core instance group

Master instance group

Core instance group

Master instance group

Core instance group

Spark On Amazon EMR

Spark on Amazon EMR

Why Spark on Amazon EMR?

Why Spark on Amazon EMR?

Scaling Spark on EMR

Scaling Spark on EMR

Add Task nodes in

Scaling Spark on EMR

Amazon EMR cluster

Scaling Spark on EMR

Save the resulBng

Scaling Spark on EMR

ElasBc Spark With Amazon EMR

CPU bounded or Memory intensive?

Use CPU/Memory uBl. metrics to decide when to

Spark Autoscaling Based Memory

How to scale based on the memory usage?

Launching a 1000 node Cluster

Launching a 1000 node Cluster

Launching a 1000 node Cluster

Launching a 1000 node Cluster

Cluster will be in WaiBng state

en Barack_Obama 997 123091092

Analyze using Shark

Analyze using Shark

Analyze using Shark

Spark Streaming and Amazon

What Do You Like To See?

You might also like