SlideShare a Scribd company logo
DW on AWS
Gaurav Agrawal
Data Platforms, AOL Inc.
AOL Data Platforms Architecture 2014
Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Migration
• Web Console vs. CLI
Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
AOL Data Platforms Architecture 2015
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission - CLI
EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS
EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" 
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" 
--visible-to-all-users 
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json 
--ami-version "3.7.0" 
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ 
--enable-debugging 
--instance-groups file://omni_awssot.generic.instance_groups.json 
--auto-terminate 
--applications file://omni_awssot.generic.applications.json 
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json 
--steps file://omni_awssot.generic.steps.json
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2
Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Optimization
Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Roles
Security Groups
AOL VPC
Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions
Thank you

More Related Content

What's hot (12)

PPTX
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer
 
PDF
AWS Glue - let's get stuck in!
Chris Taylor
 
PPTX
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Lynn Langit
 
PPTX
AWS for the Data Professional
Lynn Langit
 
PPTX
AWS_Data_Pipeline
Ahasan Habib
 
PDF
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
PPTX
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Mark Kromer
 
PPTX
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
PPTX
Data Engineering Roles
Adam Doyle
 
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
PPTX
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
PPTX
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer
 
AWS Glue - let's get stuck in!
Chris Taylor
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Lynn Langit
 
AWS for the Data Professional
Lynn Langit
 
AWS_Data_Pipeline
Ahasan Habib
 
Change Data Capture with Data Collector @OVH
Paris Data Engineers !
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Mark Kromer
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Data Engineering Roles
Adam Doyle
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
ADF Mapping Data Flows Training Slides V1
Mark Kromer
 

Similar to DW on AWS (18)

PDF
Serverless Data Platform
Shu-Jeng Hsieh
 
PDF
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
PDF
Solving enterprise challenges through scale out storage & big compute final
Avere Systems
 
PDF
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
PDF
Who's in your Cloud? Cloud State Monitoring
Kevin Hakanson
 
PPTX
Building a Just-in-Time Application Stack for Analysts
Avere Systems
 
PDF
Serverless SQL
Torsten Steinbach
 
PPTX
Aws meetup 20190427
Sridevi Murugayen
 
PDF
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
Rustem Feyzkhanov
 
PDF
Migrating Enterprise Applications to AWS
Tom Laszewski
 
PDF
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
PPTX
Windowsazureplatform Overviewlatest
rajramab
 
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
PPTX
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Scott Miao
 
PPTX
Introducing Azure SQL Data Warehouse
James Serra
 
PDF
Look Before You Leap: Migrating On-Premises Hadoop to AWS
DevOps.com
 
PPTX
AWS as platform for scalable applications
Roman Gomolko
 
PPTX
Understanding The Azure Platform Jan
DavidGristwood
 
Serverless Data Platform
Shu-Jeng Hsieh
 
Amazon Elastic Map Reduce - Ian Meyers
huguk
 
Solving enterprise challenges through scale out storage & big compute final
Avere Systems
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
Who's in your Cloud? Cloud State Monitoring
Kevin Hakanson
 
Building a Just-in-Time Application Stack for Analysts
Avere Systems
 
Serverless SQL
Torsten Steinbach
 
Aws meetup 20190427
Sridevi Murugayen
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
Rustem Feyzkhanov
 
Migrating Enterprise Applications to AWS
Tom Laszewski
 
[よくわかるAmazon Redshift]Amazon Redshift最新情報と導入事例のご紹介
Amazon Web Services Japan
 
Windowsazureplatform Overviewlatest
rajramab
 
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Scott Miao
 
Introducing Azure SQL Data Warehouse
James Serra
 
Look Before You Leap: Migrating On-Premises Hadoop to AWS
DevOps.com
 
AWS as platform for scalable applications
Roman Gomolko
 
Understanding The Azure Platform Jan
DavidGristwood
 
Ad

Recently uploaded (20)

PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
Ad

DW on AWS

  • 1. DW on AWS Gaurav Agrawal Data Platforms, AOL Inc.
  • 2. AOL Data Platforms Architecture 2014
  • 3. Data Stats & Insights Cluster Size 2 PB In-House Cluster 100 Nodes Raw Data/Day 2-3 TB Data Retention 13-24 Months
  • 4. Challenges with In-House Infrastructure Fixed Cost Slow Deployment Cycle Always On Self Serve Static : Not Scalable Outages Impact Production Upgrade Storage Compute
  • 5. AOL Data Platforms Architecture 2015 1 2 2 3 4 56
  • 7. Web Console and CLI Web Console for Training Setup IAM for users AWS Services Options S3 Data upload EMR Creation & Steps Try & Test multiple approaches CLI is your friend..!!!
  • 8. Migration • Web Console vs. CLI • Copy Existing Data to S3
  • 9. bucket-prod-control Environment Level Buckets Dev, QA, Production, Analyst Project Level Buckets Code, Data, Log, Extract and Control Compressed Snappy Data to GZIP Multi Platforms Support Best Compression Lowest storage cost Low cost for Data OUT bucket-dev bucket-qa bucket-prod bucket-analyst bucket-prod-code bucket-prod-log bucket-prod-data bucket-prod-extract 76% Less Storage 70K Saving/Year Copy Existing Data to S3
  • 10. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options
  • 11. EMR Design Options Transient Amazon S3 Elastic Cluster On-Demand vs. Reserved vs. Core NodesAmazon EMR vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
  • 12. AOL Data Platforms Architecture 2015
  • 13. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission - CLI
  • 14. EMR Jobs Submission - CLI In-house scheduler Common Utilities Provision EMR Push/Pull Data to S3 Job submission to Scheduler Database Load JSON Files Applications, Steps, Bootstrap,EC2 attributes, Instance Groups Future : Event Driven Design – Lambda, SQS
  • 15. EMR Jobs Submission - CLI aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" --tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" --visible-to-all-users --ec2-attributes file://omni_awssot.generic.ec2_attributes.json --ami-version "3.7.0" --log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ --enable-debugging --instance-groups file://omni_awssot.generic.instance_groups.json --auto-terminate --applications file://omni_awssot.generic.applications.json --bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json --steps file://omni_awssot.generic.steps.json
  • 16. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring
  • 17. Monitoring EMR WatchDog : Node.js Duplicate Clusters Failed Clusters Long-running Clusters Long-provisioning Clusters CloudWatch Alarms Monthly Billing S3 Bucket Size SNS Email Notifications Amazon CloudWatch Amazon SNS
  • 18. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity
  • 19. Elasticity Why be Elastic? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 09/05/2015 Cores Nodes Daily Processes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Core Nodes Demand - 09/20/2015 Core Nodes No Clusters Spike in Demand 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Cores Nodes Major Restatement Demand > 10K EC2
  • 20. Elasticity Why be Elastic? True Cloud Architecture Spot is an Open Market Scale Horizontally Our Limit : 3,000 EC2/Region Multiple Regions Multiple Instance Types
  • 21. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Optimization
  • 22. Optimization Data Management Partition Data on S3 S3 Versioning/Lifecycle Hadoop Run-time Params Memory Tuning Compress M & R Output Combine Splits Input format Security Roles Security Groups AOL VPC
  • 23. Score Card Feature AWS Pay for what you use ✔ Decouple Storage and Compute ✔ True Cloud Architecture ✔ Self Service Model ✔ Elastic & Scalable ✔ Global Infrastructure. BCDR. ✔ Quick & Easy Deployments ✔ Redshift External Tables on S3 ? More languages for Lambda ?
  • 24. AWS vs. In-House Cost 0 2 4 6 Service Cost Comparison AWS In-House Source : AOL & AWS Billing Tool 4xIn-House / Month 1xAWS / Month ** In-House cluster includes Storage, Power and Network cost.
  • 25. AWS vs. In-House Cost 10/8/2015 Amazon Web Services 1/4th Cost of In-House Hadoop Infrastructure 1/4th Cost Data Platforms. AOL Inc.
  • 26. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Core… Restatement Use Case • Restate historical data going back 6 months Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 27. Tag All Resources Infrastructure as CodeCommand Line Interface JSON as configuration files IAM Roles and Policies Use of Application ID Enable CloudTrail S3 Lifecycle ManagementS3 Versioning Separate Code/Data/Logs buckets Keyless EMR Clusters Hybrid Model Enable Debugging Create Multiple CLI Profiles Multi-Factor Authentication CloudWatch Billing Alarms Spot EC2 Instances SNS notifications for failures Loosely coupled Apps Scale Horizontally Best Practices & Suggestions