SlideShare a Scribd company logo
What is Cloud Computing?
Using the Cloud to Crunch Your DataAdrian Cockcroft – acockcroft@netflix.com
What is Capacity PlanningWe care about CPU, Memory, Network and Disk resources, and Application response timesWe need to know how much of each resource we are using now, and will use in the futureWe need to know how much headroom we have to handle higher loadsWe want to understand how headroom varies, and how it relates to application response times and throughput
Capacity Planning NormsCapacity is expensiveCapacity takes time to buy and provisionCapacity only increases, can’t be shrunk easilyCapacity comes in big chunks, paid up frontPlanning errors can cause big problemsSystems are clearly defined assetsSystems can be instrumented in detail
Capacity Planning in CloudsCapacity is expensiveCapacity takes time to buy and provisionCapacity only increases, can’t be shrunk easilyCapacity comes in big chunks, paid up frontPlanning errors can cause big problemsSystems are clearly defined assetsSystems can be instrumented in detail
Capacity is expensivehttp://aws.amazon.com/s3/ & http://aws.amazon.com/ec2/Storage (Amazon S3) $0.150 per GB – first 50 TB / month of storage used$0.120 per GB – storage used / month over 500 TBData Transfer (Amazon S3) $0.100 per GB – all data transfer in$0.170 per GB – first 10 TB / month data transfer out$0.100 per GB – data transfer out / month over 150 TBRequests (Amazon S3 Storage access is via http)$0.01 per 1,000 PUT, COPY, POST, or LIST requests$0.01 per 10,000 GET and all other requests$0 per DELETECPU (Amazon EC2)Small (Default) $0.085 per hour, Extra Large $0.68 per hourNetwork (Amazon EC2)Inbound/Outbound around $0.10 per GB
Capacity comes in big chunks, paid up frontCapacity takes time to buy and provisionNo minimum price, monthly billing“Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously”Capacity only increases, can’t be shrunk easilyPay for what is actually usedPlanning errors can cause big problemsSize only for what you need now
Systems are clearly defined assetsYou are running in a “stateless” multi-tenanted virtual image that can die or be taken away and replaced at any timeYou don’t know exactly where it isYou can choose to locate “USA” or “Europe”You can specify zones that will not share components to avoid common mode failures
Systems can be instrumented in detailNeed to use stateless monitoring toolsMonitored nodes come and go by the hourNeed to write role-name to hostnamee.g. wwwprod002, not the EC2 defaultall monitoring by role-nameGanglia – automatic configurationMulticast replicated monitoring stateNo need to pre-define metrics and nodes
Acquisition requires management buy-inAnyone with a credit card and $10 is in businessData governance issues…Remember 1980’s when PC’s first turned up?Departmental budgets funded PC’sNo standards, management or backupsCentral IT departments could lose controlDecentralized use of clouds will be driven by teams seeking competitive advantage and business agility – ultimately unstoppable…
December 9, 2009OK, so what should we do?
The Umbrella StrategyMonitor network traffic to cloud vendor API’sCatch unauthorized clouds as they formMeasure, trend and predict cloud activityPick two cloud standards, setup corporate accountsBetter sharing of lessons as they are learnedCreate path of least resistance for usersAvoid vendor lock-inAggregate traffic to get bulk discountsPressure the vendors to develop common standards, but don’t wait for them to arrive. The best APIs will be cloned eventuallySponsor a pathfinder project for each vendorNavigate a route through the cloudsDon’t get caught unawares in the rain
Predicting the WeatherBilling is variable, monthly in arrearsTry to predict how much you will be billed…Not an issue to start with, but can’t be ignoredAmazon has a cost estimator tool that may helpCentral analysis and collection of cloud metricsCloud vendor usage metrics via APIYour system and application metrics via GangliaIntra-cloud bandwidth is free, analyze in the cloud!Based on standard analysis framework “Hadoop”Validation of vendor metrics and billingCharge-back for individual user applications
Use it to learn it…Focus on how you can use the cloud yourself to do large scale log processingYou can upload huge datasets to a cloud and crunch them with a large cluster of computers using Amazon Elastic Map Reduce (EMR)Do it all from your web browser for a handful of dollars charged to a credit card.Here’s how
Cloud CrunchThe RecipeYou will need:A computer connected to the InternetThe Firefox browserA Firefox specific browser extensionA credit card and less than $1 to spendBig log files as ingredientsSome very processor intensive queries
RecipeFirst we will warm up our computer by setting up Firefox and connecting to the cloud.Then we will upload our ingredients to be crunched, at about 20 minutes per gigabyteYou should pick between one and twenty processors to crunch with, they are charged by the hour and the cloud takes about 10 minutes to warm up.The query itself starts by mapping the ingredients so that the mixture separates, then the excess is boiled off to make a nice informative reduction.
CostsFirefox and Extension – freeUpload ingredients – 10 cents/GigabyteSmall Processors – 11.5 cents/hour eachDownload results – 17 cents/GigabyteStorage – 15 cents/Gigabyte/monthService updates – 1 cent/1000 callsService requests – 1 cent/10,000 callsActual cost to run two example programs as described in this presentation was 26 cents
Faster Results at Same Cost!You may have trouble finding enough data and a complex enough query to keep the processors busy for an hour. In that case you can use fewer processorsConversely if you want quicker results you can use more and/or larger processors.Up to 20 systems with 8 CPU cores = 160 cores is immediately available. Oversize on request.
Step by StepWalkthrough to get you startedRun Amazon Elastic MapReduceexamplesGet up and running before this presentation is over….
Step 1 – Get Firefoxhttp://www.mozilla.com/en-US/firefox/firefox.html
Step 2 – Get S3Fox ExtensionNext select the Add-ons option from the Tools menu, select “Get Add-ons” and search for S3Fox.
Step 3 – Learn AboutAWSBring up http://aws.amazon.com/ to read about the services. Amazon S3 is short for Amazon Simple Storage Service, which is part of the Amazon Web Services product We will be using Amazon S3 to store data, and Amazon Elastic Map Reduce (EMR) to process it.Underneath EMR there is an Amazon Elastic Compute Cloud (EC2), which is created automatically for you each time you use EMR.
What is S3, EC2, EMR?Amazon Simple Storage Service lets you put data into the cloud that is addressed using a URL. Access to it can be private or public.Amazon Elastic Compute Cloud lets you pick the size and number of computers in your cloud.Amazon Elastic Map Reduce automatically builds a Hadoop cluster on top of EC2, feeds it data from S3, saves the results in S3, then removes the cluster and frees up the EC2 systems.
Step 4 – Sign Up For AWSGo to the top-right of the page and sign up at http://aws.amazon.com/You can login using the same account you use to buy books!
Step 5 – Sign Up For Amazon S3Follow the link to http://aws.amazon.com/s3/
Check The S3 Rate Card
Step 6 – Sign Up For Amazon EMRGo to http://aws.amazon.com/elasticmapreduce/
EC2 SignupThe EMR signup combines the other needed services such as EC2 in the sign up process and the rates for all are displayed.The EMR costs are in addition to the EC2 costs, so 1.5 cents/hour for EMR is added to the 10 cents/hour for EC2 making 11.5 cents/hour for each small instance running Linux with Hadoop.
EMR And EC2 Rates(old data – it is cheaper now with even bigger nodes)
Step 7 - How Big are EC2 Instances?See http://aws.amazon.com/ec2/instance-types/The compute power is specified in a standard unit called an EC2 Compute Unit (ECU).Amazon states “One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.”
Standard InstancesSmall Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform.Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform.Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
Compute Intensive InstancesHigh-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform.High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
Step 8 – Getting Startedhttp://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/There is a Getting Started guide and Developer Documentation including sample applicationsWe will be working through two of those applications.There is also a very helpful FAQ page that is worth reading through.
Step 9 – What is EMR?
Step 10 – How ToMapReduce?Code directly in JavaSubmit “streaming” command line scriptsEMR bundles Perl, Python, Ruby and RCode MR sequences in Java using CascadingProcess log files using Cascading MultitoolWrite dataflow scripts with PigWrite SQL queries using Hive
Step 11 – Setup Access KeysGetting Started Guidehttp://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsFirstSteps.htmlThe URL to visit to get your key is:http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key
Enter Access Keys in S3FoxOpen S3 Firefox Organizer and select the Manage Accounts button at the top leftEnter your access keys into the popup.
Step 12 – Create An Output FolderClick Here
Setup Folder Permissions
Step 13 – Run Your First JobSee the Getting Started Guide at:http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsConsoleRunJob.htmllogin to the EMR console at https://console.aws.amazon.com/elasticmapreduce/homecreate a new job flow called “Word crunch”, and select the sample application word count.
Set the Output S3 Bucket
Pick the Instance CountOne small instance (i.e. computer) will do…Cost will be 11.5 cents for 1 hour (minimum)
Start the Job FlowThe Job Flow takes a few minutes to get started, then completes in about 5 minutes run time
Step 14 – View The ResultsIn S3Fox click on the refresh icon, then doubleclick on the crunchie folderKeep clicking until the output file is visible
Save To Your PCS3Fox – click on left arrowSave to PCOpen in TextEditSee that the word “a” occurred 14716 timesboring…. so try a more interesting demo!Click Here
Step 15 – Crunch Some Log FilesCreate a new output bucket with a new nameStart a new job flow using the CloudFront DemoThis uses the Cascading Multitool
Step 16 – How Much Did That Cost?
Wrap UpEasy, cheap, anyone can use itNow let’s look at how to write code…
Log & Data Analysis using HadoopBased on slides byShashiMadappasmadappa@netflix.com
Agenda1. Hadoop	- Background	- Ecosystem 	- HDFS & Map/Reduce	- Example2. Log & Data Analysis @ Netflix	- Problem / Data Volume	- Current & Future Projects 3. Hadoop on Amazon EC2	- Storage options	- Deployment options
HadoopApache open-source software for reliable, scalable, distributed computing.Originally a sub-project of the Lucene search engine.Used to analyze and transform large data sets.Supports partial failure, Recoverability, Consistency & Scalability
Hadoop Sub-ProjectsCoreHDFS: A distributed file system that provides high throughput access to data. Map/Reduce : A framework for processing large data sets.HBase : A distributed database that supports structured data storage for large tablesHive : An infrastructure for ad hoc querying (sql like)Pig : A data-flow language and execution frameworkAvro : A data serialization system that provides dynamic integration with scripting languages. Similar to Thrift & Protocol Buffers Cascading : Executing data processing workflowChukwa, Mahout, ZooKeeper, and many more
Hadoop Distributed File System(HDFS)FeaturesCannot be mounted as a “file system”Access via command line or Java APIPrefers large files (multiple terabytes) to many small filesFiles are write once, read many (append coming soon)Users, Groups and PermissionsName and Space Quotas Blocksize and Replication factor are configurable per fileCommands: hadoop dfs -ls, -du, -cp, -mv, -rm, -rmrUploading fileshadoopdfs -put foomydata/foo
cat ReallyBigFile | hadoop dfs -put - mydata/ReallyBigFileDownloading fileshadoopdfs -get mydata/foofoo
 hadoop dfs –tail [–f] mydata/fooMap/ReduceMap/Reduce is a programming model for efficientdistributed computingData processing of large datasetMassively parallel (hundreds or thousands of CPUs)Easy to useProgrammers don’t worry about socket(), etc.It works like a Unix pipeline:cat *         | grep |         sort           | uniq -c   | cat > outputInput        | Map   | Shuffle & Sort     | Reduce     | OutputEfficiency from streaming through data, reducing seeksA good fit for a lot of applicationsLog processingIndex buildingData mining and machine learning
Map/ReduceSimple Dataflow
Detailed Dataflow
Input & Output FormatsThe application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application.InputFormatSplits the input to determine the input to each map task.Defines a RecordReader that reads key, value pairs that are passed to the map taskOutputFormatGiven the key, value pairs and a filename, writes the reduce task output to persistent store.
Map/Reduce ProcessesLaunching ApplicationUser application codeSubmits a specific kind of Map/Reduce jobJobTrackerHandles all jobsMakes all scheduling decisionsTaskTrackerManager for all tasks on a given nodeTaskRuns an individual map or reduce fragment for a given jobForks from the TaskTracker
Word Count ExampleMapperInput: value: lines of text of inputOutput: key: word, value: 1ReducerInput: key: word, value: set of countsOutput: key: word, value: sumLaunching programDefines the jobSubmits job to cluster
Map Phasepublic static class WordCountMapperextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {	private final static IntWritable one = new IntWritable(1);	private Text word = new Text();	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {		String line = value.toString();StringTokenizeritr = new StringTokenizer(line, “,”);		while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, one);		}  }}
Reduce Phasepublic static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,  Reporter reporter) throws IOException {int sum = 0;		while (values.hasNext()) {			sum += values.next().get();		}output.collect(key, new IntWritable(sum));	}}
Running the Jobpublic class WordCount {	public static void main(String[] args) throws IOException {JobConf conf = new JobConf(WordCount.class);		// the keys are words (strings)conf.setOutputKeyClass(Text.class);		// the values are counts (ints)conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(WordCountMapper.class);conf.setReducerClass(WordCountReducer.class);conf.setInputPath(new Path(args[0]);conf.setOutputPath(new Path(args[1]);JobClient.runJob(conf);		}}
Running the Example	InputWelcome to NetflixThis is a great place to workOuputNetflix			1This			1Welcome			1a				1great			1is				1place			1to				2work			1
WWW Access Log Structure
Analysis using access logTopURLsUsersIPSizeTime based analysisSessionDurationVisits per sessionIdentify attacks (DOS, Invalid Plug-in, etc)Study Visit patterns for Resource PlanningPreload relevant dataImpact of WWW call on middle tier services
Why run Hadoop in the cloud?“Infinite” resourcesHadoop scales linearlyElasticityRun a large cluster for a short timeGrow or shrink a cluster on demand
Common sense and safetyAmazon security is goodBut you are a few clicks away from making a data set public by mistakeCommon sense precautionsBefore you try it yourself…Get permission to move data to the cloudScrub data to remove sensitive informationSystem performance monitoring logs are a good choice for analysis
Ad

More Related Content

Viewers also liked (8)

Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
MapR Technologies
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
mattlieber
 
2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study
North Bridge
 
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a ServiceAmazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Ville Seppänen
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
Ian Massingham
 
Cloud 101: The Basics of Cloud Computing
Cloud 101: The Basics of Cloud ComputingCloud 101: The Basics of Cloud Computing
Cloud 101: The Basics of Cloud Computing
Hostway|HOSTING
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
MapR Technologies
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
mattlieber
 
2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study2015 Future of Cloud Computing Study
2015 Future of Cloud Computing Study
North Bridge
 
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a ServiceAmazon Elastic MapReduce (EMR): Hadoop as a Service
Amazon Elastic MapReduce (EMR): Hadoop as a Service
Ville Seppänen
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
Ian Massingham
 
Cloud 101: The Basics of Cloud Computing
Cloud 101: The Basics of Cloud ComputingCloud 101: The Basics of Cloud Computing
Cloud 101: The Basics of Cloud Computing
Hostway|HOSTING
 

Similar to Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop (20)

GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SCGIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
Jim Tochterman
 
How to Reduce your Spend on AWS
How to Reduce your Spend on AWSHow to Reduce your Spend on AWS
How to Reduce your Spend on AWS
Joseph K. Ziegler
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
Parashar Borkotoky
 
Cloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumCloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museum
Robert J. Stein
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing Workshop
Charlie Moad
 
Amazon S3 and EC2
Amazon S3 and EC2Amazon S3 and EC2
Amazon S3 and EC2
george.james
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
Zenita Smythe
 
Amazon cloud intance launch
Amazon cloud intance launchAmazon cloud intance launch
Amazon cloud intance launch
Zenita Smythe
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
Zenita Smythe
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
EC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland MeetupEC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland Meetup
Ian Massingham
 
Harnessing The Cloud
Harnessing The CloudHarnessing The Cloud
Harnessing The Cloud
Dan Quellhorst
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
awesomesos
 
Elastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationElastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS Presentation
Knoldus Inc.
 
gcp-for-aws-professionals-presentation.pdf
gcp-for-aws-professionals-presentation.pdfgcp-for-aws-professionals-presentation.pdf
gcp-for-aws-professionals-presentation.pdf
gobeli2850
 
Introduction to EC2
Introduction to EC2Introduction to EC2
Introduction to EC2
Mark Squires
 
Deploying On EC2
Deploying On EC2Deploying On EC2
Deploying On EC2
Steve Loughran
 
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
Amazon Web Services Korea
 
Cloud Computing Soup To Nuts
Cloud Computing Soup To NutsCloud Computing Soup To Nuts
Cloud Computing Soup To Nuts
Joe Drumgoole
 
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SCGIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
Jim Tochterman
 
How to Reduce your Spend on AWS
How to Reduce your Spend on AWSHow to Reduce your Spend on AWS
How to Reduce your Spend on AWS
Joseph K. Ziegler
 
Amazon web services : Layman Introduction
Amazon web services : Layman IntroductionAmazon web services : Layman Introduction
Amazon web services : Layman Introduction
Parashar Borkotoky
 
Cloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museumCloud Computing Primer: Using cloud computing tools in your museum
Cloud Computing Primer: Using cloud computing tools in your museum
Robert J. Stein
 
Cloud Computing Workshop
Cloud Computing WorkshopCloud Computing Workshop
Cloud Computing Workshop
Charlie Moad
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
Robert Dempsey
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
Zenita Smythe
 
Amazon cloud intance launch
Amazon cloud intance launchAmazon cloud intance launch
Amazon cloud intance launch
Zenita Smythe
 
Amazon cloud intance launch3
Amazon cloud intance launch3Amazon cloud intance launch3
Amazon cloud intance launch3
Zenita Smythe
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
EC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland MeetupEC2 Masterclass from the AWS User Group Scotland Meetup
EC2 Masterclass from the AWS User Group Scotland Meetup
Ian Massingham
 
Exploring The Cloud
Exploring The CloudExploring The Cloud
Exploring The Cloud
awesomesos
 
Elastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationElastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS Presentation
Knoldus Inc.
 
gcp-for-aws-professionals-presentation.pdf
gcp-for-aws-professionals-presentation.pdfgcp-for-aws-professionals-presentation.pdf
gcp-for-aws-professionals-presentation.pdf
gobeli2850
 
Introduction to EC2
Introduction to EC2Introduction to EC2
Introduction to EC2
Mark Squires
 
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
AWS re:Invent re:Cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
Amazon Web Services Korea
 
Cloud Computing Soup To Nuts
Cloud Computing Soup To NutsCloud Computing Soup To Nuts
Cloud Computing Soup To Nuts
Joe Drumgoole
 
Ad

More from Adrian Cockcroft (20)

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Adrian Cockcroft
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
Adrian Cockcroft
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
Adrian Cockcroft
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Adrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Ad

Recently uploaded (20)

fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Vibe Coding_ Develop a web application using AI (1).pdf
Vibe Coding_ Develop a web application using AI (1).pdfVibe Coding_ Develop a web application using AI (1).pdf
Vibe Coding_ Develop a web application using AI (1).pdf
Baiju Muthukadan
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
TrsLabs - AI Agents for All - Chatbots to Multi-Agents SystemsTrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
TrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
Trs Labs
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Connect and Protect: Networks and Network Security
Connect and Protect: Networks and Network SecurityConnect and Protect: Networks and Network Security
Connect and Protect: Networks and Network Security
VICTOR MAESTRE RAMIREZ
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Vibe Coding_ Develop a web application using AI (1).pdf
Vibe Coding_ Develop a web application using AI (1).pdfVibe Coding_ Develop a web application using AI (1).pdf
Vibe Coding_ Develop a web application using AI (1).pdf
Baiju Muthukadan
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
TrsLabs - AI Agents for All - Chatbots to Multi-Agents SystemsTrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
TrsLabs - AI Agents for All - Chatbots to Multi-Agents Systems
Trs Labs
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 

Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop

  • 1. What is Cloud Computing?
  • 2. Using the Cloud to Crunch Your DataAdrian Cockcroft – [email protected]
  • 3. What is Capacity PlanningWe care about CPU, Memory, Network and Disk resources, and Application response timesWe need to know how much of each resource we are using now, and will use in the futureWe need to know how much headroom we have to handle higher loadsWe want to understand how headroom varies, and how it relates to application response times and throughput
  • 4. Capacity Planning NormsCapacity is expensiveCapacity takes time to buy and provisionCapacity only increases, can’t be shrunk easilyCapacity comes in big chunks, paid up frontPlanning errors can cause big problemsSystems are clearly defined assetsSystems can be instrumented in detail
  • 5. Capacity Planning in CloudsCapacity is expensiveCapacity takes time to buy and provisionCapacity only increases, can’t be shrunk easilyCapacity comes in big chunks, paid up frontPlanning errors can cause big problemsSystems are clearly defined assetsSystems can be instrumented in detail
  • 6. Capacity is expensivehttp://aws.amazon.com/s3/ & http://aws.amazon.com/ec2/Storage (Amazon S3) $0.150 per GB – first 50 TB / month of storage used$0.120 per GB – storage used / month over 500 TBData Transfer (Amazon S3) $0.100 per GB – all data transfer in$0.170 per GB – first 10 TB / month data transfer out$0.100 per GB – data transfer out / month over 150 TBRequests (Amazon S3 Storage access is via http)$0.01 per 1,000 PUT, COPY, POST, or LIST requests$0.01 per 10,000 GET and all other requests$0 per DELETECPU (Amazon EC2)Small (Default) $0.085 per hour, Extra Large $0.68 per hourNetwork (Amazon EC2)Inbound/Outbound around $0.10 per GB
  • 7. Capacity comes in big chunks, paid up frontCapacity takes time to buy and provisionNo minimum price, monthly billing“Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously”Capacity only increases, can’t be shrunk easilyPay for what is actually usedPlanning errors can cause big problemsSize only for what you need now
  • 8. Systems are clearly defined assetsYou are running in a “stateless” multi-tenanted virtual image that can die or be taken away and replaced at any timeYou don’t know exactly where it isYou can choose to locate “USA” or “Europe”You can specify zones that will not share components to avoid common mode failures
  • 9. Systems can be instrumented in detailNeed to use stateless monitoring toolsMonitored nodes come and go by the hourNeed to write role-name to hostnamee.g. wwwprod002, not the EC2 defaultall monitoring by role-nameGanglia – automatic configurationMulticast replicated monitoring stateNo need to pre-define metrics and nodes
  • 10. Acquisition requires management buy-inAnyone with a credit card and $10 is in businessData governance issues…Remember 1980’s when PC’s first turned up?Departmental budgets funded PC’sNo standards, management or backupsCentral IT departments could lose controlDecentralized use of clouds will be driven by teams seeking competitive advantage and business agility – ultimately unstoppable…
  • 11. December 9, 2009OK, so what should we do?
  • 12. The Umbrella StrategyMonitor network traffic to cloud vendor API’sCatch unauthorized clouds as they formMeasure, trend and predict cloud activityPick two cloud standards, setup corporate accountsBetter sharing of lessons as they are learnedCreate path of least resistance for usersAvoid vendor lock-inAggregate traffic to get bulk discountsPressure the vendors to develop common standards, but don’t wait for them to arrive. The best APIs will be cloned eventuallySponsor a pathfinder project for each vendorNavigate a route through the cloudsDon’t get caught unawares in the rain
  • 13. Predicting the WeatherBilling is variable, monthly in arrearsTry to predict how much you will be billed…Not an issue to start with, but can’t be ignoredAmazon has a cost estimator tool that may helpCentral analysis and collection of cloud metricsCloud vendor usage metrics via APIYour system and application metrics via GangliaIntra-cloud bandwidth is free, analyze in the cloud!Based on standard analysis framework “Hadoop”Validation of vendor metrics and billingCharge-back for individual user applications
  • 14. Use it to learn it…Focus on how you can use the cloud yourself to do large scale log processingYou can upload huge datasets to a cloud and crunch them with a large cluster of computers using Amazon Elastic Map Reduce (EMR)Do it all from your web browser for a handful of dollars charged to a credit card.Here’s how
  • 15. Cloud CrunchThe RecipeYou will need:A computer connected to the InternetThe Firefox browserA Firefox specific browser extensionA credit card and less than $1 to spendBig log files as ingredientsSome very processor intensive queries
  • 16. RecipeFirst we will warm up our computer by setting up Firefox and connecting to the cloud.Then we will upload our ingredients to be crunched, at about 20 minutes per gigabyteYou should pick between one and twenty processors to crunch with, they are charged by the hour and the cloud takes about 10 minutes to warm up.The query itself starts by mapping the ingredients so that the mixture separates, then the excess is boiled off to make a nice informative reduction.
  • 17. CostsFirefox and Extension – freeUpload ingredients – 10 cents/GigabyteSmall Processors – 11.5 cents/hour eachDownload results – 17 cents/GigabyteStorage – 15 cents/Gigabyte/monthService updates – 1 cent/1000 callsService requests – 1 cent/10,000 callsActual cost to run two example programs as described in this presentation was 26 cents
  • 18. Faster Results at Same Cost!You may have trouble finding enough data and a complex enough query to keep the processors busy for an hour. In that case you can use fewer processorsConversely if you want quicker results you can use more and/or larger processors.Up to 20 systems with 8 CPU cores = 160 cores is immediately available. Oversize on request.
  • 19. Step by StepWalkthrough to get you startedRun Amazon Elastic MapReduceexamplesGet up and running before this presentation is over….
  • 20. Step 1 – Get Firefoxhttp://www.mozilla.com/en-US/firefox/firefox.html
  • 21. Step 2 – Get S3Fox ExtensionNext select the Add-ons option from the Tools menu, select “Get Add-ons” and search for S3Fox.
  • 22. Step 3 – Learn AboutAWSBring up http://aws.amazon.com/ to read about the services. Amazon S3 is short for Amazon Simple Storage Service, which is part of the Amazon Web Services product We will be using Amazon S3 to store data, and Amazon Elastic Map Reduce (EMR) to process it.Underneath EMR there is an Amazon Elastic Compute Cloud (EC2), which is created automatically for you each time you use EMR.
  • 23. What is S3, EC2, EMR?Amazon Simple Storage Service lets you put data into the cloud that is addressed using a URL. Access to it can be private or public.Amazon Elastic Compute Cloud lets you pick the size and number of computers in your cloud.Amazon Elastic Map Reduce automatically builds a Hadoop cluster on top of EC2, feeds it data from S3, saves the results in S3, then removes the cluster and frees up the EC2 systems.
  • 24. Step 4 – Sign Up For AWSGo to the top-right of the page and sign up at http://aws.amazon.com/You can login using the same account you use to buy books!
  • 25. Step 5 – Sign Up For Amazon S3Follow the link to http://aws.amazon.com/s3/
  • 26. Check The S3 Rate Card
  • 27. Step 6 – Sign Up For Amazon EMRGo to http://aws.amazon.com/elasticmapreduce/
  • 28. EC2 SignupThe EMR signup combines the other needed services such as EC2 in the sign up process and the rates for all are displayed.The EMR costs are in addition to the EC2 costs, so 1.5 cents/hour for EMR is added to the 10 cents/hour for EC2 making 11.5 cents/hour for each small instance running Linux with Hadoop.
  • 29. EMR And EC2 Rates(old data – it is cheaper now with even bigger nodes)
  • 30. Step 7 - How Big are EC2 Instances?See http://aws.amazon.com/ec2/instance-types/The compute power is specified in a standard unit called an EC2 Compute Unit (ECU).Amazon states “One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.”
  • 31. Standard InstancesSmall Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform.Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform.Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
  • 32. Compute Intensive InstancesHigh-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform.High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform.
  • 33. Step 8 – Getting Startedhttp://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/There is a Getting Started guide and Developer Documentation including sample applicationsWe will be working through two of those applications.There is also a very helpful FAQ page that is worth reading through.
  • 34. Step 9 – What is EMR?
  • 35. Step 10 – How ToMapReduce?Code directly in JavaSubmit “streaming” command line scriptsEMR bundles Perl, Python, Ruby and RCode MR sequences in Java using CascadingProcess log files using Cascading MultitoolWrite dataflow scripts with PigWrite SQL queries using Hive
  • 36. Step 11 – Setup Access KeysGetting Started Guidehttp://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsFirstSteps.htmlThe URL to visit to get your key is:http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key
  • 37. Enter Access Keys in S3FoxOpen S3 Firefox Organizer and select the Manage Accounts button at the top leftEnter your access keys into the popup.
  • 38. Step 12 – Create An Output FolderClick Here
  • 40. Step 13 – Run Your First JobSee the Getting Started Guide at:http://docs.amazonwebservices.com/ElasticMapReduce/2009-03-31/GettingStartedGuide/gsConsoleRunJob.htmllogin to the EMR console at https://console.aws.amazon.com/elasticmapreduce/homecreate a new job flow called “Word crunch”, and select the sample application word count.
  • 41. Set the Output S3 Bucket
  • 42. Pick the Instance CountOne small instance (i.e. computer) will do…Cost will be 11.5 cents for 1 hour (minimum)
  • 43. Start the Job FlowThe Job Flow takes a few minutes to get started, then completes in about 5 minutes run time
  • 44. Step 14 – View The ResultsIn S3Fox click on the refresh icon, then doubleclick on the crunchie folderKeep clicking until the output file is visible
  • 45. Save To Your PCS3Fox – click on left arrowSave to PCOpen in TextEditSee that the word “a” occurred 14716 timesboring…. so try a more interesting demo!Click Here
  • 46. Step 15 – Crunch Some Log FilesCreate a new output bucket with a new nameStart a new job flow using the CloudFront DemoThis uses the Cascading Multitool
  • 47. Step 16 – How Much Did That Cost?
  • 48. Wrap UpEasy, cheap, anyone can use itNow let’s look at how to write code…
  • 49. Log & Data Analysis using HadoopBased on slides [email protected]
  • 50. Agenda1. Hadoop - Background - Ecosystem - HDFS & Map/Reduce - Example2. Log & Data Analysis @ Netflix - Problem / Data Volume - Current & Future Projects 3. Hadoop on Amazon EC2 - Storage options - Deployment options
  • 51. HadoopApache open-source software for reliable, scalable, distributed computing.Originally a sub-project of the Lucene search engine.Used to analyze and transform large data sets.Supports partial failure, Recoverability, Consistency & Scalability
  • 52. Hadoop Sub-ProjectsCoreHDFS: A distributed file system that provides high throughput access to data. Map/Reduce : A framework for processing large data sets.HBase : A distributed database that supports structured data storage for large tablesHive : An infrastructure for ad hoc querying (sql like)Pig : A data-flow language and execution frameworkAvro : A data serialization system that provides dynamic integration with scripting languages. Similar to Thrift & Protocol Buffers Cascading : Executing data processing workflowChukwa, Mahout, ZooKeeper, and many more
  • 53. Hadoop Distributed File System(HDFS)FeaturesCannot be mounted as a “file system”Access via command line or Java APIPrefers large files (multiple terabytes) to many small filesFiles are write once, read many (append coming soon)Users, Groups and PermissionsName and Space Quotas Blocksize and Replication factor are configurable per fileCommands: hadoop dfs -ls, -du, -cp, -mv, -rm, -rmrUploading fileshadoopdfs -put foomydata/foo
  • 54. cat ReallyBigFile | hadoop dfs -put - mydata/ReallyBigFileDownloading fileshadoopdfs -get mydata/foofoo
  • 55. hadoop dfs –tail [–f] mydata/fooMap/ReduceMap/Reduce is a programming model for efficientdistributed computingData processing of large datasetMassively parallel (hundreds or thousands of CPUs)Easy to useProgrammers don’t worry about socket(), etc.It works like a Unix pipeline:cat * | grep | sort | uniq -c | cat > outputInput | Map | Shuffle & Sort | Reduce | OutputEfficiency from streaming through data, reducing seeksA good fit for a lot of applicationsLog processingIndex buildingData mining and machine learning
  • 58. Input & Output FormatsThe application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application.InputFormatSplits the input to determine the input to each map task.Defines a RecordReader that reads key, value pairs that are passed to the map taskOutputFormatGiven the key, value pairs and a filename, writes the reduce task output to persistent store.
  • 59. Map/Reduce ProcessesLaunching ApplicationUser application codeSubmits a specific kind of Map/Reduce jobJobTrackerHandles all jobsMakes all scheduling decisionsTaskTrackerManager for all tasks on a given nodeTaskRuns an individual map or reduce fragment for a given jobForks from the TaskTracker
  • 60. Word Count ExampleMapperInput: value: lines of text of inputOutput: key: word, value: 1ReducerInput: key: word, value: set of countsOutput: key: word, value: sumLaunching programDefines the jobSubmits job to cluster
  • 61. Map Phasepublic static class WordCountMapperextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();StringTokenizeritr = new StringTokenizer(line, “,”); while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, one); } }}
  • 62. Reduce Phasepublic static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {int sum = 0; while (values.hasNext()) { sum += values.next().get(); }output.collect(key, new IntWritable(sum)); }}
  • 63. Running the Jobpublic class WordCount { public static void main(String[] args) throws IOException {JobConf conf = new JobConf(WordCount.class); // the keys are words (strings)conf.setOutputKeyClass(Text.class); // the values are counts (ints)conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(WordCountMapper.class);conf.setReducerClass(WordCountReducer.class);conf.setInputPath(new Path(args[0]);conf.setOutputPath(new Path(args[1]);JobClient.runJob(conf); }}
  • 64. Running the Example InputWelcome to NetflixThis is a great place to workOuputNetflix 1This 1Welcome 1a 1great 1is 1place 1to 2work 1
  • 65. WWW Access Log Structure
  • 66. Analysis using access logTopURLsUsersIPSizeTime based analysisSessionDurationVisits per sessionIdentify attacks (DOS, Invalid Plug-in, etc)Study Visit patterns for Resource PlanningPreload relevant dataImpact of WWW call on middle tier services
  • 67. Why run Hadoop in the cloud?“Infinite” resourcesHadoop scales linearlyElasticityRun a large cluster for a short timeGrow or shrink a cluster on demand
  • 68. Common sense and safetyAmazon security is goodBut you are a few clicks away from making a data set public by mistakeCommon sense precautionsBefore you try it yourself…Get permission to move data to the cloudScrub data to remove sensitive informationSystem performance monitoring logs are a good choice for analysis