SlideShare a Scribd company logo
CLOUDFAIL
SCALING TO INFINITY – BUT NOT BEYOND
Kunal Johar
MARCH 14, 2013
π Day
What would you do?
•

You take your senior design project to the next level

•

You have some traction – 10-15 people a week using it

•

A game-changing opportunity hits you in the face

•

You need to scale to tens of thousands of users per week
Act as If
•

Scaling is no big deal right?

•

Amazon’s Elastic Cloud; Rackspace’s Infinite Capacity

•

50,000 is a small number even in O(N^2)

•

I’m sure I can figure it out
“We are counting on you”
•

Our organization depends on this software for our annual operating budget

•

This year was a total disaster. Multi-week outages.

•

We need you to tell us that this will work, that the system won’t go down, no
matter how much traffic we send to it
No Problem
•

“The old vendor was amateur hour”

•

We’ll distribute the load across multiple servers

•

We’ll load test

•

We’ll scale up

•

DON’T WORRY
MAY 20, 2013
Paperwork Signed – Now the Challenge Begins
Our Software Does it all (soon)
•

It was a Brutal Summer
•

We had 12 weeks to learn, architect, and build what ended up being 1800 hours worth of
features

•

The margin for error was Zero

•

We also had to make sure our system would scale to meet the super-surge of traffic in
January
Full Team Buy-In
•

The stakes were known to everyone.

•

If we succeeded, we’d pivot ourselves to the top of the market.

•

If we failed, half the team would be out of work

•

Our client called failure “Mutually Assured Destruction”
Full Team Buy-In
•

The stakes were known to everyone.

•

If we succeeded, we’d pivot ourselves to the top of the market.

•

If we failed, half the team would be out of work

•

Our client called failure “Mutually Assured Destruction”
SEPTEMBER 2, 2013
Lot’s of Overtime, Heat, Stress, Anxiety. But we did it.
Memo to Developers
Load Test or Beta Test?
•

From the September 1 Launch date; until even today we have been hit with new
feature requests

•

“Oh! I forgot about that – but it’s really important”

•

How do you balance engineering priorities vs feature priorities?
How to Construct a Load Test
•

Write custom scripts that simulate real users using your app
•
•
•

•

Selenium Web Driver + Sauce Labs
Browser Mob (Neustar)
Load Impact

Write a custom handler that simulates the user payload
•

Loader.io
Our Loader.io Script PayLoad
•

POST 100 KB of data

•

Simulate Save to Database

•

GET 100 KB of data from Database
The Actual Load Test
300+ Users Per Second!
•

Whoo hoo!

•

300 users per second must mean what? Thousands of users per minute!

•

I report to client a very successful load test and put the matter towards some
wishful thinking
SURVIVORSHIP BIAS
http://youarenotsosmart.com/2013/05/23/survivorship-bias/
Survivorship Bias
The misconception
You should focus on the successful if you wish to be successful
The truth
When failure becomes invisible, the difference between failure and success my also
become invisible
Survivorship Bias
•

“A Cabal of Geniuses” assembled at the request of
the White House

•

Top women mathematicians (human computers),
Nobel Prize Winners, researchers formed the
Statistical Research Group
Keeping Airlines in the Sky
•

At its lowest; survivability of a WWII
bomber was 50% on a mission

•

“Ghosts already” is how airmen
were known

•

“How, the Army Air Force asked,
could they improve the odds of a
bomber making it home”
Armor
•

Military commanders inspected the planes that made it back

•

Ideally they could put armor on the whole plane, but then it wouldn’t fly

•

Tons of bullet holes in key areas of the fuselage, wings, near the gunners

•

The army was about to add plating to these parts of the bombers
Armor
•

The scientists successfully argued
“Survivorship Bias”

•

Stop looking at the survivors – it is the
other parts of the plane that need more
armor!
WHAT IS “CLOUDSCALE”
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
Cloud fail   scaling to infinity but not beyond
LOL
WE DON’T DO THAT
Zack’s first comment as I concluded that presentation
Our Architecture
PaaS / IaaS
WEEK OF JANUARY 6
Everyday is a Record Traffic Day
Scale up on IaaS
•

Someone trying to generate a 150
page PDF

•

The norm is 10-15 pages…

•

“OutOfMemoryException”
Thursday, January 9, 2014
Whoo Hoo!
•

No Issues on our highest
traffic day ever!

•

“Can’t wait till that
number hits 250 per
minute!”

•

“Tomorrow will be our
biggest day yet!”
Friday, January 10, 2014
•

Approximately 12:00 Noon
•
•
•
•

•

Site traffic is around 185 people, 50 less than the previous day’s high
1 out of every 12 hits times out
According to Rackspace, a node is failing on cloudsites and will be taken out of rotation
About 10 complaints so far, but I email “Everything is under control”

Approximately 12:30 PM
•
•
•

Traffic falls to about 150 people per minute
Things are fine
Phew
Friday, January 10, 2014
•

At 1:00 PM we have a job interview for a new support person

•

I have live chat open with Rackspace and am hopping back and forth between the
interview --- not the best way to hire someone

•

1:45 PM interview over, and I learn traffic is at 220+ people.

•

The site is pretty much dead

•

While I work on the issue, my phone is ringing with an frightened customer. Our
help desk is filling up with complaints non-stop

•

With a stone-cold face, I walk to my teammates. “This is bad. I need help”
Backup Plan
•

I knew CloudSites had some limit, but I had a plan to shift traffic at a moment’s
notice in a worst case situation
Backup Plan Now in Play
•

Using CloudFlare, a service that lets us rapidly change DNS records; traffic was
redirected to the super server

•

1 second later
Backup Plan Part II (Scale Up)
•

OK – I’ll spin up the most powerful server I can buy.

•

64 GB RAM

•

32 vCPU
Backup Plan Part II
•

19 seconds later
3:25 PM
•

Rackspace gives me a one time “boost” to capacity

•

Let’s me know about “HTE” for the future….
•

•

“If you are having a high traffic event, let us know in advance”

I kiss the floor. My company is saved by the whim of my hosting company
9:00 PM
•

Zack and I finish responding to customer complaints

•

It would be weeks before I could sleep normally again
What the heck happened?
•

The initial load test was testing people submitting one application at a time

•

The PDF issue was actually a harbinger of things to come

•

Thursday had record traffic, but Friday had people doing “Finalization” (commits)

•

Our commit code was very slow, and used a lot of RAM. As a server would get overloaded,
the app pool would restart – this would add load to other servers

•

Demand > Supply caused a chain reaction making servers continually failing until more
supply was added
Our Future Plans
•

I’m too scared of PaaS for a
complex use case!

•

Not enough data to know when
things fail.
Thanks!
Kunal Johar
kjohar@alumni.gwu.edu
Ad

More Related Content

What's hot (20)

An Iterative Approach to Service Oriented Architecture
An Iterative Approach to Service Oriented ArchitectureAn Iterative Approach to Service Oriented Architecture
An Iterative Approach to Service Oriented Architecture
Eric Saxby
 
Continuous database deployment
Continuous database deploymentContinuous database deployment
Continuous database deployment
Mike (Michael) Acord
 
One Does Not Simply Walk Into Devops
One Does Not Simply Walk Into Devops One Does Not Simply Walk Into Devops
One Does Not Simply Walk Into Devops
Uri Cohen
 
Serverless Application Model - Executing Lambdas Locally
Serverless Application Model - Executing Lambdas LocallyServerless Application Model - Executing Lambdas Locally
Serverless Application Model - Executing Lambdas Locally
Alex
 
Reactive Streams and the Wide World of Groovy
Reactive Streams and the Wide World of GroovyReactive Streams and the Wide World of Groovy
Reactive Streams and the Wide World of Groovy
Steve Pember
 
Qcon talk
Qcon talkQcon talk
Qcon talk
bcoverston
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
Randy Shoup
 
An Introduction to Reactive Application, Reactive Streams, and options for JVM
An Introduction to Reactive Application, Reactive Streams, and options for JVMAn Introduction to Reactive Application, Reactive Streams, and options for JVM
An Introduction to Reactive Application, Reactive Streams, and options for JVM
Steve Pember
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culture
Peter Mounce
 
Scaling
ScalingScaling
Scaling
Òscar Vilaplana
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
Guy Tomer
 
Mobile Network Performance Testing
Mobile Network Performance TestingMobile Network Performance Testing
Mobile Network Performance Testing
XBOSoft
 
In the hunt of 100% delivery rate with mobile push notifications
In the hunt of 100% delivery rate with mobile push notificationsIn the hunt of 100% delivery rate with mobile push notifications
In the hunt of 100% delivery rate with mobile push notifications
Jan Haložan
 
Building a reliable, scalable service with Clojure and Core.async
Building a reliable, scalable service with Clojure and Core.asyncBuilding a reliable, scalable service with Clojure and Core.async
Building a reliable, scalable service with Clojure and Core.async
Kapil Reddy
 
Ansible Case Studies
Ansible Case StudiesAnsible Case Studies
Ansible Case Studies
Greg DeKoenigsberg
 
Test Driven Development with AngularJS
Test Driven Development with AngularJSTest Driven Development with AngularJS
Test Driven Development with AngularJS
Sirar Salih
 
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Eric Lubow
 
Intro to event sourcing and CQRS
Intro to event sourcing and CQRS Intro to event sourcing and CQRS
Intro to event sourcing and CQRS
Savvas Kleanthous
 
Scala bay meetup 9.17.2015 - Presentation 1
Scala bay meetup 9.17.2015 - Presentation 1Scala bay meetup 9.17.2015 - Presentation 1
Scala bay meetup 9.17.2015 - Presentation 1
Brendan O'Bra
 
Running Yarn at Scale
Running Yarn at Scale Running Yarn at Scale
Running Yarn at Scale
DataWorks Summit
 
An Iterative Approach to Service Oriented Architecture
An Iterative Approach to Service Oriented ArchitectureAn Iterative Approach to Service Oriented Architecture
An Iterative Approach to Service Oriented Architecture
Eric Saxby
 
One Does Not Simply Walk Into Devops
One Does Not Simply Walk Into Devops One Does Not Simply Walk Into Devops
One Does Not Simply Walk Into Devops
Uri Cohen
 
Serverless Application Model - Executing Lambdas Locally
Serverless Application Model - Executing Lambdas LocallyServerless Application Model - Executing Lambdas Locally
Serverless Application Model - Executing Lambdas Locally
Alex
 
Reactive Streams and the Wide World of Groovy
Reactive Streams and the Wide World of GroovyReactive Streams and the Wide World of Groovy
Reactive Streams and the Wide World of Groovy
Steve Pember
 
Why Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the CloudWhy Enterprises Are Embracing the Cloud
Why Enterprises Are Embracing the Cloud
Randy Shoup
 
An Introduction to Reactive Application, Reactive Streams, and options for JVM
An Introduction to Reactive Application, Reactive Streams, and options for JVMAn Introduction to Reactive Application, Reactive Streams, and options for JVM
An Introduction to Reactive Application, Reactive Streams, and options for JVM
Steve Pember
 
JUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our cultureJUST EAT: Tools we use to enable our culture
JUST EAT: Tools we use to enable our culture
Peter Mounce
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
Guy Tomer
 
Mobile Network Performance Testing
Mobile Network Performance TestingMobile Network Performance Testing
Mobile Network Performance Testing
XBOSoft
 
In the hunt of 100% delivery rate with mobile push notifications
In the hunt of 100% delivery rate with mobile push notificationsIn the hunt of 100% delivery rate with mobile push notifications
In the hunt of 100% delivery rate with mobile push notifications
Jan Haložan
 
Building a reliable, scalable service with Clojure and Core.async
Building a reliable, scalable service with Clojure and Core.asyncBuilding a reliable, scalable service with Clojure and Core.async
Building a reliable, scalable service with Clojure and Core.async
Kapil Reddy
 
Test Driven Development with AngularJS
Test Driven Development with AngularJSTest Driven Development with AngularJS
Test Driven Development with AngularJS
Sirar Salih
 
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Message Architectures in Distributed Systems - Data Day Texas 2013-01-11
Eric Lubow
 
Intro to event sourcing and CQRS
Intro to event sourcing and CQRS Intro to event sourcing and CQRS
Intro to event sourcing and CQRS
Savvas Kleanthous
 
Scala bay meetup 9.17.2015 - Presentation 1
Scala bay meetup 9.17.2015 - Presentation 1Scala bay meetup 9.17.2015 - Presentation 1
Scala bay meetup 9.17.2015 - Presentation 1
Brendan O'Bra
 

Similar to Cloud fail scaling to infinity but not beyond (20)

Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
Konstantin Gredeskoul
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
Greg Lindahl
 
Developer Week
Developer WeekDeveloper Week
Developer Week
Docker, Inc.
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
Gareth Bowles
 
Scaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHPScaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHP
120bi
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web Applications
Achievers Tech
 
Going Reactive in the Land of No
Going Reactive in the Land of NoGoing Reactive in the Land of No
Going Reactive in the Land of No
Lightbend
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler
 
Building a Fast, Reliable SQL Server for kCura Relativity
Building a Fast, Reliable SQL Server for kCura RelativityBuilding a Fast, Reliable SQL Server for kCura Relativity
Building a Fast, Reliable SQL Server for kCura Relativity
Brent Ozar
 
Finding and Using Big Data in your business
Finding and Using Big Data in your businessFinding and Using Big Data in your business
Finding and Using Big Data in your business
Simon Elliston Ball
 
Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
BOSC 2010
 
Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)
bcantrill
 
My Little Webap - DevOpsSec is Magic
My Little Webap - DevOpsSec is MagicMy Little Webap - DevOpsSec is Magic
My Little Webap - DevOpsSec is Magic
Apollo Clark
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
Konstantin Gredeskoul
 
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal ReadinessTis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
SOASTA
 
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal ReadinessTis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
SOASTA
 
CloudAustin Black Friday 2013
CloudAustin Black Friday 2013CloudAustin Black Friday 2013
CloudAustin Black Friday 2013
Ernest Mueller
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the Gap
Josh Evans
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebula Project
 
What is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress HostingWhat is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress Hosting
WPSFO Meetup Group
 
To Cloud or Not To Cloud?
To Cloud or Not To Cloud?To Cloud or Not To Cloud?
To Cloud or Not To Cloud?
Greg Lindahl
 
Release the Monkeys ! Testing in the Wild at Netflix
Release the Monkeys !  Testing in the Wild at NetflixRelease the Monkeys !  Testing in the Wild at Netflix
Release the Monkeys ! Testing in the Wild at Netflix
Gareth Bowles
 
Scaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHPScaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHP
120bi
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web Applications
Achievers Tech
 
Going Reactive in the Land of No
Going Reactive in the Land of NoGoing Reactive in the Land of No
Going Reactive in the Land of No
Lightbend
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler
 
Building a Fast, Reliable SQL Server for kCura Relativity
Building a Fast, Reliable SQL Server for kCura RelativityBuilding a Fast, Reliable SQL Server for kCura Relativity
Building a Fast, Reliable SQL Server for kCura Relativity
Brent Ozar
 
Finding and Using Big Data in your business
Finding and Using Big Data in your businessFinding and Using Big Data in your business
Finding and Using Big Data in your business
Simon Elliston Ball
 
Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
BOSC 2010
 
Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)
bcantrill
 
My Little Webap - DevOpsSec is Magic
My Little Webap - DevOpsSec is MagicMy Little Webap - DevOpsSec is Magic
My Little Webap - DevOpsSec is Magic
Apollo Clark
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
Konstantin Gredeskoul
 
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal ReadinessTis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
SOASTA
 
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal ReadinessTis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
Tis The Season: Load Testing Tips and Checklist for Retail Seasonal Readiness
SOASTA
 
CloudAustin Black Friday 2013
CloudAustin Black Friday 2013CloudAustin Black Friday 2013
CloudAustin Black Friday 2013
Ernest Mueller
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the Gap
Josh Evans
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebula Project
 
What is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress HostingWhat is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress Hosting
WPSFO Meetup Group
 
Ad

More from Kunal Johar (6)

Career - Senior Design (Computer Science)
Career - Senior Design (Computer Science)Career - Senior Design (Computer Science)
Career - Senior Design (Computer Science)
Kunal Johar
 
Design part iii - Ready to Build
Design part iii - Ready to BuildDesign part iii - Ready to Build
Design part iii - Ready to Build
Kunal Johar
 
Journey of an Idea to Invention Part 1 of 3
Journey of an Idea to Invention Part 1 of 3Journey of an Idea to Invention Part 1 of 3
Journey of an Idea to Invention Part 1 of 3
Kunal Johar
 
Real world software launch
Real world software launchReal world software launch
Real world software launch
Kunal Johar
 
Screencasting and Presenting for Engineers
Screencasting and Presenting for EngineersScreencasting and Presenting for Engineers
Screencasting and Presenting for Engineers
Kunal Johar
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
Kunal Johar
 
Career - Senior Design (Computer Science)
Career - Senior Design (Computer Science)Career - Senior Design (Computer Science)
Career - Senior Design (Computer Science)
Kunal Johar
 
Design part iii - Ready to Build
Design part iii - Ready to BuildDesign part iii - Ready to Build
Design part iii - Ready to Build
Kunal Johar
 
Journey of an Idea to Invention Part 1 of 3
Journey of an Idea to Invention Part 1 of 3Journey of an Idea to Invention Part 1 of 3
Journey of an Idea to Invention Part 1 of 3
Kunal Johar
 
Real world software launch
Real world software launchReal world software launch
Real world software launch
Kunal Johar
 
Screencasting and Presenting for Engineers
Screencasting and Presenting for EngineersScreencasting and Presenting for Engineers
Screencasting and Presenting for Engineers
Kunal Johar
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
Kunal Johar
 
Ad

Cloud fail scaling to infinity but not beyond

  • 1. CLOUDFAIL SCALING TO INFINITY – BUT NOT BEYOND Kunal Johar
  • 3. What would you do? • You take your senior design project to the next level • You have some traction – 10-15 people a week using it • A game-changing opportunity hits you in the face • You need to scale to tens of thousands of users per week
  • 4. Act as If • Scaling is no big deal right? • Amazon’s Elastic Cloud; Rackspace’s Infinite Capacity • 50,000 is a small number even in O(N^2) • I’m sure I can figure it out
  • 5. “We are counting on you” • Our organization depends on this software for our annual operating budget • This year was a total disaster. Multi-week outages. • We need you to tell us that this will work, that the system won’t go down, no matter how much traffic we send to it
  • 6. No Problem • “The old vendor was amateur hour” • We’ll distribute the load across multiple servers • We’ll load test • We’ll scale up • DON’T WORRY
  • 7. MAY 20, 2013 Paperwork Signed – Now the Challenge Begins
  • 8. Our Software Does it all (soon) • It was a Brutal Summer • We had 12 weeks to learn, architect, and build what ended up being 1800 hours worth of features • The margin for error was Zero • We also had to make sure our system would scale to meet the super-surge of traffic in January
  • 9. Full Team Buy-In • The stakes were known to everyone. • If we succeeded, we’d pivot ourselves to the top of the market. • If we failed, half the team would be out of work • Our client called failure “Mutually Assured Destruction”
  • 10. Full Team Buy-In • The stakes were known to everyone. • If we succeeded, we’d pivot ourselves to the top of the market. • If we failed, half the team would be out of work • Our client called failure “Mutually Assured Destruction”
  • 11. SEPTEMBER 2, 2013 Lot’s of Overtime, Heat, Stress, Anxiety. But we did it.
  • 13. Load Test or Beta Test? • From the September 1 Launch date; until even today we have been hit with new feature requests • “Oh! I forgot about that – but it’s really important” • How do you balance engineering priorities vs feature priorities?
  • 14. How to Construct a Load Test • Write custom scripts that simulate real users using your app • • • • Selenium Web Driver + Sauce Labs Browser Mob (Neustar) Load Impact Write a custom handler that simulates the user payload • Loader.io
  • 15. Our Loader.io Script PayLoad • POST 100 KB of data • Simulate Save to Database • GET 100 KB of data from Database
  • 17. 300+ Users Per Second! • Whoo hoo! • 300 users per second must mean what? Thousands of users per minute! • I report to client a very successful load test and put the matter towards some wishful thinking
  • 19. Survivorship Bias The misconception You should focus on the successful if you wish to be successful The truth When failure becomes invisible, the difference between failure and success my also become invisible
  • 20. Survivorship Bias • “A Cabal of Geniuses” assembled at the request of the White House • Top women mathematicians (human computers), Nobel Prize Winners, researchers formed the Statistical Research Group
  • 21. Keeping Airlines in the Sky • At its lowest; survivability of a WWII bomber was 50% on a mission • “Ghosts already” is how airmen were known • “How, the Army Air Force asked, could they improve the odds of a bomber making it home”
  • 22. Armor • Military commanders inspected the planes that made it back • Ideally they could put armor on the whole plane, but then it wouldn’t fly • Tons of bullet holes in key areas of the fuselage, wings, near the gunners • The army was about to add plating to these parts of the bombers
  • 23. Armor • The scientists successfully argued “Survivorship Bias” • Stop looking at the survivors – it is the other parts of the plane that need more armor!
  • 38. LOL WE DON’T DO THAT Zack’s first comment as I concluded that presentation
  • 41. WEEK OF JANUARY 6 Everyday is a Record Traffic Day
  • 42. Scale up on IaaS • Someone trying to generate a 150 page PDF • The norm is 10-15 pages… • “OutOfMemoryException”
  • 44. Whoo Hoo! • No Issues on our highest traffic day ever! • “Can’t wait till that number hits 250 per minute!” • “Tomorrow will be our biggest day yet!”
  • 45. Friday, January 10, 2014 • Approximately 12:00 Noon • • • • • Site traffic is around 185 people, 50 less than the previous day’s high 1 out of every 12 hits times out According to Rackspace, a node is failing on cloudsites and will be taken out of rotation About 10 complaints so far, but I email “Everything is under control” Approximately 12:30 PM • • • Traffic falls to about 150 people per minute Things are fine Phew
  • 46. Friday, January 10, 2014 • At 1:00 PM we have a job interview for a new support person • I have live chat open with Rackspace and am hopping back and forth between the interview --- not the best way to hire someone • 1:45 PM interview over, and I learn traffic is at 220+ people. • The site is pretty much dead • While I work on the issue, my phone is ringing with an frightened customer. Our help desk is filling up with complaints non-stop • With a stone-cold face, I walk to my teammates. “This is bad. I need help”
  • 47. Backup Plan • I knew CloudSites had some limit, but I had a plan to shift traffic at a moment’s notice in a worst case situation
  • 48. Backup Plan Now in Play • Using CloudFlare, a service that lets us rapidly change DNS records; traffic was redirected to the super server • 1 second later
  • 49. Backup Plan Part II (Scale Up) • OK – I’ll spin up the most powerful server I can buy. • 64 GB RAM • 32 vCPU
  • 50. Backup Plan Part II • 19 seconds later
  • 51. 3:25 PM • Rackspace gives me a one time “boost” to capacity • Let’s me know about “HTE” for the future…. • • “If you are having a high traffic event, let us know in advance” I kiss the floor. My company is saved by the whim of my hosting company
  • 52. 9:00 PM • Zack and I finish responding to customer complaints • It would be weeks before I could sleep normally again
  • 53. What the heck happened? • The initial load test was testing people submitting one application at a time • The PDF issue was actually a harbinger of things to come • Thursday had record traffic, but Friday had people doing “Finalization” (commits) • Our commit code was very slow, and used a lot of RAM. As a server would get overloaded, the app pool would restart – this would add load to other servers • Demand > Supply caused a chain reaction making servers continually failing until more supply was added
  • 54. Our Future Plans • I’m too scared of PaaS for a complex use case! • Not enough data to know when things fail.