0% found this document useful (0 votes)
3 views

Disaster+Recovery+Course

The document discusses disaster recovery strategies, emphasizing the importance of resiliency, fault tolerance, and high availability in IT systems. It outlines various disaster recovery options such as Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active, detailing their recovery time objectives (RTO) and recovery point objectives (RPO). Additionally, it highlights the significance of continuous data replication and the need for effective traffic routing during disaster events.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Disaster+Recovery+Course

The document discusses disaster recovery strategies, emphasizing the importance of resiliency, fault tolerance, and high availability in IT systems. It outlines various disaster recovery options such as Backup and Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active, detailing their recovery time objectives (RTO) and recovery point objectives (RPO). Additionally, it highlights the significance of continuous data replication and the need for effective traffic routing during disaster events.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Disaster Recovery

Copyright © ChandraMohan Lingam. All Rights Reserved.


Disaster
Fault Tolerance High Availability Resiliency
recovery

RTO RPO Failover Failback

Multi-region
Backup/Restore Pilot Light Warm Standby
active/active

Copyright © ChandraMohan Lingam. All Rights Reserved.


Copyright © ChandraMohan Lingam. All Rights Reserved.
Resiliency

1. the capacity to recover quickly from difficulties; toughness


2. the ability of a substance or object to spring back into shape; elasticity

Google, Oxford Languages dictionary

Copyright © ChandraMohan Lingam. All Rights Reserved.


EC2 instance crash
Everything fails, all Physical server crash
Disk failure
the time Network router crash
Network link slowness
Power outage
Werner Vogels, Amazon CTO Network outage
AZ down
Region down
Capacity issues
Abrupt changes in traffic
Attacks
Natural disasters

Copyright © ChandraMohan Lingam. All Rights Reserved.


Failure Scenarios – Resiliency Strategy

Common Events

• Easy to recover
• Availability

One Time Events

• Large scale disruption


• Disaster Recovery

Copyright © ChandraMohan Lingam. All Rights Reserved.


Availability

Photo By Santeri Viinamäki, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=75481494


Copyright © ChandraMohan Lingam. All Rights Reserved.
Request-based Availability Metric

Scenario 1
• ATM usable only 8 out of 10 times
• Availability = 8/10 or 80%

Scenario 2
• No one is using the machine
• Availability = 0%

Copyright © ChandraMohan Lingam. All Rights Reserved.


Time-based Availability Metric

Scenario
• ATM broke down twice in the past 100 hours
• Average down time: 5 hours
• Availability = 90/100 or 90%

Copyright © ChandraMohan Lingam. All Rights Reserved.


Availability

Request-based or time-based

Target to meet when designing a system

Assess deployed system performance

Handle common disruptions


Component failures, transient network issues, changes in traffic

Copyright © ChandraMohan Lingam. All Rights Reserved.


Fault Tolerance – Improve Availability

Automatic recovery

Handles common disruptions

Zero downtime
Copyright © ChandraMohan Lingam. All Rights Reserved.
S3 – Server and Storage Redundancy

S3

AZ 1 Here AZ 2 There AZ 3 Everywhere


Copyright © ChandraMohan Lingam. All Rights Reserved.
Many AWS services are fault tolerant

• SNS
• SQS
• DynamoDB
• And more

Fault tolerant systems are complex and expensive to build

Copyright © ChandraMohan Lingam. All Rights Reserved.


High Availability

EASIER TO DESIGN HANDLES COMMON SMALL DOWNTIME


AND DEPLOY DISRUPTIONS

Copyright © ChandraMohan Lingam. All Rights Reserved.


Primary - Standby Servers
Database not accessible during
failover

Relational Database Service

Primary Standby
Primary
Server Server

AZ 1 AZ 2
Copyright © ChandraMohan Lingam. All Rights Reserved.
Multiple Web Servers

Elastic Load Balancing

HTTP 5xx Errors

Web Web Web


Server Server Server

AZ 1 AZ 2 AZ 3

Copyright © ChandraMohan Lingam. All Rights Reserved.


Increase in traffic

App slow or errors

Elastic Load Balancer

EC2 EC2 EC2 EC2 EC2 EC2

Auto Scaling

Copyright © ChandraMohan Lingam. All Rights Reserved.


High Availability is not Disaster Recovery!

High availability is an essential first step

But may not handle disasters


• App deployed in a single region [disaster: region outage]
• Data corruption or deletion

Primary Standby

Standby data is also deleted. To recover data, we need to restore


from backup!
Copyright © ChandraMohan Lingam. All Rights Reserved.
Disaster Recovery

Natural disasters
Earthquakes, floods, hurricanes, snowstorms
Technical failures
Power failures, Network Outage
Human actions
Misconfiguration, unauthorized access or modification

Reference:
https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/what-is-a-
disaster.html

Copyright © ChandraMohan Lingam. All Rights Reserved.


Flood Impact

Redundant power,
networking, AZ 1 AZ 2
connectivity

Interconnected via
Located within 60 miles redundant, ultra low
(100 KMs) of each other latency network
AZ 3

Spread application infrastructure across two or more AZs


Copyright © ChandraMohan Lingam. All Rights Reserved.
Data loss

Backup
Backup
Backup
Primary Standby
Region A
Region A

Standby data is also deleted


Backup
To recover data, we need to Backup
Backup
restore from backup!

More complex issue! Region B

Copyright © ChandraMohan Lingam. All Rights Reserved.


Region Outage

https://www.datacenterdynamics.com/en/news/aws-us-east-1-region-suffers-errors-and-outages-
impacting-its-status-page/
Copyright © ChandraMohan Lingam. All Rights Reserved.
Disaster Events

"Disaster recovery (DR) is an important part of your resiliency


strategy and concerns how your workload responds when a
disaster strikes.“

https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-
aws/introduction.html

Copyright © ChandraMohan Lingam. All Rights Reserved.


Disaster Recovery - Metrics

Focus is on business continuity

Objectives are specified as


• Recovery Time Objective (RTO)
• Recovery Point Objective (RPO)

Copyright © ChandraMohan Lingam. All Rights Reserved.


Copyright © ChandraMohan Lingam. All Rights Reserved.
Failover – switching traffic from primary site
to DR site

Traffic

Primary DR
Site Site

Copyright © ChandraMohan Lingam. All Rights Reserved.


Failback – switching traffic from DR site back
to primary site after the disaster is resolved

Traffic

Primary DR
Site Site

Copyright © ChandraMohan Lingam. All Rights Reserved.


Copyright © ChandraMohan Lingam. All Rights Reserved.
Backup and Restore

Copyright © ChandraMohan Lingam. All Rights Reserved.


Backup and Restore

Backup
Application Cross-region
Data
Region A

Region A
Backup
• Enable Continuous Backup (Point In Time Recovery) Backup
Backup
• Periodic Full Backup of Your Data (Snapshot)
• Maintain Copy in a Second Region
Region B
Copyright © ChandraMohan Lingam. All Rights Reserved.
Backup and Restore

After disaster
• Restore data
• Deploy servers and other resources
Low-cost

RTO/RPO in hours
Copyright © ChandraMohan Lingam. All Rights Reserved.
Backup and Point-in-time Recovery

EBS RDS Use AWS Backup Service:


snapshot snapshot
• To centrally manage backup
policies and automate the
Aurora DynamoDB
snapshot backup process
• Maintain copy in a disaster
S3 Cross recovery region
Redshift
region
snapshot
replication

And many
EFS backup
more

Copyright © ChandraMohan Lingam. All Rights Reserved.


Pilot Light

Copyright © ChandraMohan Lingam. All Rights Reserved.


Pilot Light

Web Web
Resources Resources
Pre-configured
and turned OFF
App App
Resources Resources
Continuous Resources to
replication support
Application Application continuous
Data Data replication are
always ON
Region A Region B

Copyright © ChandraMohan Lingam. All Rights Reserved.


Pilot Light

After disaster
• Quickly start your web and app servers and
scale them to handle traffic

RTO/RPO in 10s of minutes

More expensive
Copyright © ChandraMohan Lingam. All Rights Reserved.
Continuous Replication (RDS)

Continuous
replication Read
Primary Primary
Replica

Region A Region B

• RDS read-replica
• Aurora Global Database
• Both services maintain read-replica(s) in another region
• After a disaster event, promote one of the read-replica as
the new primary to allow read-write traffic
Copyright © ChandraMohan Lingam. All Rights Reserved.
DynamoDB Global Table

• Automatic replication
across specified regions
• All copies are read-write
• Changes are automatically
propagated to other
regions

Image:
https://aws.amazon.com/dynamodb/global-tables/

Copyright © ChandraMohan Lingam. All Rights Reserved.


S3 Cross Region Replication

Continuous
Source replication Destination
Bucket Bucket

Region A Region B

Copyright © ChandraMohan Lingam. All Rights Reserved.


Elastic Disaster Recovery (CloudEndure)

Use AWS as Disaster Recovery site

Block-level
continuous
Servers replication
AWS
On-premises

EC2 AWS
Region A Region B

Copyright © ChandraMohan Lingam. All Rights Reserved.


Traffic Routing

• Route 53
• Global Accelerator

Using health checks, these services can detect primary


failure and direct traffic to secondary

Copyright © ChandraMohan Lingam. All Rights Reserved.


Pilot Light

Continuous replication of data

After disaster
• Quickly start your web and app servers and
scale them to handle traffic

Copyright © ChandraMohan Lingam. All Rights Reserved.


Warm Standby

Copyright © ChandraMohan Lingam. All Rights Reserved.


Warm Standby Can process requests

Web Web
Resources Resources Fully functional
Scaled down
App App
Resources Resources
Continuous Resources to
replication support
Application Application continuous
Data Data replication are
always ON
Region A Region B

Copyright © ChandraMohan Lingam. All Rights Reserved.


Failover Routing
100% Primary
traffic Region A

Primary Not Secondary


Region B
Available

Routing Domain Value Health Check

Failover - demolearn.com myelb.us-west- Primary Health Check


2.amazonaws.com Endpoint
Primary
Failover - demolearn.com myelb.us-east-
1.amazonaws.com
Secondary
Copyright © ChandraMohan Lingam. All Rights Reserved.
Warm standby

RTO/RPO in minutes

After disaster
• Scale your web and app servers to handle
traffic (Auto scaling)

More expensive than pilot light


Copyright © ChandraMohan Lingam. All Rights Reserved.
Multi-region active-active

Copyright © ChandraMohan Lingam. All Rights Reserved.


Multi-region active/active
All regions handle app traffic
Route 53/Global Accelerator

Web Web
Resources Resources

App App
Resources Resources

Continuous
replication
Application Data Application Data

Region A Region B
Copyright © ChandraMohan Lingam. All Rights Reserved.
DynamoDB Global Table

• Automatic replication
across specified regions
• All copies are read-write
• Changes are automatically
propagated to other
regions

Image:
https://aws.amazon.com/dynamodb/global-tables/

Copyright © ChandraMohan Lingam. All Rights Reserved.


Relational Database (RDS)

Continuous
replication Read
Primary Standby
Replica

Region A Region B

• RDS read-replica
• Aurora Global Database
• Both services maintain read-replica(s) in another region
• After a disaster event, promote one of the read-replica as
the new primary to allow read-write traffic
Copyright © ChandraMohan Lingam. All Rights Reserved.
Multi-region active/active

After disaster
• Zero downtime
• Traffic automatically routed to other regions
Data loss near zero

Most expensive
Copyright © ChandraMohan Lingam. All Rights Reserved.
Cloud DR

• Recover quickly with reduced complexity


• Testability
• Automate, reduce errors and improve recovery time

Copyright © ChandraMohan Lingam. All Rights Reserved.


Disaster Recovery Lab

Copyright © ChandraMohan Lingam. All Rights Reserved.


Product Catalog

Copyright © ChandraMohan Lingam. All Rights Reserved.


App for Product Catalog

ELB Route 53
demolearn.com
Web Server

API Gateway 1. Multiple Components


2. In the event of a disaster – we
need to ensure all these components
Lambda Function are available in the DR region

DynamoDB Table

Copyright © ChandraMohan Lingam. All Rights Reserved.


Primary DR

Oregon N Virginia
us-west-2 us-east-1

DR Options
1. Backup and Restore
2. Pilot Light
3. Warm Standby
4. Multi-Site Active-Active

Objective: Compare Data Loss (Recovery Point) and


Recovery Time for each approach
Copyright © ChandraMohan Lingam. All Rights Reserved.
Primary DR

Oregon N Virginia
us-west-2 us-east-1

DR Options
1. Backup and Restore
2. Pilot Light
3. Warm Standby
4. Multi-Site Active-Active

Objective: Compare Data Loss (Recovery Point) and


Recovery Time for each approach
Copyright © ChandraMohan Lingam. All Rights Reserved.
DR Primary

Oregon N Virginia
us-west-2 us-east-1

DR Options
1. Backup and Restore
2. Pilot Light
3. Warm Standby
4. Multi-Site Active-Active

Objective: Compare Data Loss (Recovery Point) and


Recovery Time for each approach
Copyright © ChandraMohan Lingam. All Rights Reserved.
Primary DR

Oregon N Virginia
us-west-2 us-east-1

DR Options
1. Backup and Restore
2. Pilot Light
3. Warm Standby
4. Multi-Site Active-Active

Objective: Compare Data Loss (Recovery Point) and


Recovery Time for each approach
Copyright © ChandraMohan Lingam. All Rights Reserved.
Primary Primary

Oregon N Virginia
us-west-2 us-east-1

DR Options
1. Backup and Restore
2. Pilot Light
3. Warm Standby
4. Multi-Site Active-Active

Objective: Compare Data Loss (Recovery Point) and


Recovery Time for each approach
Copyright © ChandraMohan Lingam. All Rights Reserved.
Instructor, Course Developer

7X AWS Certified

For a list of courses, visit


https://www.cloudwavetraining.com/

Connect with me on LinkedIn


https://www.linkedin.com/in/chandralingam/

Chandra Lingam
100K+ Students

Copyright © ChandraMohan Lingam. All Rights Reserved.

You might also like