Chaos Engineering with Kubernetes

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Arun Gupta, @arungupta
Principal Open Source Technologist,
Amazon Web Services
Using Chaos to Bring Resiliency
to Your Applications in
Kubernetes

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Failures are a given and
everything will eventually
fail over time.
https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Amazon 2006
GameDay: Creating
Resiliency Through
Destruction
Jesse Robbins

Chaos Monkeys
https://github.com/Netflix/SimianArmy

Chaos Engineering

Resilience
Ability of a system to adapt
to changes, failures, and disturbances

Chaos Engineering is the discipline of
experimenting on a distributed system in
order to build confidence in the system’s
capability to withstand turbulent
conditions in production
Credit: https://www.flickr.com/photos/loseryouthcrew/8775130600/
https://principlesofchaos.org/

Bad things will happen to your system,
no matter how well designed it is
You cannot become ignorant to it

Break your systems on purpose
Find out their weaknesses and
fix them before they break when least expected

Chaos doesn’t cause problems.
It reveals them.

• Application level
• Host failure
• Resource attacks (CPU, memory, …)
• Network attacks (dependencies, latency, …)
• Region attacks!

Where do you inject Chaos?

Phases of chaos engineering

https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
”Normal” behavior of your system

Business metric
https://medium.com/netflix-
techblog/sps-the-pulse-of-
netflix-streaming-
ae4db0e05f8a

• a service gives 404 or 503?
• latency increases by 300ms?
• the port is not accessible?
• security group rules changed?
• the database stops?
• excessive number of requests come?
• iptables are wiped out?

Pick hypothesis
Scope the experiment
Identify metrics
Notify the organization

Start with very small
As close as possible to production
Minimize the blast radius.
Have an emergency STOP!

Users
Canary deployment
99%
users
1%
users
Start with...

Time to detect?
Time for notification? And escalation?
Time to public notification?
Time for graceful degradation to kick-in?
Time for self healing to happen?
Time to recovery—partial and full?
Time to all-clear and stable?

DON’T blame that one person…

PostMortems—COE (Correction of Errors)
The 5 WHYs

Fix

Failure free operations require
experience with failure.
http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf

Kubernetes cluster

Reconciles desired and actual state for pods
Distributes pods across AZs
Automatic health-check based restarts
Rolling deployment of a service

Kubernetes cluster with Amazon EKS
AWS managed
Customer account

Kubernetes cluster with Amazon EKS
mycluster.eks.amazonaws.com
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
Kubectl

Region and Availability Zones
Control Plane is highly available
Master and Workers are configured in ASG
Master instance type auto-scaling
Etcd is HA and backed up every hour

Chaos in a Kubernetes cluster
mycluster.eks.amazonaws.com
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
Kubectl
x
x
Health check?
Dead node?
x

Istio
Chaos Toolkit
Kube Monkey
PowerfulSeal
Gremlin
Simian Army

Istio
Intelligent routing
and load balancing
Resilience across
languages and
platforms
Fleet-wide policy
enforcement
In-depth
telemetry

Timeouts
Bounded retries with timeout budget
Concurrent connections limit and request load
Active health checks (periodic)
Passive health checks (circuit breakers)
AZ-aware load balancing with automatic failover

• Timing failures
• Increased network latency
• Overloaded upstream service
• Crashes
• HTTP error codes
• TCP connection failures

Fault injection using Istio—timeout
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: greeting
spec:
hosts:
- greeting
http:
- fault:
delay:
fixedDelay: 10s
percent: 100
route:
- destination:
host: greeting
subset: greeting-hello
---
kind: DestinationRule
metadata:
name: greeting-destination-rule
spec:
host: greeting
subsets:
- name: greeting-hello
labels:
greeting: hello

Fault injection using Istio—HTTP abort
metadata:
name: greeting
spec:
hosts:
- greeting
http:
- fault:
abort:
httpStatus: 500
percent: 100
route:
- destination:
host: greeting

Istio traffic management
metadata:
name: greeting-virtual-service
spec:
hosts:
- greeting
http:
- route:
- destination:
host: greeting
weight: 75
- destination:
host: greeting
subset: greeting-howdy
weight: 25
---
metadata:
spec:
host: greeting
subsets:
labels:
greeting: hello
- name: greeting-howdy
labels:
greeting: howdy

Istio circuit breaker
metadata:
spec:
host: greeting
subsets:
labels:
greeting: hello
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100

https://istio.io/docs/

Chaos Toolkit
Open API for Chaos Engineering

CLI-driven
Experiments declared in JSON/YAML files
Open specification
Extensible: Kubernetes, AWS, Spring, others

Chaos Toolkit follows the principles of chaos

query a system to observe a behavior
• Check state of a pod with a specific label
• Multiple probes to define steady state
real-world events
• Terminate a deployment
• Multiple actions simulate events
Types of probe and method
• Process: Run a binary
• HTTP: Invoke a HTTP endpoint
• Python: Call a Python function to perform richer operations

Chaos Toolkit metadata
{
"version": "1.0.0",
"title": "Terminating the greeting service should not impact users",
"description": "How does the greeting service unavailbility impacts our users? Do they see
an error or does the webapp gets slower?",
"tags": [
"kubernetes",
"aws"
],
"configuration": {
"web_app_url": {
"type": "env",
"key": "WEBAPP_URL"
}
},

Chaos Toolkit steady state & hypothesis
"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "pods_in_phase",
"arguments": {
"label_selector": "app=webapp-pod",
"phase": "Running",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "application-must-respond-normally",
"tolerance": 200,
"provider": {
"type": "http",
"url": "${web_app_url}",
"timeout": 3
}
}
]
},

Chaos Toolkit experiment & verify
"method": [
{
"type": "action",
"name": "terminate-greeting-service",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=greeter-pod",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "fetch-application-logs",
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "read_pod_logs",
"arguments": {
"label_selector": "app=webapp-pod",
"last": "20s",
"ns": "default"
}
}
}
],

Chaos Toolkit run
$ chaos run experiments/experiment.json
[2018-03-10 14:42:38 INFO] Validating the experiment's syntax
[2018-03-10 14:42:38 INFO] Experiment looks valid
[2018-03-10 14:42:38 INFO] Running experiment: Terminate the greeting service should not impact users
[2018-03-10 14:42:38 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:38 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:38 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:39 INFO] Steady state hypothesis is met!
[2018-03-10 14:42:39 INFO] Action: terminate-greeting-service
[2018-03-10 14:42:40 INFO] Probe: fetch-application-logs
[2018-03-10 14:42:41 INFO] Steady state hypothesis: Services are all available and healthy
[2018-03-10 14:42:41 INFO] Probe: application-should-be-alive-and-healthy
[2018-03-10 14:42:42 INFO] Probe: application-must-respond-normally
[2018-03-10 14:42:45 ERROR] => failed: activity took too long to complete
[2018-03-10 14:42:45 CRITICAL] Steady state probe 'application-must-respond-normally' is not in the
given tolerance so failing this experiment
[2018-03-10 14:42:45 INFO] Let's rollback...
[2018-03-10 14:42:45 INFO] No declared rollbacks, let's move on.
[2018-03-10 14:42:45 INFO] Experiment ended with status: failed

https://github.com/chaostoolkit/chaostoolkit/

Implementation of Netflix’s Chaos Monkey for Kubernetes
Randomly deletes pods in the cluster
Applications opt-in using annotations

Run Kube-Monkey—create configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-monkey-config-map
namespace: kube-system
data:
config.toml: |
[kubemonkey]
run_hour = 8
start_hour = 10
end_hour = 16
blacklisted_namespaces = ["kube-system"]
whitelisted_namespaces = [""]

Kube-Monkey application opt-in
apiVersion: apps/v1
kind: Deployment
. . .
template:
metadata:
labels:
app: greeting
kube-monkey/enabled: enabled
kube-monkey/identifier: monkey-victim-pods
kube-monkey/mtbf: 2
kube-monkey/kill-mode: random-max-percent
kube-monkey/kill-value: 40
spec:
containers:
- name: greeting

https://github.com/asobti/kube-monkey

Chaos Engineering working group @ CNCF
https://github.com/chaoseng/wg-chaoseng

Chaos Engineering mind map
https://bit.ly/2uKOJMQ

You don’t chose the moment,
the moment chooses you.
You only choose how prepared
you are, when it does.

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Thank you!

Chaos Engineering with Kubernetes

Recommended

More Related Content

What's hot (20)

Similar to Chaos Engineering with Kubernetes (12)

More from Arun Gupta (20)

Recently uploaded (20)

Chaos Engineering with Kubernetes