SlideShare a Scribd company logo
@
Introduction
Me
@NU.nl
NU.nl
About
• First dutch digital news platform.
• Unique visitors:
• 7 mln. / month
• 2.1 mln. / day
• Page hits: ~12 mln / day
• API: ~150k rpm / 2500rps
NU.nl
Sanoma
• Part of Sanoma
• NL: NU.nl, Viva, Libelle, Scoopy
• FI: Helsingin Sanomat
• Reaching ~9.8 mln dutch people / month
IT organization
Teams
• NU.nl teams
• Web 1 (application / front-end-ish)
• Web 2 (application / back-end-ish / infra)
• Feature 1 & 2 (cross-discipline)
• iOS
• Android
• Sanoma teams
• DevSupport, Mediatool, Content Aggregation
NU.nl
Growing number of teams
• Increased number of parallel workflows
• Testing
• Releasing
• Roadmaps
• Knowing about everything no longer possible
• Aligning ‘procedures by agreement’ increasingly hard
Why Kubernetes?
Current infrastructure
AWS accounts & VPCs
VPC
sanoma
RDS Elasticache
ALBs
EC2
Cloudfront
API CMS WWW XYZ
VPC
nu-test
FOO K8S
VPC
nu-prod
BAR K8S
Infrastructure provisioning
Terrible (Terraform + Ansible)
terrible plan
terrible apply
terrible ansible
Development workflow
From code to release
• Code
• Automated tests
• Code review
• Manually initiated deploy to test
• Feature test
• Manually initiated deploy to staging
• Exploratory test
• Manually initiated deploy to production
DevOps practices
Solid foundation
• All infra in code
• Terraform
• Terrible providing mechanisms:
• Authorization
• Managing TF state files
DevOps practices
But…
• Setting up additional test environments slow
• Slow feedback loop
• Terraform plan vs apply (surprise surprise, it didn’t work)
• Ansible (~20 minutes)
• Vagrant? (but not fully representative of EC2)
• Config drift
• Hard to nail down every system package version
• EC2 instances having different lifecycle
DevOps practices
But… (part 2)
• No scaling infra*
• Heavily invested in Ansible
• Config & secrets management problematic
• GUIs time consuming
• No change history
• Or highly detached from code history
• No context
• Not overly secret
*Yes, we know it’s 2019
DevOps practices
But… (part 3)
• Current deployment system assumes fixed set of servers
• Possible alternatives include:
• ASG rolling updates (can get slow)
• Pull current application code on start-up (even slower)
• Bake AMI
• Periodically poll for application version to be deployed
• Works quite well
• …as long as new code combined with config doesn’t break.
• So a certain level of orchestration would be needed.
Where to start?
Everything’s connected
Timing
What direction to move?
• DevOps challenges
• Desire to improve delivery process, having true artifacts
• Early 2018
• Containers are a well-established way of ‘packaging’ an application
• Kubernetes getting out of early-adopters phase
• NU.nl (re-)launching a new product: NUjij
Improvement layers
A journey or a destination?
1: Containers as artifacts
• Versatile
• Forces us to do certain things right
• 12factor
• Centralized logging
• Easily moved through a pipeline
• Lots of tooling
Improvement layers
A journey or a destination?
2: A flexible platform to deploy and run containerized applications on
• Tackling challenges at platform level instead of per-application:
• Scaling
• Security updates
• Observability
• Deployment & configuration process
Improvement layers
A journey or a destination?
2: A flexible platform to deploy and run containerized applications on
• Kubernetes
• Rapidly increasing adoption
• Short feedback loop
• Ability to run locally (unlike, say, ECS)
• Easily stamp out deployments for:
• feature testing/demo-ing
• e2e tests
Narrowing the scope
Lets not get carried away
The goal is not:
• To chop up change all of our applications into nano- micro-services
• They’re not that monolithic anyway
• To put everything in Kubernetes
• Managed AWS services where possible
• Redis, RDS
Focus on agility and efficiency of what we change most frequently: Code
Initial cluster setup
The journey begins
Multiple clusters
By criticality
3 AWS accounts, 3 clusters:
• osc-nu-prod
• production
• osc-nu-test
• test
• staging
• osc-nu-dev
• proofing infra changes
Kops
Why Kops?
• Manages cluster upgrades
• Rolling upgrade
• Draining nodes
• EKS not yet available
• Let alone in eu-west-1
Kops
Glueing together cluster setup and kube-system setup
Kops
Upgrading a cluster
Kops
Upgrading a cluster
Kops
Templating Terraform and custom vars
Components
kube-system
• Networking
• Calico
• EFS
• previousnext/k8s-aws-efs
• No AZ-restrictions when re-scheduling pods
• Creates new EFS filesystem for each PersistentVolumeClaim
• Security & reliability (isolated IOPs budgets)
• Slow on initial deploy
Components
kube-system
• AWS IAM Authenticator
• The ‘Zalando suite’
• Skipper
• Skipper Daemonset
• kube-ingress-aws-controller Deployment
• ExternalDNS
• Configures PowerDNS (& others) based on ingress host
Components
Zalando skipper
• Skipper Daemonset
• Feature rich (metrics, shadow traffic,
blue/green)
• kube-ingress-aws-controller Deployment
• https://github.com/zalando-incubator/kube-
ingress-aws-controller
• Sets up & manages ALB
• Finds appropriate ACM certificate
• Supports multiple ACM certificates per ALB
Components
Autoscaling
• Horizontal Pod Autoscaler
• Scales number of pods based on
(CPU) utilization
• Cluster autoscaler
• Running on master nodes
• Scales asg out when pods pending
• Scales asg in when nodes
underutilized
Components
Logging & metrics
• ELK
• Prometheus / Grafana
Jenkins
Build & Deploy pipeline
Jenkins
Temporary deployment for running tests
• Deploy to temp. namespace
• Jenkins-SU
• Run tests in deployment
• Deploy to test/staging/production
• By bumping image version
• Production: Jenkins-SU
• Clean up temp. namespace
• Jenkins-SU
Jenkins
Jenkins-SU
• Sets up namespace
• Adding RBAC for Jenkins
• Only if ns name matches pattern ‘Jenkins-*’
• Deletes namespace
• Only if ns name matches pattern ‘Jenkins-*’
• Avoids need for Jenkins to be able to delete every namespace
curl -X POST --user ${JENKINS_SU_AUTH} --data '{"name": "${K8S_BUILD_NS}"}' http://su.jenkins-su/ns/
curl -X DELETE --user ${JENKINS_SU_AUTH} --data '{"name": "${K8S_BUILD_NS}"}' http://su.jenkins-su/ns/
Kubernetes in action
Kubernetes in action
Questions
• Will it be stable?
• Will we be able to operate?
• Should we wait for EKS?
• Do we actually want EKS? What will EKS be like?
Learning from failure
1
No memory limits
Incident 1
Accidentally trying to load a ElasticSearch index of 90Gb
• Misconfigured elast-alert (trying to read entire index)
• No memory limit configured
Incident 1
Accidentally trying to load a ElasticSearch index of 90Gb
• Required manual intervention: Yes
• Stopping the bleeding:
• Remove elast-alert
• Permanent fixes:
• Don’t load entire index
• Apply limits
2
No CPU limits
Incident 2
Rapid traffic increase affecting core components
• 2019-03-18 Utrecht shooting
• 11:11 First article published
• 11:56 breaking push
• CPU burstable pods causing node 100% CPU
• Core components (kubelet, ingress) suffering
Incident 2
Rapid traffic increase affecting core components
Incident 2
Rapid traffic increase affecting core components
Incident 2
Rapid traffic increase affecting core components
Incident 2
Rapid traffic increase affecting core components
Incident 2
Rapid traffic increase affecting core components
pod
pod
kubelet
skipper
node
Pods:
0.4 CPU req.
0.8 CPU limit
80% CPU utilization
pod
kubelet
skipper
node
pod
Pods:
0.4 CPU req.
0.8 CPU limit
120% CPU utilization
problems
Incident 2
Rapid traffic increase affecting core components
• Required manual intervention: No
• Fixes:
• Reduce CPU burstable amount of pods
• Increase resource requests of skipper
• Mind QoS: Guaranteed, Burstable, Best effort
• Reserve cpu & memory for kubelet
• --kube-reserved
• --system-reserved
3
Memory limits
OOMkiller
Incident 3
Application update increasing memory footprint
• Upgrade including moving from MongoDB 3 to MongoDB 4
• HorizontalPodAutoscaler based on CPU
• Scaling based on CPU not kicking in
• New increased memory footprint causing OOMkilled
Incident 3
Application update increasing memory footprint
Incident 3
Application update increasing memory footprint
• Required manual intervention: Yes
• Stopping the bleeding:
• Increase memory limit of Talk pods
• Permanent fixes:
• Adjust CPU request/limit & HPA thresholds
• Scale on both CPU and memory
• Note: Not all applications ‘give back’ memory
• Set memory limit higher than request to prevent ‘snowball effect’
Incident 3
OOMKilled snowball effect
pod pod pod pod
pod
pod
pod
pod
pod pod
starting
…
1 2
3 4
3
Memory limits
!?
(obligatory this-is-fine meme)
That’s not fine
Is it?
• On the positive side:
• All are result of (lack of) resource limit configuration
• This can be learned
• On the negative side:
• This needs to be learned
• Note: ‘Availability bias’
Improving
Automation
Improving the pipeline
• Automating setting the image version is not enough
• Rolling out Kubernetes manifests still manual task
• Updating configuration & secrets still manual task
• Duplication in manifests between stages
• Not easily seen what parts are different
• Differences intentional or accidental?
• This actually slows us down
• Does git represent the current state?
kubectl -n talk get secrets env -o json |jq -r '.data | map_values(@base64d) | to_entries | .[] | .key + "="" + .value +"""'
Helm
The package manager for Kubernetes
• Charts
• Configured via values
• It’s like Terraform modules
• Or Ansible group_vars
• Leveraging community knowledge and efforts
• E.g. prometheus-operator
• No need to copy charts, able to reference.
• Helm v3
SOPS: Secrets OPerationS
Secrets management stinks, use some sops!
• By Mozilla
• Manage AWS API access, not keys
• Versatile
• YAML, JSON, ENV, INI, binary (plain text)
• Not limited to Kubernetes
• Meaningful diffs
• Alternatives considered:
• Kamus
• Bitnami SealedSecrets
Helmfile
Wiring it together
• Charts
• Referenced from online chart sources or local
• Environments
• Test, staging, production
• Referencing values and secrets
• Releases
• Release name
• Reference to chart
• Values (can be a templated file, using vars and secrets from environment)
Helmfile
Wiring it together
environment
values
secrets
(SOPS)
release X
release Y
release Z
ENV
values
values
values
Helmfile
Helmfile
Wiring it together
• Advantages:
• Meaningful git diffs
• Easily manage multiple releases in single pipeline, e.g.:
• Everything related to monitoring and logging
• Kube-system
• Declarative definition
• Of what would otherwise be numerous helm args and steps in CI/CD pipeline
Helmfile
Wiring it together
• Advantages (continued):
• Ability to pass in ENV vars
• E.g. build result image tags
• Ability to reference complex charts created by community
• Charts as a building block allows re-use. Example:
• Instead of plain yaml you write a chart
• If fitting workflow, the chart can be a published artifact
• Chart can be re-used e.g. in e2e tests
Helmfile
Wiring it together
• Disadvantages:
• 2 levels of templating
• Chart itself
• Only if writing own charts
• Environment & release values into Helm values
• Template error message not overly clear
• Or even misleading
• At least it breaks
Helmfile
Example
Helmfile
Example
Helmfile
Example
Helmfile
Jenkins
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Helmfile
Final words
But tiller?
• Helm as a templating engine
• Option: Using Helm 2 ‘Tillerless’
• Tiller outside of cluster, not by-passing RBAC
• Start using Helm as package manager when Helm 3 settles down
• Easy removal of temp. per-feature deploys
• Diffs
Challenge
Auto-scaling
scale fast… scale far…
Auto-scaling
Breaking news push
Auto-scaling
Types of scaling
• Reactive
• Breaking news
• K8S cluster-autoscaler
• Can’t schedule pod? Add nodes.
• Predictive
• Ticket sale start
• Black Friday
Auto-scaling
Types of scaling
• From within cluster
• K8S cluster-autoscaler
• From outside of cluster
• ASG scaling policies
Auto-scaling
Scaling speed
node spin-up duration
node count 70% utilization
Auto-scaling
Times 5 within 5 minutes?
Cluster auto-scaler
Bag of tricks
• Mix predictive and reactive
• Add asg instances without telling cluster-autoscaler
• Traffic expected to arrive by the time cluster-autoscaler starts to scale in,
leaving plenty of resources as needed.
• Pause pods
• Lower priority pods that can safely be evicted
• Effectively ‘creating headroom’ in cluster
Considerations
When engaging ‘ludicrous mode’™
Can control-plane handle scale?
• KOPS
• Size master nodes for max. cluster size
• Overhead cost
• EKS
• What’s behind the abstraction?
• ELB 503s exist after all
• Plan: Proof of concepts
Pending
Not the pods…
Consider EKS
Managed control plane
EKS Kops
Managed control plane Total control over setup
Easier: EKS IAM roles for pods
• Launched 2019-09-04 (yesterday)*
Smooth rolling upgrade process
Probably cheaper (2/3 of 3x m4.large) No VPC CNI Pod density limitations
* https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/
EKS IAM roles for pods
Also possible on DIY clusters, officially launched yesterday
• OIDC federation access (OpenID Connect identity provider)
• Assume role via Secure Token Service (STS)
• Projected service account tokens (JWT) in pod
• STS can validate JWT tokens against OIDC provider
• Boils down to:
• Enable/set-up prerequisites in cluster
• Add ServiceAccount having IAM role annotation to pod
• Use recent AWS SDK
Multiple clusters per AWS account
Don’t lock ourselves in a corner.
api.<aws-account-name>.<k8s-sanoma-domain>
api.<cluster-name>.<aws-account-name>.<k8s-sanoma-domain>
Route53 zone 1
Route53 zone 1Route53 zone 2
NS records
CI/CD to separate cluster
Similar flows
• No more taints and tolerations
• Similar authorization mechanism to all deploy targets
• Possibly IAM
• No need for Jenkins-SU
• Clusters should be cattle anyway
Pipelines
GitOps
• Manage namespaces via pipeline:
• kube-system
• monitor
• Creation of application namespaces including RBAC
• Helmfile
System applications
Small improvements
• Prometheus-operator
• PrometheusRule resource type
• Default dashboards
• EFS
• https://github.com/previousnext/k8s-aws-efs
• Current. Works well but not a lot of active development.
• 2 contributors. 46 stars.
• https://github.com/kubernetes-incubator/external-storage
• De facto EFS provisioner. 146 contributors. 1630 stars.
• Bonus: No more time-consuming initial volume set-up
Expand
Increase Return on Investment
• Add more applications
• Facilitate parallel testing & development workflows
• Feature testing
• Mobile app development
• E2e tests
Links
Further reading
Scaling & spot instances:
• https://itnext.io/the-definitive-guide-to-running-ec2-spot-instances-as-kubernetes-worker-nodes-68ef2095e767
EKS:
• https://medium.com/glia-tech/productionproofing-eks-ed52951ffd6c
QoS:
• https://www.replex.io/blog/everything-you-need-to-know-about-kubernetes-quality-of-service-qos-classes
Failure stories:
• https://k8s.af/
Summary
Know your limits
Automate all the things
Everything code
Kubernetes is a journey, not a destination
All should be cattle. No pets allowed!
?
Ad

More Related Content

What's hot (15)

Docker in the Cloud
Docker in the CloudDocker in the Cloud
Docker in the Cloud
Sascha Möllering
 
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackHow DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
Betclic Everest Group Tech Team
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
confluent
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
DevOps Chicago
 
Building Micro-Services with Scala
Building Micro-Services with ScalaBuilding Micro-Services with Scala
Building Micro-Services with Scala
Yardena Meymann
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
Robert Virding
 
HA SOA Application with GlusterFS
HA SOA Application with GlusterFSHA SOA Application with GlusterFS
HA SOA Application with GlusterFS
zeridon
 
Kubernetes
KubernetesKubernetes
Kubernetes
Anastasios Gogos
 
DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop
Sascha Möllering
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
Sergey Dzyuban
 
Distributed automation selcamp2016
Distributed automation selcamp2016Distributed automation selcamp2016
Distributed automation selcamp2016
aragavan
 
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOpsSaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltStack
 
Autoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know HowAutoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know How
aragavan
 
All the troubles you get into when setting up a production ready Kubernetes c...
All the troubles you get into when setting up a production ready Kubernetes c...All the troubles you get into when setting up a production ready Kubernetes c...
All the troubles you get into when setting up a production ready Kubernetes c...
Jimmy Lu
 
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackHow DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
confluent
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
DevOps Chicago
 
Building Micro-Services with Scala
Building Micro-Services with ScalaBuilding Micro-Services with Scala
Building Micro-Services with Scala
Yardena Meymann
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
Robert Virding
 
HA SOA Application with GlusterFS
HA SOA Application with GlusterFSHA SOA Application with GlusterFS
HA SOA Application with GlusterFS
zeridon
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
Sergey Dzyuban
 
Distributed automation selcamp2016
Distributed automation selcamp2016Distributed automation selcamp2016
Distributed automation selcamp2016
aragavan
 
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOpsSaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltConf14 - Justin Carmony, Deseret Digital Media - Teaching Devs About DevOps
SaltStack
 
Autoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know HowAutoscaled Distributed Automation Expedia Know How
Autoscaled Distributed Automation Expedia Know How
aragavan
 
All the troubles you get into when setting up a production ready Kubernetes c...
All the troubles you get into when setting up a production ready Kubernetes c...All the troubles you get into when setting up a production ready Kubernetes c...
All the troubles you get into when setting up a production ready Kubernetes c...
Jimmy Lu
 

Similar to Kubernetes at NU.nl (Kubernetes meetup 2019-09-05) (20)

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Fwdays
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
ObjectRocket
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
David Stockton
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
Krishna Gade
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
indiver
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
Amazon Web Services Korea
 
Evolving for Kubernetes
Evolving for KubernetesEvolving for Kubernetes
Evolving for Kubernetes
Chris McEniry
 
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
panagenda
 
Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?
Martin Schmidt
 
Simplify Your Way To Expert Kubernetes Management
Simplify Your Way To Expert Kubernetes ManagementSimplify Your Way To Expert Kubernetes Management
Simplify Your Way To Expert Kubernetes Management
DevOps.com
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
CD with spinnaker
CD with spinnakerCD with spinnaker
CD with spinnaker
AbdulBasit Kabir
 
Tech4Africa 2014
Tech4Africa 2014Tech4Africa 2014
Tech4Africa 2014
FAschenbrenner
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
Sahdev Zala
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Fwdays
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
ObjectRocket
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
David Stockton
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
Krishna Gade
 
Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
indiver
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
Amazon Web Services Korea
 
Evolving for Kubernetes
Evolving for KubernetesEvolving for Kubernetes
Evolving for Kubernetes
Chris McEniry
 
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
Engage 2020 - Kubernetes for HCL Connections Component Pack - Build or Buy?
panagenda
 
Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?
Martin Schmidt
 
Simplify Your Way To Expert Kubernetes Management
Simplify Your Way To Expert Kubernetes ManagementSimplify Your Way To Expert Kubernetes Management
Simplify Your Way To Expert Kubernetes Management
DevOps.com
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
Sahdev Zala
 
Ad

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Ad

Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)

  • 1. @
  • 4. NU.nl About • First dutch digital news platform. • Unique visitors: • 7 mln. / month • 2.1 mln. / day • Page hits: ~12 mln / day • API: ~150k rpm / 2500rps
  • 5. NU.nl Sanoma • Part of Sanoma • NL: NU.nl, Viva, Libelle, Scoopy • FI: Helsingin Sanomat • Reaching ~9.8 mln dutch people / month
  • 6. IT organization Teams • NU.nl teams • Web 1 (application / front-end-ish) • Web 2 (application / back-end-ish / infra) • Feature 1 & 2 (cross-discipline) • iOS • Android • Sanoma teams • DevSupport, Mediatool, Content Aggregation
  • 7. NU.nl Growing number of teams • Increased number of parallel workflows • Testing • Releasing • Roadmaps • Knowing about everything no longer possible • Aligning ‘procedures by agreement’ increasingly hard
  • 9. Current infrastructure AWS accounts & VPCs VPC sanoma RDS Elasticache ALBs EC2 Cloudfront API CMS WWW XYZ VPC nu-test FOO K8S VPC nu-prod BAR K8S
  • 10. Infrastructure provisioning Terrible (Terraform + Ansible) terrible plan terrible apply terrible ansible
  • 11. Development workflow From code to release • Code • Automated tests • Code review • Manually initiated deploy to test • Feature test • Manually initiated deploy to staging • Exploratory test • Manually initiated deploy to production
  • 12. DevOps practices Solid foundation • All infra in code • Terraform • Terrible providing mechanisms: • Authorization • Managing TF state files
  • 13. DevOps practices But… • Setting up additional test environments slow • Slow feedback loop • Terraform plan vs apply (surprise surprise, it didn’t work) • Ansible (~20 minutes) • Vagrant? (but not fully representative of EC2) • Config drift • Hard to nail down every system package version • EC2 instances having different lifecycle
  • 14. DevOps practices But… (part 2) • No scaling infra* • Heavily invested in Ansible • Config & secrets management problematic • GUIs time consuming • No change history • Or highly detached from code history • No context • Not overly secret *Yes, we know it’s 2019
  • 15. DevOps practices But… (part 3) • Current deployment system assumes fixed set of servers • Possible alternatives include: • ASG rolling updates (can get slow) • Pull current application code on start-up (even slower) • Bake AMI • Periodically poll for application version to be deployed • Works quite well • …as long as new code combined with config doesn’t break. • So a certain level of orchestration would be needed.
  • 17. Timing What direction to move? • DevOps challenges • Desire to improve delivery process, having true artifacts • Early 2018 • Containers are a well-established way of ‘packaging’ an application • Kubernetes getting out of early-adopters phase • NU.nl (re-)launching a new product: NUjij
  • 18. Improvement layers A journey or a destination? 1: Containers as artifacts • Versatile • Forces us to do certain things right • 12factor • Centralized logging • Easily moved through a pipeline • Lots of tooling
  • 19. Improvement layers A journey or a destination? 2: A flexible platform to deploy and run containerized applications on • Tackling challenges at platform level instead of per-application: • Scaling • Security updates • Observability • Deployment & configuration process
  • 20. Improvement layers A journey or a destination? 2: A flexible platform to deploy and run containerized applications on • Kubernetes • Rapidly increasing adoption • Short feedback loop • Ability to run locally (unlike, say, ECS) • Easily stamp out deployments for: • feature testing/demo-ing • e2e tests
  • 21. Narrowing the scope Lets not get carried away The goal is not: • To chop up change all of our applications into nano- micro-services • They’re not that monolithic anyway • To put everything in Kubernetes • Managed AWS services where possible • Redis, RDS Focus on agility and efficiency of what we change most frequently: Code
  • 22. Initial cluster setup The journey begins
  • 23. Multiple clusters By criticality 3 AWS accounts, 3 clusters: • osc-nu-prod • production • osc-nu-test • test • staging • osc-nu-dev • proofing infra changes
  • 24. Kops Why Kops? • Manages cluster upgrades • Rolling upgrade • Draining nodes • EKS not yet available • Let alone in eu-west-1
  • 25. Kops Glueing together cluster setup and kube-system setup
  • 29. Components kube-system • Networking • Calico • EFS • previousnext/k8s-aws-efs • No AZ-restrictions when re-scheduling pods • Creates new EFS filesystem for each PersistentVolumeClaim • Security & reliability (isolated IOPs budgets) • Slow on initial deploy
  • 30. Components kube-system • AWS IAM Authenticator • The ‘Zalando suite’ • Skipper • Skipper Daemonset • kube-ingress-aws-controller Deployment • ExternalDNS • Configures PowerDNS (& others) based on ingress host
  • 31. Components Zalando skipper • Skipper Daemonset • Feature rich (metrics, shadow traffic, blue/green) • kube-ingress-aws-controller Deployment • https://github.com/zalando-incubator/kube- ingress-aws-controller • Sets up & manages ALB • Finds appropriate ACM certificate • Supports multiple ACM certificates per ALB
  • 32. Components Autoscaling • Horizontal Pod Autoscaler • Scales number of pods based on (CPU) utilization • Cluster autoscaler • Running on master nodes • Scales asg out when pods pending • Scales asg in when nodes underutilized
  • 33. Components Logging & metrics • ELK • Prometheus / Grafana
  • 35. Jenkins Temporary deployment for running tests • Deploy to temp. namespace • Jenkins-SU • Run tests in deployment • Deploy to test/staging/production • By bumping image version • Production: Jenkins-SU • Clean up temp. namespace • Jenkins-SU
  • 36. Jenkins Jenkins-SU • Sets up namespace • Adding RBAC for Jenkins • Only if ns name matches pattern ‘Jenkins-*’ • Deletes namespace • Only if ns name matches pattern ‘Jenkins-*’ • Avoids need for Jenkins to be able to delete every namespace curl -X POST --user ${JENKINS_SU_AUTH} --data '{"name": "${K8S_BUILD_NS}"}' http://su.jenkins-su/ns/ curl -X DELETE --user ${JENKINS_SU_AUTH} --data '{"name": "${K8S_BUILD_NS}"}' http://su.jenkins-su/ns/
  • 38. Kubernetes in action Questions • Will it be stable? • Will we be able to operate? • Should we wait for EKS? • Do we actually want EKS? What will EKS be like?
  • 41. Incident 1 Accidentally trying to load a ElasticSearch index of 90Gb • Misconfigured elast-alert (trying to read entire index) • No memory limit configured
  • 42. Incident 1 Accidentally trying to load a ElasticSearch index of 90Gb • Required manual intervention: Yes • Stopping the bleeding: • Remove elast-alert • Permanent fixes: • Don’t load entire index • Apply limits
  • 44. Incident 2 Rapid traffic increase affecting core components • 2019-03-18 Utrecht shooting • 11:11 First article published • 11:56 breaking push • CPU burstable pods causing node 100% CPU • Core components (kubelet, ingress) suffering
  • 45. Incident 2 Rapid traffic increase affecting core components
  • 46. Incident 2 Rapid traffic increase affecting core components
  • 47. Incident 2 Rapid traffic increase affecting core components
  • 48. Incident 2 Rapid traffic increase affecting core components
  • 49. Incident 2 Rapid traffic increase affecting core components pod pod kubelet skipper node Pods: 0.4 CPU req. 0.8 CPU limit 80% CPU utilization pod kubelet skipper node pod Pods: 0.4 CPU req. 0.8 CPU limit 120% CPU utilization problems
  • 50. Incident 2 Rapid traffic increase affecting core components • Required manual intervention: No • Fixes: • Reduce CPU burstable amount of pods • Increase resource requests of skipper • Mind QoS: Guaranteed, Burstable, Best effort • Reserve cpu & memory for kubelet • --kube-reserved • --system-reserved
  • 53. Incident 3 Application update increasing memory footprint • Upgrade including moving from MongoDB 3 to MongoDB 4 • HorizontalPodAutoscaler based on CPU • Scaling based on CPU not kicking in • New increased memory footprint causing OOMkilled
  • 54. Incident 3 Application update increasing memory footprint
  • 55. Incident 3 Application update increasing memory footprint • Required manual intervention: Yes • Stopping the bleeding: • Increase memory limit of Talk pods • Permanent fixes: • Adjust CPU request/limit & HPA thresholds • Scale on both CPU and memory • Note: Not all applications ‘give back’ memory • Set memory limit higher than request to prevent ‘snowball effect’
  • 56. Incident 3 OOMKilled snowball effect pod pod pod pod pod pod pod pod pod pod starting … 1 2 3 4
  • 58. That’s not fine Is it? • On the positive side: • All are result of (lack of) resource limit configuration • This can be learned • On the negative side: • This needs to be learned • Note: ‘Availability bias’
  • 60. Automation Improving the pipeline • Automating setting the image version is not enough • Rolling out Kubernetes manifests still manual task • Updating configuration & secrets still manual task • Duplication in manifests between stages • Not easily seen what parts are different • Differences intentional or accidental? • This actually slows us down • Does git represent the current state? kubectl -n talk get secrets env -o json |jq -r '.data | map_values(@base64d) | to_entries | .[] | .key + "="" + .value +"""'
  • 61. Helm The package manager for Kubernetes • Charts • Configured via values • It’s like Terraform modules • Or Ansible group_vars • Leveraging community knowledge and efforts • E.g. prometheus-operator • No need to copy charts, able to reference. • Helm v3
  • 62. SOPS: Secrets OPerationS Secrets management stinks, use some sops! • By Mozilla • Manage AWS API access, not keys • Versatile • YAML, JSON, ENV, INI, binary (plain text) • Not limited to Kubernetes • Meaningful diffs • Alternatives considered: • Kamus • Bitnami SealedSecrets
  • 63. Helmfile Wiring it together • Charts • Referenced from online chart sources or local • Environments • Test, staging, production • Referencing values and secrets • Releases • Release name • Reference to chart • Values (can be a templated file, using vars and secrets from environment)
  • 64. Helmfile Wiring it together environment values secrets (SOPS) release X release Y release Z ENV values values values Helmfile
  • 65. Helmfile Wiring it together • Advantages: • Meaningful git diffs • Easily manage multiple releases in single pipeline, e.g.: • Everything related to monitoring and logging • Kube-system • Declarative definition • Of what would otherwise be numerous helm args and steps in CI/CD pipeline
  • 66. Helmfile Wiring it together • Advantages (continued): • Ability to pass in ENV vars • E.g. build result image tags • Ability to reference complex charts created by community • Charts as a building block allows re-use. Example: • Instead of plain yaml you write a chart • If fitting workflow, the chart can be a published artifact • Chart can be re-used e.g. in e2e tests
  • 67. Helmfile Wiring it together • Disadvantages: • 2 levels of templating • Chart itself • Only if writing own charts • Environment & release values into Helm values • Template error message not overly clear • Or even misleading • At least it breaks
  • 73. Helmfile Final words But tiller? • Helm as a templating engine • Option: Using Helm 2 ‘Tillerless’ • Tiller outside of cluster, not by-passing RBAC • Start using Helm as package manager when Helm 3 settles down • Easy removal of temp. per-feature deploys • Diffs
  • 77. Auto-scaling Types of scaling • Reactive • Breaking news • K8S cluster-autoscaler • Can’t schedule pod? Add nodes. • Predictive • Ticket sale start • Black Friday
  • 78. Auto-scaling Types of scaling • From within cluster • K8S cluster-autoscaler • From outside of cluster • ASG scaling policies
  • 79. Auto-scaling Scaling speed node spin-up duration node count 70% utilization
  • 81. Cluster auto-scaler Bag of tricks • Mix predictive and reactive • Add asg instances without telling cluster-autoscaler • Traffic expected to arrive by the time cluster-autoscaler starts to scale in, leaving plenty of resources as needed. • Pause pods • Lower priority pods that can safely be evicted • Effectively ‘creating headroom’ in cluster
  • 82. Considerations When engaging ‘ludicrous mode’™ Can control-plane handle scale? • KOPS • Size master nodes for max. cluster size • Overhead cost • EKS • What’s behind the abstraction? • ELB 503s exist after all • Plan: Proof of concepts
  • 84. Consider EKS Managed control plane EKS Kops Managed control plane Total control over setup Easier: EKS IAM roles for pods • Launched 2019-09-04 (yesterday)* Smooth rolling upgrade process Probably cheaper (2/3 of 3x m4.large) No VPC CNI Pod density limitations * https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/
  • 85. EKS IAM roles for pods Also possible on DIY clusters, officially launched yesterday • OIDC federation access (OpenID Connect identity provider) • Assume role via Secure Token Service (STS) • Projected service account tokens (JWT) in pod • STS can validate JWT tokens against OIDC provider • Boils down to: • Enable/set-up prerequisites in cluster • Add ServiceAccount having IAM role annotation to pod • Use recent AWS SDK
  • 86. Multiple clusters per AWS account Don’t lock ourselves in a corner. api.<aws-account-name>.<k8s-sanoma-domain> api.<cluster-name>.<aws-account-name>.<k8s-sanoma-domain> Route53 zone 1 Route53 zone 1Route53 zone 2 NS records
  • 87. CI/CD to separate cluster Similar flows • No more taints and tolerations • Similar authorization mechanism to all deploy targets • Possibly IAM • No need for Jenkins-SU • Clusters should be cattle anyway
  • 88. Pipelines GitOps • Manage namespaces via pipeline: • kube-system • monitor • Creation of application namespaces including RBAC • Helmfile
  • 89. System applications Small improvements • Prometheus-operator • PrometheusRule resource type • Default dashboards • EFS • https://github.com/previousnext/k8s-aws-efs • Current. Works well but not a lot of active development. • 2 contributors. 46 stars. • https://github.com/kubernetes-incubator/external-storage • De facto EFS provisioner. 146 contributors. 1630 stars. • Bonus: No more time-consuming initial volume set-up
  • 90. Expand Increase Return on Investment • Add more applications • Facilitate parallel testing & development workflows • Feature testing • Mobile app development • E2e tests
  • 91. Links Further reading Scaling & spot instances: • https://itnext.io/the-definitive-guide-to-running-ec2-spot-instances-as-kubernetes-worker-nodes-68ef2095e767 EKS: • https://medium.com/glia-tech/productionproofing-eks-ed52951ffd6c QoS: • https://www.replex.io/blog/everything-you-need-to-know-about-kubernetes-quality-of-service-qos-classes Failure stories: • https://k8s.af/
  • 93. Know your limits Automate all the things Everything code Kubernetes is a journey, not a destination All should be cattle. No pets allowed!
  • 94. ?