Kubernetes Troubleshooting
Kubernetes Troubleshooting
com
Kubernetes troubleshooting can be very complex. This article will focus on:
Understanding
In a Kubernetes environment, it can be very difficult to understand what happened and
determine the root cause of the problem. This typically involves:
Reviewing recent changes to the affected cluster, pod, or node, to see what caused
the failure.
Analyzing YAML configurations, Github repositories, and logs for VMs or bare
metal machines running the malfunctioning components.
Looking at Kubernetes events and metrics such as disk pressure, memory
pressure, and utilization. In a mature environment, you should have access to
dashboards that show important metrics for clusters, nodes, pods, and containers
over time.
Comparing similar components behaving the same way, and analyzing
dependencies between components, to see if they are related to the failure.
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Management
In a microservices architecture, it is common for each component to be developed and
managed by a separate team. Because production incidents often involve multiple
components, collaboration is essential to remediate problems fast.
Once the issue is understood, there are three approaches to remediating it:
Prevention
Successful teams make prevention their top priority. Over time, this will reduce the time
invested in identifying and troubleshooting new issues. Preventing production issues in
Kubernetes involves:
Creating policies, rules, and playbooks after every incident to ensure effective
remediation
Investigating if a response to the issue can be automated, and how
Defining how to identify the issue quickly next time around and make the
relevant data available—for example by instrumenting the relevant components
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Ensuring the issue is escalated to the appropriate teams and those teams can
communicate effectively to resolve it
Even in a small, local Kubernetes cluster, it can be difficult to diagnose and resolve
issues, because an issue can represent a problem in an individual container, in one or
more pods, in a controller, a control plane component, or more than one of these.
In a large-scale production environment, these issues are exacerbated, due to the low
level of visibility and a large number of moving parts. Teams must use multiple tools to
gather the data required for troubleshooting and may have to use additional tools to
diagnose issues they detect and resolve them.
In short – Kubernetes troubleshooting can quickly become a mess, waste major resources
and impact users and application functionality – unless teams closely coordinate and
have the right tools available.
CreateContainerConfigError
ImagePullBackOff or ErrImagePull
CrashLoopBackOff
Kubernetes Node Not Ready
CreateContainerConfigError
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
This error is usually the result of a missing Secret or ConfigMap. Secrets are Kubernetes
objects used to store sensitive information like database credentials. ConfigMaps store
data as key-value pairs, and are typically used to hold configuration information used by
multiple pods.
Now run this command to see if the ConfigMap exists in the cluster.
ImagePullBackOff or ErrImagePull
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
This status means that a pod could not run because it attempted to pull a container image
from a registry, and failed. The pod refuses to start because it cannot create one or more
containers defined in its manifest.
Wrong image name or tag—this typically happens because the image name or
tag was typed incorrectly in the pod manifest. Verify the correct image name
using docker pull , and correct it in the pod manifest.
Authentication issue in Container registry—the pod could not authenticate
with the registry to retrieve the image. This could happen because of an issue in
the Secret holding credentials, or because the pod does not have an RBAC role
that allows it to perform the operation. Ensure the pod and node have the
appropriate permissions and Secrets, then try the operation manually
using docker pull .
CrashLoopBackOff
This issue indicates a pod cannot be scheduled on a node. This could happen because the
node does not have sufficient resources to run the pod, or because the pod did not
succeed in mounting the requested volumes.
Insufficient resources—if there are insufficient resources on the node, you can
manually evict pods from the node or scale up your cluster to ensure more nodes
are available for your pods.
Volume mounting—if you see the issue is mounting a storage volume, check
which volume the pod is trying to mount, ensure it is defined correctly in the pod
manifest, and see that a storage volume with those definitions is available.
Use of hostPort—if you are binding pods to a hostPort, you may only be able to
schedule one pod per node. In most cases you can avoid using hostPort and use a
Service object to enable communication with your pod.
To check if pods scheduled on your node are being moved to other nodes, run the
command get pods .
Check the output to see if a pod appears twice on two different nodes, as follows:
If the failed node is able to recover or is rebooted by the user, the issue will resolve
itself. Once the failed node recovers and joins the cluster, the following process takes
place:
1. The pod with Unknown status is deleted, and volumes are detached from the
failed node.
2. The pod is rescheduled on the new node, its status changes
from Unknown to ContainerCreating and required volumes are attached.
3. Kubernetes uses a five-minute timeout (by default), after which the pod will run
on the node, and its status changes from ContainerCreating to Running.
If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes
reschedule the stateful pods on another, working node. There are two ways to achieve
this:
Remove failed node from the cluster—using the command kubectl delete node
[name]
Delete stateful pods with status unknown—using the command kubectl delete
pods [pod_name] --grace-period=0 --force -n [namespace]
Name: nginx-deployment-1006230814-6winp
Namespace: default
Node: kubernetes-node-wul5/10.240.0.9
Labels: app=nginx,pod-template-hash=1006230814
Annotations: kubernetes.io/created-
by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","na mespa
ce":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16"
...
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Status: Running
IP: 10.244.0.6
Controllers: ReplicaSet/nginx-deployment-1006230814
Containers:
nginx:
Container ID:
docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb11
49
Image: nginx
Image ID:
docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e51637
07
Port: 80/TCP
QoS Tier:
cpu: Guaranteed
memory: Guaranteed
Limits:
cpu: 500m
memory: 128Mi
Requests:
memory: 128Mi
cpu: 500m
State: Running
Ready: True
Restart Count: 0
Environment: [none]
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Mounts:
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
default-token-4bcbi:
SecretName: default-token-4bcbi
Optional: false
Node-Selectors: [none]
Tolerations: [none]
Events:
Name—below this line are basic data about the pod, such as the node it is running
on, its labels and current status.
Status —this is the current state of the pod, which can be:
o Pending
o Running
o Succeeded
o Failed
o Unknown
Containers —below this line is data about containers running on the pod (only one
in this example, called nginx),
Containers:State—this indicates the status of the container, which can be:
o Waiting
o Running
o Terminated
Volumes —storage volumes, secrets or ConfigMaps mounted by containers in the
pod.
Events —recent events occurring on the pod, such as images pulled, containers
created and containers started.
If a pod’s status is Waiting, this means it is scheduled on a node, but unable to run. Look
at the describe pod output, in the ‘Events’ section, and try to identify reasons the pod is
not able to run.
Most often, this will be due to an error when fetching the image. If so, check for the
following:
Try deleting the pod and recreating it with kubectl apply --validate -f mypod1.yaml
This command will give you an error like this if you misspelled a command in the pod
manifest, for example if you wrote continers instead of containers :
pods/mypod1
Checking for a mismatch between local pod manifest and API Server
It can happen that the pod manifest, as recorded by the Kubernetes API Server, is not the
same as your local manifest—hence the unexpected behavior.
Run this command to retrieve the pod manifest from the API server and save it as a local
YAML file:
You will now have a local file called apiserver-[pod-name].yaml , open it and compare
with your local YAML. There are three possible cases:
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Local YAML has the same lines as API Server YAML, and more—this
indicates a mismatch. Delete the pod and rerun it with the local pod manifest
(assuming it is the correct one).
API Server YAML has the same lines as local YAML, and more—this is
normal, because the API Server can add more lines to the pod manifest o ver time.
The problem lies elsewhere.
Both YAML files are identical—again, this is normal, and means the problem
lies elsewhere.
If the container has crashed, you can use the --previous flag to retrieve its crash log, like
so:
You can now run kubectl exec on your new ephemeral container, and use it to debug
your production container.
Running a Debug Pod on the Node
If none of these approaches work, you can create a special pod on the node, running in
the host namespace with host privileges. This method is not recommended in production
environments for security reasons.
Run a special debug pod on your node using kubectl debug node/[node-name] -it --
image=[image-name] .
After running the debug command, kubectl will show a message wit h your new
debugging pod—take note of this name so you can work with it:
Note that the new pod runs a container in the host IPC, Network, and PID namespaces.
The root filesystem is mounted at /host.
When finished with the debugging pod, delete it using kubectl delete pod [debug-pod-
name] .
To see a list of worker nodes and their status, run kubectl get nodes --show-labels . The
output will be something like this:
kubectl cluster-info
The output will be something like this:
Impact: If the API server is down, you will not be able to start, stop, or update
pods and services.
Resolution: Restart the API server VM.
Prevention: Set the API server VM to automatically restart, and set up high
availability for the API server.
Impact: Pods on the node stop running, the Scheduler will attempt to run them on
other available nodes. The cluster will now have less overall capacity to run pods.
Resolution: Identify the issue on the node, bring it back up and register it with
the cluster.
Prevention: Use a replication control or a Service in front of pods, to ensure
users are not impacted by node failures. Design applications to be fault tolerant.
Kubelet Malfunction
www.linkedin.com/in/bhavaniprasad32 for more details checkout www.clearcurt.com
Impact: If the kubelet crashes on a node, you will not be able to start new pods
on that node. Existing pods may or may not be deleted, and the node will be
marked unhealthy.
Resolution: Same as Worker Node Shuts Down.
Prevention: Same as Worker Node Shuts Down.
Impact: The master nodes think that nodes in the other network partition are
down, and those nodes cannot communicate with the API Server.
Resolution: Reconfigure the network to enable communication between all nodes
and the API Server.
Prevention: Use a networking solution that can automatically reconfigure cluster
network parameters.