Pod Terminating Stuck Openshift

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Pod terminating stuck OpenShift

PODs hanging in terminating state.

There might be situations where you have already deleted pods (or already removed deployment
configuration) but pods are stuck in Terminating state.

In that case Force delete the pod

kubectl delete pod --grace-period=0 --force --namespace <NAMESPACE> <PODNAME>

Network files system: An nfs volume allows an existing NFS (Network File System) share to
be mounted into a Pod. Unlike emptyDir, which is erased when a Pod is removed, the
contents of an nfs volume are preserved and the volume is merely unmounted. This means
that an NFS volume can be pre-populated with data, and that data can be shared between
pods. NFS can be mounted by multiple writers simultaneously.

Trident: is an open-source storage provisioner.

The simplicity you can obtain by using Trident for dynamically creating PVCs, coupled with
its production grade CSI drivers and data management capabilities make it a key option for
stateful storage requirements for OpenShift. Applications generate data and access to
storage should be painless and on-demand.

1. It is the first out-of-tree, out-of-process storage provisioner that works by watching


events at the Kubernetes API Server, affording it levels of visibility and flexibility that cannot
otherwise be achieved.

2. It is capable of orchestrating across multiple platforms at the same time through a unified
interface.

Reboot node without causing application outages?

To reboot a node without causing an outage for applications running on the platform, it is important
to first evacuate the pods.

For pods that are made highly available by the routing tier, nothing else needs to be done. For other
pods needing storage, typically databases, it is critical to ensure that they can remain in operation
with one pod temporarily going offline.
Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes
available to run infrastructure. The nodes to run the infrastructure are called master nodes.

The scenario below demonstrates a common mistake that can lead to service interruptions for the
applications running on OpenShift Container Platform when only two nodes are available.

Node A is marked unschedulable and all pods are evacuated.

The registry pod running on that node is now redeployed on node B. This means node B is now
running both registry pods.

Node B is now marked unschedulable and is evacuated.

The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints
until they are redeployed to node A.

The same process using three master nodes for infrastructure does not result in a service disruption.
However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is
left running zero registries. The other two nodes will run two and one registries respectively. The
best solution is to rely on pod anti-affinity.

Pod anti-affinity: with this in place only two infrastructure nodes are available and one is rebooted,
the container image registry pod is prevented from running on the other node. oc get pods reports
the pod as unready until a suitable node is available. Once a node is available and all pods are back
in ready state, the next node can be restarted.

Scenarios

1. Pod is in pending state:

We need to check whether node is working fine or not.

To check any problem with node

- Check docker service, atomic -shift -node.service is running or not and few other commands

Systemctl status docker

Systemctl status atomic -OpenShift -node.service

Any of these two are not running means there is a problem with the node

If we find the node is not running we have to make it as unschedulable

Oc adm manage-node <node-name> --schedulable=false


Or

Oc adm cordon <node-name>

This will make to get attached to other nodes

Check the operation and do remedy and mark is schedulable

2. Evacuate a node?

Evacuate pods with graceful termination by restarting on another node prior to


termination on the existing node

a. Set a desired node to unschedulable , preventing new workloads from


arriving on the node
b. Timeout periods apply before a pod is forcefully terminated
c. Some pods are never terminated gracefully based on their schedule type,
such as daemonset

Oc adm manage-node <node-name> --evacuate

Oc adm drain <node-name>

Step1: Mark that node as unschedulable

Step2: we have to drain the node

(docker service is not running or pod is in pending state or when there is a space issue we drain the
node)

Step3: once the issue is fixed we need to make as schedulable

Oc adm manage-node <node-name> --schedulable

Oc adm uncordon <node-name>

3. Pod is in pending state

If a Pod is stuck in Pending it means that it cannot be scheduled onto a node. Generally
this is because there are insufficient resources of one type or another that prevent scheduling.

Step1: oc desc pod <pod-name>

Reason1: If the issue is like cluster is full or no nodes are available to schedule the pod, we need
to add more nodes to the cluster.

Reason2: If issue is because of resource quota limit (increase quota of the project)
Reason3: If the issue is because of pending state pvc, fix the pvs by mounting. (Mention in
deployment config file)

Reason4: If pod is assigned to any node or not, if assigned to any node and still it is in pending
state, that means some issue with the atomic OpenShift –node service (not running/error), if the
pod is not assigned to any node then issue is with the scheduler.

4. Pods not running state

If a Pod is stuck in the Waiting state, then it has been scheduled to a worker node, but it can't
run on that machine.
Step1: check logs of the pod

Oc log –f <pod-name>

If we can see the error in the log we can fix the issue in the code.

Reason1: image pull back off -> state of the pod

This will arise when there is a problem in pulling the image

Step1: check the image name is correct or not

Step2: check image tag is correct or not

Stecp3: check whether image is pulling from private/public registry, if we are pulling the
images from the private registry we need to configure that option explicitly, if pulling from public
registry and still getting image pull back off issue that means there is some problem with (a)
internet (b) atomic – OpenShift-node.service

Reason2: pod status -> crash loop backoff

Step1: we need to check logs and fix the issue in code

Step2: check if the pod if we have forgotten cmd instruction in docker file

Step3: check if the pod is restarting frequently (fix the issue in code or liveness probe)

Reasron3: run container error -> restart the pods

Issue is likely to be with the mounting PVC or check in stack-overflow.

5. Pods are in not ready state?

Step1: Describe the pod


Step2: check if readiness probe is failing
Oc desc pod <pod-name>
Step3: Check readiness is configured or not
Curt –kv .. command

You might also like