Pod Terminating Stuck Openshift
Pod Terminating Stuck Openshift
Pod Terminating Stuck Openshift
There might be situations where you have already deleted pods (or already removed deployment
configuration) but pods are stuck in Terminating state.
Network files system: An nfs volume allows an existing NFS (Network File System) share to
be mounted into a Pod. Unlike emptyDir, which is erased when a Pod is removed, the
contents of an nfs volume are preserved and the volume is merely unmounted. This means
that an NFS volume can be pre-populated with data, and that data can be shared between
pods. NFS can be mounted by multiple writers simultaneously.
The simplicity you can obtain by using Trident for dynamically creating PVCs, coupled with
its production grade CSI drivers and data management capabilities make it a key option for
stateful storage requirements for OpenShift. Applications generate data and access to
storage should be painless and on-demand.
2. It is capable of orchestrating across multiple platforms at the same time through a unified
interface.
To reboot a node without causing an outage for applications running on the platform, it is important
to first evacuate the pods.
For pods that are made highly available by the routing tier, nothing else needs to be done. For other
pods needing storage, typically databases, it is critical to ensure that they can remain in operation
with one pod temporarily going offline.
Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes
available to run infrastructure. The nodes to run the infrastructure are called master nodes.
The scenario below demonstrates a common mistake that can lead to service interruptions for the
applications running on OpenShift Container Platform when only two nodes are available.
The registry pod running on that node is now redeployed on node B. This means node B is now
running both registry pods.
The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints
until they are redeployed to node A.
The same process using three master nodes for infrastructure does not result in a service disruption.
However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is
left running zero registries. The other two nodes will run two and one registries respectively. The
best solution is to rely on pod anti-affinity.
Pod anti-affinity: with this in place only two infrastructure nodes are available and one is rebooted,
the container image registry pod is prevented from running on the other node. oc get pods reports
the pod as unready until a suitable node is available. Once a node is available and all pods are back
in ready state, the next node can be restarted.
Scenarios
- Check docker service, atomic -shift -node.service is running or not and few other commands
Any of these two are not running means there is a problem with the node
2. Evacuate a node?
(docker service is not running or pod is in pending state or when there is a space issue we drain the
node)
If a Pod is stuck in Pending it means that it cannot be scheduled onto a node. Generally
this is because there are insufficient resources of one type or another that prevent scheduling.
Reason1: If the issue is like cluster is full or no nodes are available to schedule the pod, we need
to add more nodes to the cluster.
Reason2: If issue is because of resource quota limit (increase quota of the project)
Reason3: If the issue is because of pending state pvc, fix the pvs by mounting. (Mention in
deployment config file)
Reason4: If pod is assigned to any node or not, if assigned to any node and still it is in pending
state, that means some issue with the atomic OpenShift –node service (not running/error), if the
pod is not assigned to any node then issue is with the scheduler.
If a Pod is stuck in the Waiting state, then it has been scheduled to a worker node, but it can't
run on that machine.
Step1: check logs of the pod
Oc log –f <pod-name>
If we can see the error in the log we can fix the issue in the code.
Stecp3: check whether image is pulling from private/public registry, if we are pulling the
images from the private registry we need to configure that option explicitly, if pulling from public
registry and still getting image pull back off issue that means there is some problem with (a)
internet (b) atomic – OpenShift-node.service
Step2: check if the pod if we have forgotten cmd instruction in docker file
Step3: check if the pod is restarting frequently (fix the issue in code or liveness probe)