There are a number of reasons why pods fail to reach a running state. Missing required resources was covered in Part 1. Another scenario happens after the pod is successfully scheduled on a node. When the pod trys to start on its node and either crashes or exits unexpectedly, the
restartPolicy field of the pods
PodSpec determines what happens next. It may just fail and stop, if set to
Never. Or if set to
Always, which is the default, the pod could go into a never ending loop of exiting, trying again and exiting again… In this case its status will show as
I ran into an example of this when installing Rancher Labs very interesting distributed storage project called Longhorn - that’s a good topic for a future post on it’s own. After installing Longhorn, I noticed the pods stuck in
[centos@wrk1 ~]$ kubectl get pods -n longhorn-system NAME READY STATUS RESTARTS AGE longhorn-driver-deployer-69f889d6c7-5zcsf 0/1 Init:0/1 0 12h longhorn-manager-5mc6p 0/1 CrashLoopBackOff 149 12h longhorn-manager-cp7xp 0/1 CrashLoopBackOff 148 12h longhorn-manager-jrw8x 0/1 CrashLoopBackOff 149 12h longhorn-manager-kwdjk 0/1 CrashLoopBackOff 149 12h longhorn-manager-r8qzz 0/1 CrashLoopBackOff 148 12h longhorn-manager-wq7z8 0/1 CrashLoopBackOff 149 12h longhorn-ui-789db56875-qqk7w 1/1 Running 0 12h
Investigating deeper, I found the cause in the pod logs. Longhorn requires the
open-iscsi package to handle mounting volumes after they are provisioned. Makes sense.
[centos@wrk1 ~]$ kubectl -n longhorn-system logs longhorn-manager-5mc6p time="2019-01-03T13:19:29Z" level=error msg="Failed environment check, please make sure you have iscsiadm/open-iscsi installed on the host" time="2019-01-03T13:19:29Z" level=fatal msg="Error starting manager: Environment check failed: Failed to execute: nsenter [--mount=/host/proc/1004/ns/mnt --net=/host/proc/1004/ns/net iscsiadm --version], output nsenter: failed to execute iscsiadm: No such file or directory\n, error exit status 1"
I installed the missing package
sudo yum -y install iscsi-initiator-utils then deleted the pods. These Pods were started by a
DaemonSet which means they will get recreated automatically. The new instances started, made it to
Running status and the storage system was working.
[centos@mgr1 ~]$ kubectl get pods -n longhorn-system NAME READY STATUS RESTARTS AGE engine-image-ei-3bda103d-4wm5v 1/1 Running 0 15m engine-image-ei-3bda103d-72jh6 1/1 Running 0 15m engine-image-ei-3bda103d-8c5qx 1/1 Running 0 15m engine-image-ei-3bda103d-8l5ql 1/1 Running 0 15m engine-image-ei-3bda103d-b48jx 1/1 Running 0 15m engine-image-ei-3bda103d-fmkkr 1/1 Running 0 15m longhorn-driver-deployer-69f889d6c7-gm28b 1/1 Running 0 15m longhorn-flexvolume-driver-64ldd 1/1 Running 0 8m14s longhorn-flexvolume-driver-8s2zf 1/1 Running 0 8m14s longhorn-flexvolume-driver-9sz25 1/1 Running 0 8m14s longhorn-flexvolume-driver-dqdnx 1/1 Running 0 8m14s longhorn-flexvolume-driver-p7xv9 1/1 Running 0 8m14s longhorn-flexvolume-driver-vjbz4 1/1 Running 0 8m14s longhorn-manager-97fbv 1/1 Running 0 6m27s longhorn-manager-9ljbq 1/1 Running 0 6m39s longhorn-manager-bvdgc 1/1 Running 0 6m53s longhorn-manager-jrw8x 1/1 Running 151 12h longhorn-manager-mmwml 1/1 Running 0 6m52s longhorn-manager-zprn7 1/1 Running 0 6m48s longhorn-ui-789db56875-qqk7w 1/1 Running 0 12h
Many things can cause crash loops. It happens any time the containers process systemically exits and kubernetes is configured to restart it, causing a loop of start-stop crashes. Other examples include
Most of the time it’s easy to fix once you track down the cause of the crash. Look at the pod logs, the error usually shows up there, then fix the cause and restart the pods.