Kubernetes Pod failover never ending on Node failure

Question

I have the following scenario:

Node 1:

Pod-A

Node 2:

No Pods

Pod-A is using a PV in RWO accessMode.

When Node 1 fails (the VM is powered off from the hypervisor), it appears as "NotReady" and Pod-A appears as "Running" until the pod-eviction-timeout ends.

Then, Pod-A enters in "Terminating" status and kubernetes tries to start a Pod-B in Node 2 (because Node 1 is tainted and the desired state demands one Pod running). However, Pod-B cannot start due to the PV it needs to claim being RWO and it is still attached to Pod-A (kubectl describe pod Pod-A shows "Multi-attach error for volume").

Pod-A gets stuck in "Terminating" (I suppose because it cannot comunicate with the kublet of Node 1), even after the grace period (supposedly 30 seconds) ends.

If I force the deletion of Pod-A (kubectl delete pods Pod-A --grace-period=0 --force), the pod is successfully deleted. After 6 minutes, kubernetes realizes that the volume is no longer being used by Pod-A, it is succesfully attached to Pod-B and finally it starts running normally.

I already found a way to shorten the pod-eviction-timeout (modifying the /etc/kubernetes/manifests/kube-controller-manager.yaml), but I cannot find a way to make Pod-A terminate automatically and to lower the time it takes Kubernetes to detach the volume from Pod-A once it was deleted. Is there a way to make this happen faster? Do i have to install some kind of monitoring service? Which one?

Thanks in advance.

Kubernetes Pod failover never ending on Node failure

0 Answers0