I have the following scenario:
Node 1:
- Pod-A
Node 2:
- No Pods
Pod-A is using a PV in RWO accessMode.
When Node 1 fails (the VM is powered off from the hypervisor), it appears as "NotReady" and Pod-A appears as "Running" until the pod-eviction-timeout ends.
Then, Pod-A enters in "Terminating" status and kubernetes tries to start a Pod-B in Node 2 (because Node 1 is tainted and the desired state demands one Pod running). However, Pod-B cannot start due to the PV it needs to claim being RWO and it is still attached to Pod-A (kubectl describe pod Pod-A shows "Multi-attach error for volume").
Pod-A gets stuck in "Terminating" (I suppose because it cannot comunicate with the kublet of Node 1), even after the grace period (supposedly 30 seconds) ends.
If I force the deletion of Pod-A (kubectl delete pods Pod-A --grace-period=0 --force), the pod is successfully deleted. After 6 minutes, kubernetes realizes that the volume is no longer being used by Pod-A, it is succesfully attached to Pod-B and finally it starts running normally.
I already found a way to shorten the pod-eviction-timeout (modifying the /etc/kubernetes/manifests/kube-controller-manager.yaml), but I cannot find a way to make Pod-A terminate automatically and to lower the time it takes Kubernetes to detach the volume from Pod-A once it was deleted. Is there a way to make this happen faster? Do i have to install some kind of monitoring service? Which one?
Thanks in advance.