VOLUME MOUNTING TROUBLESHOOTING (stuck ceph RBD volumes)
If the StorageClass is rook-cephfs-block
Get image name of PVC:
kubectl get pv {PV Name} -o yaml | grep imageName
If there's only volumeHandle
fiend, the needed part is csi-vol-
+ the 2nd part of the volumeHandle
Find Node an image is on:
Login to the tools container for the region the volume is in.
rbd status rbd/{image name}
Returns the node's IP as a watcher:
[rook@rook-ceph-tools-7cd79f6fbf-jrvqv /]$ rbd status rbd/csi-vol-bdbb4e58-e06f-11ed-a0fc-8e51dd4b77c0
Watchers:
watcher=10.244.231.56:0/2673997804 client.434796007 cookie=18446462598732840961```
IP here being 10.244.231.56
Search for the IP in lens node view to get the node, or look up the node in the corresponding ipamblock
Rebooting a node with volume mounting issues:
Drain node: kubectl drain {node name} --ignore-daemonsets --delete-emptydir-data --force
SSH into node and reboot: ssh {user}@{node name} reboot
If GPU, check if nvidia-smi is up: nvidia-smi
Uncordon Node: kubectl uncordon {node name}
Or use the ansible playbook: ansible-playbook reboot.yaml -l {node name}