Skip to content

VOLUME MOUNTING TROUBLESHOOTING (stuck ceph RBD volumes)

If the StorageClass is rook-cephfs-block

Get image name of PVC:

kubectl get pv {PV Name} -o yaml | grep imageName

If there's only volumeHandle fiend, the needed part is csi-vol- + the 2nd part of the volumeHandle

Find Node an image is on:

Login to the tools container for the region the volume is in.

rbd status rbd/{image name}

Returns the node's IP as a watcher:

[rook@rook-ceph-tools-7cd79f6fbf-jrvqv /]$ rbd status rbd/csi-vol-bdbb4e58-e06f-11ed-a0fc-8e51dd4b77c0
Watchers:
        watcher=10.244.231.56:0/2673997804 client.434796007 cookie=18446462598732840961```

IP here being 10.244.231.56

Search for the IP in lens node view to get the node, or look up the node in the corresponding ipamblock

Rebooting a node with volume mounting issues:

Drain node: kubectl drain {node name} --ignore-daemonsets --delete-emptydir-data --force

SSH into node and reboot: ssh {user}@{node name} reboot

If GPU, check if nvidia-smi is up: nvidia-smi

Uncordon Node: kubectl uncordon {node name}

Or use the ansible playbook: ansible-playbook reboot.yaml -l {node name}