VOLUME MOUNTING TROUBLESHOOTING (stuck ceph RBD volumes)
Get the volume's StorageClass
You'll need to get the storageclass to determine how to debug this. Block storage means it should be on one node at a time, and thus if it's stuck on a node it will preven other nodes from using it. The non-blocked class should allow other nodes to mount as well.
kubectl describe -n mizzou pvc/claim-hikf3-40mail-2emissouri-2eedu | grep StorageClass
StorageClass: rook-ceph-block
Here we're getting the storage class via a pvc name. The namespace here is mizzou, and the pvc name is claim-hikf3-40mail-2emissouri-2eedu
If you have the pv name, that can also be used the same way:
kubectl describe -n mizzou pv/pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7 | grep StorageClass
StorageClass: rook-ceph-block
Here the storage calss is rook-ceph-block, but there are variations of storage class based on the region the ceph cluster is in. E.g. rook-ceph-block-central, rook-ceph-block-east. Basically you want to be looking for whether it says block or not.
If the StorageClass is rook-ceph-block
Get the name of PV:
If you only have the pvc name, you can find the pv name by doing the following command:
kubectl get pv | grep claim-hikf3-40mail-2emissouri-2eedu
pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7 5Gi RWO Delete Bound mizzou/claim-hikf3-40mail-2emissouri-2eedu rook-ceph-block 185d
Here in this example the pvc name is claim-hikf3-40mail-2emissouri-2eedu and the pv name is pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7
Find Node a pv is currently mounted on:
We can get the volume attachments and grep for the pv name to find what nodes it's attached to:
kubectl get volumeattachments | grep pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7
csi-68ba7951a0c04649cd3e80156c355e899e753e50b9030a424dbe8dd872061067 rook-system.rbd.csi.ceph.com pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7 k8s-bharadia-02.sdsc.optiputer.net true 2d19h
Here the pv name is pvc-ae8904a6-23f2-46d0-ac5c-4e9e9271c6f7 and the node it's attached to is k8s-bharadia-02.sdsc.optiputer.net
Rebooting the node:
Reach out to an admin to reboot the node the volume is attached to.
If you have permissions to reboot yourself, you can do the following:
Drain node: kubectl drain {node name} --ignore-daemonsets --delete-emptydir-data --force
SSH into node and reboot: ssh {user}@{node name} reboot
If GPU, check if nvidia-smi is up: nvidia-smi
Uncordon Node: kubectl uncordon {node name}
Or use the ansible playbook: ansible-playbook reboot.yaml -l {node name}
If the StorageClass is rook-cephfs
Since this type of storage class allows the ceph volume to be mounted on multiple nodes, a stuck node is likely not the issue. The most common issue is that the user has configured their volume incorrectly, and their access mode is set to ReadWriteOnce when it should be set to ReadWriteMany for this type of storageclass.
You can do this by outputting the config for a pvc or pv and grepping for the accessMode. Like I do in the following:
kubectl get -n mizzou pvc/claim-hikf3-40mail-2emissouri-2eedu -o yaml | grep accessModes: -A 1
Here the pvc name is claim-hikf3-40mail-2emissouri-2eedu, namespace is mizzou, and I added a -A 1 to the end to display the line below accessModes
accessModes:
- ReadWriteOnce