Fixing Rook Ceph OSD paths when they change
Rook paths changing
I rebooted monstar and ambellina and the paths for all the OSDs changed. This broke Rook in half with no clear way to fix it.
Here’s the fix.
Now, ideally you don't break enough OSDs to actually break your cluster, but if you're desperate, here you go...
- Identify the broken OSD, in this case 26
- Use ceph-bluestore-tool on your raw block devices (don’t worry it’s read only for this operation) and find your missing OSD… you can find that tool in the
ceph-osd
package (feel free to uninstall it afterwards) and look at osd_uuid to find the UUID that matches your actual OSD. Ignore the "magic" here, it doesn't mean "OSD 26" it means version 26.
root@emerald-k8s-w10:~# ceph-bluestore-tool show-label --dev /dev/sdc
{
"/dev/sdc": {
"osd_uuid": "1d303492-bde6-4afb-b35d-9ea4668e4176",
"size": 1920383410176,
"btime": "2024-02-19T23:36:55.559570+0000",
"description": "main",
"bfm_blocks": "468843648",
"bfm_blocks_per_key": "128",
"bfm_bytes_per_block": "4096",
"bfm_size": "1920383410176",
"bluefs": "1",
"ceph_fsid": "3025d3f7-5572-4075-9b54-ee8237b6fa06",
"ceph_version_when_created": "ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)",
"created_at": "2024-02-19T23:37:00.295664Z",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQAW5tNlwUcGHhAA5zXFjlsTgwhbuIQyaAM4Mg==",
"ready": "ready",
"require_osd_release": "18",
"whoami": "29"
}
}
Once you do, get the path for it from /dev/disk/by-id
root@emerald-k8s-w10:~# ls -lah /dev/disk/by-id | grep sdc
lrwxrwxrwx 1 root root 9 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3 -> ../../sdc
lrwxrwxrwx 1 root root 10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part3 -> ../../sdc3
So the path is /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi3
root@emerald-k8s-w10:~# ls -lah /dev/disk/by-id | grep sdc
lrwxrwxrwx 1 root root 9 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3 -> ../../sdc
lrwxrwxrwx 1 root root 10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part3 -> ../../sdc3
Now we need to download the deployment for that OSD, back it up, then edit it and replace the old path with the new one, then re-create the deployment.
wings:rescue/ (main✗) $ kubectl -n rook-ceph get deployment rook-ceph-osd-26 -o yaml > osd26.yaml [12:54:07]
wings:rescue/ (main✗) $ cp osd26.yaml osd26.yaml.bak
Replace both instances of ROOK_BLOCK_PATH with the new path, save the file, and recreate the deployment.
wings:rescue/ (main✗) $ code osd26.yaml [12:54:13]
wings:rescue/ (main✗) $ kubectl delete -f osd26.yaml ; kubectl apply -f osd26.yaml [12:54:35]
deployment.apps "rook-ceph-osd-26" deleted
deployment.apps/rook-ceph-osd-26 created
wings:rescue/ (main✗) $
Your OSD should spin back to life.
PS: If you are doing a bunch of OSDs, partially script it. It'll save you a lot of effort.