Fixing Rook Ceph OSD paths when they change

Rook paths changing

I rebooted monstar and ambellina and the paths for all the OSDs changed. This broke Rook in half with no clear way to fix it.

Here’s the fix.

Now, ideally you don't break enough OSDs to actually break your cluster, but if you're desperate, here you go...

  • Identify the broken OSD, in this case 26
  • Use ceph-bluestore-tool on your raw block devices (don’t worry it’s read only for this operation) and find your missing OSD… you can find that tool in the ceph-osd package (feel free to uninstall it afterwards) and look at osd_uuid to find the UUID that matches your actual OSD. Ignore the "magic" here, it doesn't mean "OSD 26" it means version 26.
root@emerald-k8s-w10:~# ceph-bluestore-tool show-label --dev /dev/sdc
{
    "/dev/sdc": {
        "osd_uuid": "1d303492-bde6-4afb-b35d-9ea4668e4176",
        "size": 1920383410176,
        "btime": "2024-02-19T23:36:55.559570+0000",
        "description": "main",
        "bfm_blocks": "468843648",
        "bfm_blocks_per_key": "128",
        "bfm_bytes_per_block": "4096",
        "bfm_size": "1920383410176",
        "bluefs": "1",
        "ceph_fsid": "3025d3f7-5572-4075-9b54-ee8237b6fa06",
        "ceph_version_when_created": "ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)",
        "created_at": "2024-02-19T23:37:00.295664Z",
        "kv_backend": "rocksdb",
        "magic": "ceph osd volume v026",
        "mkfs_done": "yes",
        "osd_key": "AQAW5tNlwUcGHhAA5zXFjlsTgwhbuIQyaAM4Mg==",
        "ready": "ready",
        "require_osd_release": "18",
        "whoami": "29"
    }
}

Once you do, get the path for it from /dev/disk/by-id

root@emerald-k8s-w10:~# ls -lah /dev/disk/by-id | grep sdc
lrwxrwxrwx 1 root root   9 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3 -> ../../sdc
lrwxrwxrwx 1 root root  10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part3 -> ../../sdc3

So the path is /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi3

root@emerald-k8s-w10:~# ls -lah /dev/disk/by-id | grep sdc
lrwxrwxrwx 1 root root   9 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3 -> ../../sdc
lrwxrwxrwx 1 root root  10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Feb 24 12:22 scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-part3 -> ../../sdc3

Now we need to download the deployment for that OSD, back it up, then edit it and replace the old path with the new one, then re-create the deployment.

wings:rescue/ (main✗) $ kubectl -n rook-ceph get deployment rook-ceph-osd-26 -o yaml > osd26.yaml                                            [12:54:07]
wings:rescue/ (main✗) $ cp osd26.yaml osd26.yaml.bak                                                                                         

Replace both instances of ROOK_BLOCK_PATH with the new path, save the file, and recreate the deployment.

wings:rescue/ (main✗) $ code osd26.yaml                                                                                                      [12:54:13]
wings:rescue/ (main✗) $ kubectl delete -f osd26.yaml ; kubectl apply -f osd26.yaml                                                           [12:54:35]
deployment.apps "rook-ceph-osd-26" deleted
deployment.apps/rook-ceph-osd-26 created
wings:rescue/ (main✗) $                      

Your OSD should spin back to life.

PS: If you are doing a bunch of OSDs, partially script it. It'll save you a lot of effort.

export OSDID=17
kubectl -n rook-ceph get deployment rook-ceph-osd-$OSDID -o yaml > osd$OSDID.yaml ; cp osd$OSDID.yaml osd$OSDID.yaml.bak
code osd$OSDID.yaml
kubectl delete -f osd$OSDID.yaml ; kubectl apply -f osd$OSDID.yaml

Set OSDID to your OSD's number. Replace "code" with your favourite editor.