Update restore-ceph-from-mon-disaster.rst

1378554d · Alberto Colla · 1638e2c3 · 1378554d
Commit 1378554d authored 3 years ago by Alberto Colla
--- a/web/support/kb/ceph/restore-ceph-from-mon-disaster.rst
+++ b/web/support/kb/ceph/restore-ceph-from-mon-disaster.rst
@@ -206,4 +206,60 @@ Another command is `lsblk`::
    disk  
    ├─nvme2n1p1                                                                                             259:2    0  3.9T  0 part

-       └─ceph--5fd8f96e--2ccb--460f--87b8--359ff81cff8a-osd--block--5fd8f96e--2ccb--460f--87b8--359ff81cff8a 253:0    0  3.9T  0 lvm   
+       └─ceph--5fd8f96e--2ccb--460f--87b8--359ff81cff8a-osd--block--5fd8f96e--2ccb--460f--87b8--359ff81cff8a 253:0    0  3.9T  0  lvm   
+
+
+
+In our case we need to zap (delete everything including partition table) device nvme3n1p1, the device that was added but not initialized, and then re-add it to ceph.
+
+To recover this, firstly fix the keys on osd units.
+On a mon, get client.bootstrap-osd key and client.osd-upgrade::
+
+    ceph auth get client.bootstrap-osd
+    ceph auth get client.osd-upgrade
+
+If not present, create them using the commands::
+
+    ceph auth get-or-create client.bootstrap-osd mon "allow profile bootstrap-osd"
+    ceph auth get-or-create client.osd-upgrade mon "allow command \"config-key\"; allow command \"osd tree\"; allow command \"config-key list\"; allow command \"config-key put\"; allow command \"config-key get\"; allow command \"config-key exists\"; allow command \"osd out\"; allow command \"osd in\"; allow command \"osd rm\"; allow command \"auth del\""
+
+
+Replace the key value in following files on EACH osd unit::
+
+    /var/lib/ceph/bootstrap-osd/ceph.keyring <---- client.bootstrap-osd key
+    /var/lib/ceph/osd/ceph.client.osd-upgrade.keyring <----- client.osd-upgrade key
+
+Those keys were created when new mons were installed.
+
+Now, FOR EACH OSD:
+
+- ssh to the OSD's and zap nvme3n1p1::
+
+   !!! Please note this will DESTROY all data on nvme3n1p1 and CANNOT be recovered. !!!
+   !!! Please double check before proceeding. !!!
+
+   ceph-volume lvm zap /dev/nvme3n1p1 --destroy
+
+- recreate the partition::
+
+   parted -a optimal /dev/nvme3n1 mkpart primary 0% 4268G
+
+
+- Now go back to juju client machine and use juju to remove it from juju's internal db::
+
+   juju run-action --wait ceph-osd/X zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
+
+Then check juju status.
+
+juju status can be forced to update with::
+
+   juju run --unit ceph-osd/X 'hooks/update-status'
+
+At this stage we should see the osd status is back to normal (green).
+
+If not, let's run the following commands::
+
+   juju run-action --wait ceph-osd/21 zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
+   juju run-action ceph-osd/21 add-disk osd-devices=\"/dev/nvme3n1p1\""
+
+