I have a hetzner server (the server that hosts this blog, as it happens) which runs Debian. While waiting to give a presentation at the day job over zoom, I decided this was a good moment to upgrade the system from Debian Bullseye (now EOL) to Bookworm.

This is something I’ve done before, so I didn’t anticipate this to be a problem.

I was wrong.

Anyway, after performing the upgrade in the usual way, I rebooted and was greeted with… well… nothing.

The server failed to restart.

I logged in to hetzner robot and booted the rescue image, but I couldn’t find anything of note in the server logs. Indeed, the logs appeared to be completely untouched. To me, this pointed to a problem with the boot loader.

Thankfully, Hetzner supports a vKVM, so I booted into that. Again, nothing.

Hmm…

On playing around, however, I did notice that if you boot the KVM (which automatically starts the rescue image), and then trigger a soft reset from within the kvm itself, the kvm will remain attached during the boot process, and allowed me to see… a grub error.

Joy. But at least I had identified the fault.

The error in question was complaining about not being able to find normal.mod, which is fairly critical. I poked around the recovery console, mounting the various filesystems in the RAID, but couldn’t find the file. So I attempted to load the kernel manually using insmod… to be treated to another error complaining about linux.mod not being found.

So, grub was completely b0rked.

This, however, gave us the answer….

The fix

The problem is that, for whatever reason, grub (the bootloader) has got messed up. So, we need rebuild it. Since we can’t boot the system, we need to do this from the rescue system. Hetzner does have a installimage tool, but I felt a little wary about running that since my understanding was that this would wipe everything… a bit of a nuclear option.

Thankfully, there was a lower impact solution we could try first.

  1. Confirm your software raid is working by taking a look at /proc/mdstat, and listing the structure using lsblk. Your raid should already be assembled, but if it isn’t, you can run mdadm --assemble --scan
  2. Next, find your mount points, for me:
    • md0 = swap (ignore)
    • md1 = /boot
    • md2 = /
    • md3 = is your home directory, so leave this alone
  3. Now you’re ready to rebuild your filesystem in chroot.
    • Mount your root drive (md2) to /mnt/ mount /dev/md2 /mnt
    • Mount your boot drive (md1) inside – mount /dev/md1 /mnt/boot
    • Bind various system drives
      • mount --bind /dev /mnt/dev
      • mount --bind /proc /mnt/proc
      • mount --bind /sys /mnt/sys
    • Finally, create your chroot: chroot /mnt
  4. Now, rebuild and reinstall grub
    • grub-install /dev/sda (I took a guess this was where my boot loader is, usually the case)
    • update-grub
  5. Exit, unmount, and reboot
    • umount /mnt/dev
    • umount /mnt/proc
    • umount /mnt/sys
    • umount /mnt/boot  
    • umount /mnt
    • reboot

All being well, your server should be back up and running. For me, however, this wasn’t quite the end of the story.

After rebooting, my server was still inaccessible. I repeated the vKVM trick and fully expected to see a grub error, however the server was booting normally.

Using the root password, I logged in to the console and sure enough my server was running, however there was no network connectivity.

A bit of poking around shows that for some reason the network interface name had changed, and the server was hard coded in /etc/network/interfaces to use the incorrect one.

I used ip link show to find the correct network interface address, modified interfaces and restarted.

Boom, server back up… and now I can tell you about it here!

Hope this is of use to someone.

So, I recently performed some long overdue upgrades on my work computers, which involved performing a complete from scratch rebuild of the house server.

This involved, among other things, moving the OpenLDAP directory in which I store, among other things, my email address book. Easier said than done.

Previously, when I had done this, it was a simple matter of installing slapd, copying /etc/ldap and /var/lib/ldap and restarting slapd. This time proved more tricky; slapd initially spat out errors complaining that the database was corrupt. Check out https://www.sapphire.net/mss/incident-response-services/ for more information on the benefits of improving your cybersecurity at work.

Fixing the corrupt database could be done in one of two ways: 1) deleting /var/lib/ldap/DB_CONFIG, forcing slapd to recover the database, or 2) running db_recover directly. After this was done, slapcat displayed the directory as expected, and slapd restarted, however any queries on the database – via phpldapadmin, ldapsearch or the email address book – caused the answering slapd process to lock with no error. Each subsequent query would cause another slapd thread to lock, eventually bringing down the whole ldap server.

Hmm…

I’m not entirely sure what was going on, perhaps it was a bug in my particular version of slapd, or perhaps it was the fact that I was moving from 32bit to 64bit, who knows. I eventually found a solution, and I write this down primarily as a note for myself.

The workaround

    1. Install slapd, copy the config and db over, then run a db recover on the database:

      db_recover -v -h /var/lib/ldap

    2. You should now be able to list the directory with slapcat, so dump it to a file:

      slapcat > /tmp/directory.ldif

    3. Next, we need to nuke the database and force slapd to create a fresh blank database. There’s probably an easy way of doing this, but the quickest way I found was to delete /var/lib/ldap/* and reinstall slapd:

      cd /var/lib/ldap; rm *; apt-get remove slapd; apt-get install slapd.

The apt-get remove will remove the program but by default it’ll leave your configuration intact, if it doesn’t you might need to copy your config back as well.

  1. Next, stop slapd and import your ldif file.

    /etc/init.d/slapd stop; slapadd -l /tmp/directory.ldif; /etc/init.d/slapd start

If you had the same problem as me, your OpenLDAP directory should now be back in place, imported into the new database. For me, my address book now functions correctly and phpldapadmin no longer hangs.

There are probably easier ways, but I was tired. Let me have your thoughts in the comments below!