Friday, June 19, 2009

The Missing Link (or: When UDEV Bites)

It happened again. The nightly backup job failed. What a drag.

After a short investigation I found that the external backup disk disappeared. It was physically there of course, but the symbolic link to its device path was gone.

A while ago I added the following UDEV rule that creates a symbolic link /dev/elements to point to the device path representing the external backup disk (e.g. /dev/sda1) when it is first connected to the computer:
KERNEL=="sd?1", ATTRS{serial}=="574341523030323834303731", ACTION=="add", SYMLINK+="elements"
(see this post).

This allows me to always address the disk as /dev/elements in scripts and configuration files, instead of using the actual device path. This is useful because the device path can and does change, depending on the order of its detection by the kernel (i.e. it may appear as /dev/sdb1 instead of /dev/sda1).

I could've used the UUID link to the device: /dev/disk/by-uuid/306fc694-5328-4a6d-b6ec-4d1310c2feb8, but my custom symlink is shorter to type, easier to remember and I find it to be somehow prettier.

Anyhow, my nice link was gone, but the device path /dev/sda1 was still there, as were the various other links to this device in /dev/disk/by-uuid, /dev/disk/by-path, etc.

I was able to mount the disk via the UUID link, so it didn't appear to be hardware related.

I tried re-triggering UDEV events:
udevadm trigger
and was pleased to find that the link re-appeared.

I chalked it up to my incomplete grasp of reality. It's a fluke. A one-time event. Nothing to worry about.

All was well for a few days, and then I had to reboot my box. I was rather upset to find out that the link was missing again. I had no time to investigate this and simply added udevadm trigger to /etc/init.d/bootmisc.sh and restarted the machine. This seemed to work fine.

Close to two months passed by. Every time I had to reboot my box I recalled that incident but was not inclined to investigate any further - I am trying to curb my pathological dislike to workarounds.

I almost forgot about it, until I tried to mount my encrypted live-HDD.

Epic fail.

The script that I use to do this chore has worked flawlessly for more than a year. I intend to post it here soon, but till then suffice it say that the first step is to open the disk with cryptsetup:
cryptsetup luksOpen /dev/aluminum-crypto aluminum-crypto
and yes, /dev/aluminum-crypto is a custom symlink pointing to the encrypted partition, setup by an UDEV rule similar to the one shown earlier for my backup disk.

And guess what? that link was missing too. I re-triggered UDEV events and saw the symlink re-appear. Been there, done that.

I tried running cryptsetup again, and to my surprise it failed, right after querying me for the encryption password.

And my custom link was gone!

I tried it a few more times, and used strace to find that cryptsetup managed to successfully open and close /dev/aluminum-crypto several times, before it failed to open it. This was weird.

I decided to dig deeper. My plan was simple: run cryptsetup from within a debugger, break close to the failing open statement and step through the code until my link disappeared.

This meant that I needed to get the source code for cryptsetup and build it:

apt-get build-dep cryptsetup
cd /path/to/temporary/dir
apt-get source cryptsetup
cd cryptsetup-1.0.6
fakeroot -- debian/rules build

Unfortunately, cryptsetup failed to build from source (FTBFS - see Debian bug #522338). Fortunately, the bug was already fixed in the unstable package, so I retried with that package:

apt-get source -t unstable cryptsetup
cd cryptsetup-1.0.6+20090405.svn49
fakeroot -- debian/rules build

This time it worked, and I was left with an executable at src/cryptsetup. I launched gdb, stepped through the code and then I realized that the link got removed as soon as the first close function completed.

Did I mention how weird this is? All I did was open and close /dev/aluminum-crypto, which was enough to remove it - this shouldn't happen.

I had to verify that I actually saw what I thought I saw, so I wrote the following little program, named rmlink.c:

/* -*- compile-command:"gcc rmlink.c -g -o rmlink"; -*- */
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
int fd;

if (argc != 2) {
fprintf(stderr, "usage: rmlink <symlink>\n");
exit(-1);
}

fd = open(argv[1], O_RDWR);
if (fd < 0) {
perror("rmlink");
exit(-1);
}
close(fd);
return 0;
}

This little program successfully removed my custom device symlinks by simply opening and then closing them.

Curiously, I was not able to remove other device symlinks (e.g. the ones under /dev/disk/by-uuid).

What was going on here? is this a personal thing that UDEV has against me? after all, UDEV is responsible for all those symlinks, both standard and custom, and they are all generated using UDEV rules which look similar to mine (see /lib/udev/rules.d/60-persistent-storage.rules).

I used the following magic to persuade UDEV to verbosely log its actions to /var/log/syslog:
udevadm control --log-priority=debug

I then saw the following messages appear, after running ./rmlink /dev/gigapod (yet another external USB disk that I use):
udevd-event[24908]: update old name, '/dev/gigapod' no longer belonging to '/devices/pci0000:00/0000:00:0a.0/0000:02:00.2/usb5/5-1/5-1.4/5-1.4:1.0/host1/target1:0:0/1:0:0:0/block/sda/sda1'
udevd-event[24908]: no reference left, remove '/dev/gigapod'


Gotcha!

But why was this happening? and why now? It used to work...

I scrutinized my own UDEV rules and pondered the rules in /lib/udev/rules.d/60-persistent-storage.rules and noticed two things that I hadn't noticed before:
  1. the standard rules are fired both for "add" and "change" events, while mine only address the "add" event
  2. the standard rules enable an option: OPTIONS+="watch"

I couldn't find any documentation for the "watch" option. But one of the commenters in Ubuntu bug #332270 explained that this option causes the Kernel to trigger "change" events for inotify events (such as closing a block device...), and it seems to be a recent addition...

The fix was easy enough to figure out now: add the "change" action to my rules, so that the symlinks were re-added when the "change" event got triggered:
KERNEL=="sd?1", ATTRS{serial}=="574341523030323834303731", ACTION=="add|change", SYMLINK+="elements"
Later I dropped the ACTION condition altogether.

Finally, I removed the UDEV event triggering workaround from /etc/init.d/bootmisc.sh, and restarted the machine.

All is fine.

For now.

1 comment:

  1. Thank you, thank you, THANK YOU!

    I also searched half a day for why my backup script didn't work when firing cryptsetup with a symlink. Now all works fine :-)

    Best regards from Germany,
    Christoph D. (www.el-th.de)

    ReplyDelete