Thursday, April 26, 2007

Backup/Extract/Convert Outlook Express Messages

[18 Jun. 2008] UPDATE: I now use UnDBX to facilitate fast incremental backups of DBX files.

My wife's email client of choice is Outlook Express. That, in itself, is OK with me. Except that she doesn't delete messages. Never. She just can't be bothered. Her OE storage folder currently takes up 1.4GB of disk space.

The real problem is backup. All the messages in each OE message folder are stored in a single monolithic dbx file. This means that, on a daily basis, Bacula, during the incremental backup process, encounters very large files that were modified and need to be backed up. A daily incremental backup of over 1GB is unacceptable, since my storage medium is a 60GB hard disk, that needs to hold the full and incremental backups of both our computers.

The solution I came up with was pretty simple: extract all the email messages from the dbx files to a different folder, so that Bacula only needs to backup new email messages, during an incremental backup process. Little did I know how difficult it would be to setup such a scheme.

It took me quite a while to find a command line tool that can extract eml files from dbx files, and is free. Searching Google for "extract eml dbx" or "convert eml dbx" bring a lot of links to shareware tools, and most of these cannot be used from a script.

I tried using tools like xdelta to build and backup binary delta files, but this proved to be problematic - all the tools I tried required too much memory, and took a lot of time to run, to the point of being impractical.

I even started toying with the idea of writing such a tool. Finally, during my search for the OE dbx file format specification, I found DbxConv - a nice little utility that does exactly what I wanted it to do.

The complete solution is a bit more complex than just running DbxConv. Before every backup job, the Bacula Director Daemon instructs the Bacula File Daemon on my wife's computer, to run a VB script (available here), as specified in /etc/bacula/bacula-dir.conf:

Job {
...
ClientRunBeforeJob = "c:/windows/system32/cscript.exe c:/backup/tools/run-before-job.vbs %n"
...
}

This script attempts to shutdown Outlook Express, calls DbxConv to extract eml messages from the dbx files to a scratchpad folder, and then uses cygwin's rsync utility to synchronize the content of the scratchpad folder with an eml storage folder that is marked for backup in the Bacula Director's configuration. The scratchpad folder is then erased, and the backup process continues.

This process require an extra free disk space of twice the size of the OE storage folder, but that is a small price to pay, compared to the daily savings in backup disk space.

[18 Jun. 2008] UPDATE: I now use UnDBX to facilitate fast incremental backups of DBX files.

5 comments:

  1. EASIER -- In Outlook express, just create a new folder. Label it "Old" or Pre-2008 or whatever you want. Drag all the old messages into it (leave a few months of the current ones). Do the File-Folder-Compact process to compact the original folder (effectively deleting all the files from it that you'd moved to the new folder). And tadaaa, you now have an unchanging archive that won't need updating, and a new smaller folder that will change with all the new email that comes in. Much simpler to do.

    ReplyDelete
  2. True. Both simpler and easier.

    Still, it's a manual process that has to be repeated each month or so, in order to keep the size of the mailbox close to constant.

    I'd rather have a robust automated process, that doesn't require my intervention.

    As for compacting folders - I had some bad experience with it.

    There's also the issue of size: a few months worth of messages in the Inbox folder amounts to a few hundred megabytes of data to backup daily. With my current (admittedly elaborate) setup, the size of incremental backup is at least ten times smaller.

    I'm considering using duplicity, but it'll take some time before I get to it.

    Bottom line though, is that I'm a geek bent on doing things my own way, with enough programming skills to back my personality disorders ;-)

    ReplyDelete
  3. AND IT DOESN'T WORK!

    I actually tried it - I moved all of my wife's Inbox content to an archive folder, and did a full backup.

    I expected the next incremental backup job to only backup the Inbox folder dbx file, and ignore the already backed-up archive dbx file.

    But OE does update the archive dbx, even if its contents hasn't been modified - and I ended up with a daily 1.5 to 2 GB update each night.

    It was worth a try though.

    ReplyDelete
  4. If backup of Outlook Express seems to be a problem than Microsoft Outlook was a nightmare to backup.

    Luckily we found datamills edgesafe solution for our company. Its is fast and dedicated to Outlook PST files.

    ReplyDelete
  5. Microsoft Outlook is related to Outlook Express only by name. It's a completely different animal.

    Unlike OE, Outlook is programmable - one can write a program to interact with it. So that instead of reverse engineer its file formats I'd probably write a Visual Basic or C# program to extract the messages (and other stuff).

    Note that Microsoft has released a backup tool for Outlook, a free download, that might be used as part of an incremental binary backup scheme using tools like xdelta or duplicity.

    ReplyDelete