Tuesday, June 17, 2008

UnDBX: Extract E-Mail Messages from Outlook Express DBX files

If you can't beat them, join them.

Here's my tiny, hopefully half decent, contribution to the FOSS universe: UnDBX, a command-line utility that I've developed to extract e-mail messages from Outlook Express DBX files.

There are many such utilities around, so why write another one? because I had to.

As I described on this blog some time ago, I used to backup my wife's mailboxes with a combination of DbxConv and rsync, launched from a VB script run by the Bacula file daemon (phew!). This allowed me to backup a few megabytes of data a day (i.e. just the new messages), instead of several gigabytes (i.e. a bunch of very large monolithic DBX files). The objective was to save precious disk space on my backup device (an external USB hard disk). The price was a complicated backup scheme, wasted disk space on my wife's PC, and long backup jobs (more than 3 hours every night!).

This backup scheme failed mysteriously several times. Debugging it is a real pain, simply because it takes so much time to complete a backup job. I finally decided, almost three months ago to stop using it, and directly backup the gigantic DBX files, until I can come up a with a better solution.

My original intent was to add an incremental extraction option to DbxConv, so that it would only extract to disk e-mail messages that haven't been extracted yet. That would make the extraction process much shorter, and also save disk space because a scratch folder is not needed anymore. As I browsed through the DbxConv source code I realized that I can't modify it, because it uses MFC, and MFC is not available in MinGW, which is the toolchain I have available in Debian.

The solution? UnDBX - the DBX extraction tool.

I ported the DbxConv DBX parsing code from C++ with MFC to plain C, and wrote a main function that extracts messages from all the DBX files in a specified folder, to a sub-folder of a given output folder. The first round works very much like DbxConv - all messages are extracted to disk as EML files. Subsequent runs only extract new messages to disk, and also delete EML files on the disk that do not correspond to messages in the DBX files (i.e. deleted messages).

Unlike DbxConv, UnDBX cannot convert DBX files to MBOX files - its sole purpose is to facilitate fast incremental backup of DBX file.

Backup jobs are down to 8 minutes! that's with 14 DBX files, over 35000 messages, and 3.5GB of data - a nightmare. I hope some of you will find it useful too. Enjoy.

No comments: