Friday, January 30, 2009

Backup a Blogger Blog

I've decided to backup this blog because I'd hate to lose the stuff I've accumulated here.

The Google OS Blog suggests one of the following tips for blog backup:
  1. save the contents of the following link, together with images:

  2. save the contents of the following URLs (posts and comments) as XML files:
(replace blogname with your blog name, and N with the number of posts - or a really large number, say 10000).

While the second option seems to be more useful for machine consumption and has the benefit of saving the comments too, it does not provide a full backup - at least for this blog. I can't get all the posts from that URL. According to the readers' comments to the Google OS post, this seems to be a problem for others too.

Another comment pointed me to a relevant post on Lifehacker, which pointed me to HTTrack - the Website Copier.

The first step should be obvious - install it like this:
aptitude install httrack

A simple minded experiment
convinced me that this method needs some tweaking: HTTrack followed every link on the blog, saving the same files with different names over and over again. I stopped the mirroring process after more than 100 megabytes of data were downloaded...

After some more experimentation, I decided to start the mirroring process at the link provided in the first backup tip from Google OS, together with some filtering (I also disable HTTrack's animated progress messages, so that I can run this from a cron job):
httrack -%v0 --verbose --update "" \
-"" \
-"*widgetType=BlogArchive*" \
-"*" \
-"*_archive.html*" \
-"*" \
-"*.html?showComment=*" \
+"*.gif" \

This way I backup one page with all the posts, and one page per each post and its associated comments. This is good enough, as I don't really care about restoring the blog - I just want the contents saved.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.