Friday, April 29, 2011

Backup a Blogger Blog - Revisited

I recently found out that the method I used to backup this blog automatically, has stopped working. It all hinged on the observation that one can retrieve a single web page with the full text of all the posts of a given Blogger blog, by retrieving the link
http://blogname.blogspot.com/search?max-results=N
with a large enough N, e.g. 10000. This is not true anymore - at the moment I can only retrieve the latest 42 posts on this blog, from a total of 178.

I'm not sure when this method stopped working, or, for that matter, if it ever really worked. The fact remains, however, that I have a blog that I want to backup, so I spent the better part of an evening figuring out how to properly do this.

My new Blogger blog backup script, shown below, makes use of the Google Data services API to export and download the blog archive in XML format, and then extracts from it the links of all the posts, and mirrors these pages locally, with HTTrack:


#! /bin/bash

BLOGGER_EMAIL=user@gmail.com
BLOGGER_PASSWD=password
BLOGGER_BLOGID=000000000000000000
BLOGGER_BLOG=blogname

DEST_DIR=/path/to/backup/directory/
mkdir -p ${DEST_DIR}
cd ${DEST_DIR}

eval $( \
    curl -s "https://www.google.com/accounts/ClientLogin" \
    --data-urlencode Email=$BLOGGER_EMAIL --data-urlencode Passwd=$BLOGGER_PASSWD \
    -d accountType=GOOGLE \
    -d source=MachineCycle-cURL-BlogBackup \
    -d service=blogger | grep 'Auth='
)

curl -s "http://www.blogger.com/feeds/$BLOGGER_BLOGID/archive" \
    --header "Authorization: GoogleLogin auth=$Auth" \
    --header "GData-Version: 2" \
    | xml_pp > ${BLOGGER_BLOG}.blogspot.com.archive.xml

grep -o -e '<link href="http://'$BLOGGER_BLOG'.blogspot.com/..../[^\.]*.html" rel="alternate" title=' \
    ${BLOGGER_BLOG}.blogspot.com.archive.xml | \
    sed -e 's@.link href="@@g' -e 's@" rel="alternate" title=@@g' | \
    sort -ur > ${BLOGGER_BLOG}.links

mkdir -p ${BLOGGER_BLOG}
cd ${BLOGGER_BLOG}
httrack \
    -%v0 \
    -%e0 \
    -X0 \
    --verbose \
    --update \
    -%L ../${BLOGGER_BLOG}.links \
    -"http://${BLOGGER_BLOG}.blogspot.com/" \
    -"${BLOGGER_BLOG}.blogspot.com/*widgetType=BlogArchive*" \
    -"${BLOGGER_BLOG}.blogspot.com/search*" \
    -"${BLOGGER_BLOG}.blogspot.com/*_archive.html*" \
    -"${BLOGGER_BLOG}.blogspot.com/feeds/*" \
    -"${BLOGGER_BLOG}.blogspot.com/*.html?showComment=*" \
    +"*.gif" \
    +"*.jpg" \
    +"*.png"
A few comments are in order:
  1. the script contains the Blogger username and password - keep it safe!
  2. the blog id is the number that appears in the URL of most links accessible from the Blogger dashboard, after the blogID= part
  3. the XML blog archive may later be used to restore/migrate the blog
  4. local mirroring isn't really necessary - I just like it that I can view the blog contents offline
  5. another unnecessary step: I use xml_pp to beautify the exported XML file
  6. currently, the script performs no error checking - I may add some checks if and when I observe failures
  7. sources: Using cURL to interact with Google Data services, Blogger export format

1 comment:

  1. Such a great information and I've been looking for this..

    ReplyDelete