http://blogname.blogspot.com/search?max-results=Nwith a large enough
N, e.g. 10000. This is not true anymore - at the moment I can only retrieve the latest 42 posts on this blog, from a total of 178. I'm not sure when this method stopped working, or, for that matter, if it ever really worked. The fact remains, however, that I have a blog that I want to backup, so I spent the better part of an evening figuring out how to properly do this.
My new Blogger blog backup script, shown below, makes use of the Google Data services API to export and download the blog archive in XML format, and then extracts from it the links of all the posts, and mirrors these pages locally, with HTTrack:
#! /bin/bash
BLOGGER_EMAIL=user@gmail.com
BLOGGER_PASSWD=password
BLOGGER_BLOGID=000000000000000000
BLOGGER_BLOG=blogname
DEST_DIR=/path/to/backup/directory/
mkdir -p ${DEST_DIR}
cd ${DEST_DIR}
eval $( \
curl -s "https://www.google.com/accounts/ClientLogin" \
--data-urlencode Email=$BLOGGER_EMAIL --data-urlencode Passwd=$BLOGGER_PASSWD \
-d accountType=GOOGLE \
-d source=MachineCycle-cURL-BlogBackup \
-d service=blogger | grep 'Auth='
)
curl -s "http://www.blogger.com/feeds/$BLOGGER_BLOGID/archive" \
--header "Authorization: GoogleLogin auth=$Auth" \
--header "GData-Version: 2" \
| xml_pp > ${BLOGGER_BLOG}.blogspot.com.archive.xml
grep -o -e '<link href="http://'$BLOGGER_BLOG'.blogspot.com/..../[^\.]*.html" rel="alternate" title=' \
${BLOGGER_BLOG}.blogspot.com.archive.xml | \
sed -e 's@.link href="@@g' -e 's@" rel="alternate" title=@@g' | \
sort -ur > ${BLOGGER_BLOG}.links
mkdir -p ${BLOGGER_BLOG}
cd ${BLOGGER_BLOG}
httrack \
-%v0 \
-%e0 \
-X0 \
--verbose \
--update \
-%L ../${BLOGGER_BLOG}.links \
-"http://${BLOGGER_BLOG}.blogspot.com/" \
-"${BLOGGER_BLOG}.blogspot.com/*widgetType=BlogArchive*" \
-"${BLOGGER_BLOG}.blogspot.com/search*" \
-"${BLOGGER_BLOG}.blogspot.com/*_archive.html*" \
-"${BLOGGER_BLOG}.blogspot.com/feeds/*" \
-"${BLOGGER_BLOG}.blogspot.com/*.html?showComment=*" \
+"*.gif" \
+"*.jpg" \
+"*.png"
A few comments are in order:- the script contains the Blogger username and password - keep it safe!
- the blog id is the number that appears in the URL of most links accessible from the Blogger dashboard, after the
blogID=part - the XML blog archive may later be used to restore/migrate the blog
- local mirroring isn't really necessary - I just like it that I can view the blog contents offline
- another unnecessary step: I use
xml_ppto beautify the exported XML file - currently, the script performs no error checking - I may add some checks if and when I observe failures
- sources: Using cURL to interact with Google Data services, Blogger export format