Friday, December 12, 2008

Cropping Pages in Scanned PDF Files

Here's a script that takes a PDF file containing a scanned document whose pages are surrounded by an annoying black margin, extracts all pages from it, crops every page to a (common) desired geometry and then joins them back:

#! /bin/bash
# usage:
# pdf-crop.sh path/to/file.pdf geometry
# see 'man convert' for geometry syntax (example: 100%x90%+750)
mkdir -p "/tmp/$1"
echo "Extracting images..."
pdfimages -j "$1" "/tmp/$1/image"
echo "Cropping images..."
list=$( \
find "/tmp/$1/" -name "image-*.pbm" -o -name "image-*.ppm" -o -name "image-*.jpg" | sort | \
while read file ; do \
pdffile="${file}.pdf" ;\
printf "\"%s\" " "${pdffile}" ;\
convert -crop "$2" "$file" "$pdffile" ;\
done \
)
echo "Joining images..."
eval "pdfjoin --outfile \"""${1/%.pdf/.cropped.pdf}""\" "${list}

The script depends on pdfimages, convert and pdfjoin:
aptitude install xpdf-utils imagemagick pdfjam

And just in case you're wondering - the script started out simple, but there are spaces in the name of the PDF file that I used for testing, which turned out to be rather tricky to handle.

No comments:

Post a Comment