Convert two-page color scan of book into monochrome single pdf

Two days in a row now I’ve had to visit the physical library to retrieve an old paper. Makes me feel very authentic as an academic. Our library has free scanning facilities, but the resulting PDF will have a couple problems. If I’m scanning a book then each page of the pdf actually contains 2 pages of the book. Depending on the scanner settings, I might also accidentally have my 2 pages running vertically instead of horizontally. Finally, if I forgot to set the color settings on the scanner, then I get a low-contrast color image instead of a high-contrast monochrome scan.

Here’s a preview of pdf of an article from a book I scanned that has all these problems: scanned low contrast color pdf

If this pdf is in input.pdf then I call the following commands to create output.pdf:

pdfimages input.pdf .scan
mogrify -format png -monochrome -rotate 90 -crop 50%x100% .scan*
convert +repage .scan*png output.pdf
rm .scan*

output monochrome pdf

I’m pretty happy with the output. There are some speckles, but the simple -monochrome flag does a fairly good job.

I use Adobe Acrobat Pro to run OCR so that the text is selectable (haven’t found a good command line solution for that, yet).

Note: I think the -rotate 90 is needed because the images are stored rotated by -90 degrees but the input.pdf is compositing them after rotation. This hints that this script won’t generalize to complicated pdfs. But we’re safe here because a scanner will probably apply the same transformation to each page.

Tags: , , , , , , , ,

One Response to “Convert two-page color scan of book into monochrome single pdf”

  1. jdumas says:

    Have you tried -auto-orient instead of -rotate 90 if the images are stored rotated?

    For the cleaning up part, I recently came across this cool imagemagick one-liner (it might not work as well on scanned documents though):
    https://gist.github.com/lelandbatey/8677901

    About the OCR, there are some open-source pipeline such as this one:
    https://github.com/danielquinn/paperless
    But I don’t know how they compare to Adobe Acrobat.

Leave a Reply