Two days in a row now I’ve had to visit the physical library to retrieve an old paper. Makes me feel very authentic as an academic. Our library has free scanning facilities, but the resulting PDF will have a couple problems. If I’m scanning a book then each page of the pdf actually contains 2 pages of the book. Depending on the scanner settings, I might also accidentally have my 2 pages running vertically instead of horizontally. Finally, if I forgot to set the color settings on the scanner, then I get a low-contrast color image instead of a high-contrast monochrome scan.
Here’s a preview of pdf of an article from a book I scanned that has all these problems:
If this pdf is in
input.pdf then I call the following commands to create
pdfimages input.pdf .scan mogrify -format png -monochrome -rotate 90 -crop 50%x100% .scan* convert +repage .scan*png output.pdf rm .scan*
I’m pretty happy with the output. There are some speckles, but the simple
-monochrome flag does a fairly good job.
I use Adobe Acrobat Pro to run OCR so that the text is selectable (haven’t found a good command line solution for that, yet).
Note: I think the
-rotate 90 is needed because the images are stored rotated by -90 degrees but the input.pdf is compositing them after rotation. This hints that this script won’t generalize to complicated pdfs. But we’re safe here because a scanner will probably apply the same transformation to each page.