Posts Tagged ‘pdf’

Convert two-page color scan of book into monochrome single pdf

Friday, November 25th, 2016

Two days in a row now I’ve had to visit the physical library to retrieve an old paper. Makes me feel very authentic as an academic. Our library has free scanning facilities, but the resulting PDF will have a couple problems. If I’m scanning a book then each page of the pdf actually contains 2 pages of the book. Depending on the scanner settings, I might also accidentally have my 2 pages running vertically instead of horizontally. Finally, if I forgot to set the color settings on the scanner, then I get a low-contrast color image instead of a high-contrast monochrome scan.

Here’s a preview of pdf of an article from a book I scanned that has all these problems: scanned low contrast color pdf

If this pdf is in input.pdf then I call the following commands to create output.pdf:

pdfimages input.pdf .scan
mogrify -format png -monochrome -rotate 90 -crop 50%x100% .scan*
convert +repage .scan*png output.pdf
rm .scan*

output monochrome pdf

I’m pretty happy with the output. There are some speckles, but the simple -monochrome flag does a fairly good job.

I use Adobe Acrobat Pro to run OCR so that the text is selectable (haven’t found a good command line solution for that, yet).

Note: I think the -rotate 90 is needed because the images are stored rotated by -90 degrees but the input.pdf is compositing them after rotation. This hints that this script won’t generalize to complicated pdfs. But we’re safe here because a scanner will probably apply the same transformation to each page.

Rasterize everything in pdf except text

Wednesday, October 19th, 2016

I had an issue including a PDF with transparency as a subfigure to another PDF. This lead me down a dark path of trying to rasterize everything in a pdf except for the text. I tried rasterizing everything and just running OCR on top of the text but OCR-ized selection is weird and the text recognition wasn’t perfect. Not to mention that would have been a really round about way to solve this.

Here’s the insane pipeline I settled on:

  • open the PDF in illustrator
  • save as input.svg, under options “use system fonts”,
  • run ./rasterize-everything-but-text.sh input.svg output.svg (see below)
  • open output.svg in illustrator, save as raster-but-text.pdf

The bash script ./rasterize-everything-but-text.sh is itself an absurd, likely very fragile text manipulation and rasterization of the .svg files:

#!/bin/bash
#
# Usage:
#
#     rasterize-everything-but-text.sh input.svg output.svg
#
input="$1"
output="$2"
# suck out header from svg file
header=`dos2unix < $input | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\).*/\1/' | tr '\00' '\n'`
# grab all text tags
text=`cat $input | grep     "<text.*"`
# create svg file without text tags
notextsvg="no-text.svg"
notextpng="no-text.png"
cat $input | grep  -v "<text.*" > $notextsvg
# convert to png
rsvg-convert -h 1000 $notextsvg > $notextpng
# convert back to svg (containing just <image> tag)
rastersvg="raster.svg"
convert $notextpng $rastersvg
# extract body (image tag)
body=`dos2unix < $rastersvg | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\)\(.*\)<\/svg>/\2/' | tr '\00' '\n'`
# piece together original header, image tag, and text
echo "$header
$body
$text
</svg>" > "$output"
# Fix image tag to have same size as document
dim=`echo "$header" | grep -o 'width=".*" height="[^"]*"' | tr '"' "'"`
sed -i '' "s/\(image id=\"image0\" \)width=\".*\" height=\"[^\"]*\"/\1$dim/" $output

Create a low resolution pdf of a LaTeX paper

Thursday, April 21st, 2016

Here’re the steps I use to create a low resolution version of a pdf created by LaTeX.

The simple thing to do is to follow the steps to create a camera ready (high res) pdf but replace the JPG settings to downsample the images and perhaps reduce the quality.

If you have a lot of raster images, this will work OK.

However, if you have a lot of vector graphics images or just a lot of images, this will not truly bring down the size of the final pdf.

For this I make sure that all of my \includegraphics commands in LaTeX are of the form:

\includegraphics[width=\linewidth]{\figs/figure-file-name}

Notice three things:

  1. the width is explicitly specified,
  2. the figure directory is a command/macro \figs, and
  3. the filename does not have an extension (figure-file-name instead of figure-file-name.pdf)

Then at the beginning of my file I have defined \figs. For the normal high resolution pdf I use:

\newcommand{\figs}{}
\def\figs/{figs/}

When I switch to low resolution I use

\newcommand{\figs}{}
%\def\figs/{figs/}
\def\figs/{figs/low-res/}

And before compiling I run the commands:

mkdir -p figs/low-res/
mogrify -colorspace RGB -resize 400x -background white -alpha remove -quality 100 -format jpg -path figs/low-res/ figs/*.{pdf,png}[0]

There are a couple import flags here to ensure that the background is white and only a single output is created for pdfs with accidentally multiple art-boards/pages.

Make the most recent tex document in the current directory and open it

Wednesday, January 13th, 2016

Here’s a little bash script to compile (pdflatex, bitex, 2*pdflatex,etc.) the most recent .tex file in your current directory that contains begin{document} (i.e. the main document):

#!/bin/bash
if [ -z "$LMAKEFILE" ]; then
  echo "Error: didn't find LMAKEFILE environment variable"
  exit 1
fi
TEX=$( \
  grep -Il1dskip "begin{document}" *.tex | \
  xargs stat -f "%m %N" | \
  sort -r | \
  head -n 1 | \
  sed -e "s/^[^ ]* //")
BASE="${TEX%.*}"
if [ -z "$TEX" ]; then
  echo "Error: Didn't find begin{document} in any .tex files"
  exit 1
fi
make -f $LMAKEFILE $BASE && open $BASE.pdf

Simply use it:

texmake

Save As Optimized PDF using Acrobat Pro via the command line

Saturday, January 2nd, 2016

Here’s a tremendously hacky way to automate the procedure of optimizing a PDF using Acrobat Pro (with default settings) from the command line. It’s an applescript sending mouse clicks and keyboard signals so don’t get too excited.

However, I’m doing this all the time and it will hopefully save clicking through menus.

#!/usr/bin/osascript
on run argv
    if (count of argv) < 2 then
        do shell script "echo " & "\"optimizepdf path/to/input.pdf simple-output-name\""
    else
        set p to item 1 of argv
        set out_name to item 2 of argv
        set abs to do shell script "[[ \"" & p & "\" = /* ]] && echo \"" & p & "\" || echo \"$PWD/\"" & p & "\"\""
        set a to POSIX file abs
        tell application "Adobe Acrobat Pro"
            activate
            open a
            tell application "System Events"
                click menu item "Optimized PDF..." of ((process "Acrobat")'s (menu bar 1)'s ¬
                    (menu bar item "File")'s (menu "File")'s ¬
                    (menu item "Save As")'s (menu "Save As"))
                tell process "Acrobat"
                    keystroke return
                    keystroke out_name
                    keystroke return
                    keystroke "r" using {command down}
                end tell
            end tell
            close document 1
        end tell
    end if
end run

Then you can run this with something like:

optimizepdf path/to/input.pdf simple-output-name

overwrite warning: this will overwrite the output file (and potentially files named similarly if the keystrokes fail or get garbled).

Oddly, it seems to work fastest if the input document is not already open in acrobat pro.

This code above is written for Acrobat Pro Version 10.1.16.

Update: Here’s a legacy version for Acrobat Pro Version 9.5.1

#!/usr/bin/osascript
on run argv
    if (count of argv) < 2 then
        do shell script "echo " & "\"optimizepdf path/to/input.pdf simple-output-name\""
    else
        set p to item 1 of argv
        set out_name to item 2 of argv
        set abs to do shell script "[[ \"" & p & "\" = /* ]] && echo \"" & p & "\" || echo \"$PWD/\"" & p & "\"\""
        set a to POSIX file abs
        tell application "Adobe Acrobat Pro"
            activate
            open a
            tell application "System Events"
                click menu item "PDF Optimizer..." of ((process "Acrobat")'s (menu bar 1)'s ¬
                    (menu bar item "Advanced")'s (menu "Advanced"))
                tell process "Acrobat"
                    keystroke return
                    keystroke out_name
                    keystroke return
                    keystroke "r" using {command down}
                end tell
            end tell
            close document 1
        end tell
    end if
end run

Remove prince annotation from pdf

Tuesday, November 10th, 2015

Here’s a little perl script to remove the prince watermark note from a pdf:

perl -p -e '!$x && s:/Annots \[[0-9]+ 0 R [0-9]+ 0 R ?([^\]]+)\]:/Annots [\1]: && ($x=1)' input.pdf > output.pdf

Update:

So, unfortunately the simple perl hack will “damage” the pdf. It seems that most viewers will ignore this, but I was alerted that a popular ipad reader “GoodReader” produces an ominous “This file is damaged” warning (though it then renders OK).

I couldn’t quite reverse engineer why, but here’s a temporary albeit heavy-handed fix. After running the perl script, also repair the pdf with ghostscript:

gs -o output.pdf  -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf

Note that output.pdf cannot be the same as input.pdf or it quietly creates an empty pdf instead.

Update:

Even better you can directly remove the specific Annotation from prince:

perl -i -p -e 's:/(Rect \[572\.0000 716\.0000 597\.0000 741\.0000\]|Contents \(This document was created with Prince, a great way of getting web content onto paper.\))::' input-and-output.pdf

Update:

Or even even better better:

  perl -i -pe 'BEGIN{undef $/;} s:/Rect.*?Contents \(This document was created with Prince, a great way of getting web content onto paper.\)::smg' input-and-output.pdf

Remove annotations from a pdf

Wednesday, July 29th, 2015

I lost track of where I found this online and had to dig it out of my bash history:

perl -pi -e 's:/Annots \[[^]]+\]::g' input-and-output.pdf

Emoji in LaTeX documents

Tuesday, August 5th, 2014

Update: Hilariously, it turns out that either wordpress or my wordpress markdown plugin is choking on the emoji inserted into this post. Thus, to see the actually emoji commands see this plaintext version


We were recently joking around about using emoji in math equations. The idea was to satire of the bad rap exterior calculus symbols like the the Hodge star operator (★) and the “musical isomorphisms” (♭,♯) get in the computer graphics community.

I found a solution for upTeX. This works by first extracting all of the emojis as pdfs and then including the pdfs via (includegraphics) whenever a \coloremoji{...} command is found. This unfortunately did not work with my TexLive pdflatex setup. With some help, I’ve redesigned a coloremoji.sty stylesheet that allows you to directly include emoji in your LaTeX documents.

A Hello, EmojiWorld LaTeX document would look like this:

WordPress dies on emoji, see plaintext version

This produces something that should look like:

hello world emoji

You can also use emoji in mathmode:

WordPress dies on emoji, see plaintext version

alligator power integral math emoji latex

Download the coloremoji package and simply add \usepackage{coloremoji} to the top of your document.

Actually, Nobuyuki Umetani gave a talk where he used graphic icons in math (I believe successfully!) to explain Sensitive Couture (change in clothing with respect to change in mouse interaction):

Nobuyuki's emoji math

Update: I’m now hosting this style package on github.

Tricking latexmk into using -draftmode

Thursday, June 12th, 2014

A friend recently showed me latexmk, an alternative to my current setup of using make to compile complicated tex documents (calling bibtex and pdflatex). Latexmk seems cool because it’s really tracking all the dependencies and not just recompiling everything. However, it seems to always run pdflatex in “final draft” mode, even for early passes which might as well use the -draftmode option. On my thesis (218 page document with figures on roughly every other page), this meant latexmk took 50s and my makefile routine took 23s (assuming a cold start).

I guess the sad answer is that it’s impossible to know if the current run of pdflatex should be the last (and hence should not be using -draftmode). So I guess latexmk plays it safe and runs everything without -draftmode.

My makefile assumes that I only ever need 3 passes, which I guess is pretty common but by no means universal.

I came up with gnarly alternative, which almost needs a makefile itsel:

latexmk -f -pdf -pdflatex="touch thesis.pdf && pdflatex -draftmode" thesis.tex && rm thesis.pdf && latexmk -pdf thesis.tex

First it runs latexmk forcing pdflatex to use -draftmode, but also always touching the pdf so latexmk is convinced that it succeeded in making its targets, then before running a final pass with latexmk I need to remove the pdf so that latexmk thinks there’s something to do. Draftmode passes cost very little so this also runs at about 24s on my thesis.

Wonder if there’s a cleaner way. Especially if thesis.pdf could be inferred nicely from thesis.tex (I guess using basename) and whether I can safely wrap this into an alias:

alias latexmk="..."

Update: Here’s a better version, my friend came up with. It still needs the tex filename twice, but at least it’s using substitution in the pdflatex “command” and the -g option forces latexmk to run at least once.

latexmk -f -pdf -pdflatex='pdflatex -draftmode %O %S && touch %D' thesis && latexmk -pdf -g thesis

Open IEEE pdf without banner

Tuesday, February 11th, 2014

The IEEE Xplore library has a very annoying way of serving up PDFs. The page has two frames, one’s a small banner at the top and the bother is the PDF. Both Safari and Chrome keep the banner in focus so when I try to zoom I’m just zooming the banner. Here’s a javascript bookmarklet to open that second frame as a proper page:

javascript:(function(){window.location.href=document.getElementsByTagName("frame")[1].src;})()