Rasterize everything in pdf except text

Alec Jacobson

October 19, 2016

weblog/

I had an issue including a PDF with transparency as a subfigure to another PDF. This lead me down a dark path of trying to rasterize everything in a pdf except for the text. I tried rasterizing everything and just running OCR on top of the text but OCR-ized selection is weird and the text recognition wasn't perfect. Not to mention that would have been a really round about way to solve this.

Here's the insane pipeline I settled on:

The bash script ./rasterize-everything-but-text.sh is itself an absurd, likely very fragile text manipulation and rasterization of the .svg files:

#!/bin/bash
#
# Usage:
#
#     rasterize-everything-but-text.sh input.svg output.svg
#
input="$1"
output="$2"
# suck out header from svg file
header=`dos2unix < $input | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\).*/\1/' | tr '\00' '\n'`
# grab all text tags
text=`cat $input | grep     "<text.*"`
# create svg file without text tags
notextsvg="no-text.svg"
notextpng="no-text.png"
cat $input | grep  -v "<text.*" > $notextsvg
# convert to png
rsvg-convert -h 1000 $notextsvg > $notextpng
# convert back to svg (containing just <image> tag)
rastersvg="raster.svg"
convert $notextpng $rastersvg
# extract body (image tag)
body=`dos2unix < $rastersvg | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\)\(.*\)<\/svg>/\2/' | tr '\00' '\n'`
# piece together original header, image tag, and text
echo "$header
$body
$text
</svg>" > "$output"
# Fix image tag to have same size as document
dim=`echo "$header" | grep -o 'width=".*" height="[^"]*"' | tr '"' "'"`
sed -i '' "s/\(image id=\"image0\" \)width=\".*\" height=\"[^\"]*\"/\1$dim/" $output