Posts Tagged ‘tr’

Unwrap hard-wrapped text via command line

Monday, October 24th, 2016

I searched for a bash/sed/tr combination to unwrap hard 80-character per line text like:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed imperdiet felis
suscipit odio fringilla, pharetra ullamcorper felis interdum. Aenean ut mollis
est. Maecenas mattis convallis enim. Nullam eget maximus mi. Vivamus nec risus
suscipit, facilisis nunc at, eleifend massa. Aliquam erat volutpat. Aenean
malesuada velit vel libero cursus, et aliquam nibh imperdiet. Maecenas
ultrices, orci eu posuere commodo, leo diam ultricies velit, sed hendrerit odio
leo sed erat.

Pellentesque at enim id lacus tristique blandit. Duis at suscipit odio, eu
ullamcorper lorem. Interdum et malesuada fames ac ante ipsum primis in
faucibus. Sed non massa urna. Cum sociis natoque penatibus et magnis dis
parturient montes, nascetur ridiculus mus. Etiam blandit metus eget sem
consequat tincidunt. Vivamus auctor pharetra sapien non iaculis. Curabitur quis
fermentum est. Mauris laoreet augue finibus, rhoncus enim et, finibus nibh.
Praesent varius neque mi, id tempor massa facilisis eget. Nulla consectetur,
massa sed tempus laoreet, nisl purus posuere ipsum, eu gravida purus arcu nec
ante.

Pellentesque dapibus ultrices purus, et accumsan sapien ultrices a. Nulla
ultricies odio sit amet tellus tempus, et gravida dui feugiat. Aenean pretium
in lectus vitae molestie. Proin in rhoncus eros. Donec in ultricies nisi,
volutpat ultrices lacus. Suspendisse gravida hendrerit ipsum vitae feugiat.
Phasellus pharetra malesuada orci et euismod. Proin luctus nunc sit amet
gravida pulvinar. Nam quis dapibus mauris. Nulla accumsan nisl vel turpis
lobortis vulputate. Integer sem orci, lobortis ut blandit quis, consequat eget
purus. Fusce accumsan magna eu mi placerat rhoncus.

Into single lines per paragraph, like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed imperdiet felis suscipit odio fringilla, pharetra ullamcorper felis interdum. Aenean ut mollis est. Maecenas mattis convallis enim. Nullam eget maximus mi. Vivamus nec risus suscipit, facilisis nunc at, eleifend massa. Aliquam erat volutpat. Aenean malesuada velit vel libero cursus, et aliquam nibh imperdiet. Maecenas ultrices, orci eu posuere commodo, leo diam ultricies velit, sed hendrerit odio leo sed erat.

Pellentesque at enim id lacus tristique blandit. Duis at suscipit odio, eu ullamcorper lorem. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed non massa urna. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Etiam blandit metus eget sem consequat tincidunt. Vivamus auctor pharetra sapien non iaculis. Curabitur quis fermentum est. Mauris laoreet augue finibus, rhoncus enim et, finibus nibh. Praesent varius neque mi, id tempor massa facilisis eget. Nulla consectetur, massa sed tempus laoreet, nisl purus posuere ipsum, eu gravida purus arcu nec ante.

Pellentesque dapibus ultrices purus, et accumsan sapien ultrices a. Nulla ultricies odio sit amet tellus tempus, et gravida dui feugiat. Aenean pretium in lectus vitae molestie. Proin in rhoncus eros. Donec in ultricies nisi, volutpat ultrices lacus. Suspendisse gravida hendrerit ipsum vitae feugiat. Phasellus pharetra malesuada orci et euismod. Proin luctus nunc sit amet gravida pulvinar. Nam quis dapibus mauris. Nulla accumsan nisl vel turpis lobortis vulputate. Integer sem orci, lobortis ut blandit quis, consequat eget purus. Fusce accumsan magna eu mi placerat rhoncus.

This is useful, for example, when editing a plain text entry with vi that is ultimately pasted into a web form.

I couldn’t find a good unix tools solution so I settled on a python script I found. Here’s the slightly edited version I save in unwrap:

#!/usr/bin/env python

import sys;paragraph = []
for line in sys.stdin:
   line = line.strip()
   if line:
      paragraph.append(line)
   else:
      print ' '.join(paragraph).replace('  ', ' ')
      paragraph = []
print ' '.join(paragraph).replace('  ', ' ')

Then I call it with

unwrap < my-text-file.txt

Rasterize everything in pdf except text

Wednesday, October 19th, 2016

I had an issue including a PDF with transparency as a subfigure to another PDF. This lead me down a dark path of trying to rasterize everything in a pdf except for the text. I tried rasterizing everything and just running OCR on top of the text but OCR-ized selection is weird and the text recognition wasn’t perfect. Not to mention that would have been a really round about way to solve this.

Here’s the insane pipeline I settled on:

  • open the PDF in illustrator
  • save as input.svg, under options “use system fonts”,
  • run ./rasterize-everything-but-text.sh input.svg output.svg (see below)
  • open output.svg in illustrator, save as raster-but-text.pdf

The bash script ./rasterize-everything-but-text.sh is itself an absurd, likely very fragile text manipulation and rasterization of the .svg files:

#!/bin/bash
#
# Usage:
#
#     rasterize-everything-but-text.sh input.svg output.svg
#
input="$1"
output="$2"
# suck out header from svg file
header=`dos2unix < $input | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\).*/\1/' | tr '\00' '\n'`
# grab all text tags
text=`cat $input | grep     "<text.*"`
# create svg file without text tags
notextsvg="no-text.svg"
notextpng="no-text.png"
cat $input | grep  -v "<text.*" > $notextsvg
# convert to png
rsvg-convert -h 1000 $notextsvg > $notextpng
# convert back to svg (containing just <image> tag)
rastersvg="raster.svg"
convert $notextpng $rastersvg
# extract body (image tag)
body=`dos2unix < $rastersvg | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\)\(.*\)<\/svg>/\2/' | tr '\00' '\n'`
# piece together original header, image tag, and text
echo "$header
$body
$text
</svg>" > "$output"
# Fix image tag to have same size as document
dim=`echo "$header" | grep -o 'width=".*" height="[^"]*"' | tr '"' "'"`
sed -i '' "s/\(image id=\"image0\" \)width=\".*\" height=\"[^\"]*\"/\1$dim/" $output

Copy text from a latex file without comments newlines or comments

Monday, January 20th, 2014

Here’s a bash one-liner to copy the raw text from a LaTeX file. It strips comments (be careful if you use % in your text), removes newline, thins whitespace and finally pipes to the keyboard.

cat abstract.tex | sed -e "s/%.*$//" | tr '\n' ' ' | sed -e "s/  */ /g" | sed -e "s/^ //g" | pbcopy

If you’re issuing this from vim then you need to escape the % sign:

    cat abstract.tex | sed -e "s/\%.*$//" | tr '\n' ' ' | sed -e "s/  */ /g" | sed -e "s/^ //g" | pbcopy