Posts Tagged ‘sed’

Unwrap hard-wrapped text via command line

Monday, October 24th, 2016

I searched for a bash/sed/tr combination to unwrap hard 80-character per line text like:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed imperdiet felis
suscipit odio fringilla, pharetra ullamcorper felis interdum. Aenean ut mollis
est. Maecenas mattis convallis enim. Nullam eget maximus mi. Vivamus nec risus
suscipit, facilisis nunc at, eleifend massa. Aliquam erat volutpat. Aenean
malesuada velit vel libero cursus, et aliquam nibh imperdiet. Maecenas
ultrices, orci eu posuere commodo, leo diam ultricies velit, sed hendrerit odio
leo sed erat.

Pellentesque at enim id lacus tristique blandit. Duis at suscipit odio, eu
ullamcorper lorem. Interdum et malesuada fames ac ante ipsum primis in
faucibus. Sed non massa urna. Cum sociis natoque penatibus et magnis dis
parturient montes, nascetur ridiculus mus. Etiam blandit metus eget sem
consequat tincidunt. Vivamus auctor pharetra sapien non iaculis. Curabitur quis
fermentum est. Mauris laoreet augue finibus, rhoncus enim et, finibus nibh.
Praesent varius neque mi, id tempor massa facilisis eget. Nulla consectetur,
massa sed tempus laoreet, nisl purus posuere ipsum, eu gravida purus arcu nec
ante.

Pellentesque dapibus ultrices purus, et accumsan sapien ultrices a. Nulla
ultricies odio sit amet tellus tempus, et gravida dui feugiat. Aenean pretium
in lectus vitae molestie. Proin in rhoncus eros. Donec in ultricies nisi,
volutpat ultrices lacus. Suspendisse gravida hendrerit ipsum vitae feugiat.
Phasellus pharetra malesuada orci et euismod. Proin luctus nunc sit amet
gravida pulvinar. Nam quis dapibus mauris. Nulla accumsan nisl vel turpis
lobortis vulputate. Integer sem orci, lobortis ut blandit quis, consequat eget
purus. Fusce accumsan magna eu mi placerat rhoncus.

Into single lines per paragraph, like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed imperdiet felis suscipit odio fringilla, pharetra ullamcorper felis interdum. Aenean ut mollis est. Maecenas mattis convallis enim. Nullam eget maximus mi. Vivamus nec risus suscipit, facilisis nunc at, eleifend massa. Aliquam erat volutpat. Aenean malesuada velit vel libero cursus, et aliquam nibh imperdiet. Maecenas ultrices, orci eu posuere commodo, leo diam ultricies velit, sed hendrerit odio leo sed erat.

Pellentesque at enim id lacus tristique blandit. Duis at suscipit odio, eu ullamcorper lorem. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed non massa urna. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Etiam blandit metus eget sem consequat tincidunt. Vivamus auctor pharetra sapien non iaculis. Curabitur quis fermentum est. Mauris laoreet augue finibus, rhoncus enim et, finibus nibh. Praesent varius neque mi, id tempor massa facilisis eget. Nulla consectetur, massa sed tempus laoreet, nisl purus posuere ipsum, eu gravida purus arcu nec ante.

Pellentesque dapibus ultrices purus, et accumsan sapien ultrices a. Nulla ultricies odio sit amet tellus tempus, et gravida dui feugiat. Aenean pretium in lectus vitae molestie. Proin in rhoncus eros. Donec in ultricies nisi, volutpat ultrices lacus. Suspendisse gravida hendrerit ipsum vitae feugiat. Phasellus pharetra malesuada orci et euismod. Proin luctus nunc sit amet gravida pulvinar. Nam quis dapibus mauris. Nulla accumsan nisl vel turpis lobortis vulputate. Integer sem orci, lobortis ut blandit quis, consequat eget purus. Fusce accumsan magna eu mi placerat rhoncus.

This is useful, for example, when editing a plain text entry with vi that is ultimately pasted into a web form.

I couldn’t find a good unix tools solution so I settled on a python script I found. Here’s the slightly edited version I save in unwrap:

#!/usr/bin/env python

import sys;paragraph = []
for line in sys.stdin:
   line = line.strip()
   if line:
      paragraph.append(line)
   else:
      print ' '.join(paragraph).replace('  ', ' ')
      paragraph = []
print ' '.join(paragraph).replace('  ', ' ')

Then I call it with

unwrap < my-text-file.txt

Rasterize everything in pdf except text

Wednesday, October 19th, 2016

I had an issue including a PDF with transparency as a subfigure to another PDF. This lead me down a dark path of trying to rasterize everything in a pdf except for the text. I tried rasterizing everything and just running OCR on top of the text but OCR-ized selection is weird and the text recognition wasn’t perfect. Not to mention that would have been a really round about way to solve this.

Here’s the insane pipeline I settled on:

  • open the PDF in illustrator
  • save as input.svg, under options “use system fonts”,
  • run ./rasterize-everything-but-text.sh input.svg output.svg (see below)
  • open output.svg in illustrator, save as raster-but-text.pdf

The bash script ./rasterize-everything-but-text.sh is itself an absurd, likely very fragile text manipulation and rasterization of the .svg files:

#!/bin/bash
#
# Usage:
#
#     rasterize-everything-but-text.sh input.svg output.svg
#
input="$1"
output="$2"
# suck out header from svg file
header=`dos2unix < $input | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\).*/\1/' | tr '\00' '\n'`
# grab all text tags
text=`cat $input | grep     "<text.*"`
# create svg file without text tags
notextsvg="no-text.svg"
notextpng="no-text.png"
cat $input | grep  -v "<text.*" > $notextsvg
# convert to png
rsvg-convert -h 1000 $notextsvg > $notextpng
# convert back to svg (containing just <image> tag)
rastersvg="raster.svg"
convert $notextpng $rastersvg
# extract body (image tag)
body=`dos2unix < $rastersvg | tr '\n' '\00' | sed 's/\(.*<svg[^<]*>\)\(.*\)<\/svg>/\2/' | tr '\00' '\n'`
# piece together original header, image tag, and text
echo "$header
$body
$text
</svg>" > "$output"
# Fix image tag to have same size as document
dim=`echo "$header" | grep -o 'width=".*" height="[^"]*"' | tr '"' "'"`
sed -i '' "s/\(image id=\"image0\" \)width=\".*\" height=\"[^\"]*\"/\1$dim/" $output

Determine how much space is used by .git/.svn/.hg in a directory

Thursday, November 12th, 2015

Here’s a nasty little bash one-liner to determine how much space is being “wasted” but .svn/ or .git/ or .hg/ repos in your current directory:

du -k | sed -nE 's/^([0-9]*).*\.(svn|git|hg)$/\1/p' | awk '{s+=$1*1024} END {print s}' | awk '{ sum=$1 ; hum[1024**3]="Gb";hum[1024**2]="Mb";hum[1024]="Kb"; for (x=1024**3; x>=1024; x/=1024){ if (sum>=x) { printf "%.2f %s\n",sum/x,hum[x];break } }}'

Remove annotations from a pdf

Wednesday, July 29th, 2015

I lost track of where I found this online and had to dig it out of my bash history:

perl -pi -e 's:/Annots \[[^]]+\]::g' input-and-output.pdf

Abbreviate long strings with dots using sed

Saturday, January 24th, 2015

My bash prompt lists the current directory name and I was bothered that if it’s very long then it will take up the whole line. Here’s a short sed command which will abbreviate any lines over 5+3+5=13 characters replacing the middle part with three dots:

echo \
"a very long directory name which will take up a whole line
short name
perfect name
just big name" | \
sed -e "s/\(.\{5\}\).\{3,\}\(.\{5,\}\)/\1...\2/"

produces

a ver... line
short name
perfect name
just ... name

Scrape all torrent titles from tehconnection.eu

Monday, March 24th, 2014

Here’s a bash script to log into <tehconnection.eu> then scrape all movie titles from their list of available torrents, doing minor cleanup on the symbols.

#!/bin/bash

LOGIN_URL="https://tehconnection.eu/login.php"
USERNAME=myusername
PASSWORD=mypassword
# log in to website and save cookies
wget --post-data \
  "username=$USERNAME&password=$PASSWORD" \
  --save-cookies=cookies.txt --keep-session-cookies \
  -O /dev/null \
  -q \
  $LOGIN_URL \
  &>/dev/null

FILMS_URL="https://tehconnection.eu/torrents.php?order_by=s1&order_way=ASC"
## download first page determine last page
RES=`wget --load-cookies=cookies.txt $FILMS_URL -q -O -`
LAST=`echo "$RES" | grep -m 1 -o "page[^\/]* Last" | \
  sed -e "s/page=\([0-9][0-9]*\).*/\1/g"`
#LAST=363

for p in $(seq 1 $LAST);
do 
  URL="$FILMS_URL&page=$p"
  RES=`wget --load-cookies=cookies.txt $URL -q -O -`
  echo "$RES"| grep "torrent_title\"" | \
    sed -e "s/.*View Torrent\">\([^<]*\).*/\1/g" | ./html_entity_decode.php
  sleep 3
done

# get rid of cookies
rm cookies.txt

Copy text file without hard-wrap new lines

Friday, February 28th, 2014

I often format small text entries in vim and then copy them into other places like web forms. In vim I like to have a hard 80 character line wrap. But this means that after every 80 characters I have a newline character. If I just copy the file or from the terminal screen then I’ll impose this hard wrap in the submitted text. I’ve noticed that this is especially bad for academic review submissions because the submission system might then additionally impose its own hard wrap at a different width causing a very staggered, ragged appearance.

Here’s my solution to copy a text file without newlines but keeping double newlines which indicate paragraphs.

cat % | perl -pe '$/=""; s/\n([^\n])/ \1/g;' | pbcopy

There must be a way to do this inside of vim properly, but I couldn’t figure it out.

Copy text from a latex file without comments newlines or comments

Monday, January 20th, 2014

Here’s a bash one-liner to copy the raw text from a LaTeX file. It strips comments (be careful if you use % in your text), removes newline, thins whitespace and finally pipes to the keyboard.

cat abstract.tex | sed -e "s/%.*$//" | tr '\n' ' ' | sed -e "s/  */ /g" | sed -e "s/^ //g" | pbcopy

If you’re issuing this from vim then you need to escape the % sign:

    cat abstract.tex | sed -e "s/\%.*$//" | tr '\n' ' ' | sed -e "s/  */ /g" | sed -e "s/^ //g" | pbcopy

Remove missing files from svn (forget)

Saturday, December 14th, 2013

I often forget to delete files under version control using the svn rm command. Then I still need to issue that command but auto-complete won’t help me out. I use this snippet to delete any missing (already deleted) files.

svn rm `svn st | sed -n "s/^\! *//p"`

See who’s been checking in frequently to mercurial repository

Friday, December 13th, 2013

Here’s a bash snippet to show who’s been checking in code to a mercurial repository:

hg log | sed -n "s/user:  *//p" | sort | uniq -c | sort -rn

This prints something like:

 262 Pablo
 108 juanita
  23 carlos
  23 Maria Castano <maria@gmail.com>
  21 Juan Hernandez (hernandez@gmail.com)
  17 psalamanca
  13 chico
   7 Maria Castano <maria.castano@yahoo.com>
   1 paco

source

Update: For an alternative measure try the churn extension.