Letter and character frequencies of Faulkner and Hemingway

Alec Jacobson

April 06, 2010

weblog/

As a freshman at college I took a class called, Randomness and Chaos taught by Mark Nelkin. During the section on power law probability distributions, I remember becoming obsessed with trying to find these in nature. Later when a learned only a little Java programming, I wrote a (albeit horribly inefficient) character frequency counter program that I ran over plain text versions of William Faulkner's The Sound and the Fury and Ernest Hemingway's The Old Man and the Sea. I made some charts with the intention of adding the to the letter frequency wikipedia article, but the wikimilitia users removed them citing that they were "original research". Hardly, I thought. Hardly more than snapping a picture of John Kerry holding a baby is original research. Anyway I repost them here, so at least I know where to find them and because I think they are an interesting seed to the discussion of recognizing authorship by certain frequencies in their writings (probably not of characters). Also, it is nice to examine these distributions "in nature".

Latin letter frequency in The Old Man and the Sea

relative letter frequency of the old man and the sea

Latin letter frequency in The Sound and the Fury

relative letter frequency of the sound and the fury

Latin letter frequency in English

relative letter frequency of English from wikipedia.

Character frequency in The Old Man and the Sea

relative Character frequency of the old man and the sea

Latin Character frequency in The Sound and the Fury

relative Character frequency of the sound and the fury The character frequencies exhibit much more of a power-law distribution than the letters, mostly because of the space character and the uncommon punctuation marks and digits.