The text is the overture from Proust's _Swann's Way_; chosen because it is available freely on Project Gutenberg: http://gutenberg.net.au/ebooks03/0300511.txt I have, hopefully, removed all the punctuation and special characters and changed all the upper case to lower. I did cat Example.txt | tr "\n" " " | tr -d "." | tr -d "," | tr -d ";" | tr -d ":" | tr -d "?" | tr -d "(" | tr -d ")" | sed 's/"//g' | tr [:upper:] [:lower:] | tr -d "_" | tr "-" " " | tr -d '!' > temp I opened it in emacs and noticed I still had ', I removed it with replace-string, I also removed francoise, since it a cedille. Later on, I discovered there were still some e-fathas and other accented letters, so I removed them by hand in emacs.