There are some terrific just plain word lists online, e.g., at puzzlers.org go to "Solving tools of the Enigma" and then "grep dictionary search".
From there you can download the list (or lists; there are a lot of choices) and toss out all the items in the otherwise-promising data that are not found in the list.
Kind of a pain but doable. —Dan
From: "Keith F. Lynch" <kfl@KeithLynch.net>
Veit Elser <ve10@cornell.edu> wrote:
1. Extract sequences of letters, including spaces (as word separators), from actual text.
I tried that decades ago. It picks up far too much junk: Proper names, abbreviations, acronyms, jargon, computer codes, ham radio codes, misspelled words, foreign words, misspelled foreign words, etc. If I were to try it again today, now that lots of people send HTML email, it would probably tell me that "msonormal" was the most common English word.
I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash.
For instance http://norvig.com/ngrams/count_1w.txt starts promisingly:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816
but if, for instance, I search it for the anagrams of "post" I get:
post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839
Any idea where I can find a clean and complete list? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun