On Aug 10, 2017, at 10:53 PM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Veit Elser <ve10@cornell.edu> wrote:
1. Extract sequences of letters, including spaces (as word separators), from actual text.
I tried that decades ago. It picks up far too much junk: Proper names, abbreviations, acronyms, jargon, computer codes, ham radio codes, misspelled words, foreign words, misspelled foreign words, etc. If I were to try it again today, now that lots of people send HTML email, it would probably tell me that "msonormal" was the most common English word.
Perhaps I wasn’t clear, but I was only proposing that method for searching “double-pangrams”, where each letter appears exactly twice. I’m still optimistic that it could work. I tried it using Alice in Wonderland as the source, to avoid “junk”. I don’t do this kind of coding/computing but gave it a shot in Mathematica, since this book is included in their datasets. There are 52k characters and 7k unique blocks of size 4 (omitting punctuation marks). There are 31k transitions. The key number, I believe, is the ratio: 4.47 continuations, on average, per block. We might want to look for an author (Nabokov?) whose writing has a higher ratio. Anyway, searching all continuations starting from a randomly selected set of just 10 of the 793 starting blocks I got as far as 12 blocks, where most blocks have just one blank. Most of these are weakly comprehensible sentences made of true words: "all mad but why the king’s crowd be off quick n…” (apostrophe added by hand) -Veit
I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash.
For instance http://norvig.com/ngrams/count_1w.txt starts promisingly:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816
but if, for instance, I search it for the anagrams of "post" I get:
post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839
Any idea where I can find a clean and complete list? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun