Re: [math-fun] ISO a perfect pangram
Veit Elser <ve10@cornell.edu> wrote:
1. Extract sequences of letters, including spaces (as word separators), from actual text.
I tried that decades ago. It picks up far too much junk: Proper names, abbreviations, acronyms, jargon, computer codes, ham radio codes, misspelled words, foreign words, misspelled foreign words, etc. If I were to try it again today, now that lots of people send HTML email, it would probably tell me that "msonormal" was the most common English word. I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash. For instance http://norvig.com/ngrams/count_1w.txt starts promisingly: the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816 but if, for instance, I search it for the anagrams of "post" I get: post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839 Any idea where I can find a clean and complete list? Thanks.
On Aug 10, 2017, at 10:53 PM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Veit Elser <ve10@cornell.edu> wrote:
1. Extract sequences of letters, including spaces (as word separators), from actual text.
I tried that decades ago. It picks up far too much junk: Proper names, abbreviations, acronyms, jargon, computer codes, ham radio codes, misspelled words, foreign words, misspelled foreign words, etc. If I were to try it again today, now that lots of people send HTML email, it would probably tell me that "msonormal" was the most common English word.
Perhaps I wasn’t clear, but I was only proposing that method for searching “double-pangrams”, where each letter appears exactly twice. I’m still optimistic that it could work. I tried it using Alice in Wonderland as the source, to avoid “junk”. I don’t do this kind of coding/computing but gave it a shot in Mathematica, since this book is included in their datasets. There are 52k characters and 7k unique blocks of size 4 (omitting punctuation marks). There are 31k transitions. The key number, I believe, is the ratio: 4.47 continuations, on average, per block. We might want to look for an author (Nabokov?) whose writing has a higher ratio. Anyway, searching all continuations starting from a randomly selected set of just 10 of the 793 starting blocks I got as far as 12 blocks, where most blocks have just one blank. Most of these are weakly comprehensible sentences made of true words: "all mad but why the king’s crowd be off quick n…” (apostrophe added by hand) -Veit
I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash.
For instance http://norvig.com/ngrams/count_1w.txt starts promisingly:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816
but if, for instance, I search it for the anagrams of "post" I get:
post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839
Any idea where I can find a clean and complete list? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
On Fri, Aug 11, 2017 at 9:33 AM, Veit Elser <ve10@cornell.edu> wrote:
There are 52k characters and 7k unique blocks of size 4 (omitting punctuation marks). There are 31k transitions. The key number, I believe, is the ratio: 4.47 continuations, on average, per block. We might want to look for an author (Nabokov?) whose writing has a higher ratio.
Anyway, searching all continuations starting from a randomly selected set of just 10 of the 793 starting blocks I got as far as 12 blocks, where most blocks have just one blank. Most of these are weakly comprehensible sentences made of true words:
"all mad but why the king’s crowd be off quick n…” (apostrophe added by hand)
Note that although you have only 36 letters of the 52 you are hoping for, you already have used up 2 each of aeiou, and one y. This is not coincidence. The density of vowels in english words is much higher than in the alphabet, so the key will alway be finding enough words with sufficiently low vowel density, that also use uncommon letters. I would bet that if you generated 100 more 36-letter fragments, you would find that either 99 or 100 of them used up aaeeiioouu. So you're not nearly as close to finding a solution as it might appear. Andy
-Veit
I've tried searching for word lists online that list each word by its frequency. In a perverse equivalent of Godel's theorem, it appears that every such list is either incomplete or contains trash.
For instance http://norvig.com/ngrams/count_1w.txt starts promisingly:
the 23135851162 of 13151942776 and 12997637966 to 12136980858 a 9081174698 in 8469404971 for 5933321709 is 4705743816
but if, for instance, I search it for the anagrams of "post" I get:
post 392956436 stop 77749471 spot 26750929 tops 11771127 pots 3854743 opts 662207 tsop 205591 tpos 43379 ostp 41988 ptos 38390 otps 23858 ptso 21839
Any idea where I can find a clean and complete list? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
participants (3)
-
Andy Latto -
Keith F. Lynch -
Veit Elser