[math-fun] ISO a perfect pangram
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory. My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence. I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons. I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word. With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes. Has anyone tried this already? Does anyone have a better algorithm? Thanks.
Have you considered generalizing to sentences where each letter appears exactly twice? It should still be a challenge, but will improve the likelihood of being able to form a coherent statement. -Veit
On Aug 3, 2017, at 12:46 AM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory.
My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence.
I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons.
I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already? Does anyone have a better algorithm? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Veit Elser: "Have you considered generalizing to sentences where each letter appears exactly twice? It should still be a challenge, but will improve the likelihood of being able to form a coherent statement." For a year I've been playing with this concept applied to factorization "sentences" in base ten. The letters are of course the ten digits. A sentence is an integer equated to its normally expressed factorization (greater-than-one powers of strictly increasing primes). No leading zeros. When each digit appears exactly k times, the sentences may be said to be "k-balanced". https://oeis.org/A273260 There are 4 solutions for k = 1 and 13022 solutions for k = 2. I recently showed that 4546782683595318279169 (= 7^10 * 2003^4) is the largest integer leading to a k = 3 solution (785984586660 is the smallest) and 19260075803546226131439208984375 (= 5^18 * 7 * 947^6) is the largest integer leading to a k = 4 solution (we'll likely never know the smallest). I have a solution for k = 13. Can you find a larger one?
The quick brown fox jumps over the lazy dog. Although not quite what you want, I can still recall typing this sentence (using thumbs only for the space bar!!) for hours during typing class while in junior high school in the early 1960's. A closely allied problem from the 1960's: what sequence of characters, when printed on an IBM 1403 chain printer, will break the chain? Is there an English sentence which could do this? (IBM went to a lot of trouble to make sure that no such sentence existed; I wonder if they ever published their research on this topic.) At 09:46 PM 8/2/2017, Keith F. Lynch wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory.
My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence.
I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons.
I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already? Does anyone have a better algorithm? Thanks.
Here’s another approach for searching for these things: Also, let’s look for “doubled-pangrams”, where each letter appears exactly twice. 1. Extract sequences of letters, including spaces (as word separators), from actual text. Each sequence should have 8 letters, but no more than two a’s, b’s etc. (the number of characters will vary, because of spaces). A sequence should begin/end with a space if the sequence begins/ends as a word. 2. Split each sequence in half, each containing 4 letters. Use these to construct a giant transition matrix, as you would for a Markov chain. The more transition elements (more processed text), the greater chance of success. 3. Do a backtracking search, using the transition matrix to extend the sentence in chunks of 4 letters. The first chunk should start with a blank. Reject branches when a new chunk exceeds the limit of two copies of each letter. If you reach depth = 13 (52 letters), limiting the final chunk to one that ends in a space, declare success. 4. There’s going to be a tradeoff between the size of chunks and the rate at which the search produces complete sentences. Small chunks will have a higher rate than large chunks. Large chunks will generate better sentences but require more memory for the transition matrix. -Veit
On Aug 3, 2017, at 12:46 AM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory.
My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence.
I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons.
I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already? Does anyone have a better algorithm? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
This approach will get you some false positives. If abcde is a word, and bcedfg is a word, you will get sentences with abcdefg in them, which may not be a word. The sentences you get, if you get any, and they happen to consist of real words, will be better, but finding a set of words that use all the letters once is constrained enough that I suspect that this more restrictive version won't find any matches at all, even with a pretty large corpus. Look at the best sentences that manual investigations have found; you'll never find any 8-letter sequences from any of these in a corpus except in the part of it that is discussing this problem. Andy On Thu, Aug 3, 2017 at 10:49 AM, Veit Elser <ve10@cornell.edu> wrote:
Here’s another approach for searching for these things:
Also, let’s look for “doubled-pangrams”, where each letter appears exactly twice.
1. Extract sequences of letters, including spaces (as word separators), from actual text. Each sequence should have 8 letters, but no more than two a’s, b’s etc. (the number of characters will vary, because of spaces). A sequence should begin/end with a space if the sequence begins/ends as a word.
2. Split each sequence in half, each containing 4 letters. Use these to construct a giant transition matrix, as you would for a Markov chain. The more transition elements (more processed text), the greater chance of success.
3. Do a backtracking search, using the transition matrix to extend the sentence in chunks of 4 letters. The first chunk should start with a blank. Reject branches when a new chunk exceeds the limit of two copies of each letter. If you reach depth = 13 (52 letters), limiting the final chunk to one that ends in a space, declare success.
4. There’s going to be a tradeoff between the size of chunks and the rate at which the search produces complete sentences. Small chunks will have a higher rate than large chunks. Large chunks will generate better sentences but require more memory for the transition matrix.
-Veit
On Aug 3, 2017, at 12:46 AM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory.
My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence.
I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons.
I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already? Does anyone have a better algorithm? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
On Aug 3, 2017, at 10:56 AM, Andy Latto <andy.latto@pobox.com> wrote:
This approach will get you some false positives. If abcde is a word, and bcedfg is a word, you will get sentences with abcdefg in them, which may not be a word. If the rate of false positives is too high, make the chunks larger (10=5+5 instead of 8=4+4).
The sentences you get, if you get any, and they happen to consist of real words, will be better, but finding a set of words that use all the letters once is constrained enough that I suspect that this more restrictive version won't find any matches at all, even with a pretty large corpus. Look at the best sentences that manual investigations have found; you'll never find any 8-letter sequences from any of these in a corpus except in the part of it that is discussing this problem. I recognized that and propose this method only for the generalized problem, where each letter appears exactly k times. Surely there are solutions for sufficiently large k, maybe even k=2.
Andy
On Thu, Aug 3, 2017 at 10:49 AM, Veit Elser <ve10@cornell.edu> wrote:
Here’s another approach for searching for these things:
Also, let’s look for “doubled-pangrams”, where each letter appears exactly twice.
1. Extract sequences of letters, including spaces (as word separators), from actual text. Each sequence should have 8 letters, but no more than two a’s, b’s etc. (the number of characters will vary, because of spaces). A sequence should begin/end with a space if the sequence begins/ends as a word.
2. Split each sequence in half, each containing 4 letters. Use these to construct a giant transition matrix, as you would for a Markov chain. The more transition elements (more processed text), the greater chance of success.
3. Do a backtracking search, using the transition matrix to extend the sentence in chunks of 4 letters. The first chunk should start with a blank. Reject branches when a new chunk exceeds the limit of two copies of each letter. If you reach depth = 13 (52 letters), limiting the final chunk to one that ends in a space, declare success.
4. There’s going to be a tradeoff between the size of chunks and the rate at which the search produces complete sentences. Small chunks will have a higher rate than large chunks. Large chunks will generate better sentences but require more memory for the transition matrix.
-Veit
On Aug 3, 2017, at 12:46 AM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory.
My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence.
I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons.
I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already? Does anyone have a better algorithm? Thanks.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Andy.Latto@pobox.com
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Better algorithm: No need to restrict the word list, gets a full search, fairly fast, no tree search. Divide words into two sets, those with a A, and those without. Form all legal combinations of each set. (Treat the words with earliest letter B as a group, attempting merge with A list and with (1 element, no bits, no words). Then merge in the Cs, etc. This avoids trying to form the power set directly.) No list can have more than 2^25 elements. Sort the two lists. Reverse the no-A list. Do a "merge" of the two lists, looking for an arithmetic sum of exactly 2^26-1. Record successes. Either redo the run to notice the ingredients in the matches, or, in the original list build, record the words or bit patterns for each item. Rich ________________________________________ From: math-fun [math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com] on behalf of Keith F. Lynch [kfl@KeithLynch.net] Sent: Wednesday, August 2, 2017 10:46 PM To: math-fun@mailman.xmission.com Subject: [EXTERNAL] [math-fun] ISO a perfect pangram Bill Gosper's recent anagrams reminded me of my search for a perfect pangram, i.e. a sentence in English containing every letter once and only once. For instance, "Cwm fjord bank glyphs vext quiz" and "Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really satisfactory. My approach was to find a complete list of words, discard any words containing duplicate letters (e.g. "containing," as it contains two "n"s (and also two "i"s)), turn each of them into a 32-bit integer which indicates which letters are present ("of" becomes 2^4 + 2^13 as it contains the 5th and 14th letters). Anagrams are merged (e.g. the numbers for "opts," "post," "pots," "spot," "stop," and "tops" are the same). A depth-first tree search is then done during which the numbers are ORed together. If an AND gets a non-zero result, the branch is immediately abandoned. If 2^26-1 is reached, the result is logged, so that I can manually try to arrange the words (or anagrams of them) into a coherent sentence. I tried this about 30 years ago, on a 286. Needless to say, I didn't get anywhere. It probably would have run for eons. I'm considering trying again, on the same general plan, but only with the most common ten thousand words. Better yet, the most common thousand words containing "a", the most common thousand words containing "b", ..., and the most common thousand words containing "z," with the many duplicates merged, for a total of perhaps ten thousand. I might also include the most common thousand words *not* containing "a", the most common thousand words not containing "b", ..., and the most common thousand words not containing "z." Again, duplicates would be merged. And words that are anagrams would move up in the rankings as the usage of each anagram would be summed as if they were all the same word. With memory as cheap as it is, I might also include every *pair* of words in the list as a word, even though that would increase memory usage from 40 kilobytes to 400 megabytes. Has anyone tried this already? Does anyone have a better algorithm? Thanks. _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (6)
-
Andy Latto -
Hans Havermann -
Henry Baker -
Keith F. Lynch -
Schroeppel, Richard -
Veit Elser