Re: [math-fun] ISO a perfect pangram

3 Aug 2017


      
...
On Aug 3, 2017, at 10:56 AM, Andy Latto <andy.latto@pobox.com> wrote:
This approach will get you some false positives. If abcde is a word,
and bcedfg is a word, you will get sentences with abcdefg in them,
which may not be a word.
If the rate of false positives is too high, make the chunks larger
(10=5+5 instead of 8=4+4).
The sentences you get, if you get any, and they happen to consist of
real words, will be better, but finding a set of words that use all
the letters once is constrained enough that I suspect that this more
restrictive version won't find any matches at all, even with a pretty
large corpus. Look at the best sentences that manual investigations
have found; you'll never find any 8-letter sequences from any of these
in a corpus except in the part of it that is discussing this problem.
I recognized that and propose this method only for the generalized
problem, where each letter appears exactly k times. Surely there are
solutions for sufficiently large k, maybe even k=2.
Andy
On Thu, Aug 3, 2017 at 10:49 AM, Veit Elser <ve10@cornell.edu> wrote:
...
Here’s another approach for searching for these things:
Also, let’s look for “doubled-pangrams”, where each letter appears
exactly twice.
1. Extract sequences of letters, including spaces (as word separators),
from actual text. Each sequence should have 8 letters, but no more than
two a’s, b’s etc. (the number of characters will vary, because of spaces).
A sequence should begin/end with a space if the sequence begins/ends as
a word.
2. Split each sequence in half, each containing 4 letters. Use these to
construct a giant transition matrix, as you would for a Markov chain.
The more transition elements (more processed text), the greater chance of
success.
3. Do a backtracking search, using the transition matrix to extend the
sentence in chunks of 4 letters. The first chunk should start with a blank.
Reject branches when a new chunk exceeds the limit of two copies of each
letter. If you reach depth = 13 (52 letters), limiting the final chunk to
one that ends in a space, declare success.
4. There’s going to be a tradeoff between the size of chunks and the rate
at which the search produces complete sentences. Small chunks will have a
higher rate than large chunks. Large chunks will generate better sentences
but require more memory for the transition matrix.
-Veit
...
On Aug 3, 2017, at 12:46 AM, Keith F. Lynch <kfl@KeithLynch.net> wrote:
Bill Gosper's recent anagrams reminded me of my search for a perfect
pangram, i.e. a sentence in English containing every letter once
and only once.  For instance, "Cwm fjord bank glyphs vext quiz" and
"Mr. Jock, TV quiz PhD, bags few lynx," neither of which is really
satisfactory.
My approach was to find a complete list of words, discard any words
containing duplicate letters (e.g. "containing," as it contains two
"n"s (and also two "i"s)), turn each of them into a 32-bit integer
which indicates which letters are present ("of" becomes 2^4 + 2^13 as
it contains the 5th and 14th letters).  Anagrams are merged (e.g. the
numbers for "opts," "post," "pots," "spot," "stop," and "tops" are
the same).  A depth-first tree search is then done during which the
numbers are ORed together.  If an AND gets a non-zero result, the
branch is immediately abandoned.  If 2^26-1 is reached, the result is
logged, so that I can manually try to arrange the words (or anagrams
of them) into a coherent sentence.
I tried this about 30 years ago, on a 286.  Needless to say, I didn't
get anywhere.  It probably would have run for eons.
I'm considering trying again, on the same general plan, but only
with the most common ten thousand words.  Better yet, the most
common thousand words containing "a", the most common thousand words
containing "b", ..., and the most common thousand words containing
"z," with the many duplicates merged, for a total of perhaps ten
thousand.  I might also include the most common thousand words *not*
containing "a", the most common thousand words not containing "b",
..., and the most common thousand words not containing "z."  Again,
duplicates would be merged.  And words that are anagrams would move
up in the rankings as the usage of each anagram would be summed as
if they were all the same word.
With memory as cheap as it is, I might also include every *pair* of
words in the list as a word, even though that would increase memory
usage from 40 kilobytes to 400 megabytes.
Has anyone tried this already?  Does anyone have a better algorithm?
Thanks.
_______________________________________________
math-fun mailing list
math-fun@mailman.xmission.com
https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________
math-fun mailing list
math-fun@mailman.xmission.com
https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- 
                Andy.Latto@pobox.com
_______________________________________________
math-fun mailing list
math-fun@mailman.xmission.com
https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun