[math-fun] number of longest common substrings problem
I ran into my old friend Al Aho today for the first time in many years, and he mentioned the following possibly unsolved problem. Given two strings X and Y of length n from some fixed alphabet, let f(X,Y) = number of longest common substrings of X and Y. What is a(n) = max f(X,Y) over all choices of X and Y? Remarks: S is a substring of X if S can be obtained by deleting some of the entries of X (not necessarily adjacent). You can choose any alphabet you like. Here is a definition of "longest common substring" from http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml (see also http://www.nist.gov/dads/HTML/longestCommonSubstring.html) Given two strings X and Y, what is the longest string S such that the characters of S appear as a scattered subsequence of both X and Y? Then the question is, how many different longest common substrings can occur? If someone would compute the initial terms, we could look it up in the OEIS and see if it a solved problem! NJAS
I snooped around in http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml and, under header 'Implementations', dived into Combinatorica (Mathematica) (rating 2) only to find the absence of an implementation. This must be the mathematical equivalent of 'vacuum energy', the famous object that is symmetric to anything you throw at it, the ultimate nothingness ? ;-) W. ----- Original Message ----- From: "N. J. A. Sloane" <njas@research.att.com> To: <math-fun@mailman.xmission.com>; <seqfan@ext.jussieu.fr> Cc: <njas@research.att.com> Sent: Saturday, June 12, 2004 5:40 AM Subject: number of longest common substrings problem I ran into my old friend Al Aho today for the first time in many years, and he mentioned the following possibly unsolved problem. Given two strings X and Y of length n from some fixed alphabet, let f(X,Y) = number of longest common substrings of X and Y. What is a(n) = max f(X,Y) over all choices of X and Y? Remarks: S is a substring of X if S can be obtained by deleting some of the entries of X (not necessarily adjacent). You can choose any alphabet you like. Here is a definition of "longest common substring" from http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml (see also http://www.nist.gov/dads/HTML/longestCommonSubstring.html) Given two strings X and Y, what is the longest string S such that the characters of S appear as a scattered subsequence of both X and Y? Then the question is, how many different longest common substrings can occur? If someone would compute the initial terms, we could look it up in the OEIS and see if it a solved problem! NJAS
"N. J. A. Sloane" <njas@research.att.com> wrote: :I ran into my old friend Al Aho today for :the first time in many years, and he mentioned :the following possibly unsolved problem. : :Given two strings X and Y of length n from some fixed :alphabet, let f(X,Y) = number of longest common substrings of X and Y. :What is a(n) = max f(X,Y) over all choices of X and Y? : :Remarks: : :S is a substring of X if S can be :obtained by deleting some of the entries of X :(not necessarily adjacent). : :You can choose any alphabet you like. : :Here is a definition of "longest common substring" from : :http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml : :(see also http://www.nist.gov/dads/HTML/longestCommonSubstring.html) : :Given two strings X and Y, :what is the longest string S such that the characters of S appear :as a scattered subsequence of both X and Y? : :Then the question is, how many different longest common substrings :can occur? Here you introduce the word "different" for the first time. Are two substrings "different" if the substrings themselves differ, or is it sufficient for the letters of the substring to be drawn from different positions in X or Y? Ie given: X = "aabaa", Y = "aaabb" are there 1 or 4 occurrences of the LCS "aaa"? Hugo van der Sanden
At 11:40 PM 6/11/2004, N. J. A. Sloane wrote:
If someone would compute the initial terms, we could look it up in the OEIS and see if it a solved problem!
Don't send me something like this a few minutes before midnight - I was thinking about how to do it when I was trying to go to sleep! :-) But if no one else wants to do it, I will probably do it tonight or Sunday.
Is it obvious that the best answer is always going to use a two-character alphabet? It seems to be true, empirically. My search program would be a lot faster if it only checked the two-character alphabet. Russ
I think the C program below computes the initial terms of this sequence, for a two-character alphabet or an arbitrary alphabet. For a two-character alphabet: len 1: 1 lcs of length 1 for a a len 2: 2 lcs of length 1 for aa ab len 3: 4 lcs of length 2 for aab abb len 4: 10 lcs of length 2 for abba baab len 5: 24 lcs of length 2 for abbba baaab len 6: 46 lcs of length 3 for aabbba abaaab len 7: 100 lcs of length 4 for aaaaabb aabbbbb len 8: 225 lcs of length 4 for aaaaaabb aabbbbbb len 9: 525 lcs of length 5 for aaaaaaabb aaabbbbbb len 10: 1225 lcs of length 6 for aaaaaaabbb aaabbbbbbb len 11: 3136 lcs of length 6 for aaaaaaaabbb aaabbbbbbbb len 12: 7056 lcs of length 7 for aaaaaaaaabbb aaaabbbbbbbb The search finds the alphabetically earliest pair of strings that exhibit the maximum. It's interesting that len 4, 5, 6 don't follow the same string pattern as all the rest. Over alphabets of arbitrary size, the results match the above through len 6. Beyond that, the search takes longer than I am patient. The sequence 1,2,4,10,24,46 does not appear in the OEIS. I'd be more comfortable submitting it if the terms were corroborated by someone else. Russ #include <stdio.h> #include <stdlib.h> #define MAXN 20 #define debug 1 typedef struct Match Match; struct Match { int len; int count; }; /* * Compute number of lcs for a, b using dynamic programming. * The lcs for the substrings a[0:i] b[0:j] is one of * * if a[i]==b[j], lcs(a[0:i-1], b[0:j-1])+a[i] * lcs(a[0:i-1], b[0:j]) * lcs(a[0:i], b[0:j-1]) * * We can compute the number of lcs by just adding up the * number from each place, except that if we take them from * lcs(a[0:i-1], b[0:j]) and lcs(a[0:i], b[0:j-1]) and those in turn * didn't actually add any new ones, then we'll double-count * the ones from lcs(a[0:i-1], b[0:j-1]), so we need to subtract * those out. */ int numlcs(char *a, char *b, int n, int *plen) { int i, j, d, t, len, count; Match match[MAXN][MAXN]; for(d=0; d<2*n-1; d++){ i = 0; j = d-i; if(j >= n){ j = n-1; i = d-j; } for(; i<n && j>=0; i++, j--){ /* start with nothing */ len = 0; count = 0; /* try adding to a previous match */ if(a[i] == b[j]){ if(i>0 && j>0){ len = match[i-1][j-1].len+1; count = match[i-1][j-1].count; }else{ len = 1; count = 1; } } /* try matches not containing a[i] */ if(i>0){ if((t=match[i-1][j].len) > len){ len = t; count = match[i-1][j].count; }else if(t == len) count += match[i-1][j].count; } /* try matches not containing b[j] */ if(j>0){ if((t=match[i][j-1].len) > len){ len = t; count = match[i][j-1].count; }else if(t == len) count += match[i][j-1].count; } /* we may have double-counted matches not containing either */ if(i>0 && j>0 && match[i-1][j-1].len==len) count -= match[i-1][j-1].count; match[i][j].len = len; match[i][j].count = count; } } if(plen) *plen = match[n-1][n-1].len; return match[n-1][n-1].count; } /* * Enumerate all pairs of strings of length n over an alphabet of size nalpha * and run numlcs on them. buf contains the concatenation of the two strings; * only nbuf of the characters have been filled in so far. The nbuf characters * only use the first nused characters in the alphabet, so we need only * consider the first nused+1 alphabet characters as next possible. * * Accumulate the current max in bestcount, bestlen, besta, bestb. */ int bestcount; int bestlen; char besta[MAXN+1]; char bestb[MAXN+1]; char *alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; void search(int nalpha, int nused, int n, char *buf, int nbuf) { int count, len; char *p, *ep; if(nbuf == 2*n){ count = numlcs(buf, buf+n, n, &len); if(count > bestcount){ bestcount = count; bestlen = len; memmove(besta, buf, n); memmove(bestb, buf+n, n); } }else{ ep = alpha+nalpha; if(nused < nalpha) ep = alpha+nused+1; for(p=alpha; p<ep; p++){ buf[nbuf] = *p; search(nalpha, (p<alpha+nused ? nused : nused+1), n, buf, nbuf+1); } } } void usage(void) { fprintf(stderr, "usage: lcs count aabaa aaabb\n"); fprintf(stderr, " or lcs search n\n"); fprintf(stderr, " or lcs search2 n\n"); exit(1); } int main(int argc, char **argv) { int i, n, len; char buf[2*MAXN]; if(argc < 2) usage(); if(strcmp(argv[1], "count") == 0){ if(argc != 4) usage(); n = strlen(argv[2]); if(n != strlen(argv[3])){ fprintf(stderr, "string lengths must match\n"); exit(1); } n = numlcs(argv[2], argv[3], n, &len); printf("%d of length %d\n", n, len); }else if(strcmp(argv[1], "search") == 0){ if(argc != 3) usage(); n = atoi(argv[2]); search(2*n, 0, n, buf, 0); printf("len %d: %d lcs of length %d for %s %s (all alphabets)\n", n, bestcount, bestlen,besta, bestb); }else if(strcmp(argv[1], "search2") == 0){ if(argc != 3) usage(); n = atoi(argv[2]); search(2, 0, n, buf, 0); printf("len %d: %d lcs of length %d for %s %s (just alphabet=ab)\n", n, bestcount, bestlen, besta, bestb); }else usage(); return 0; }
On Fri, 11 Jun 2004, N. J. A. Sloane wrote:
Given two strings X and Y of length n from some fixed alphabet, let f(X,Y) = number of longest common substrings of X and Y. What is a(n) = max f(X,Y) over all choices of X and Y?
There seem to be many different ways to interpret this problem. In particular internet seaches indicate there is a difference between common substrings and common subsequences. From Neil's description it seems he is referring to what some call common "subsequences". Maple, for example, in its StringTools package has the two procedures LongestCommonSubString(s1,s2) and LongestCommonSubSequence(s1,s2) where s1 and s2 are strings. These have the following descriptions (which seem to agree with other definitions found by googling around.) -------------------------------------------------------------------- A substring of a string S is a contiguous sequence of the characters appearing in S. The empty string is a substring of every string. A subsequence of a string S is a sequence of characters from S, which may not be contiguous in S. Every substring of S is a subsequence of S. For example, "bc" is a substring of "abc", and "ac" is a subsequence of "abc" which is not a substring. The LongestCommonSubString(s1, s2) command returns from its input strings, s1 and s2, a common substring of maximum length. Many common substrings of maximum length may exist. Which among the candidates is returned depends upon the suffix structure of the pair of strings, but is deterministic. The LongestCommonSubSequence(s1, s2) command is similar, but searches for subsequences of the pair of input strings rather than substrings. -------------------------------------------------------------------- Once we know what a longest common subsequence or substring is, there is the problem of how to count them: with multiplicities or not? Anyhow, I did one of the simplest interpretations, presumably NOT one that Aho intended, nevertheless perhaps new and of interest: I'm counting (contiguous) substrings and not counting multiplicity: What I did was to find the set CS(s1,s2) of common substrings (in the above sense) for binary strings s1 and s2 of length n and count the number A(s1,s2) of strings in CS(s1,s2) that are the longest. Then a(n) is the max of A(s1,s2) over all such strings. If I haven't made an error this gives the sequence a(n), n = 1,...,11: 1, 2, 2, 2, 3, 3, 4, 6, 6, 7, 7 For example the fact that a(7) = 4 is given by the two strings s = 0001011,t = 0011010 for which the set of maximum length common substrings is {011,001,101,010} and the fact that no other pair of strings of length 7 has more than 4 common maximum length substrings. Apparently this sequence is not is the OEIS. But it seems simple enough that maybe someone can find a formula for it. And maybe someone using something faster than Maple can compute more terms even if no one can figure out a formula. --Edwin
participants (6)
-
Edwin Clark -
hv@crypt.org -
Jud McCranie -
N. J. A. Sloane -
Russ Cox -
wouter meeussen