[math-fun] Summary so far: number of longest common substrings problem

12 Jun 2004

      Since I posted the original message to both the math-fun
and seq-fan lists, and some readers of one list may not
have been able to post their replies to both lists,
the following gives all the replies I have received to date
(13:00 Sat Jun 12 2004), beginning with the original posting.
There are 355 lines.
NJAS
...
From njas@research.att.com  Fri Jun 11 23:40:54 2004
Date: Fri, 11 Jun 2004 23:40:53 -0400 (EDT)
From: "N. J. A. Sloane" <njas@research.att.com>
Message-Id: <200406120340.XAA13275@fry.research.att.com>
To: math-fun@mailman.xmission.com, seqfan@ext.jussieu.fr
Subject: number of longest common substrings problem
I ran into my old friend Al Aho today for
the first time in many years, and he mentioned
the following possibly unsolved problem.

Given two strings X and Y of length n from some fixed
alphabet, let f(X,Y) = number of longest common substrings of X and Y.
What is a(n) = max f(X,Y) over all choices of X and Y?

Remarks: 

S is a substring of X if S can be
obtained by deleting some of the entries of X
(not necessarily adjacent).

You can choose any alphabet you like.

Here is a definition of "longest common substring" from

http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml

(see also http://www.nist.gov/dads/HTML/longestCommonSubstring.html)

Given two strings X and Y,
what is the longest string S such that the characters of S appear
as a scattered subsequence of both X and Y?

Then the question is, how many different longest common substrings
can occur?

If someone would compute the initial terms,
we could look it up in the OEIS and see if it a solved problem!

NJAS
...
From hv@crypt.org  Sat Jun 12 10:25:07 2004
Cc: math-fun@mailman.xmission.com, seqfan@ext.jussieu.fr
Subject: Re: number of longest common substrings problem
Here you introduce the word "different" for the first time. Are two
substrings "different" if the substrings themselves differ, or is it
sufficient for the letters of the substring to be drawn from different
positions in X or Y? Ie given:
  X = "aabaa", Y = "aaabb"
are there 1 or 4 occurrences of the LCS "aaa"?

Hugo van der Sanden

My reply:

The latter, I think, since the goal of the question is to
determine how rapidly these numbers can grow.

So in your example, the LCS has length 3, and
there seem to be 10 ways to get 3:

X = aabaa  Y = aaabb
---------------------
    aab..      aa.b.
    aab..      aa..b
    aab..      a.ab.
    aab..      a.a.b
    aab..      .aab.
    aab..      .aa.b
    aa.a.      aaa..
    aa..a      aaa..
    a..aa      aaa..
    .a.aa      aaa..

so a(5) >= 10.

NJAS
...
From Sterten@aol.com  Sat Jun 12 11:15:42 2004
Date: Sat, 12 Jun 2004 11:15:06 EDT
Subject: re: number of longest common substrings problem
let m be the size of the  alphabet.
I get :  (n=1,2,...)
m=2:1,2,2,4,4,6,8,11,15,20,
m=3:1,2,3,4,6,10,
m=4:1,2,3,4,7,
m=5:1,2,3,4,7,

I assume that the maximum is usually attained for  
X=123..m123..m...  , except for some small values of n.

fixing X this way  gives:
m=2:1,2,2,3,4,6,7,10,14,20,26,36,51,70,96,141,
m=3:1,2,3,4,6,10,14,24,36,58,
m=4:1,2,3,4,7,10,19,28,51,
m=5:1,2,3,4,6,10,15,28,
m=6:1,2,3,4,6,9,14,

with hillclimbing my best values  were:
m=2:1,2,2,4,4,6,8,11,15,20,26,36,50,70,96,141,192,
m=3:1,2,3,4,6,10,14,24,36,58,88,145,
m=4:1,2,3,4,7,10,19,28,51,78,141,220,
m=5:1,2,3,4,7,10,19,28,44,69,112,186,

my lower bounds for m=5 are obviously bad, or there could
even be a bug  in my program.

I didn't find the above sequences in the OEIS

Guenter
...
From j.mccranie@adelphia.net  Sat Jun 12 11:16:50 2004
Date: Sat, 12 Jun 2004 11:16:08 -0400
To: njas@research.att.com
From: Jud McCranie <j.mccranie@adelphia.net>
Subject: Re: number of longest common substrings problem
Don't send me something like this a few minutes before midnight - I was 
thinking about how to do it when I was trying to go to sleep! :-)

But if no one else wants to do it, I will probably do it tonight or Sunday.
...
From russcox@gmail.com  Sat Jun 12 12:46:49 2004
Date: Sat, 12 Jun 2004 12:46:17 -0400
From: Russ Cox <russcox@gmail.com>
To: njas@research.att.com
Subject: Re: [math-fun] number of longest common substrings problem
I think the C program below computes the initial terms
of this sequence, for a two-character alphabet or an
arbitrary alphabet.  For a two-character alphabet:

len 1: 1 lcs of length 1 for a a
len 2: 2 lcs of length 1 for aa ab
len 3: 4 lcs of length 2 for aab abb
len 4: 10 lcs of length 2 for abba baab
len 5: 24 lcs of length 2 for abbba baaab
len 6: 46 lcs of length 3 for aabbba abaaab
len 7: 100 lcs of length 4 for aaaaabb aabbbbb
len 8: 225 lcs of length 4 for aaaaaabb aabbbbbb
len 9: 525 lcs of length 5 for aaaaaaabb aaabbbbbb
len 10: 1225 lcs of length 6 for aaaaaaabbb aaabbbbbbb
len 11: 3136 lcs of length 6 for aaaaaaaabbb aaabbbbbbbb
len 12: 7056 lcs of length 7 for aaaaaaaaabbb aaaabbbbbbbb

The search finds the alphabetically earliest pair of strings
that exhibit the maximum.  It's interesting that len 4, 5, 6 don't
follow the same string pattern as all the rest.

Over alphabets of arbitrary size, the results match the above
through len 6.  Beyond that, the search takes longer than
I am patient.

The sequence 1,2,4,10,24,46 does not appear in the OEIS.
I'd be more comfortable submitting it if the terms were 
corroborated by someone else.

Russ

#include <stdio.h>
#include <stdlib.h>

#define MAXN 20
#define debug 1

typedef struct Match Match;
struct Match
{
	int len;
	int count;
};

/*
 * Compute number of lcs for a, b using dynamic programming. 
 * The lcs for the substrings a[0:i] b[0:j] is one of
 *
 *	if a[i]==b[j], lcs(a[0:i-1], b[0:j-1])+a[i]
 *	lcs(a[0:i-1], b[0:j])
 *	lcs(a[0:i], b[0:j-1])
 *
 * We can compute the number of lcs by just adding up the
 * number from each place, except that if we take them from
 * lcs(a[0:i-1], b[0:j]) and lcs(a[0:i], b[0:j-1]) and those in turn
 * didn't actually add any new ones, then we'll double-count
 * the ones from lcs(a[0:i-1], b[0:j-1]), so we need to subtract
 * those out.
 */
int
numlcs(char *a, char *b, int n, int *plen)
{
	int i, j, d, t, len, count;
	Match match[MAXN][MAXN];

	for(d=0; d<2*n-1; d++){
		i = 0;
		j = d-i;
		if(j >= n){
			j = n-1;
			i = d-j;
		}
		for(; i<n && j>=0; i++, j--){
			/* start with nothing */
			len = 0;
			count = 0;

			/* try adding to a previous match */
			if(a[i] == b[j]){
				if(i>0 && j>0){
					len = match[i-1][j-1].len+1;
					count = match[i-1][j-1].count;
				}else{
					len = 1;
					count = 1;
				}
			}

			/* try matches not containing a[i] */
			if(i>0){
				if((t=match[i-1][j].len) > len){
					len = t;
					count = match[i-1][j].count;
				}else if(t == len)
					count += match[i-1][j].count;
			}

			/* try matches not containing b[j] */
			if(j>0){
				if((t=match[i][j-1].len) > len){
					len = t;
					count = match[i][j-1].count;
				}else if(t == len)
					count += match[i][j-1].count;
			}

			/* we may have double-counted matches not containing either */
			if(i>0 && j>0 && match[i-1][j-1].len==len)
				count -= match[i-1][j-1].count;

			match[i][j].len = len;
			match[i][j].count = count;
		}
	}

	if(plen)
		*plen = match[n-1][n-1].len;
	return match[n-1][n-1].count;
}

/*
 * Enumerate all pairs of strings of length n over an alphabet of size nalpha
 * and run numlcs on them.  buf contains the concatenation of the two strings; 
 * only nbuf of the characters have been filled in so far.  The nbuf characters
 * only use the first nused characters in the alphabet, so we need only
 * consider the first nused+1 alphabet characters as next possible.
 * 
 * Accumulate the current max in bestcount, bestlen, besta, bestb.
 */
int bestcount;
int bestlen;
char besta[MAXN+1];
char bestb[MAXN+1];
char *alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

void
search(int nalpha, int nused, int n, char *buf, int nbuf)
{
	int count, len;
	char *p, *ep;

	if(nbuf == 2*n){
		count = numlcs(buf, buf+n, n, &len);
		if(count > bestcount){
			bestcount = count;
			bestlen = len;
			memmove(besta, buf, n);
			memmove(bestb, buf+n, n);
		}
	}else{
		ep = alpha+nalpha;
		if(nused < nalpha)
			ep = alpha+nused+1;
		for(p=alpha; p<ep; p++){
			buf[nbuf] = *p;
			search(nalpha, (p<alpha+nused ? nused : nused+1),
				n, buf, nbuf+1);
		}
	}
}

void
usage(void)
{
	fprintf(stderr, "usage: lcs count aabaa aaabb\n");
	fprintf(stderr, "  or   lcs search n\n");
	fprintf(stderr, "  or   lcs search2 n\n");
	exit(1);
}

int
main(int argc, char **argv)
{
	int i, n, len;
	char buf[2*MAXN];

	if(argc < 2)
		usage();

	if(strcmp(argv[1], "count") == 0){
		if(argc != 4)
			usage();
		n = strlen(argv[2]);
		if(n != strlen(argv[3])){
			fprintf(stderr, "string lengths must match\n");
			exit(1);
		}
		n = numlcs(argv[2], argv[3], n, &len);
		printf("%d of length %d\n", n, len);
	}else if(strcmp(argv[1], "search") == 0){
		if(argc != 3)
			usage();
		n = atoi(argv[2]);
		search(2*n, 0, n, buf, 0);
		printf("len %d: %d lcs of length %d for %s %s (all alphabets)\n", 
			n, bestcount, bestlen,besta, bestb);
	}else if(strcmp(argv[1], "search2") == 0){
		if(argc != 3)
			usage();
		n = atoi(argv[2]);
		search(2, 0, n, buf, 0);
		printf("len %d: %d lcs of length %d for %s %s (just alphabet=ab)\n",
			n, bestcount, bestlen, besta, bestb);
	}else
		usage();
	return 0;
}

N. J. A. Sloane

tags

participants (1)