[math-fun] Re: number of longest common substrings problem
Hugo asks:
Here you introduce the word "different" for the first time. Are two substrings "different" if the substrings themselves differ, or is it sufficient for the letters of the substring to be drawn from different positions in X or Y? Ie given: X = "aabaa", Y = "aaabb" are there 1 or 4 occurrences of the LCS "aaa"?
Me: The latter, I think, since the goal of the question is to determine how rapidly these numbers can grow. So in your example, the LCS has length 3, and there seem to be 10 ways to get 3: X = aabaa Y = aaabb --------------------- aab.. aa.b. aab.. aa..b aab.. a.ab. aab.. a.a.b aab.. .aab. aab.. .aa.b aa.a. aaa.. aa..a aaa.. a..aa aaa.. .a.aa aaa.. so a(5) >= 10. NJAS
"N. J. A. Sloane" <njas@research.att.com> wrote: :Hugo asks: : :> Here you introduce the word "different" for the first time. Are two :> substrings "different" if the substrings themselves differ, or is it :> sufficient for the letters of the substring to be drawn from different :> positions in X or Y? Ie given: :> X = "aabaa", Y = "aaabb" :> are there 1 or 4 occurrences of the LCS "aaa"? : :Me: The latter, I think, since the goal of the question is to :determine how rapidly these numbers can grow. In that case, here are some initial thoughts: - a(n) has a lower bound of A001405 (C(n, floor(n/2))), since we can always construct an LH string of all 'a's, and a RH string of floor(n/2) a's followed by ceil(n/2) b's. - a couple of hand-crafted examples of a(n) exceeding that lower bound: (aab, abb) => a(3) >= 4 > C(3,1) (abbbc, aabcc) => a(5) >= 12 > C(5,2) (aaabb, abbbb) => a(5) >= 18 (abbbbc, aabccc) => a(6) >= 24 > C(6,3) (abbbbc, aabbcc) => a(6) >= 24 (aaaaab, aabbbb) => a(6) >= 40 I think this serves to demonstrate the style of string pairs that will maximise the count; I'm not yet sure whether it is ever beneficial to have non-monotonic strings such as "abababab"; if not, this essentially boils down to finding <x> and <y>, two ordered partitions of n such that prod(C(max(x_i,y_i), min(x_i,y_i))) is maximised; that product would then be a(n). If I've calculated correctly, that gives a sequence starting: 1 2 4 9 18 40 100 225 525 ... .. but I don't see that currently in EIS: have I miscalculated? Those figures can be generated by the following strings: n a(n)? strings 1 1 a a 2 2 aa ab 3 4 aab abb 4 9 aaab abbb 5 18 aaaab aabbb 6 40 aaaaab aabbbb 7 100 aaaaabb aabbbbb 8 225 aaaaaabb aabbbbbb 9 525 aaaaaaabb aaabbbbbb By eye, it appears that the maximum product is always generated by two-element partitions, in which case it further simplifies to selecting 0 <= x <= y <= n such that C(y, x) * C(n-x, n-y) is maximised, which lets me quickly extend the above sequence: 1 2 4 9 18 40 100 225 525 1225 3136 7056 17640 44100 108900 261360 637065 1656369 4008004 10020010 25050025 64128064 155739584 393853824 1012766976 2538950544 6347376360 15868440900 41408180100 102252852900 261312846300 667799496100 1709566710016 4273916775040 10851741811625 28214528710225 71170904601225 181162302621300 461140406672400 This sequence exhibits some fascinating patterns: if we look at the factorisation of these as a vector of powers of p_n, we get this: 1: (0) 2: 1 (1) 3: 2 (2) 4: 0 2 (2) 5: 1 2 (3) 6: 3 0 1 (4) # ie f(6) = 40 = 2^3 * 3^0 * 5^1 (3 + 0 + 1 = 4) 7: 2 0 2 (4) 8: 0 2 2 (4) 9: 0 1 2 1 (4) 10: 0 0 2 2 (4) 11: 6 0 0 2 (8) 12: 4 2 0 2 (8) 13: 3 2 1 2 (8) 14: 2 2 2 2 (8) 15: 2 2 2 0 2 (8) 16: 4 3 1 0 2 (10) 17: 0 4 1 0 2 1 (8) 18: 0 4 0 0 2 2 (8) 19: 2 0 0 2 2 2 (8) 20: 1 0 1 2 2 2 (8) 21: 0 0 2 2 2 2 (8) 22: 6 0 0 2 2 2 (12) 23: 6 0 0 1 2 2 1 (12) 24: 7 2 0 1 0 2 2 (14) 25: 8 4 0 0 0 2 2 (16) 26: 4 2 0 0 0 2 2 2 (12) 27: 3 2 1 0 0 2 2 2 (12) 28: 2 2 2 0 0 2 2 2 (12) 29: 2 4 2 2 0 0 2 2 (14) 30: 2 4 2 0 2 0 2 2 (14) 31: 2 2 2 0 2 0 2 2 1 (13) 32: 2 0 2 0 2 0 2 2 2 (12) 33: 8 0 0 0 2 0 2 2 2 (16) 34: 7 0 1 0 2 0 2 2 2 (16) 35: 0 0 3 0 2 1 2 2 2 (12) 36: 0 0 2 0 2 2 2 2 2 (12) 37: 0 6 2 0 2 2 0 2 2 (16) 38: 2 6 2 1 1 2 0 2 2 (18) 39: 4 6 2 2 0 2 0 2 2 (20) .. and this could generate further interesting new sequences (eg f(n) is odd; f(n) is not square (do the first differences of this form a cyclce of period 7?); f(n) has an odd number of prime factors). In case others find it useful, I've put up a perl program to calculate the LCS count for two strings at <http://www.crypt.org/hv/puzzle/lcsc>. (It requires the additional perl module Algorithm::Loops, found at <http://search.cpan.org/~tyemq/Algorithm-Loops>.) This program is fast for strings up to a certain complexity, but beyond that consumes large amounts of memory as the number of different subsequences undergoes a combinatorial explosion. Hugo
By eye, it appears that the maximum product is always generated by two-element partitions, in which case it further simplifies to selecting 0 <= x <= y <= n such that C(y, x) * C(n-x, n-y) is maximised, which lets me quickly extend the above sequence:
This conjecture appears to be true for large n, but it's not true for all smallish n. For length 4, a non-monotonic pair of strings actually works best: abba and baab, which have 1(aa)+1(bb)+4(ab)+4(ba) = 10 matches. The same is true of 5 and 6. Russ
participants (3)
-
hv@crypt.org -
N. J. A. Sloane -
Russ Cox