[math-fun] Re: genetics-fun question

James Propp

29 Jun 2006 29 Jun '06

10:30 a.m.

So figuring out who's mother and who's daughter is as easy as labelling their genomes a,b and c,d and noticing which of them has the property that, eg, ccccccccccccc is the same as aaaabbbaabbbb.

Neat! (And not even all that subtle; if I'd thought a bit harder about what I supposedly learned in high school, I could've figured it out.) This fact about the asymmetry of the parent-child relationship suggests a more general question: How much can you deduce about the precise way in which two blood-relatives are related simply from looking at their genes? This can be a real-world issue ("And is it your sworn testimony before this court, as an expert in the field of genetics, that this man cannot possibly be the nephew of Howard Hughes?"), but I could imagine it also giving rise to some mathematically amusing albeit unrealistic puzzles ("Assume an infinite genome..."). Michael? Jim

Show replies by date

Schroeppel, Richard

29 Jun 29 Jun

11:48 a.m.

New subject: [math-fun] Re: genetics-fun question -- another one

I have a puzzle from c. 1970. Back then, it was science fiction, but now it looks less farfetched. Suppose we have a big computer with everyone's genome data. [Looking less farfetched every day, in fact.] What can we deduce about the genomes of our ancestors? Could we reconstruct an estimated genome of Fermat, Buddha, or Caesar? There's some information loss from generation to generation: Parents with N children will (on average) omit 2^-N of their genes from (the set of) their children. Some of that lost information could be deduced from knowing genomes of cousins (-> aunts & uncles -> grandparents), so it's hard to see exactly how much is erased. If we throw in present-day location information, we can probably figure our who moved where, when, and perhaps deduce population movements from long ago. Rich -----Original Message----- From: math-fun-bounces+rschroe=sandia.gov@mailman.xmission.com on behalf of James Propp Sent: Thu 6/29/2006 10:29 AM To: math-fun@mailman.xmission.com Subject: [math-fun] Re: genetics-fun question

...

So figuring out who's mother and who's daughter is as easy as labelling their genomes a,b and c,d and noticing which of them has the property that, eg, ccccccccccccc is the same as aaaabbbaabbbb.

Tom Knight

12:34 p.m.

New subject: [math-fun] Re: genetics-fun question -- another one

On Jun 29, 2006, at 1:03 PM, Schroeppel, Richard wrote:

...

I have a puzzle from c. 1970. Back then, it was science fiction, but now it looks less farfetched. Suppose we have a big computer with everyone's genome data. [Looking less farfetched every day, in fact.] What can we deduce about the genomes of our ancestors? Could we reconstruct an estimated genome of Fermat, Buddha, or Caesar? There's some information loss from generation to generation: Parents with N children will (on average) omit 2^-N of their genes from (the set of) their children. Some of that lost information could be deduced from knowing genomes of cousins (-> aunts & uncles -> grandparents), so it's hard to see exactly how much is erased. If we throw in present-day location information, we can probably figure our who moved where, when, and perhaps deduce population movements from long ago.

They are busy doing exactly this in Iceland, where the genealogies are carefully recorded. Needless to say, there are many embarrassing revelations. I'm not sure this would ever be allowed politically. There is too much to be lost by too many people and cultures. Imagine what the reaction to Hitler's jewishness would have been, for example.

Mike Stay

12:43 p.m.

New subject: [math-fun] Re: genetics-fun question -- another one

On 6/29/06, Tom Knight <tk@csail.mit.edu> wrote:

...

They are busy doing exactly this in Iceland, where the genealogies are carefully recorded. Needless to say, there are many embarrassing revelations. I'm not sure this would ever be allowed politically. There is too much to be lost by too many people and cultures. Imagine what the reaction to Hitler's jewishness would have been, for example.

A Maori friend of mine told a story of a large family reunion he attended; they invited all the descendents of a great-grandfather, and one white family that showed up was horrified to learn that they had Maori ancestors. They demanded that their family name be taken off the large family tree that had been set up and left in a huff. Fortunately, no one else was offended in return; the other families thought they were being ridiculous and had a good laugh. -- Mike Stay metaweta@gmail.com http://math.ucr.edu/~mike

Michael Kleber

30 Jun 30 Jun

2:32 p.m.

New subject: [math-fun] Re: genetics-fun question -- another one

Okay, time to ramble on a bit about that other branch in the conversation... Rich started it off by asking:

...

Suppose we have a big computer with everyone's genome data. What can we deduce about the genomes of our ancestors? Could we reconstruct an estimated genome of Fermat, Buddha, or Caesar?

The more straightforward question is what we can deduce about the family tree of all people currently alive. Tom Knight already mentioned the deCode project to create a genome map of the entire population of Iceland, which is remarkable for both its scientific value and its tricky ethics. If we were haploid -- that is, we reproduced asexually, and you were just a slightly imperfect clone of one parent -- then this would be a well-studied problem. That kind of thing is done all the time at the level of species, not individuals; it's called "phylogenic tree reconstruction". It's an area of active research, and people are getting pretty good at it; that phrase will allow interested folks armed with Google to learn more. In that case, the methods do indeed produce a family tree in which each dedeuced ancestor is labelled with a most-likely genome -- though the problem of determining which node on the tree corresponds to Fermat/Buddah/Caesar remains a sticking point. But our knowledge of the mutation rate in human DNA is good enough that we can pretty well tell how long ago the various individuals in the tree lived -- so applying this to the Y chromosome, we can indeed say that *someone* around the 1200s was a direct male-line ancestor of about 8% of all men alive today in the areas the Mongols visited, and by looking at historical records we can make guesses about it being GK himself, and can make a darn good guess as to what his Y-chromosome was. (GK is probably *ancestor* of everyone there; for 8% of the people, he is their father(father(father(...))), which is a *much* stronger statement.) In the same way, we can tell that some male less than 100,000 years ago is the direct male-line ancestor of all males alive today. But the hopes of reconstructing the full genomes of long-dead people are completely dashed by that messy sex stuff. Mike Stay has already plugged Dawkins's book "The Ancestor's Tale", and I'd like to loudly second that recommendation; it's a superbly-written introduction to how to think about all this. (Note to self: next time someone asks me for a pointer to a book to learn the math that goes with all of this, push this instead!) At the risk of repeating some of what Mike wrote yesterday... For any population -- say, all Europeans, or all humans alive today -- you can ask, who is the Most Recent individual who is a Common Ancestor to all of them? This MRCA changes over time. Before the 1400s, eg, the MRCA(Europe+America) was surely from from before the last time there was a Bering land bridge, maybe 20,000 years ago. But today there are probably no pure-blooded descendants of the Paleo Indians left, due to either lineages dying out or intermarrying, and that MRCA is surely more recent than that. (Estiimates of MRCA of all humans vary wildly, and I don't know enough to get into that debate.) More surprising is that, once you go back far enough that *someone* is an ancestor of, say, all people living today, you don't need to go much farther back before *everyone* is either (a) an ancestor of everyone alive today, or (b) an ancestor of no one, a person whose line has died out. Moreover, about 80% of individuals will be (a)'s. Of course, barriers to communication between separated groups of humans get in the way of this happening, but such barriers are almost never absolute. Note that this pretty well puts the kibbosh on reconstructing Caesar's genome: his descendants might include, say, everyone in Europe, but so what?, that doens't distinguish him from 80% of the population of Italy at that time. Let me just add that when you first hear this, you may well have the overpowering reaction "What, what about natural selection?!" After all, if the fittest have the most kids, but at the same time 80% of people end up ancestors of everyone, where's the advantage to being fit? Dawkins addresses this superbly: natural selection doesn't happen on the level of individuals, but on the level of genes. Every *gene* has only one parent, and we're back in the nice clean world of mostly-perfect clones.

...

If we throw in present-day location information, we can probably figure our who moved where, when, and perhaps deduce population movements from long ago.

The human HapMap project, a large amount of which happens in the building where I'm typing this, can do this and vastly more. But I'm out of time, so that will have to wait for another day. --Michael Kleber -- It is very dark and after 2000. If you continue you are likely to be eaten by a bleen.

Michael Kleber

29 Jun 29 Jun

4:03 p.m.

Wow! I had no idea that there was so much interest in genetics-fun simmering just below the surface here. I only have a little time now, so I'll blather about one thing here, and try to get to the rest of the discussion soon. Jim Propp wrote:

...

This fact about the asymmetry of the parent-child relationship suggests a more general question: How much can you deduce about the precise way in which two blood-relatives are related simply from looking at their genes?

I don't know the general answer to the question "what pairs of familial relations map to isomorphic genetic comparisons," but this begins to get into the trickier things that we happily ignored in Jim's first question; I'm not even sure I could figure out all that apply. The key, obviously, is the mixing between the two copies of the parental genome, and while biologists have a good basic model of it, even this is imperfectly understood -- recent research is just starting to reveal how frequent some things are that were previously believed to be rare (eg loss of heterozygosity). But even with the baby model, things are tricky. Your genome is broken up into 22 pairs of autosomes (chromosomes other than X and Y), one of each from each parent. The mixing happens by these pairs exchanging corresponding sections, and the number of such crossover events is constrained, by cell biology, to be a small but nonzero integer. The result is that the amount of your genome that you get from each grandparent is 25% on average, but that there is much more variation than you might think. (And even more if you're, say, a possum, with only 8 pairs of chromosomes to our 22.) Before I finish not answering, let me make two key points in an attempt to keep us aware of what's realistic and accurate: 1) We're talking about what information we could obtain, theoretically, if we knew both haplotypes of an individual's entire genome. That's completely unrealistic. First, almost all ways of obtaining genetic information mix the two haplotypes together, so you really get a mixture of the signal from the two parents' copies. Second, when you do eg a paternity test, they don't look at the, say, ten million places where your genome differs from someone else's. Instead they look at some 20 locations where people's genomes tend to be highly variable -- say, a spot where there's a repeated section that goes "acgacgacgacg..." somewhere between 5 and 11 times. 2) I'd like to point out that we're talking about looking at someone's genome, *not* at their "genes". You have about 20,000 genes -- that is, pieces of DNA which are used for building proteins. These together make up perhaps 2% of your genome. A total of about 5% of your genome is DNA which appears to be useful in *some* way, though we don't know entirely what. The other 95% -- the part that isn't acted on by selective pressures -- is the part that is the most free to vary among individuals; mutations easily creep in over time. If you're trying to work out the ancestry of all living things, that's the information gold mine. (But if you're trying to learn about *important* events in evolution, then you *do* want to look at the genes.) --Michael Kleber -- It is very dark and after 2000. If you continue you are likely to be eaten by a bleen.

Bill Thurston

30 Jun 30 Jun

6:26 a.m.

I've been told by people who are seriously involved the rule of thumb that the number of crossover events ~ the number of chromosome pairs, so that each chromosome of a child on average is divided into two segments from different parents. Two siblings share on average the same amount of genetic material as a parent and child, but that relationship is easily distinguished because the sharing is unevenly distributed over the genome --- about 1/4 of the length has nothing in common, 1/2 shares one copy, and 1/4 shares both copies. Furthermore, assuming the rule of thumb above for the "small integer", the siblings autosomal matches are broken into about 3*22 segments, while the parent-child is broken into about 2*22. Not all crossovers would be directly observable from just the sibling genomes, though, so I'm not sure how much this would (theoretically) help. When Jim posted his original question I thought I saw asymmetry between the mother and daughter, as Michael wrote, but I no longer see this. Unfortunately I've deleted those messages, but: the entire length of their genomes share one identical copy. We could label them so that aaaaaaaa matches cccccccccc ... I don't see what more information there is. Bill Thurston On Jun 29, 2006, at 6:03 PM, Michael Kleber wrote:

...

Wow! I had no idea that there was so much interest in genetics-fun simmering just below the surface here.

I only have a little time now, so I'll blather about one thing here, and try to get to the rest of the discussion soon.

Jim Propp wrote:

...
This fact about the asymmetry of the parent-child relationship suggests a more general question: How much can you deduce about the precise way in which two blood-relatives are related simply from looking at their genes?

I don't know the general answer to the question "what pairs of familial relations map to isomorphic genetic comparisons," but this begins to get into the trickier things that we happily ignored in Jim's first question; I'm not even sure I could figure out all that apply. The key, obviously, is the mixing between the two copies of the parental genome, and while biologists have a good basic model of it, even this is imperfectly understood -- recent research is just starting to reveal how frequent some things are that were previously believed to be rare (eg loss of heterozygosity).

But even with the baby model, things are tricky. Your genome is broken up into 22 pairs of autosomes (chromosomes other than X and Y), one of each from each parent. The mixing happens by these pairs exchanging corresponding sections, and the number of such crossover events is constrained, by cell biology, to be a small but nonzero integer. The result is that the amount of your genome that you get from each grandparent is 25% on average, but that there is much more variation than you might think. (And even more if you're, say, a possum, with only 8 pairs of chromosomes to our 22.)

Before I finish not answering, let me make two key points in an attempt to keep us aware of what's realistic and accurate:

1) We're talking about what information we could obtain, theoretically, if we knew both haplotypes of an individual's entire genome. That's completely unrealistic. First, almost all ways of obtaining genetic information mix the two haplotypes together, so you really get a mixture of the signal from the two parents' copies. Second, when you do eg a paternity test, they don't look at the, say, ten million places where your genome differs from someone else's. Instead they look at some 20 locations where people's genomes tend to be highly variable -- say, a spot where there's a repeated section that goes "acgacgacgacg..." somewhere between 5 and 11 times.

2) I'd like to point out that we're talking about looking at someone's genome, *not* at their "genes". You have about 20,000 genes -- that is, pieces of DNA which are used for building proteins. These together make up perhaps 2% of your genome. A total of about 5% of your genome is DNA which appears to be useful in *some* way, though we don't know entirely what. The other 95% -- the part that isn't acted on by selective pressures -- is the part that is the most free to vary among individuals; mutations easily creep in over time. If you're trying to work out the ancestry of all living things, that's the information gold mine. (But if you're trying to learn about *important* events in evolution, then you *do* want to look at the genes.)

--Michael Kleber

-- It is very dark and after 2000. If you continue you are likely to be eaten by a bleen.

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Michael Kleber

10:05 a.m.

Okay, the crush has relaxed; I think I'll have more time to spend on this today... Bill Thurston wrote:

...

I've been told by people who are seriously involved the rule of thumb that the number of crossover events ~ the number of chromosome pairs, so that each chromosome of a child on average is divided into two segments from different parents.

That's a lower bound and not so far from the true answer, for humans. Human chromosome 22 (the shortest; they're numbered in order of length) averages one crossover per meiosis, while chromosome 1 averages four. I was hoping to include some information about the probability distribution function for these numbers, but the guy I'd ask -- Simon Meyers, one floor down from me, and author of last year's Science paper with the best map of recombination across the genome to date -- just left for a week in Hawaii.

...

[...] Furthermore, assuming the rule of thumb above for the "small integer", the siblings autosomal matches are broken into about 3*22 segments, while the parent-child is broken into about 2*22.

This is actually a pretty bad assumption because of the extent to which recombination events cluster: about 80% of recombinations take place in about 10-20% of the sequence -- even more for some people, since recombination probabilities depend on your genotype. This gets in to hard biology, though, so we should probably drop it. (The paper in question is Science 14 Oct 2005 pp. 321-324, "A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome", http://www.sciencemag.org/cgi/content/abstract/310/5746/321 if you're at an institution with access to Science on-line. But it's not terribly relevant to most of the present discussion; biologists usually focus on recombination rates on scales much smaller than full chromosomes.)

...

When Jim posted his original question I thought I saw asymmetry between the mother and daughter, as Michael wrote, but I no longer see this. Unfortunately I've deleted those messages, but: the entire length of their genomes share one identical copy. We could label them so that aaaaaaaa matches cccccccccc ... I don't see what more information there is.

My answer ("yes, you can tell easily") was based on Jim's hypothetical in which we assumed that we really knew the full sequence of each of the two haplotypes in each individual -- that is, for each person you had two sequences, the "from mom" and "from dad" ones (though unlabelled). This is what's really present in the cell, but *not* what you generally get when you genotype someone, as we discussed yesterday. The answer to Jim's question depends on what you're assuming about the data you're able to gether. --Michael Kleber -- It is very dark and after 2000. If you continue you are likely to be eaten by a bleen.

7084

Age (days ago)

7085

Last active (days ago)

List overview

Download

7 comments

6 participants

participants (6)

Bill Thurston
James Propp
Michael Kleber
Mike Stay
Schroeppel, Richard
Tom Knight