Adaptive Acceleration: Data: nyuanshin

nyuanshin

Adaptive Acceleration: Data

Dec 14, 2007 15:58

(Part I: Theory.)

The empirical method Hawks et al. used to check human genomic data for signs of adaptive acceleration is called the Linkage Disequilibrium Decay test, pioneered by co-authors Eric Wang & Robert Moyzis. Linkage here refers to a statistical tendency of blocks of DNA to be inherited together as a unit; in long-run equilibrium, linkage is no greater than would be expected by pure chance, hence seeing abnormally strong linkage signals disequilibrium. But disequilibrium is of course a transient state, and linkage disequilibrium (LD) tends to decay relatively rapidly on a geological timescale due to the random genetic shuffling that goes on with every reproductive event.

When an allele is undergoing positive selection, it'll spread through the population more quickly than genetic recombination can break up uniformity of the genomic region around it, giving rise to unusually long blocks of DNA that are highly similar across many members of a population (called haplotypes). The stronger the selection, the more LD you see, so we can compare the expected equilibrium length of the haplotype to its actual length and use this as a rough and ready indicator of the strength of recent selection. And when we're looking at loci that are still polymorphic within a population (as this test does, i.e. they looked only at alleles with frequencies between 22% and 78%), you can estimate the ages of the mutants by comparing the differences between the older and newer variants.

Using these methods, Hawks et al. scanned several hundred genomes taken from four different populations (Yoruba, Han Chinese, Japanese, and Utahns of European extraction) from the HapMap database, and came up positive for a whopping 7% of loci on the human genome. The really remarkable thing about this is that they took multiple steps to make sure this estimate was conservative: They excluded alleles at both very high and very low frequencies, which means they'd necessarily miss both older alleles that were very strongly selected and newer alleles that were weakly selected. The frequency of a selected allele as it sweeps to fixation follows a sigmoid curve, which means it spends less than 1/4 of its time within the intermediate frequency window where it'd be cleanly "visible" to the LDD test. Additionally, they're only counting loci that met a 99.5% confidence standard on the LDD test in order to minimize false positives. (A picture is worth a thousand words, and John Hawks has a double-scoop of both over at his blog. You should go read him after this.)

Once they had these ascertained variants in hand, they set their gaze on ways to test the alternative hypothesis to their own-that there has been no acceleration-using alternative sources of data. The most obvious test is to extrapolate the recent rate backward in time and see what number of selective sweeps it implies between humans and chimps; if you do this, which they did, you get a self-evidently absurd number which is about two orders of magnitude bigger than what's actually observed. Another test is to look for selected variants above the 78% frequency threshold; if the rate has been constant then there should be a whole lot of those. Empirically, only 50 were found; even assuming that's an undercount by an entire order of magnitude, that's a clear suggestion of acceleration. They also tested what a constant rate would imply for genome diversity (since selective sweeps tend to reduce diversity), and found the predictions about an order of magnitude lower than what's actually observed.

So the data seems to accord very well with the theory, though of course other people can and should take their best shot at testing it in whatever other ways they can. Unfortunately now we hit a bit of a wall: We know there's been a lot of selection, and we have a few ideas about what the selection was for, but we really don't know what the majority of these genes do at the biochemical level. But that'll change soon, and now that we have a powerful hypothesis-generating theoretical framework to work with we can get to work on the nitty gritty specific cases.

There are a lot of implications to be hashed out from this, which I'll discuss in a future post. (Likely a little far in the future: I'll probably be gone for a couple of weeks. In the mean time go read Hawks!)

genetics, evolution