Пояснения gpuccio по поводу оценок количества функциональной информации: mns2012

mns2012

Пояснения gpuccio по поводу оценок количества функциональной информации

Aug 29, 2018 13:41

При обсуждении одной из моих предыдущих записей, где я привёл расчёт gpuccio, возник вопрос о выводе верхних оценок функциональной информации в параметрическом пространстве белков.

Я задал этот вопрос gpuccio:
If you don’t mind, a question slightly related to this topic as well since you have pictures of information quantities.

When you calculate the number of states that can be visited by evolutionary walk, you arrive at a value of 2^140 states. Then you reason about information deltas: if the absolute value of a delta related to some biological change is greater than 140 bits, then we infer design. It is this “then” that is not entirely clear to me.

Could you expound a bit on how this estimate of 140 bits relates to the methods of determining the functional information quantities in polypeptydes. What is not clear to me in this reasoning is how we connect the dots between these two things: (a) the estimate based on the number of states visitable by evolutionary walk and (b) the methods of calculating functional information is an amino acid sequence.

E.g. an estimate of the absolute value of information content for the human genome is about 70 MB. Does it mean that any deltas I can get by evolution are within 140 bits on top? What about duplication and recombination? From information theory books, we can get that information gain per generation with sexual reproduction is of the order sqrt(G) where G is the genome size (if I remember rightly). Surely it can be greater than 140 bits.

Can you see my question?

В виду важности помещаю здесь его ответ:

Thank you for the interesting question, which requires a detailed answer.

“When you calculate the number of states that can be visited by evolutionary walk, you arrive at a value of 2^140 states.”

That’s correct. Only, remember that this is really a big overestimation, just to be on the safe side. The real number of individual states that can be reached is probably much lower.

I would like that with the word “state” I mean some specific new genomic configuration, as it can derive from reproduction which involves some genomic variation. The reson for that is that the whole genome, as it emerges from reproduction, is the functional unit that is subject to natural selection, if any.

“Then you reason about information deltas: if the absolute value of a delta related to some biological change is greater than 140 bits, then we infer design.”

This is not really correct, if you are referring to absolute information content, as it seems from your following remarks. I believe that this could be your main misunderstanding, so I will try to clarify it better. I apologize in advance if instead the concept was already clear to you.

The important point is: I never reason in terms of absolute information content, only in terms of functional information. Indeed, all the “jumps” I analyze and discuss are jumps (deltas) in functional information.

Why? Because I have applied a special procedure, that I have tried to explain in soem detail when possible.

What I measure is “human conserved information”, IOWs the bits of homology to the human form of the protein. Another way to say it is that I use human proteins as “probes” to measure the evolutionary history of proteins in relation to the form that the protein assumes in humans

I could as well use as probes the proteins in bees, for example (indeed, I have done that in some cases, for specific comparisons). My choice of huma proteins as “measuring probes” has, however, a few important motivations:

a) Of course, we are naturally interested in human functions

b) The human proteome is probably the best investifated and reviewed

c) Human are a very recent species, so human proteins can be considered, in their final form, a recent result

d) I am specially interested in the transition to vertebrates, and humans are recent vertebrates. So, the time distance from the originary split to vertebrates (and then from cartilaginous fish to bo0ny fish) and the recent split to humans is more than 400 million years

So, why is a jump in human conserved information observed after the slit to vertebrates (in my graphs, that is the transition from non vertebrate deuterostomes to cartilaginous fish) a good measure of a variation in functional information?

Well, the reasoning is simple, and it relies on assumptions that cannot be easily denied, especially by neo-darwinists.

If some specific sequence appears for the first time after the split to vertebrates, and is then conserved for more than 400 My, then we can safely assume that the sequence is functional. Indeed, the measure that it is conserved (IOWs, the bits of information that appear for the first time in vertebrates and are conserved to humans) is a measure of its functional restraint.

So, if a protein homolog a some huma protein (let’s call it A) has a maximum homology hit with the human form, before the appearance of vertebrates, of say 300 bits, and then we find 1000 bits of homology in cartilaginous fish, then we can say that 700 bits of human conserved, and therefore functional, information have appeared at the transition to vertebrates. If theprotein A is, say, 1000 AAs long, that is a jump of 0.7 bits per aminoacid site (about one third of the total information content of the protein in a blast comparison).

The 400 million years gap is important: indeed, it is an evolutionary distance that guarantees that any non functional sequence homology will be completely cancelled by neutral variation, as can be easily seen in synonimous sites. Therefore, any homology that is conserved for such a time is under extremely strong functional constraint.

So, I hope that I have explaine my “methods of calculating functional information in an amino acid sequence”, your b) question. This is not the only way to do it, of course. The Durston method, that has inspired all my reasonings, is slightly different. However, I find this method quick and reliable.

The connection with your a) point (“the estimate based on the number of states visitable by evolutionary walk”) is rather direct: if we exclude that natural selection is a relevant factor for comple functions (and the arguments to exclude it are different in nature, and you can find them in my OP:

What are the limits of Natural Selection? An interesting open discussion with Gordon Davisson

https://uncommondescent.com/intelligent-design/what-are-the-limits-of-natural-selection-an-interesting-open-discussion-with-gordon-davisson/

then the only way for a non design system to reach a functional isalnd is to have probabilisitc resources comparable to the functional complexity of that functional island.

Remember that the functional complexity, measured as described above, is a measure not of the absolute information content, but of the ratio between the target space and the search space, IOWs a measure of the probability to find the target space by a single random search. That’s why blast homologies are directly transformed, in the blast algorithm, into E values, that are a slitly different, but related concept (the expected number of random hits of that level by the type of search done by the algorithm in the available database).

I am not sure why you say that:

“an estimate of the absolute value of information content for the human genome is about 70 MB”

The absolute value of information content for the whole human genome (3 Gbp) is certainly much more than that. The number you give is probably related to protein coding genes only.

However, as said, the total information content is not relevant in ID: only the functional information counts.

Duplication is not really an increase in functional information. It is similar to printing two copies of the same book.

Of course, if duplication in itself generates a new function, then the limited functional information linked to that event should be computed again by dividing the target space (the possible duplications that generate that function) by the search space (all the possible duplication events).

The same is true for recombinations.

The important point is that duplications and recombinations reuse existin functional information at the sequence level: they don’t create it. The only new functional information can derive from the new disposition.

So, let’s say that, in a very simple case, a recombination shifts two sequences in a protein, maybe changing the function. However, using a blast comparison, the sequence homology will not be significantly changed (because blast is a local alignment).

In the same way, a blast hit of a human protein against a group of organism does not depend on how many copies of that sequence are present in the organism: it just gives the highest homology hit in the group, the single sequence that is most similar to the human one.

So, neither sexual reproduction nor duplication nor recombination can generate any new functional sequence of more than 140 bits of functional information. Because 140 bits measn that the event has a probability of 1:2^140 to happen, and the probabilistic resources of our biological world just cannot do it.

Please feel free to ask new questions, if my answers are not clear or sufficient. This is an important point, and it is not at all easily grasped.

gpuccio, функциональная информация, intelligent design