Are all monomers of a polypeptide chain equally important? Are some merely incidental --- does something needs to be there, but it doesn't matter what? Lockless and Ranganathan developed a way1 of answering this in the context of thermodynamic mutant cycles, and found that only sparse pairs of amino acids appear to be important.
The mathematics of their approach doesn't require a thermodynamic mutant cycle, or any reference to thermodynamics at all. The Boltzmann distribution2 arises in a much more general context.
Assume that we have an amino acid sequence, and we are interested in a particular function for it. It either functions or it doesn't. We may also be more fine grained: we only consider sequences that function, but they function in quantitatively different ways.
The Lockless-Ranganathan formalism replaces this deterministic fact with a probabilistic one: if we consider some sequence in an ensemble of sequences, we say that it has some probability of functioning and some probability of not functioning. This may seem troubling at first, but it has served physicists well for a hundred years (see, for instance, 3, section III.2). Where there is a pronounced difference in the probability one way or another, we hope that will reflect the reality. In physics, the size of the ensembles we work with is so enormous ( or more states) that in practice it always converges. Even for small ensembles we can hope for some ability to discriminate.
In all that follows we will operate on ensembles. An ensemble is simply the set of all possible states of the system. For particles in a box, each element of the ensemble is some configuration of all the positions and momenta of all particles. In our case at hand, an element of the ensemble will be a sequence and a state of the function of that sequence. Thus each sequence will appear in the ensemble with each possible functioning state. If we are interested in functions/doesn't function, and we have sequences of two positions, each of which can be X or Y, then the ensemble consists of:
Once we have an ensemble, we turn to defining random variables. A random variable simply associates some value with each element of the ensemble. Let be the first element of the sequence in the ensemble above. Then at both (XX,functions) and (XX,doesn't function), X. Similarly we can define for the second element, and for whether it functions or not (we'll write if it functions, and if it doesn't, just to save space).
We can also write down more complicated random variables, such as (though what X may mean, I have no idea), or (which is really as boring as itself).
Further, we can put a probability distribution on the ensemble. The probabilities of every element of the ensemble must add to 1, but aside from that, our distribution is arbitrary. Once we have probabilities, we can take averages and expectations of our random variables:
where ranges over all the elements of the ensemble, and is the probability of the th element.
From the end of the previous section, it should be obvious that our goal is to construct a sensible probability distribution on our ensemble. Note that this distribution can take no account of the phylogenetic structure of the sequences which go into the ensemble. By its very nature the ensemble contains entirely independent sequences. Trying to impose phylogenetic information in this formalism is analagous to asking how close a kinship two positions of a bouncing ball have to each other.
When we are thinking of ensembles and probabilities, the random variables are the things we measure. When we perform such a measurement, we get a certain amount of information. If the variable is like above, we get 1 bit of information (1 or 0). Shannon took this kind of argument to another level. If we have a random variable with states , each of which can occur with probability , then he defined the information we obtain from a measurement of the random variable to be
When we make a measurement, we would like the information content of the measurement not to depend on the details of the system. Thus if we count the C's in a nucleic acid sequence, we would like to get the same amount of information as if we counted them in another sequence. There may be a different number of C's, but intuitively, the amount more that we know after counting is the same either way. If we regard as a function of , then we want , for the 's some arbitrary small constants. We do insist that the changes in the underlying system be small: if we start counting C's in an protein sequence instead, we're in for a shock!
If we Taylor expand around the 's, we have
The sum must be zero, and the 's are arbitrary (though small), so we must have
Unfortunately, this isn't quite right. We also need to constrain
To do this we use Lagrange multipliers. Define
If doesn't change as we change , it means
If we do this, then we can eliminate one of the 's, say, . When we do this we get for alll .
But what if we know something more about the system? What if we've measure, say, the average of a random variable ? We will denote the average by , and the value of in the th state by . We handle it by adding another term to of the form
which just says that the average of must equal our measured value. If we carry this minimization through we get where
a normalization factor. This is a Boltzmann distribution, and provides us with an a priori way of imposing a probability distribution on an ensemble once we know something about a random variable on that ensemble.
In this section we're going to construct a massive mathematical object: .
Consider a position in an amino acid sequence. It can take on twenty one values: any of the twenty amino acids, or a null (let's write it X) if it doesn't exist. The null shows up when you want to compare sequences. Now let's number these values 1 to 21, and write down a twenty one element vector to represent a position in an amino acid sequence, where ranges from 1 to 21. If we number tyrosine as 1, then if the position contains a tyrosine, and all other elements of . If it contains some other amino acid, then that amino acid's corresponding position will be 1, and the others 0.
If we now consider a bunch of positions, we can label them from 1 to , and put an extra index on the to denote which position we're talking about: where goes from 1 to . For convenience, if we want to talk about two different positions in the sequence, we'll write for one of them and for another. It's the same object, but we're just looking at different parts of it, so we call those parts by different letters.
is completely equivalent to original sequence of letters. We can generalize it immediately to a random variable on a sequence ensemble.
Now we'll define things called correlation functions. Consider two random variable and . We would like to know if they vary independently or note over the ensemble. In the manner of Feynman4, we can find a sensible measure: the random variable . If and are completely independent, then . If is less than its mean when is greater, we get a negative value, and if they are both greater or less than their means at the same time, we get a positive value. This object is called the two point correlation function.
With random variables , , and , the three point correlation function would be . For a four point correlation, you would also multiply by and so on. Be cautious: the higher order your correlation function, the more computationally intensive it is to calculate, and the more data it requires to get a good estimate of it. The substance of the papers by Ranganathan and Lockless was that for a couple of small binding domains which they were studying, a two point correlation function between sequence sites (if they considered only sequences which bound the molecule in question in approximating their random variable) encoded most of the relevant structural information.
We turn back to our sequence ensembles now. Let's say we have our sequence and a random variable representing function (whether it fluoresces red or green, say, or even its actual wavelength). Let , the three point correlation function. Here's our object at last.
It is easy to imagine extensions. For instance, if a protein is under two separate evolutionary pressures, we might want to include two random variables describing function. We could then disentangle the two forces by including sequences which knew were subject to only one of the forces in our ensemble.
We have a random variable . We plug this directly into our formalism to get a probability of a given pair of amino acids at a given position having functional consequences.
There's only one problem: we actually have to measure the expectation of . It's obviously impractical to make all possible sequences and actually check them --- and it would render the whole exercise pointless. Instead, we take the sequences which we know already and use them as a small ensemble. For any given sequence we already have we can write down : it's some deterministic set of numbers. This amino acid is a leucine, and that one is a serine, and the whole thing fluoresces red. The actual process is easiest to see by a couple examples.
Take a sequence of length two which either functions () or not (). Say we actually have elements GG with , TG , LG with , GK with . Then can be 1 or 2, and we can ignore all the amino acids besides the four that show up here. I'll write as the actual letters. Only when talking to a computer will it be necessary to actually number them.