madhadron

The Lockless-Ranganathan formalism

Status: Done
Confidence: Certain

Math in this page not rendering? See the fix

Are all monomers of a polypeptide chain equally important? Are some merely incidental — does something needs to be there, but it doesn’t matter what? Lockless and Ranganathan developed a way1 of answering this in the context of thermodynamic mutant cycles, and found that only sparse pairs of amino acids appear to be important.

The mathematics of their approach doesn’t require a thermodynamic mutant cycle, or any reference to thermodynamics at all. The Boltzmann distribution2 arises in a much more general context.

Assume that we have an amino acid sequence, and we are interested in a particular function for it. It either functions or it doesn’t. We may also be more fine grained: we only consider sequences that function, but they function in quantitatively different ways.

The Lockless-Ranganathan formalism replaces this deterministic fact with a probabilistic one: if we consider some sequence in an ensemble of sequences, we say that it has some probability of functioning and some probability of not functioning. This may seem troubling at first, but it has served physicists well for a hundred years (see, for instance, 3, section III.2). Where there is a pronounced difference in the probability one way or another, we hope that will reflect the reality. In physics, the size of the ensembles we work with is so enormous (102310^{23} or more states) that in practice it always converges. Even for small ensembles we can hope for some ability to discriminate.

Random Variables and Ensembles

In all that follows we will operate on ensembles. An ensemble is simply the set of all possible states of the system. For particles in a box, each element of the ensemble is some configuration of all the positions and momenta of all particles. In our case at hand, an element of the ensemble will be a sequence and a state of the function of that sequence. Thus each sequence will appear in the ensemble with each possible functioning state. If we are interested in functions/doesn’t function, and we have sequences of two positions, each of which can be X or Y, then the ensemble consists of:

(XX,functions)
(XX,doesn’t function)
(XY,functions)
(XY,doesn’t function)
(YY,functions)
(YY,doesn’t function)
(YX,functions)
(YX,doesn’t function)

Once we have an ensemble, we turn to defining random variables. A random variable simply associates some value with each element of the ensemble. Let S1S_1 be the first element of the sequence in the ensemble above. Then at both (XX,functions) and (XX,doesn’t function), S1=XS_1 = \textrm{X}. Similarly we can define S2S_2 for the second element, and FF for whether it functions or not (we’ll write F=1F=1 if it functions, and F=0F=0 if it doesn’t, just to save space).

We can also write down more complicated random variables, such as F+S1F + S_1 (though what 1+X1+\textrm{X} may mean, I have no idea), or sin(F)\sin(F) (which is really as boring as FF itself).

Further, we can put a probability distribution on the ensemble. The probabilities of every element of the ensemble must add to 1, but aside from that, our distribution is arbitrary. Once we have probabilities, we can take averages and expectations of our random variables:

E[F]=iF(i)P(i)E[F] = \sum_i F(i) P(i)

where ii ranges over all the elements of the ensemble, and P(i)P(i) is the probability of the iith element.

From the end of the previous section, it should be obvious that our goal is to construct a sensible probability distribution on our ensemble. Note that this distribution can take no account of the phylogenetic structure of the sequences which go into the ensemble. By its very nature the ensemble contains entirely independent sequences. Trying to impose phylogenetic information in this formalism is analagous to asking how close a kinship two positions of a bouncing ball have to each other.

Shannon Information and Boltzmann Distributions

When we are thinking of ensembles and probabilities, the random variables are the things we measure. When we perform such a measurement, we get a certain amount of information. If the variable is like FF above, we get 1 bit of information (1 or 0). Shannon took this kind of argument to another level. If we have a random variable with states i=1,,ni = 1, \dots, n, each of which can occur with probability pip_i, then he defined the information we obtain from a measurement of the random variable to be

I=ipilogpiI=- \sum_i p_i \log p_i

When we make a measurement, we would like the information content of the measurement not to depend on the details of the system. Thus if we count the C’s in a nucleic acid sequence, we would like to get the same amount of information as if we counted them in another sequence. There may be a different number of C’s, but intuitively, the amount more that we know after counting is the same either way. If we regard II as a function of p1,p2,p_1,p_2,\dots, then we want I(p1,,pn)=I(p1+ε1,,pn+εn)I(p_1,\dots,p_n) = I(p_1+\varepsilon_1,\dots,p_n+\varepsilon_n), for the εi\varepsilon_i’s some arbitrary small constants. We do insist that the changes in the underlying system be small: if we start counting C’s in an protein sequence instead, we’re in for a shock!

If we Taylor expand II around the pip_i’s, we have

I(p1+ε1,)=I(p1,)+iεiIpi(p1,)+I(p_1+\varepsilon_1,\dots) = I(p_1,\dots) + \sum_i \varepsilon_i \frac{\partial I}{\partial p_i}(p_1,\dots) + \dots

The sum must be zero, and the εi\varepsilon_i’s are arbitrary (though small), so we must have

Ipi(p1,,pn)=0\frac{\partial I}{\partial p_i}(p_1,\dots,p_n) = 0

Unfortunately, this isn’t quite right. We also need to constrain

ipi=1\sum_i p_i = 1

To do this we use Lagrange multipliers. Define

I*=I+α(ipi1)I^{\ast} = I + \alpha(\sum_i p_i - 1)

If I*I^{\ast} doesn’t change as we change α\alpha, it means

ipi1=0\sum_i p_i -1 = 0

If we do this, then we can eliminate one of the pip_i’s, say, p1=1p2p3p_1 = 1 - p_2 - p_3 - \dots. When we do this we get pi=1/np_i = 1/n for alll ii.

But what if we know something more about the system? What if we’ve measure, say, the average of a random variable X\textrm{X}? We will denote the average by X\langle \textrm{X} \rangle, and the value of X\textrm{X} in the iith state by xix_i. We handle it by adding another term to I*I^{\ast} of the form

β(ixipiX)\beta(\sum_i x_i p_i - \langle X \rangle)

which just says that the average of X\textrm{X} must equal our measured value. If we carry this minimization through we get pi=exp(βxi)/𝒩p_i = \textrm{exp}(-\beta x_i)/\mathcal{N} where

𝒩=ieβxi\mathcal{N} = \sum_i e^{-\beta x_i}

a normalization factor. This is a Boltzmann distribution, and provides us with an a priori way of imposing a probability distribution on an ensemble once we know something about a random variable on that ensemble.

Sequence Ensembles and Correlation Functions

In this section we’re going to construct a massive mathematical object: Sijkl(f)S_{ij}^{kl}(f).

Consider a position in an amino acid sequence. It can take on twenty one values: any of the twenty amino acids, or a null (let’s write it X) if it doesn’t exist. The null shows up when you want to compare sequences. Now let’s number these values 1 to 21, and write down a twenty one element vector AkA^k to represent a position in an amino acid sequence, where kk ranges from 1 to 21. If we number tyrosine as 1, then if the position contains a tyrosine, A1=1A^1 = 1 and all other elements of Ak=0A^k=0. If it contains some other amino acid, then that amino acid’s corresponding position will be 1, and the others 0.

If we now consider a bunch of positions, we can label them from 1 to mm, and put an extra index on the AkA^k to denote which position we’re talking about: AikA^k_i where ii goes from 1 to mm. For convenience, if we want to talk about two different positions in the sequence, we’ll write AikA^k_i for one of them and AjlA^l_j for another. It’s the same object, but we’re just looking at different parts of it, so we call those parts by different letters.

AikA^k_i is completely equivalent to original sequence of letters. We can generalize it immediately to a random variable on a sequence ensemble.

Now we’ll define things called correlation functions. Consider two random variable AA and BB. We would like to know if they vary independently or note over the ensemble. In the manner of Feynman4, we can find a sensible measure: the random variable (AE[A])(BE[B])(A-E[A])(B-E[B]). If AA and BB are completely independent, then E[(AE[A])(BE[B])]=E[AE[A]]E[BE[B]]=0E[(A-E[A])(B-E[B])] = E[A-E[A]] E[B-E[B]] = 0. If AA is less than its mean when BB is greater, we get a negative value, and if they are both greater or less than their means at the same time, we get a positive value. This object is called the two point correlation function.

With random variables AA, BB, and CC, the three point correlation function would be AE[A])(BE[B])(CE[C])A-E[A])(B-E[B])(C-E[C]). For a four point correlation, you would also multiply by (DE[D])(D-E[D]) and so on. Be cautious: the higher order your correlation function, the more computationally intensive it is to calculate, and the more data it requires to get a good estimate of it. The substance of the papers by Ranganathan and Lockless was that for a couple of small binding domains which they were studying, a two point correlation function between sequence sites (if they considered only sequences which bound the molecule in question in approximating their random variable) encoded most of the relevant structural information.

We turn back to our sequence ensembles now. Let’s say we have our sequence AikA^k_i and a random variable ff representing function (whether it fluoresces red or green, say, or even its actual wavelength). Let Sijkl(f)=(AikE[Aik)(AjlE[Ajl])(fE[f])S^{kl}_{ij}(f) = (A^k_i - E[A^k_i)(A^l_j - E[A^l_j])(f-E[f]), the three point correlation function. Here’s our object at last.

It is easy to imagine extensions. For instance, if a protein is under two separate evolutionary pressures, we might want to include two random variables describing function. We could then disentangle the two forces by including sequences which knew were subject to only one of the forces in our ensemble.

Sequence Probabilities

Sijkl(f)S^{kl}_{ij}(f). We plug this directly into our formalism to get a probability of a given pair of amino acids at a given position having functional consequences.

There’s only one problem: we actually have to measure the expectation of Sijkl(f)S^{kl}_{ij}(f). It’s obviously impractical to make all possible sequences and actually check them — and it would render the whole exercise pointless. Instead, we take the sequences which we know already and use them as a small ensemble. For any given sequence we already have we can write down Sijkl(f)S^{kl}_{ij}(f): it’s some deterministic set of numbers. This amino acid is a leucine, and that one is a serine, and the whole thing fluoresces red. The actual process is easiest to see by a couple examples.

Take a sequence of length two which either functions (f=1f=1) or not (f=0f=0). Say we actually have elements GG with f=1f=1, TG f=0f=0, LG with f=0f=0, GK with f=1f=1. Then i,ji,j can be 1 or 2, and we can ignore all the amino acids besides the four that show up here. I’ll write k,lk,l as the actual letters. Only when talking to a computer will it be necessary to actually number them.

E[A1G]=12E[A^G_1] = \frac{1}{2}
E[A1T]=14E[A^T_1] = \frac{1}{4}
E[E1L]=14E[E^L_1]=\frac{1}{4}
E[A2G]=34E[A^G_2] = \frac{3}{4}
E[A2T]=0E[A^T_2] = 0
E[A2L]=0E[A^L_2] = 0
E[A2K]=14E[A^K_2] = \frac{1}{4}

These values should be obvious: I take the number of sequences with the right amino acid at this position and divide it by the total number of sequences. We can calculate the average of ff the same way: E[f]=1/2E[f] = 1/2. Next I’ll calculate one value in Sijkl(f)S^{kl}_{ij}(f) (the rest follow the same way), in particular

S12TG(1)=(A1T14)(A2G34)(f12)S^{TG}_{12}(1) = (A^{T}_1 - \frac{1}{4})(A^G_2 - \frac{3}{4})(f - \frac{1}{2})

For the actual sequence pairs, we get

Sequence S12TG(1)S^{TG}_{12}(1)
GG (014)(134)(112)=0.03125(0 - \frac{1}{4})(1 - \frac{3}{4})(1 - \frac{1}{2}) = -0.03125
TG (114)(134)(012)=0.09375(1 - \frac{1}{4})(1 - \frac{3}{4})(0 - \frac{1}{2}) = -0.09375
LG (014)(134)(012)=0.03125(0 - \frac{1}{4})(1 - \frac{3}{4})(0 - \frac{1}{2}) = 0.03125
GK (014)(134)(112)=0.03125(0 - \frac{1}{4})(1 - \frac{3}{4})(1 - \frac{1}{2}) = -0.03125

We average these numbers just as we would for any other random variable: E[S12TG(1)=(0.031250.09375+0.031250.03125)/4=0.03125E[S^{TG}_{12}(1) = (-0.03125 - 0.09375 + 0.03125 - 0.03125)/4 = -0.03125. In order words, this is inversely correlated with function, which makes sense since we only see function when the first amino acid is a G.

Next we can turn this into a probability. It will not be the probability of the whole sequence functioning!5 It will be the probability of a sequence consisting of TG functioning. We could have done a sequence that wasn’t in our initial ensemble, but this way we can see what the formalism does to our data.

We assume that our calculated expectation of S12TG(f)S^{TG}_{12}(f) is reasonably near the real expectation. This is an assumption about our data: it adequately samples the space of sequences meet our criteria. We also need a "temperature" (the β\beta in our Boltzmann distribution). We could beggar the question by calculating it from our data, but it is better to look elsewhere for an independent set of data.

Unfortunately, I don’t know how to do this, and Eric Siggia’s description of current methods when I asked him was "socially acceptable."


  1. Michael Socolich, Steve W. Lockless, William P. Russ, Heather Lee, Kevin H. Gardner, and Rama Ranganathan. Evolutionary information for specifying a protein fold. Nature, 437(7058):512-518, Sep 2005.↩︎

  2. This is a probability distributions of the form exp(X/T)\textrm{exp}(X/T) where TT is a controlling parameter.↩︎

  3. N. G. Van Kampen. Stochastic Processes in Physics and Chemistry. North Holland, July 2001.↩︎

  4. Think really hard, then write down the answer.↩︎

  5. Well, in this case it will, but as soon as we have three amino acids, it won’t be anymore.↩︎