# The Lockless-Ranganathan formalism

Status: Done

Confidence: Certain

*Math in this page not rendering? See the fix*

Are all monomers of a polypeptide chain equally important? Are some
merely incidental — does something needs to be there, but it doesn’t
matter what? Lockless and Ranganathan developed a way^{1} of
answering this in the context of thermodynamic mutant cycles, and found
that only sparse pairs of amino acids appear to be important.

The mathematics of their approach doesn’t require a thermodynamic
mutant cycle, or any reference to thermodynamics at all. The Boltzmann
distribution^{2} arises in a much more general
context.

Assume that we have an amino acid sequence, and we are interested in a particular function for it. It either functions or it doesn’t. We may also be more fine grained: we only consider sequences that function, but they function in quantitatively different ways.

The Lockless-Ranganathan formalism replaces this deterministic fact
with a probabilistic one: if we consider some sequence in an ensemble of
sequences, we say that it has some probability of functioning and some
probability of not functioning. This may seem troubling at first, but it
has served physicists well for a hundred years (see, for instance, ^{3}, section III.2). Where there is a
pronounced difference in the probability one way or another, we hope
that will reflect the reality. In physics, the size of the ensembles we
work with is so enormous
($10^{23}$
or more states) that in practice it always converges. Even for small
ensembles we can hope for some ability to discriminate.

## Random Variables and Ensembles

In all that follows we will operate on ensembles. An ensemble is simply the set of all possible states of the system. For particles in a box, each element of the ensemble is some configuration of all the positions and momenta of all particles. In our case at hand, an element of the ensemble will be a sequence and a state of the function of that sequence. Thus each sequence will appear in the ensemble with each possible functioning state. If we are interested in functions/doesn’t function, and we have sequences of two positions, each of which can be X or Y, then the ensemble consists of:

(XX,functions)

(XX,doesn’t function)

(XY,functions)

(XY,doesn’t function)

(YY,functions)

(YY,doesn’t function)

(YX,functions)

(YX,doesn’t function)

Once we have an ensemble, we turn to defining random variables. A random variable simply associates some value with each element of the ensemble. Let $S_1$ be the first element of the sequence in the ensemble above. Then at both (XX,functions) and (XX,doesn’t function), $S_1 = \textrm{X}$. Similarly we can define $S_2$ for the second element, and $F$ for whether it functions or not (we’ll write $F=1$ if it functions, and $F=0$ if it doesn’t, just to save space).

We can also write down more complicated random variables, such as $F + S_1$ (though what $1+\textrm{X}$ may mean, I have no idea), or $\sin(F)$ (which is really as boring as $F$ itself).

Further, we can put a probability distribution on the ensemble. The probabilities of every element of the ensemble must add to 1, but aside from that, our distribution is arbitrary. Once we have probabilities, we can take averages and expectations of our random variables:

$E[F] = \sum_i F(i) P(i)$

where $i$ ranges over all the elements of the ensemble, and $P(i)$ is the probability of the $i$th element.

From the end of the previous section, it should be obvious that our goal is to construct a sensible probability distribution on our ensemble. Note that this distribution can take no account of the phylogenetic structure of the sequences which go into the ensemble. By its very nature the ensemble contains entirely independent sequences. Trying to impose phylogenetic information in this formalism is analagous to asking how close a kinship two positions of a bouncing ball have to each other.

## Shannon Information and Boltzmann Distributions

When we are thinking of ensembles and probabilities, the random variables are the things we measure. When we perform such a measurement, we get a certain amount of information. If the variable is like $F$ above, we get 1 bit of information (1 or 0). Shannon took this kind of argument to another level. If we have a random variable with states $i = 1, \dots, n$, each of which can occur with probability $p_i$, then he defined the information we obtain from a measurement of the random variable to be

$I=- \sum_i p_i \log p_i$

When we make a measurement, we would like the information content of the measurement not to depend on the details of the system. Thus if we count the C’s in a nucleic acid sequence, we would like to get the same amount of information as if we counted them in another sequence. There may be a different number of C’s, but intuitively, the amount more that we know after counting is the same either way. If we regard $I$ as a function of $p_1,p_2,\dots$, then we want $I(p_1,\dots,p_n) = I(p_1+\varepsilon_1,\dots,p_n+\varepsilon_n)$, for the $\varepsilon_i$’s some arbitrary small constants. We do insist that the changes in the underlying system be small: if we start counting C’s in an protein sequence instead, we’re in for a shock!

If we Taylor expand $I$ around the $p_i$’s, we have

$I(p_1+\varepsilon_1,\dots) = I(p_1,\dots) + \sum_i \varepsilon_i \frac{\partial I}{\partial p_i}(p_1,\dots) + \dots$

The sum must be zero, and the $\varepsilon_i$’s are arbitrary (though small), so we must have

$\frac{\partial I}{\partial p_i}(p_1,\dots,p_n) = 0$

Unfortunately, this isn’t quite right. We also need to constrain

$\sum_i p_i = 1$

To do this we use Lagrange multipliers. Define

$I^{\ast} = I + \alpha(\sum_i p_i - 1)$

If $I^{\ast}$ doesn’t change as we change $\alpha$, it means

$\sum_i p_i -1 = 0$

If we do this, then we can eliminate one of the $p_i$’s, say, $p_1 = 1 - p_2 - p_3 - \dots$. When we do this we get $p_i = 1/n$ for alll $i$.

But what if we know something more about the system? What if we’ve measure, say, the average of a random variable $\textrm{X}$? We will denote the average by $\langle \textrm{X} \rangle$, and the value of $\textrm{X}$ in the $i$th state by $x_i$. We handle it by adding another term to $I^{\ast}$ of the form

$\beta(\sum_i x_i p_i - \langle X \rangle)$

which just says that the average of $\textrm{X}$ must equal our measured value. If we carry this minimization through we get $p_i = \textrm{exp}(-\beta x_i)/\mathcal{N}$ where

$\mathcal{N} = \sum_i e^{-\beta x_i}$

a normalization factor. This is a Boltzmann distribution, and
provides us with an *a priori* way of imposing a probability
distribution on an ensemble once we know something about a random
variable on that ensemble.

## Sequence Ensembles and Correlation Functions

In this section we’re going to construct a massive mathematical object: $S_{ij}^{kl}(f)$.

Consider a position in an amino acid sequence. It can take on twenty one values: any of the twenty amino acids, or a null (let’s write it X) if it doesn’t exist. The null shows up when you want to compare sequences. Now let’s number these values 1 to 21, and write down a twenty one element vector $A^k$ to represent a position in an amino acid sequence, where $k$ ranges from 1 to 21. If we number tyrosine as 1, then if the position contains a tyrosine, $A^1 = 1$ and all other elements of $A^k=0$. If it contains some other amino acid, then that amino acid’s corresponding position will be 1, and the others 0.

If we now consider a bunch of positions, we can label them from 1 to $m$, and put an extra index on the $A^k$ to denote which position we’re talking about: $A^k_i$ where $i$ goes from 1 to $m$. For convenience, if we want to talk about two different positions in the sequence, we’ll write $A^k_i$ for one of them and $A^l_j$ for another. It’s the same object, but we’re just looking at different parts of it, so we call those parts by different letters.

$A^k_i$ is completely equivalent to original sequence of letters. We can generalize it immediately to a random variable on a sequence ensemble.

Now we’ll define things called correlation functions. Consider two
random variable
$A$
and
$B$.
We would like to know if they vary independently or note over the
ensemble. In the manner of Feynman^{4}, we can find a sensible
measure: the random variable
$(A-E[A])(B-E[B])$.
If
$A$
and
$B$
are completely independent, then
$E[(A-E[A])(B-E[B])] = E[A-E[A]] E[B-E[B]] = 0$.
If
$A$
is less than its mean when
$B$
is greater, we get a negative value, and if they are both greater or
less than their means at the same time, we get a positive value. This
object is called the two point correlation function.

With random variables $A$, $B$, and $C$, the three point correlation function would be $A-E[A])(B-E[B])(C-E[C])$. For a four point correlation, you would also multiply by $(D-E[D])$ and so on. Be cautious: the higher order your correlation function, the more computationally intensive it is to calculate, and the more data it requires to get a good estimate of it. The substance of the papers by Ranganathan and Lockless was that for a couple of small binding domains which they were studying, a two point correlation function between sequence sites (if they considered only sequences which bound the molecule in question in approximating their random variable) encoded most of the relevant structural information.

We turn back to our sequence ensembles now. Let’s say we have our sequence $A^k_i$ and a random variable $f$ representing function (whether it fluoresces red or green, say, or even its actual wavelength). Let $S^{kl}_{ij}(f) = (A^k_i - E[A^k_i)(A^l_j - E[A^l_j])(f-E[f])$, the three point correlation function. Here’s our object at last.

It is easy to imagine extensions. For instance, if a protein is under two separate evolutionary pressures, we might want to include two random variables describing function. We could then disentangle the two forces by including sequences which knew were subject to only one of the forces in our ensemble.

## Sequence Probabilities

$S^{kl}_{ij}(f)$. We plug this directly into our formalism to get a probability of a given pair of amino acids at a given position having functional consequences.

There’s only one problem: we actually have to measure the expectation
of
$S^{kl}_{ij}(f)$.
It’s obviously impractical to make all possible sequences and actually
check them — and it would render the whole exercise pointless. Instead,
we take the sequences which we know already and use them as a small
ensemble. For any given sequence we already have we can write down
$S^{kl}_{ij}(f)$:
it’s some deterministic set of numbers. *This* amino acid is a
leucine, and *that* one is a serine, and the whole thing
fluoresces red. The actual process is easiest to see by a couple
examples.

Take a sequence of length two which either functions ($f=1$) or not ($f=0$). Say we actually have elements GG with $f=1$, TG $f=0$, LG with $f=0$, GK with $f=1$. Then $i,j$ can be 1 or 2, and we can ignore all the amino acids besides the four that show up here. I’ll write $k,l$ as the actual letters. Only when talking to a computer will it be necessary to actually number them.

$E[A^G_1] = \frac{1}{2}$

$E[A^T_1] = \frac{1}{4}$

$E[E^L_1]=\frac{1}{4}$

$E[A^G_2] = \frac{3}{4}$

$E[A^T_2] = 0$

$E[A^L_2] = 0$

$E[A^K_2] = \frac{1}{4}$

These values should be obvious: I take the number of sequences with the right amino acid at this position and divide it by the total number of sequences. We can calculate the average of $f$ the same way: $E[f] = 1/2$. Next I’ll calculate one value in $S^{kl}_{ij}(f)$ (the rest follow the same way), in particular

$S^{TG}_{12}(1) = (A^{T}_1 - \frac{1}{4})(A^G_2 - \frac{3}{4})(f - \frac{1}{2})$

For the actual sequence pairs, we get

Sequence | $S^{TG}_{12}(1)$ |
---|---|

GG | $(0 - \frac{1}{4})(1 - \frac{3}{4})(1 - \frac{1}{2}) = -0.03125$ |

TG | $(1 - \frac{1}{4})(1 - \frac{3}{4})(0 - \frac{1}{2}) = -0.09375$ |

LG | $(0 - \frac{1}{4})(1 - \frac{3}{4})(0 - \frac{1}{2}) = 0.03125$ |

GK | $(0 - \frac{1}{4})(1 - \frac{3}{4})(1 - \frac{1}{2}) = -0.03125$ |

We average these numbers just as we would for any other random variable: $E[S^{TG}_{12}(1) = (-0.03125 - 0.09375 + 0.03125 - 0.03125)/4 = -0.03125$. In order words, this is inversely correlated with function, which makes sense since we only see function when the first amino acid is a G.

Next we can turn this into a probability. It will not be the
probability of the whole sequence functioning!^{5} It
will be the probability of a sequence consisting of TG functioning. We
could have done a sequence that wasn’t in our initial ensemble, but this
way we can see what the formalism does to our data.

We assume that our calculated expectation of $S^{TG}_{12}(f)$ is reasonably near the real expectation. This is an assumption about our data: it adequately samples the space of sequences meet our criteria. We also need a "temperature" (the $\beta$ in our Boltzmann distribution). We could beggar the question by calculating it from our data, but it is better to look elsewhere for an independent set of data.

Unfortunately, I don’t know how to do this, and Eric Siggia’s description of current methods when I asked him was "socially acceptable."

Michael Socolich, Steve W. Lockless, William P. Russ, Heather Lee, Kevin H. Gardner, and Rama Ranganathan. Evolutionary information for specifying a protein fold.

*Nature*, 437(7058):512-518, Sep 2005.↩︎This is a probability distributions of the form $\textrm{exp}(X/T)$ where $T$ is a controlling parameter.↩︎

N. G. Van Kampen.

*Stochastic Processes in Physics and Chemistry*. North Holland, July 2001.↩︎Think really hard, then write down the answer.↩︎

Well, in this case it will, but as soon as we have three amino acids, it won’t be anymore.↩︎