Feuding schools of inference

Every so often some comparison of Bayesian and frequentist statistics comes to my attention. Today it was on a blog called Pythonic Perambulations. It’s the work of amateurs. Their description on noninformative priors is simplified to the point of distortion. They insist on kludging their tools instead of fixing their model when it is clearly misspecified. They use a naive construction for 95% confidence intervals and are surprised when it fails miserably, and even use this as an argument against 95% confidence intervals.¹ Normally I would shrug and move on, but it happened to catch me in a particularly grumpy mood, so here we are.

Essays discussing frequentist versus Bayesian statistics follow a fairly standard form. The author lays out both positions, then argues for the one he (it seems invariably to be a he) likes. The two positions are both quite subtle, but each tries to make the concept of a probability correspond to something in the real world. Frequentists operationalize probability as the fraction of elements of an ensemble of hypothetical outcomes of a trial with a certain property. Bayesians operationalize probability as degree of belief. Both have mathematical models which justify this. All the models have limitations which are rarely justified in practice. Which one is right?

The answer, as usual when faced with a dichotomy, is neither. van Kampen wrote a paper² about quantum mechanics that has some dicta which can be translated almost directly to statistics, notably:

The quantum mechanical probability is not observed but merely serves as an intermediate stage in the computation of an observable phenomenon.

and

Whoever endows $\psi$ with more meaning than is needed for computing observable phenomena is responsible for the consequences.

Probability, as a mathematical theory, has no need of an interpretation. Mathematicians studying combinatorics use it quite happily with nothing in sight that a frequentist or Bayesian would recognize. The real battleground is statistics, and the real purpose is to choose an action based on data. The formulation that everyone uses for this, from machine learning to the foundations of Bayesian statistics, is decision theory. A decision theoretic formulation of a situation has the following components:³

a set $\Omega$ of possible states of nature
a set $X$ of values that will result from a trial meant to measure some aspect of that state of nature
a set $M$ of possible actions to take based on the outcome of that trial
a loss function $L : \Omega \times M \rightarrow \mathbb{R}$ , giving the cost of taking a particular action when one of the possible states of nature is the true one

Given these components, the task is to find a function $t$ from $X$ to $M$ which minimizes the loss. The loss is a function, though, not a single value, and there are many ways we can make this well defined. Each of those ways has different uses.

For example, if we are engaged in a contest against an opponent, we may want to minimize the maximum loss we can have. Thus we choose $t$ to minimize the maximum value $L$ achieves over any combination of $(\omega, x) \in (\Omega, X)$ which can occur.

Alternately, we can choose to integrate $L$ against some measure $\mu$ . Usually we decompose the measure into a measure on $X$ given $\Omega$ (the probability of getting a particular value from $X$ given that some element of $\Omega$ is the true state of nature) and a measure on $\Omega$ . This is a Bayes procedure, with the measure on $\Omega$ the prior. We could also integrate over $X$ but not $\Omega$ and use some other technique to eliminate that variable.

Almost any of the tricks of defining norms that you can dig out of functional analysis can be used and will have a use, but in the end you have a procedure $t$ . You apply it to the data from your trial, and take the action dictated. Probability does not enter the picture.⁴

We can and should fight over the specification of the states of nature $\Omega$ , of the possible decisions $M$ , over the loss function $L$ ⁵ We should discuss the norm we use to choose our optimal procedure $t$ . These are hard questions. There is no reason to make the situation any more difficult by attaching unnecessary ideas to probability, which is a tool for calculation and no more.

Naive constructions typically fail wildly for non-Gaussian distributions. See Brown, Cai, and DasGupta, Interval estimation for a binomial proportion, Statistical Science, 2001, Vol.16, No.2, 101-133 for the binomial case.↩︎
N.G. van Kampen, Ten theorems about quantum mechanical measurements, Physica A 153 (1988) 97-113↩︎
I learned this from Kiefer’s Introduction to Statistical Inference↩︎
This was the lesson I took away from Leonard Savage’s Foundations of Statistics; everyone else seems to have read a different book than I did.↩︎
In practice, we try to find procedures that are optimal under a range of loss functions to make this decision less subjective.↩︎

« Home