Statistics resembles the apocryphal elephant being examined by blind men. Each person uses, and often only knows, a particular set of statistical tools, and, when passing on their knowledge, does not have a picture to impart of the general structure of statistics. Yet that structure consists of only four pieces: planning experiments and trials; exploring patterns and structures in the resulting data; making reproducible inferences from that data; and designing the experience of interacting with the results of an analysis. These parts are known in the field as
Sadly, this basic structure of statistics doesn’t seem to be written down anywhere, particularly not in books accessible to the beginner.
The knowledge of how to plan experiments is largely specific to particular fields, but among that knowledge are certain universal aspects that are properly part of statistics. Unfortunately, because these aspects tend to be transmitted with knowledge specific to a field, each field usually only knows a fraction of the useful methods available to them. The universal aspects are defined on an abstract model of an experiment: the experimenter selects several factors to vary, producing distinct conditions, and then makes a sequence of measurements under those conditions. In particular, there are methods to choose optimal values of a numerical factor to use, how many measurements per condition to make, and how to choose what conditions to use in what order.
Though the methods of the design of experiments are universal, they are not magical, and the optimal designs produced depend on assumptions about how the results of the measurements, called ther response, will vary with the conditions. For example, if we expect the response to vary linearly in over the values of a numerical factor, the optimal values to use are a pair of points spaced as far apart as possible, though in practice we would usually add another couple measurements between them in case the response is not linear. On the other hand, if we expect a step function, with the step occurring in some known range, the optimal values are points evenly spaced over that range.
Choosing the values of a factor to use is well understood if it only has a small number of discrete values, such as the sex of a mouse, or if it is numerical. For other cases, there may not be any universal theory available. Sometimes there may be a generalizable, though not quite universal theory, such as methods in software engineering for handling discrete factors with large numbers of values, such as program inputs.
Once we have chosen the values to use for each factor in our experiment, we must allocate measurements to combinations of those values. We can get the basics from common sense: no one (I hope) would allocate all men to the control group of a clinical trial and all women to the treatment group. But we quickly find ourselves in very involved calculations. How do you assort three values of fertilizer and three values of watering over five fields in an agronomy trial? Can we plan an experiment that stops early if it its outcome will be clear?
The optimal allocation depends on how many of the factors and their interactions we expect to be important. If we expect all of our factors and their interactions to be important, we must test all possible combinations, which is called a factorial design. If we expect only a subset of the factors or their interactions to be important, we can reduce the number of measurements required in various ways.
The universal part of the design of experiments also has techniques to calculate the number of measurements needed to achieve some desired precision. This precision is usually couched in terms of what effect size can be detected or the probability of missing a rare event.
The calculations described above are all closely linked to those used in inference. Indeed, the design of experiments is in general very closely linked to inference. A particular design constrains the methods of inference to use, and many methods of inference make assumptions that can only be satisfied by design.
Exploring patterns and structure in data sets goes by the name of exploratory data analysis, from the title of the book by John Tukey that gave the subject academic respectability.1 Exploratory data analysis is inherently iterative: the analyst maps the data in such a way that a pattern’s presence or absence becomes easily visible, and uses the result to guide the choice her next step.
The key technical material of exploratory data analysis is the mappings from data to some form that indicates next steps to try. The mapping may be to almost any mathematical structure that is easy to reason about. Some mappings change the form of the data to make it look like a more familiar data set, such as taking a logarithm of skewed data to make it look more like a bell curve, or smoothing a time sequence to find trends. Others mappings consist of fitting a model and examining the details of the fit and the deviations of the data from the fit, such as fitting a sine wave of a particular frequency and looking at the phase, the amplitude, and residual, non-sinusoidal behavior in the data. The data mining community favors models of the form of rules, clusters, and classifications, but any structure that is the output of inference, including graphs and persistent homology, is a candidate mapping.
Exploratory data analysis is problematic to learn. By its nature, it is a skill, not a set of results, and, like playing an instrument, benefits dramatically from working with a skilled practitioner. The results are scattered among older works from before ubiquitous computing—Tukey’s book was based on his experience working without fast computers to hand, and emphasizes methods that are fast and reliable to perform by hand—or in early graphical computer environments that seem primitive today, or are in the data mining literature, which is focused mcuh more on specific mappings than on using them in practice. Forty years after it was written, Tukey’s book remains the best starting point.
Finally, exploratory data analysis, and its relation with inference, provokes a perennial, acrimonous dispute. Since exploratory data analysis by its nature involves matching many patterns to data, naively doing inference on the patterns found can lead to inflated strength of inference. Inference should be done on an independently collected data set or the analyst doing exploratory data analysis must keep track of how many patterns she has tried in her exploration, and use that count in a multiple testing correction.
Inference is about making decisions in a reproducible way. Its initial motivation was analyzing data in a way that other scientists could accept without knowing the personal foibles of the experimenter. Inference was viewed this well up until 1950 when Wald’s decision theory sank its foundations in game theory2, though inference usually appears in textbooks couched in the old way. However, decision theory, along with high speed computing and a move away from simple statistical models drove inference in very different directions in the second half of the twentieth century.
The body of techniques from before decision theory are usually referred to as “classical inference”: parametric hypothesis tests, point estimates, confidence intervals, and various forms of regression. Its methods are still ubiquitous in the sciences, and are still developing. The proper confidence intervals for a binomial distribution were only found after the turn of the 21st century. Much of the material of classical inference also underlies the calculations in the design of experiments. Statisticians must know this material, and much of the original literature, such as the papers of William Gossett, remain wonderful reading.
Classical inference is entirely concerned with estimating the values of a fixed number of numerical parameters in closed form mathematical models. Such methods can fail in subtle ways when reality does not match the model, and reality, unfortunately, usually does not match the model. This weakness drove the development of “robust” statistics and eventually methods that depended only on topological properties of the data. Such methods are go under the name “nonparametric statistics”. By the turn of the 21st century, nonparametric statistics was a mature field, with a steady growth of techniques.
The conjunction of decision theory and high speed computing drove two major areas of inference. One aimed to solve similar problems to nonparametric statistics by directly fitting enormously flexible classes of models, such as neural networks or support vector machines, directly to example data. This approach turns out to generalize beautifully to high dimensions. The other area driven by computing and decision theory brought randomness into the decision procedure itself via Monte Carlo sampling. Random methods had been used to compile tables for decades. Such sampling a Monte Carlo technique was how Gossett arrived at his first table of values for the t-test in 1908.3 However, random procedures seemed illegitimate to the intuition of statisticians before decision theory, and were too expensive to perform in practice. Once decision theory made them acceptable and computers made them practical they appeared first as resampling statistics, and then in increasingly sophisticated forms such as Markov Chain Monte Carlo in Bayesian statistics and bagging and boosting in machine learning.
Classical and nonparametric inference at this point are well understood, though important results are still steadily emerging in both. Meanwhile, the revolution in inference caused by the combination of decision theory and high speed computers is still unfolding.
The experience of interacting with data or the result of an analysis has generally meant producing a table or plot, and the principles of producing those graphics are well understood. That experience is no longer the whole of visualization, though. Interacting with the analysis process itself has become a huge visualization problem as more and more analysis is done by non-statisticians.
In many ways, visualization is the most organized of the four areas. Reading a few works covers the important parts of producing tables and plots. Edward Tufte’s four books4 set forth the basic types and principles of statistical graphics, and Wilkinson’s The Grammar of Graphics5 sets out a formalism for constructing plots by computer. Reading these five books is a completely adequate introduction to the subject.
Ubiquitous computers with graphical displays have produced a new visualization problem by enabling interactive experiences that blur the lines between doing analysis and consuming its. These interactions range from simple filters and sorting in programs that capture and display information to general tools like pivot tables in Microsoft Excel or the various graphical interfaces to run one of a variety of hypothesis tests. This part of visualization is not well understood, though proximal fields like information design and user experience provide the basic tools.
John Tukey, Exploratory Data Analysis. (1977)↩
Wald, Statistical Decision Functions. Wiley (1950)↩
Student, “The probable error of a mean.” Biometrika (1908): 1-25.↩
Edward Tufte, The Visual Display of Quantitative Information (1983), Envisioning Information (1991), Visual Explanations (1997), Beautiful Evidence (2006). Graphics Press.↩
Leland Wilkinson, The Grammar of Graphics. Springer Science & Business Media (2006)↩