Making better inferences from statistical graphics Edward Tufte

September 1, 2013  |  Edward Tufte
2 Comment(s)

The rage-to-conclude bias sees patterns in data lacking such patterns.

The bias leads to premature, simplistic, and false inferences about causality.

Good statistical analysis seeks to calm down the rage to conclude,

to align the reality of the evidence with the inferences made from that evidence.

The calming down process often involves statistical analysis sorting out

the “random” from the “nonrandom,” the “significant” and the dreaded

“not significant,” and “explained” from “unexplained” variance.

Alas this statistical language is swamp of confounding puns and double-talk

that play on the differences between technical and everyday language.

Perversely, these puns also play to the rage to conclude, as in “explained

variance” derived from a cherry-picked, over-specified model on a small

data set–which is very unlike actually explaining something.

In everyday work, the integrity of statistical practice is too often compromised.

Researchers easily finesse standard statistical validation practices by posing

ridiculous alternative hypotheses, by searching millions of fitted models and

publishing one, and by little tilts in data analysis (and 3 or 4 little tilts will

generate a publishable finding).

The conventional practices for insuring analytical integrity, as implemented in

day-to-day research, have not headed off a vast literature of false findings.

See John P. A. Ioannidis, “Why Most Published Research Findings Are False,”

and the stunning recent report that only 6 of 53 “landmark studies” in clinical

oncology could be replicated (one researcher turned out to have done an

experiment 6 times, it “worked” once, and only the one-off was published).

Can our data displays be designed to reduce biases in viewing data–

such as the rage to conclude, cherry-picking, seeing the nonrandom in

the random, recency bias, confirmation bias? How can visual inferences

from data be improved to separate the real from the random, the

explained from the unexplained? How can the inferential integrity of

graphics viewing, as in exploratory data analysis, be improved?

Here are some practical ways to improve the quality of inferences

made from data displays.

All statistical displays should be accompanied by a unique documentation box

For both consumers and a producers of data displays, a fundamental task is to assess

the credibility and integrity of graphical displays. The first-line quality control

mechanism for analytical integrity is documentation of the data sources and the

process of data analysis.

The data documentation box accompanying the display should provide links to the

data matrix used in the graphic display, the location of those data in the full data set,

names of analysts responsible for choosing and contructing the graphic, and whether

the project’s research design specified in advance the graphic (or was published graphic

an objet trouve, a found object, such as Marcel Duchamp’s famous conceptual

artwork Fountain,1917?).

 image7

Some journals require documentation of the data analysis process. For example, from

the British Journal of Sports Medicine:

Contributors: All authors had full access to the data in the study. JLV takes responsibility

for the integrity of the data and the accuracy of the data analysis. He is guarantor.


JLV and GNH designed the study. GNH, DWD, NO, JLV and LJC acquired the data.

JLV, EAHW, LJC and GNH performed the analysis and interpreted the data.

JLV drafted the manuscript, which was critically revised for intellectual content

by all co-authors.

Note the assignment of personal responsibility, a guarantor of data integrity/accuracy

of analysis. (The study’s finding was that watching television a lot was correlated

with a substantially shorter lifespan. However intuitively pleasing, that finding

was hopelessly compromised by use of nonexperimental data that confounded

multiple causes and effects.)

A documentation box makes cherry-picking of evidence more obvious, and may

thereby serve as a check on cherry-picking by researchers. A documentation box

also allows reviewers of the study to assess the robustness of the results against

the plausible alternative hypothesis that the findings were generated by data-selection

rather than by a fair reading of the overall data. Published data displays too often

serve as research sanctification trophies, serving up pitches to the gullible, based

unstable over-selected results.

All computer programs for producing statistical displays should routinely provide

a documentation box, just as they now provide a plotting space, labels, and scales

of measurement. Why haven’t those programs done so years ago?

Diluting Perceptual Cluster/Streak Bias:

Informal, Inline, Interocular Trauma Tests

When people look at random number tables, they sees all kinds of clusters

and streaks (in a completely random set of data). Similarly, when people are

asked generate a random series of bits, they generate too few long streaks

(such as 6 identical bits a row), because their model of what is random

greatly underestimates the amount of streakiness in truly random data.

Sports and election reporters are notorious for their

streak/cluster/momentum/turning-point/trendspotting

narrative over-reach. xkcd did this wonderful critique:

To dilute streak-guessing, randomize on time over the same data,

and compare random streaks with the observed data.

Below, the top sparkline shows the season’s win-loss sequence

(the little horizontal line = home games, no line = road games).

Weighting by overall record of wins/losses and home/road effects

yields ten random sparklines. Hard to see the difference between

real and random.

The 10 random sparkline sequences can be regenerated again and

again by, oddly enough, clicking on “Regenerate random seasons.”

The test of the 10 randomized sparklines vs. the actual data is an

“Interocular Trauma Test” because the comparison hits the analyst right

between the eyes. This little randomization check-up, which can be repeated

again and again, is seen by the analyst at the very moment of making

inferences based on a statistical graphic of observed data.

(Thanks to Adam Schwartz for his excellent work on randomized sparklines. ET)

This is looking a bit like bootstrap calculation. For the real and amazing

bootstrap, applied to data graphics and contour lines, see Persi Diaconis

and Bradley Efron, “Computer Intensive Methods in Statistics.”

For recent developments on identifying spurious random structures

in statistical graphics, see the lovely paper by Hadley Wickham, Dianne Cook,

Heike Hofmann, Andreas Buja, “Graphical inference for infovis.”

 image8

Edmond Murphy in 1964 wrote about dubious inferences about 2 underlying causal mechanisms based on staring at bimodal distributions (from Edward Tufte, The Visual Display of Quantitative Information, p. 169:

 image9

Sparklines can help reduce recency bias, encourage being approximately right rather than exactly wrong, and provide high-resolution context

From Edward Tufte, Beautiful Evidence, pages 50-51.

Sparklines have obvious applications for financial and economic data—by tracking and comparing changes over time, by showing overall trend along with local detail. Embedded in a data table, this sparkline depicts an exchange rate (dollar cost of one euro) for every day for one year:

 image4

Colors help link the sparkline with the numbers: red = the oldest and newest rates in the series; blue = yearly low and high for daily exchange rates. Extending this graphic table is straightforward; here, the price of the euro versus 3 other currencies for 65 months and for 12 months:

 image5

Daily sparkline data can be standardized and scaled in all sorts of ways depending on the content: by the range of the price, inflation-adjusted price, percent change, percent change off of a market baseline. Thus multiple sparklines can describe the same noun, just as multiple columns of numbers report various measures of performance. These sparklines reveal the details of the most recent 12 months in the context of a 65-month daily sequence (shown in the fractal-like structure below).

Consuming a horizontal length of only 14 letter spaces, each sparkline in the big table above provides a look at the price and the changes in price for every day for years, and the overall time pattern. This financial table reports 24 numbers accurate to 5 significant digits; the accompanying sparklines show about 14,000 numbers readable from 1 to 2 significant digits. The idea is to be approximately right rather than exactly wrong. 1

By showing recent change in relation to many past changes, sparklines provide a context for nuanced analysis—and, one hopes, better decisions. Moreover, the year-long daily history reduces recency bias, the persistent and widespread over-weighting of recent events in making decisions. Tables sometimes reinforce recency bias by showing only current levels or recent changes; sparklines improve the attention span of tables.

Tables of numbers attain maximum densities of only 300 characters per square inch or 50 characters per square centimeter. In contrast, graphical displays have far greater resolutions; a cartographer notes “the resolving power of the eye enables it to differentiate to 0.1 mm where provoked to do so.” 2  Distinctions at 0.1 mm mean 250 per linear inch, which implies 60,000 per square inch or 10,000 per square centimeter, which is plenty.

 image6

1 On being “approximately right rather than exactly wrong,” see John W. Tukey, “The Technical Tools of Statistics,” American Statistician, 19 (1965), 23-28.

2 D.P. Bickmore, “The Relevance of Cartography,” in J.C. Davis and M.J. McCullagh, eds., Display and Analysis of Spatial Data (London, 1975), 331.

Here is a conventional financial table comparing various return rates of 10 popular mutual funds: 1

 image1

This is a common display in data analysis: a list of nouns (mutual funds, for example) along with some numbers (assets, changes) that accompany the nouns. The analyst’s job is to look over the data matrix and then decide whether or not to go crazy—or at least to make a decision (buy, sell, hold) about the noun based on the data. But along with the summary clumps of tabular data, let us also look at the day-to-day path of prices and their changes for the entire last year. Here is the sparkline table: 2

 image2

Astonishing and disconcerting, the finely detailed similarities of these daily sparkline histories are not all that surprising, after the fact anyway. Several funds use market index-tracking or other copycat strategies, and all the funds are driven daily by the same amalgam of external forces (news, fads, economic policies, panics, bubbles). Of the 10 funds, only the unfortunately named PIMCO, the sole bond fund in the table, diverges from the common pattern of the 9 stock funds, as seen by comparing PIMCO’s sparkline with the stacked pile of 9 other sparklines below.

In newspaper financial tables, down the deep columns of numbers, sparklines can be added to tables set at 8 lines per inch (as in our example above). This yields about 160 sparklines per column, or 400,000 additional daily graphical prices and their changes per 5-column financial page. Readers can scan the sparkline tables, making simultaneous multiple comparisons, searching for nonrandom patterns in the random walks of prices.

 image3

1  “Favorite Funds,” The New York Times, August 10, 2003, p. 3-1.

2  In our redesigned table, the typeface Gill Sans does quite well compared to the Helvetica in the original Times table. Smaller than the Helvetica, the Gill Sans appears sturdier and more readable, in part because of the increased white space that results from its smaller x-height and reduced size. The data area (without column labels) for our sparkline table is only 21% larger than the original’s data area, and yet the sparklines provide an approximate look at 5,000 more numbers.

Topics: E.T., Science