Making better inferences from statistical graphics Edward Tufte
The rage-to-conclude bias sees patterns in data lacking such patterns.
The bias leads to premature, simplistic, and false inferences about causality.
Good statistical analysis seeks to calm down the rage to conclude,
to align the reality of the evidence with the inferences made from that evidence.
The calming down process often involves statistical analysis sorting out
the “random” from the “nonrandom,” the “significant” and the dreaded
“not significant,” and “explained” from “unexplained” variance.
Alas this statistical language is swamp of confounding puns and double-talk
that play on the differences between technical and everyday language.
Perversely, these puns also play to the rage to conclude, as in “explained
variance” derived from a cherry-picked, over-specified model on a small
data set–which is very unlike actually explaining something.
In everyday work, the integrity of statistical practice is too often compromised.
Researchers easily finesse standard statistical validation practices by posing
ridiculous alternative hypotheses, by searching millions of fitted models and
publishing one, and by little tilts in data analysis (and 3 or 4 little tilts will
generate a publishable finding).
The conventional practices for insuring analytical integrity, as implemented in
day-to-day research, have not headed off a vast literature of false findings.
See John P. A. Ioannidis, “Why Most Published Research Findings Are False,”
and the stunning recent report that only 6 of 53 “landmark studies” in clinical
oncology could be replicated (one researcher turned out to have done an
experiment 6 times, it “worked” once, and only the one-off was published).
Can our data displays be designed to reduce biases in viewing data–
such as the rage to conclude, cherry-picking, seeing the nonrandom in
the random, recency bias, confirmation bias? How can visual inferences
from data be improved to separate the real from the random, the
explained from the unexplained? How can the inferential integrity of
graphics viewing, as in exploratory data analysis, be improved?
Here are some practical ways to improve the quality of inferences
made from data displays.
All statistical displays should be accompanied by a unique documentation box
For both consumers and a producers of data displays, a fundamental task is to assess
the credibility and integrity of graphical displays. The first-line quality control
mechanism for analytical integrity is documentation of the data sources and the
process of data analysis.
The data documentation box accompanying the display should provide links to the
data matrix used in the graphic display, the location of those data in the full data set,
names of analysts responsible for choosing and contructing the graphic, and whether
the project’s research design specified in advance the graphic (or was published graphic
an objet trouve, a found object, such as Marcel Duchamp’s famous conceptual
artwork Fountain,1917?).
Some journals require documentation of the data analysis process. For example, from
the British Journal of Sports Medicine:
Contributors: All authors had full access to the data in the study. JLV takes responsibility
for the integrity of the data and the accuracy of the data analysis. He is guarantor.
JLV and GNH designed the study. GNH, DWD, NO, JLV and LJC acquired the data.
JLV, EAHW, LJC and GNH performed the analysis and interpreted the data.
JLV drafted the manuscript, which was critically revised for intellectual content
by all co-authors.
Note the assignment of personal responsibility, a guarantor of data integrity/accuracy
of analysis. (The study’s finding was that watching television a lot was correlated
with a substantially shorter lifespan. However intuitively pleasing, that finding
was hopelessly compromised by use of nonexperimental data that confounded
multiple causes and effects.)
A documentation box makes cherry-picking of evidence more obvious, and may
thereby serve as a check on cherry-picking by researchers. A documentation box
also allows reviewers of the study to assess the robustness of the results against
the plausible alternative hypothesis that the findings were generated by data-selection
rather than by a fair reading of the overall data. Published data displays too often
serve as research sanctification trophies, serving up pitches to the gullible, based
unstable over-selected results.
All computer programs for producing statistical displays should routinely provide
a documentation box, just as they now provide a plotting space, labels, and scales
of measurement. Why haven’t those programs done so years ago?
Diluting Perceptual Cluster/Streak Bias:
Informal, Inline, Interocular Trauma Tests
When people look at random number tables, they sees all kinds of clusters
and streaks (in a completely random set of data). Similarly, when people are
asked generate a random series of bits, they generate too few long streaks
(such as 6 identical bits a row), because their model of what is random
greatly underestimates the amount of streakiness in truly random data.
Sports and election reporters are notorious for their
streak/cluster/momentum/turning-point/trendspotting
narrative over-reach. xkcd did this wonderful critique:
To dilute streak-guessing, randomize on time over the same data,
and compare random streaks with the observed data.
Below, the top sparkline shows the season’s win-loss sequence
(the little horizontal line = home games, no line = road games).
Weighting by overall record of wins/losses and home/road effects
yields ten random sparklines. Hard to see the difference between
real and random.
The 10 random sparkline sequences can be regenerated again and
again by, oddly enough, clicking on “Regenerate random seasons.”
The test of the 10 randomized sparklines vs. the actual data is an
“Interocular Trauma Test” because the comparison hits the analyst right
between the eyes. This little randomization check-up, which can be repeated
again and again, is seen by the analyst at the very moment of making
inferences based on a statistical graphic of observed data.
(Thanks to Adam Schwartz for his excellent work on randomized sparklines. ET)
This is looking a bit like bootstrap calculation. For the real and amazing
bootstrap, applied to data graphics and contour lines, see Persi Diaconis
and Bradley Efron, “Computer Intensive Methods in Statistics.”
For recent developments on identifying spurious random structures
in statistical graphics, see the lovely paper by Hadley Wickham, Dianne Cook,
Heike Hofmann, Andreas Buja, “Graphical inference for infovis.”
Edmond Murphy in 1964 wrote about dubious inferences about 2 underlying causal mechanisms based on staring at bimodal distributions (from Edward Tufte, The Visual Display of Quantitative Information, p. 169:
Sparklines can help reduce recency bias, encourage being approximately right rather than exactly wrong, and provide high-resolution context
From Edward Tufte, Beautiful Evidence, pages 50-51.
Sparklines have obvious applications for financial and economic data—by tracking and comparing changes over time, by showing overall trend along with local detail. Embedded in a data table, this sparkline depicts an exchange rate (dollar cost of one euro) for every day for one year:
Colors help link the sparkline with the numbers: red = the oldest and newest rates in the series; blue = yearly low and high for daily exchange rates. Extending this graphic table is straightforward; here, the price of the euro versus 3 other currencies for 65 months and for 12 months:
Daily sparkline data can be standardized and scaled in all sorts of ways depending on the content: by the range of the price, inflation-adjusted price, percent change, percent change off of a market baseline. Thus multiple sparklines can describe the same noun, just as multiple columns of numbers report various measures of performance. These sparklines reveal the details of the most recent 12 months in the context of a 65-month daily sequence (shown in the fractal-like structure below).
Consuming a horizontal length of only 14 letter spaces, each sparkline in the big table above provides a look at the price and the changes in price for every day for years, and the overall time pattern. This financial table reports 24 numbers accurate to 5 significant digits; the accompanying sparklines show about 14,000 numbers readable from 1 to 2 significant digits. The idea is to be approximately right rather than exactly wrong. 1
By showing recent change in relation to many past changes, sparklines provide a context for nuanced analysis—and, one hopes, better decisions. Moreover, the year-long daily history reduces recency bias, the persistent and widespread over-weighting of recent events in making decisions. Tables sometimes reinforce recency bias by showing only current levels or recent changes; sparklines improve the attention span of tables.
Tables of numbers attain maximum densities of only 300 characters per square inch or 50 characters per square centimeter. In contrast, graphical displays have far greater resolutions; a cartographer notes “the resolving power of the eye enables it to differentiate to 0.1 mm where provoked to do so.” 2 Distinctions at 0.1 mm mean 250 per linear inch, which implies 60,000 per square inch or 10,000 per square centimeter, which is plenty.
1 On being “approximately right rather than exactly wrong,” see John W. Tukey, “The Technical Tools of Statistics,” American Statistician, 19 (1965), 23-28.
2 D.P. Bickmore, “The Relevance of Cartography,” in J.C. Davis and M.J. McCullagh, eds., Display and Analysis of Spatial Data (London, 1975), 331.
Here is a conventional financial table comparing various return rates of 10 popular mutual funds: 1
This is a common display in data analysis: a list of nouns (mutual funds, for example) along with some numbers (assets, changes) that accompany the nouns. The analyst’s job is to look over the data matrix and then decide whether or not to go crazy—or at least to make a decision (buy, sell, hold) about the noun based on the data. But along with the summary clumps of tabular data, let us also look at the day-to-day path of prices and their changes for the entire last year. Here is the sparkline table: 2
Astonishing and disconcerting, the finely detailed similarities of these daily sparkline histories are not all that surprising, after the fact anyway. Several funds use market index-tracking or other copycat strategies, and all the funds are driven daily by the same amalgam of external forces (news, fads, economic policies, panics, bubbles). Of the 10 funds, only the unfortunately named PIMCO, the sole bond fund in the table, diverges from the common pattern of the 9 stock funds, as seen by comparing PIMCO’s sparkline with the stacked pile of 9 other sparklines below.
In newspaper financial tables, down the deep columns of numbers, sparklines can be added to tables set at 8 lines per inch (as in our example above). This yields about 160 sparklines per column, or 400,000 additional daily graphical prices and their changes per 5-column financial page. Readers can scan the sparkline tables, making simultaneous multiple comparisons, searching for nonrandom patterns in the random walks of prices.
1 “Favorite Funds,” The New York Times, August 10, 2003, p. 3-1.
2 In our redesigned table, the typeface Gill Sans does quite well compared to the Helvetica in the original Times table. Smaller than the Helvetica, the Gill Sans appears sturdier and more readable, in part because of the increased white space that results from its smaller x-height and reduced size. The data area (without column labels) for our sparkline table is only 21% larger than the original’s data area, and yet the sparklines provide an approximate look at 5,000 more numbers.