All 4 books by Edward Tufte now in
paperback editions, $100 for all 4
Visual Display of Quantitative Information
Beautiful EvidencePaper/printing = original clothbound books.
Only available through ET's Graphics Press:
catalog + shopping cart
All 4 clothbound books, autographed by the author $150
catalog + shopping cart
Edward Tufte e-books
Immediate download to any computer:
Visual and Statistical Thinking $2
The Cognitive Style of Powerpoint $2
Seeing Around + Feynman Diagrams $2
Data Analysis for Politics and Policy $2catalog + shopping cart
Edward Tufte one-day course,
Presenting Data and Information
Houston TX, January 29
Austin TX, January 31
Dallas TX, February 2
This question concerns how to reduce information in 4 dimensions to a representation (or set of representations) in 2 dimensions.
For background, consider the economic duopoly situation, with two firms producing non-identical products. A standard textbook way of picturing the solution is to construct "Reaction Curves" RC1 and RC2, showing each firm #1's optimal quantity response Q1 to the other's entire range of possible quantity choices Q2, and vice versa. The Cournot-Nash solution is at the intersection of the two Reaction Curves thus constructed, where each firm's choice is a best response to the other's. (The associated prices are not pictured, but are implicitly determined by the market demand function for the two products.)
Now here's the question. Suppose the firms can vary not only quantities produced (Q1 and Q2) but also the nature of their products -- measured as V1 and V2 on some numerical scale.
Instead of the previous 2-dimensional space on Q1,Q2 axes there is now a 4-dimensional space on Q1,Q2, V1,V2 axes. In this space the previous Reaction Curves now become Reaction Surfaces: firm #1 chooses a profit-maximizing (Q1,V1) vector in response to any (Q2,V2) choice on the part of its rival, and vice versa. As before, the Cournot-Nash condition will be met at the intersection, so that the firms' choices are best responses to one another.
I believe the two Reaction Surfaces will intersect generically at a single point if all the functions are ideally well-behaved, or at any rate at a finite number of points.
But is there any way of representing this solution that would permit some intuitive glimpse of the process? I've been playing without success with linked pairs of 2-dimensional diagrams. For example one diagram on Q1,Q2 axes and the other on V1,V2 axes.
-- Jack Hirshleifer (email)
If the material involves surfaces, work on 4D visualization is definitely worth looking at; for example, Thomas F. Banchoff, Beyond the Third Dimension: Geometry, Computer Graphics, and Higher Dimensions (1990).Tom is in the math department at Brown University and you might pose your question directly to him.
See Paul A. Tukey and John W. Tukey, "Data-Driven View Selection: Agglomeration and Sharpening," in Vic Barnett, ed., Interpreting Multivariate Data (1981). They show X1 on X2 scatterplots in a scatterplot matrix that simultaneously conditions on X3 and X4. This example uses empirical data although the architecture might also be relevant for theoretical curves. An example from Tukey and Tukey is reproduced in my The Visual Display of Quantitative Information, p.114.
See also the example of 3D objects placed within a 2D matrix, in my Visual Explanations on page 80, and note the possibility of parallel matrices.
Or lose a dimension; any sort of collinearity at times here?
What makes this especially difficult is that none of the variables are space or time. Nobody said that 4 simultaneous dimensions was going to be easy!
-- Edward Tufte
Would it be meaningful to reduce the variables to two "dimensionless" variables, Q1/Q2 and V1/V2? Or, if V1 and V2 are rather abstract, leave them alone and just reduce Q1 and Q2 to Q1/Q2; then a family of curves might be generated.
-- Kelly Runyon (email)
How about a property map which compares the properties of the output of the two companies in two dimensions that describe the output. So if the products are identical they overlap and if they are different in two properties they are separated in 2D, but superimpose the products on top of a demand contour, which would show the volume aspect. That way the demand aspect which is orthoganal to the product description aspect could separate the nature of the two companies. Example, suppose one company makes a very hard and long product that satisfys the market very well. It would plot in the upper right hand corner over the top of a contour plot for demand which is at a peak. Likewise the other company makes a soft but long product it would plot in 2D in a different place, but over the top of a contour plot which would peak at a level, perhaps higher or lower than the first companies peak. Naturally the first companies data would be in blue and the second would be in Red, two other orthoganal values. Questions?
-- CLSims (email)
See visual models of classical thermodynamics, in particular, the Gibbs Models at http://www.public.iastate.edu/~jolls/homepage.html
The site shows visualizations of thermodynamic surfaces, with work by Kenneth R. Jolls and Daniel C. Coy at Iowa State University.
-- Edward Tufte
This isn't, or at least needn't be, a matter of intersecting surfaces. Assume that Firm #1 wants to depict their optimal responses to all actions of Firm #2. A single surface plot in perspective will do. The floor of a cube represents the Q2*V2 plane. Plot a surface where the vertical dimension represents Q1, and encode V1 as the surface's color (or vice versa, although V may not be quantitative whereas Q definitely is). A complimentary graph would depict the response of Firm #2 to #1 (surface Q2*V2 in space Q1*V1).
If the graphic tool allows for transparency, you could plot both the Q and V responses as separate translucent interposed surfaces, but this would be hard to interpret unless the viewer could rotate the image in 3D.
If you really want a 2D graph, not just a 2D representation, then use an iconographic approach: X and Y axes are Q2 and V2, and in this plane you depict discrete Q1&V1 combinations with a tiny glyph that depicts both values simultaneously (e.g., an oval where width = Q1 and height = V1). If there are only few values on the X & Y axes, then instead of glyphs you could use what Tufte calls "small multiples," a matrix of little Reaction Curve graphs.
Another reader's suggestion to use the dimensionless Q1/Q2 seems insufficient, as there can be more than one combination of values that produce the same ratio: e.g., 10/30 = 200/600, but the component quantities are quite different. If you plot a point for Q1/Q2=.34 in a V1-by-V2 plane, the viewer won't know whether this is in response to Q2=30 or Q2=600 (or infinite other possibilities); because while the response to Q2=30 is Q1=10, the response to Q2=600 might be something other than Q1=200.
-- Brian O'Hearn (email)
I know this is a very late response ... but the question remains posted, so here's a thought: It seems to me that this may be a 3D or a 3 1/2 dimensional problem (RCx, Qx, Vx). Am currently playing around with "heat maps" in which shared variance vs mean are plotted on the x-y coordinates, and color intensity of the data points indicates process variation. Displayed on a web page, a mouse-over can popup a tool-tip which provides additional information for each data point (the 1/2 dimension, if you will). A glimpse at the map is intended to quickly show which processes are in control (variation), which are important (shared variance), and which are performing well (mean). Additionally, all data have been converted to percentages so that relative comparisons (among data points and statistics) are easily seen and understood (raw data are available in another table).
This 3D example may or may not serve you, but the idea of a heat map may spark an idea.
-- George Chynoweth (email)
I have had some success with 2D visualization of 4D information, using 6 2D views. Each view is comprised of a pair of dimensions. The key to making it work is making it dynamic. If you can rotate your set of data points in one dimension-pair (I was using geometric figures, but you could use small iconography as described above) then you can see what happens to them in the other 5 sets (4 dims taken 2 at a time) of dimension pairs. No doubt the reason the original writer didn't experience success was because he was only operating with one set of linked pairs, and couldn't rotate the dimensions dynamically.
Of course, this approach probably works best if your dimensions are all of the same type, for my case spatial. Still, I'd think that for data not so orthogonal, this dynamic small multiple approach would have a good chance of showing the relationship he thinks could exist.
-- George Girton (email)
What are the best ways to display multi-year time information on single plan, elevation, and section 2D photos, plans, or diagrams? For example, how is the best way to show progressive development on a single "bird's eye" view/aerial photo or topo map? Thanks. J>D>
-- J. D. McCubbin (email)
-- Edward Tufte
After attending the recent course in Manhattan, I've been implementing a number of changes in the way that my bank represents data, including finding it actually very easy to generate sparklines within Excel (all of our work is Excel-centric). On the issue of showing trying to show bivariate x-y plots as a better description of cause/effect, the problem is to show how this has progressed through time. The solution I've used - and implemented very easily in Excel - is to set the fill colour on the earliest data point to white and the penultimate point to black, with all intermediate points becoming increasingly dark. The very last point (significant in my area, as this needs to highlight where we are currently) is indicated as a red data point. I don't know how I can post the example here but it does seem to work very well.
-- Will Oswald (email)
I designed some risk charts for use in cardiovascular prevention in Europe. They use principles that will be well-recognised by anyone familiar with Tufte's work, and present, on a flat surface, risk of cardiovascular disease by sex, age, cholesterol, blood pressure and smoking. That's 6 dimensions. They also present risk numerically and using a green-to-red colour scheme. By the way, can anyone figure out how to add an extra dimension without making the result cumbersome? I would like to show risk for diabetics and non-diabetics. (Never satisfied...) A small skuzzy version of the charts is at http://www.escardio.org/initiatives/prevention/SCORE+Risk+Charts.htm and you can see how they are used in the actual guidelines http://www.escardio.org/NR/rdonlyres/E2EA65FA-93EC-4C03-AB1D-1D944AFA1690/0/cvdprevention_ex_sum.pdf
-- Ronan Conroy (email)
The plots by Hirshleifer are wonderful because they do (at least) two things. First, they present highly multivariate data in a high-resolution, easy-to-read read format.
Second, they demonstrate two design points that I have come to understand after doing similar high-dimension plots for the last couple of years. The first principle is that color, and shades and intensities thereof, seems to work best when showing response variables than they do when showing design, or predictor, variables. Note that the color shade gives the predicted risk of fatal CVD (the response) while the design variables (sex, smoking status, cholesterol, age, blood pressure) form the grid. Perhaps this point is explicitly mentioned in ET's books; perhaps it is only implied. So, point 1 is "(typically) use color for response, not design, variables", I think.
The second principle then follows beautifully from Hirshleifer's plots once one sees the design versus response partition: in data displays, be they tables or graphs or whatever, THE DISPLAY IS THE MODEL, with emphasis on the word "is". Hirshleifer's plots show the increasing risk of fatal CVD as one moves from women to men, from young to old, from non-smoker to smoker, from low to high blood pressure. The consistency in the color stripes, from upper left to lower right, drives this home. Based on this consistency, although I didn't read the paper, I can guess the form of the model from which the response numbers are generated.
Minard's plot of Napolean's march to (and from) Moscow tell us that the size of the army is a function of latitude and longitude, date, direction of travel, temperature, and (in the case of the river) land forms.
Understanding, and making inference from, data requires understanding the sources of variability in the data; understanding requires a model, at least a mental one. The typical time series plot fails to generate understanding because it implies that the only source of change in the response is the unrelenting, impassionate drumbeat of time. Show cause. Show effect.
The data display IS the model.
-- Rafe Donahue (email)
My bad. Of course, I mean the plots by Ronan Conroy. I was looking at the original post, not the (at the time) most recent response.
"We apologize for any inconvenience this may have caused."
-- Rafe Donahue (email)
OK, I'm on a roll here so I'll just keep making comments.
I'm not sure that I agree or disagree with Mr Simmon. Dividing the squares works great if the comparison of interest is to compare diabetic with nondiabetic while controlling for the other factors. I think, however, that as a 'look-up' table/display, this end might be better served by just making two versions of Conroy's Figure 1, one on top, say, for the diabetics and one on the bottom for the nondiabetics. Then we would have a grid of grids of grids: major rows are diabetes status, major columns are sex, minor rows are age groups, minor columns are smoking status, elemental rows are blood pressure, elemental columns are cholesterol, and color is the response! Whee!
One might note that Conroy already has a dimension that he hasn't mentioned: risk region of Europe (low versus high). This is shown in Figure 1 versus Figure 2.
Of course, it is certainly possible to "transpose" all these dimensions around so as to alter the fundamental or elemental comparisons. Unless there is a reason not to do so, I think I like the design variables with the largest numbers of support points further along in the dimension heirarchy. Maybe this idea means that diabetic status (yes versus no) below more toward the "outside" of the plot. Yes, I think so: Don't split the squares; make diabetes the major rows!
-- Rafe Donahue (email)
See Albers on the Interaction of Color, or the examples in Envisioning Information, the chapter on color and information.
Color is intensely contextual, like design elements in general.
To see the effect here, take 2 pieces of papers, cover everything except a thin line between A and B.
-- Edward Tufte
A group in Hungary has developed an ingenious method called the "holographic design strategy" for visualizing 4 (and higher) dimensional data. It requires that all the predictor variables be quantitative. The method uses an inspired arrangement in which one-half of the experimental variables are placed along the X-axis of a graph and the other half along the Y-axis. The points are arranged so that each forms a wave of different periodicity. The results are placed in this X-Y space coded by color or gray-scale intensity. This arrangement forms a data-rich graphic, which can easily allow visualization of several hundred thousand points! For example, the eye can easily view and resolve a 200x200 mm grid at 0.5 mm pitch, for a total of 160,000 points. 0.25 mm pitch gives 640,000 points.
I have an EXCEL file with an example if anyone is interested.
Reference: Vegvari, L., et al., Holographic research strategy for catalyst library design: Description of a new powerful optimisation method. Catalysis Today, 2003. 81(3): p. 517-527.
-- James Cawse (email)
The small multiples concept works wonders for reducing dimensionality. A two-dimensional grid of two-diemsional grids yields essentially 4 dimensions. Drop in a response via color or shape and one is up to five or six dimensions without even breaking a sweat.
Of course, with statistical graphics, the point is to display to distribution of responses and how this distribution is influenced by the various sources of variation. The model then, as shown by the graphic, shows what the sources of variation are and how they affect the distribution of response. This is why the small multiples grid works so well.
I recently had the opportunity to make an 81-by-15 grid of 4-by-4 grids in showing the results of a survey of parents of autistic children. Shrinking things way down and using a large sheet of paper (11x17 inches), all 2182 data points, and the margins, along with titles and explanation are shown on the plot. The plot (approx 360kb) in wonderfully rich pdf (with discussion and annotation) and a short description of the source of the problem are available for your viewing pleasure or fodder.
p.s. Can I take this opportunity to rave about the ability to preview the response and edit the source code on the same page? Very, very nice. Thank you, ET's IT staff!
-- rafe donahue (email)
Regarding the visualization of objects in more than one dimension, I'm intrigued by the work of Charles Hinton, who in 1906 published a book called The Fourth Dimension, in which he outlines the means for visualizing four spatial dimensions. Hinton devised a system of colored cubes to aid readers in developing this facility; it seems an interesting specimen in the history of envisioned information. Hinton's book is available for download here. I learned about his work at an interesting blog; there, I read that Hinton was the son-in-law of George Boole (inventor of Boolean logic), that he coined the word "tesseract" (made popular by the books of Madeleine L'Engle), and that some people find his book and its system of 4-D graphics strangely mesmerizing.
-- Matthew Battles (email)
Some CNC robot arms have axes of movement: if you imagine holding a paper plane in your hand; you can move it (1) up and down, (2) left and right, (3) in and out, and rotate round (4) up axis, (5) right axis, and (6) in axis.
On a graph or chart: size, shape and colour of symbols placed at X-Y coordinates, can add extra dimensions - although often not mutally exclusive. This is also like a weather map: we have (1) X, (2) Y position of the weather, (3) symbol (shape) for how sunny (cloudy) or wet, (4) perhaps a colour for tempature, (5) size of an arrow for how windy, (6) amination to show time.
-- Bobby (email)
Sometimes we think we need multi-dimensional diagrams where in fact the several dimensions can all be seen as a sequence, i.e. mostly dependant on one variable (e.g. time or position).
I tried to use this to visualize the bucket filling motion of a front loader (wheel loader). See figure 5 on page 9 of this paper: http://www.arxiv.org/pdf/cs/0503087
Here, I show bucket position (x and y) and orientation (angle) in a side view together with the operator control commands (gas pedal/throttle, lift and tilt lever). Since the time between marked bucket tip position is 0.5s, we also get a feeling for speed.
I am not sure if this diagram holds in light of E.T.'s teaching (discovered the books only after this paper was written), but I still find it quite informative.
Anyway, my motto is to avoid colors as long as I can. This will save you a disappointing experience with miserable prints but will also help your color-blind colleagues quite a lot. Not to mention your copying students.
3D diagrams are good but one has to take great care to get the right perspective without distortions. Sometimes a good 2D representation is easier to understand.
-- Reno Filla (email)
The technique of flow cytometry allows biologists (especially immunologists) to quantify the copy number of a protein or other molecule of interest on each cell in a population of tens of thousands of cells, in minutes to hours. The technique as currently practiced commonly has a dynamic measurement range spanning more than four orders of magnitude, and advanced instruments allow simultaneous quantification of over a dozen molecules at a time, along with measurements of cell size and other optical parameters. In other words, high-precision datasets with large sample sizes and ten or more dimensions are becoming reasonably common. Issues pertinent to this and other threads now are being confronted by scientists and clinicians, and analytical errors can mean years of effort or lives lost.
As a consequence, I was delighted to see this superb review article [mirrored copy] which is largely about the visual display of quantitative flow cytometry data. (Nonspecialist readers will be happy to discover the explanatory box hidden on the last page, which contains a glossary of terminology). A few things jump out. First, small multiples of two-dimensional plots are (here, and in the literature at large) widely viewed as the correct way to present higher-dimensional data. Second, the authors spend a great deal of the article describing how instrument noise, leading to zero or negative values, makes the use of (otherwise optimal) logarithmic scales problematic, because low values pile up against the bottom of the scale or are even omitted altogether. The alternative is the "Logicle" double-exponential baseline, which allows a visual display that reflects what the data (and the instruments used to collect them) are actually doing. This is, in my view, a major advance that should and probably will propagate well beyond its field of origin. Third, the authors lament:
Over the years, these plots have become such a staple in FACS [fluorescence-activated cell sorting] plots that published figures often (and inexcuably) omit axis values and even axis 'tick marks' that identify the scale as logarithmic.
Apparently, no field is, um, immune to this sort of thing.
-- Alexey Merz (email)
ET has suggested that I post in this thread a solution to a three-and-a-bit dimensional problem.
A proposed new electoral system, PR-Squared, rewards coalitions formed before the election. It is good that voters see these coalitions: it is good that voters choose a government, rather than vaguely influencing post-election haggling. The chart shows, or is meant to show, the incentive for parties to form such coalitions. I like it—the chart and the electoral system. But perhaps readers of this bulletin board can suggest improvements to the chart.
-- Julian D. A. Wiseman (email)