Boxplots data test

January 16, 2014  |  Edward Tufte
5 Comment(s)

Construct bunch (5 to 8, maybe) x/y scatterplots of 50 > n > 500 with widely divergent scatter patterns
that all generate the same identical set of 5, say, vertical boxplots.

N can vary across scatterplots if you wish.

This is a sort of reverse of Anscombe’s Quartet, pages 13-14 in my book The Visual Display of Quantitative Information.

 image1

The point is that the binning of the x variable, and the summarizing of y, can sometimes mask what is going on in the underlying data, and may at times create the appearance of a stronger and smoother relationship between x and y than actually appears in the data.

For example, you might shows scatters that one might expect for that set of boxplots, but also make a scatter that rides a sine wave from lower left to upper right on the xy plane. See Anscombe’s Quartet for other ideas.

Provide the matrix of the raw data for each scatterplot.

Since many boxplots are used to report categoric data, the 2D issue above can be turned into a one-dimensional issue of how well boxplots represents the behavior of the data within a category and comparing across categories.

Different categories may have very different n’s. Sometime pseudo-categories are creating by binning data. Sometimes the categories are ordered if not measured more exactly. Or the categories and the boxplots may all reside in a pool of noise and misclassification. Thus the 1D y-binning of box plots should be carefully examined. Again provide the matrix of plotted data.

Post down below via the CONTRIBUTE button.

Provide a working email (masked to our readers) so I can get in touch with you.

The 5+ best answers get a set of all 4 of my books, autographed. At some point, I will probably publish some of the contributed material.

I will give you credit for your contribution.

Topics: E.T.