Data mining coincidences: Bellwether electoral districts

December 3, 2008  |  Edward Tufte
2 Comment(s)

Lurking in this story below are general points about data-mining and the constant need for genuinely predictive replication. There are big incentives in data-mining to produce (false) alarms, for what data-mining consultant is going to say “We found nothing at all in your data”? Also, in many domains (such as medical testing, security systems), the money is in the false alarms. The current notorious example is prostate cancer testing, which requires mutilating surgery on 47 men to extend the life of one man. That is, there are 46 false alarms, 46 wrong unnecessary surgeries, and $1,000,000 in medical fees. Our paper suggests methodologies for validating–or not–the results of data mining.

For years journalists have data-mined election returns to find “bellwethers,” geographic units whose overall vote division mimicked the national result in election after election. I published an article (along with my Princeton undergraduate student Richard Sun) that showed that bellwether electoral districts have no predictive value (at least winner-take-all bellwethers; we did find some evidence for barometric and swingometric districts, which we constructed for fun). Journalists haven’t always gotten this news and so 30 years later I still get emails from reporters asking about some area that has voted for the electoral winner in the last 00 elections and whether that miracle will continue in the upcoming election.

Edward R. Tufte and Richard A. Sun, “Are There Bellwether Electoral Districts?”
Public Opinion Quarterly, 39 (1975), 1-18.

Click here to view the article, which shows some ways to test data-mined results and avoid being caught by coincidences.

Here’s the summary at the end of paper, concluding with a wonderful quote from Somerset Maugham:

“Are there bellwether electoral districts? No, at least not if they are chosen before the fact. Some counties are more barometric than others, both in retrospect and in prospect. While spectacular in their postdictions, these counties are not sufficiently barometric or swingometric in their predictions to provide a precise or reliable guide to upcoming elections. Several alternative methods of prediction are also preferred because their underlying inferential logic is more secure than the unknown mechanisms producing the highly variable barometric and swingometric behavior observed in our data.

The all-or-nothing counties are only a curiosity and probably should be forgotten. It is a waste of time to send reporters out to interview non-randomly selected citizens of Crook County a week or two before the election–at least from any sort of scientific point of view.

There perhaps remains a magical air about the bellwethers of the past. Some of these districts, considered individually, seem to have such phenomenal records, and while we know better than to take them seriously, still. . . . It may be best to look not to the election returns for the source of the mystery, but rather to ourselves. Somerset Maugham once wrote:

‘The faculty for myth is innate in the human race. It seizes with avidity upon any incidents, surprising or mysterious, in the career of those who have distinguished themselves from their fellows, and invents a legend to which it then attaches a fanatical belief. It is the protest of romance against the commonplace of life.'”

Topics: E.T.