Trouble With Difference In Differences

Jeremy also told me about a fascinating he'd recently read that turned a standard statistical tool of economic research on its ear. How Much Should We Trust Differences-in-Differences Estimates? is the title of the paper, and the answer is "not all that much."

The background here is that the kinds of things economists want to measure in the wild -- say, wages, unemployment, spending patterns, crime, or any number of important trends -- are changing all the time. If a few states alter their welfare laws in a particular way, you can't just compare their unemployment rates before and after the change, becuase you might be misled by national changes in overall unemployment. On the other hand, you can't just compare states that enacted these changes to other states directly, because they might have had different unemployment rates to begin with (and this difference is likely have been a factor in legislative decisions).

What you have to do is complete out the box. You make a "difference-in-differences" comparison, by looking at the changes in the important quantity in both your experimental and control groups. That is, you treat the before-to-after changes in your control group as a baseline, and see how much the before-to-after changes in your experimental group deviate from this baseline. Where your experimental and control groups are both sizeable, you can use standard statistical tools to see how much of the before-to-after changes are attributable to the change whose effects you're trying to measure.

There is nothing wrong with this procedure as I've described it. But, in practice, difference-in-differences often papers go one step further. They look at changes in the "dependent" variable as a time series. That is, if the legal changes were enacted in 1987, say the authors, and we have data every year from 1980 to 2000, we should try to fit a curve to that data. We weight every year equally, use standard statistical heuristics (say, least-squares regression) to fit a curve, and then compare control curves with experimental curves. Surely this technique gives us a more finely-tuned result than just lumping together "before" and "after" data for each state?

Actually, no, say the authors of the paper Jeremy showed me. The problem is that -- especially for data measured at the state level -- there are strong year-to-year correlations that have nothing to do with anything susceptible to manipulation. Treating each year as an indepndent measurement vastly overstates the useful information present in the time series. The years aren't independent measurements randomly clumped around some "true" underlying value. The year-to-year "error" (which is actually the normal economic noise induced by all the things economists don't understand) tends to persist and replicate itself. In the context of a difference-in-differences measurement, disaggregating the time series tends to mistake these persistent effects for genuine changes caused by something. Lumping all the "before" and "after" data together brings the problem back to the simple two-by-two matrix that difference-in-differences is designed to handle.

As a particularly striking demonstration of the trap, the crowd who wrote w8841 tried running difference-on-differences on random groups of states. That is, they pulled 25 state names out of a hat, and pretended that these 25 states had enacted a placebo "law." They then looked at the difference-in-differences effect of this placebo law on female wages, using the standard, cookbook, statistical treatment I described above. The result: almost half the time, they found an "effect" traceable to their law's passage.


I'm curious whether this discovery will induce a wave of retractions in the economics world. How many recently-publicized results depended on this particular piece of shaky methodology? What else that we think we know is wrong?