Apr 132012
 

In dealing with thoughtful people who have either never taken a stats class or have not been students of the discipline for many years, I have often run across a few interesting false premises or areas of ignorance. One worth mentioning in passing (and this falls into the paradigm of a little knowledge being a bad thing) is the belief that to run any kind of experiment you can only manipulate one variable. This belief falls squarely in the “I’m going to tell you something right now to help you understand a general concept, but later you will be told to forget it because it is false” camp, but since most folks only take that first stats class they never get to the “it is false… and here is why”. For those of you in that camp who are waiting for me to explain how this insane proclamation can be true, I have some sad news… I’m not going to cover it here. However, there are plenty of great online resources that will clue you in on how this is possible – check out descriptions of regression analysis, ANOVA (but not one-way ANOVA) and MANOVA to get yourself started down the path of complex experimental design.

I specifically call out “thoughtful  people” in the opening paragraph because these folks tend to make quick associations, recognize patterns easily, and apply knowledge from other sources to solve or contemplate current concerns. This, at times, can be a crutch due to the misapplication of a concept or set of rules. Stats tend to be tricky, and especially vex those who are mathematically inclined but who don’t have the full statistical picture. While the discipline of statistics isn’t exactly math, it relies on sometimes complex mathematical functions quite a bit, and as such folks like engineers tend to think about statistics with the mindset of applying the rigid rules they would otherwise apply to mathematical processes. Silly engineers. An interesting observation I made while in grad school was that the majority of us who were stats savvy were actually math-poor. I have never taken calculus, and the last time I really used complex algebra was high school. Statistics as a field has a lot of essentially black-box processes that we let computers munge through while we get on with the business of interpreting the results. This can and does lead to unintentional misuse, but also lets us graduate in 4-6 years rather than 10. That said, I am a math nerd wanna-be… one day I’ll understand the full picture.

Now that I have waxed philosophically on the mystery and magic of statistics (or the misunderstanding of what statistics, as a discipline, actually is) I’ll get to the meat of the post: nonsignificant interpretations from statistically significant results. It’s pretty dead simple when you break it down – there are two classes of significance: statistical and practical. Statistical significance is simple, rules based, math based, and straightforward – take the data, run your test, and see what the math tells you.  Practical significance is squishy, relative, and as such, confusing to those not used to having to add in a “furthermore” after calculating a mathematical outcome. Practical significance refers to the cost/benefit of the statistically significant outcome. “What?” I hear you say… bear with me (or is it bare… looked it up, bear it is). Say I ran an extremely well designed experiment (perfect sampling, impeccable design, in a method so perfect that nobody could fault it in any way) whereby one set of children was taught math in a novel way and another set was taught in a traditional way. The question is whether the novel method produces significant improvement on the final math test in the course. Again, remember there is nothing wrong with the methodology of the design, sampling, etc. After the test is run, it is found that, lo and behold, children in the “novel technique” group score significantly higher on their tests. Champagne corks are popped, a new era has dawned, and math instruction is changed forever. Right? Well, the first thing any discerning statistician will ask (once she is convinced that the design and test were valid and appropriate, and damn you I’m telling you for this thought experiment, they were!) is what was the effect size? To keep it simple, let’s ask “what was the difference in the average scores between the groups?” This is half of the equation (and sometimes enough to answer the grater question at large). The other half of the equation (sometimes irrelevant) is “what was the cost” of this new and amazing technique?

The importance of the first question is to make a judgment about the actual expected improvement one will see when employing this new technique.   If the results were that the average difference seen was 15 points on a 100 point test, we may determine that it is a nice, healthy lift in scores. If the results were a meager 2 points (it is absolutely possible to have small but significant result – parametric statistical tests are based around standard deviations of scores, so a test where scores of a group “clump together” can have small absolute differences that are still significant) we may not be as impressed. Two points… hmm… is the cost of the alternate method worth it for 2 points? This is where we get into the practical. How much does it cost (where “cost” can have lots of definitions)? If to apply the new technique you have to retrain all math teachers for 50 hours and buy special course materials @ $55 per student… do the math. It may be determined that while the technique improves student scores, it’s just too damn costly to implement for what you get out of it. In other words, it’s impractical – statistically significant, but practically not worthwhile. One is mathematically derived, the other is not.  As an opposite example (where a very small difference still has a major impact) imagine Google running a UX experiment whereby they significantly increase ad click rates by one one-tenth of a percent by changing the font from X to Y. Considering the change is trivial and the impact is an extra ten million dollars assuming $10B a year in ad revenue, it’s obvious that this tiny effect is not only statistically significant, but also probably practically significant as well. The real point here is that finding a statistically significant result is not the last step in your experimental quest – it’s somewhere in the middle, s it lets you get to the next step of asking “so what”. The follow up questions centering on the practical implications are essential to the decision around whether the findings fall into the camp of “OMG” or “BFD”.