Nov 252012
 

So Nate Silver is the stats nerd of the year for his great (or lucky, if you hate science) methodology around poll aggregation and the poll weighting algorithm he employed regarding the prediction of the outcome of the recent national elections. Congratulations Nate, if I didn’t live in a country with Byzantine banking laws, I would have made a tidy sum using your leg work (among others – I firmly believe in leveraging the wisdom of crowds of experts) to invest on “Obama to win” via the event based market InTrade. I haven’t been able to find any apologies by the demonizers who suggested Nate was just another political hack (like them?) who was rooting for the wrong team and trying to hide it behind some sort of magical thinking in the guise of science, but I can’t say I looked too hard.

While the disappointing part of the whole Nate Silver predicting the elections bit lies in the constant misinterpretation of what Nate actually did to come by his numbers due to the general publics’ pseudounderstanding of statistics, the beauty of the press he received both before and after the election has elevated the role of data in decision making – even messy social data like poll results (essentially surveys, with all their inherent issues). The age old “gut feeling” as the sole driver of decision making (i.e. guessing) is coming under needed scrutiny in an age where having current and historical information is finally possible. Those who fail to incorporate data, especially data that is readily available or easily gathered, will be left behind or when successful in their guesses (expertise does have its place) will be less efficient.

It is my firm opinion that gut feeling is a garnish best placed on top of data driven analysis where the depth of gut needed is (roughly) inversely proportionate to the data available. Nate doesn’t use gut feelings, he uses data, which can then then be handed to those responsible for making decisions.

So how does Nate Silver make my job easier? As Silver commented to Jon Stewart on the Daily Show after being asked about what it would mean if his model had been wrong, Nate responded “It would have been bad, I think, because for some reason 538 became invested with this symbolic power, and you know symbolic power isn’t particularly rational, right, but it became this symbol for people who were believing in hey, let’s look at the polls. Let’s do some empirical research, right.” Empirical research was shown to best guts. This research was contrary to a huge contingent of, not surprisingly, biased observers but was shown to be superior to all other estimations, guesses, scrying stone proclamations, etc. even those made by individuals with a vested interest in Obama winning. His. Model. Won. Data won.  As data of such a highly scrutinized, over-thought, expensive contest won over individual “expert” opinion, my job got easier. The hugely symbolic power of that specific use of data helps serve as a powerful example of what data can do. When talking to organizations about the value of data, the value of quality data, and the usefulness of measurement to drive business decisions I now have an example that everyone knows, and in some small way, understands. Am I comparing myself to Nate Silver? Not particularly – we come from very different backgrounds, education, approaches etc. But one thing is certain – he has just made the human interaction part of my job a lot easier – that part where I am convincing a client to invest in data resources, to care about data quality, data completeness, and data driven decision making. Thanks Nate.

Jul 302012
 

A friend sent me a great blog post (see #1 in the list at the end of this post) around testing that has been buzzing around (and should be read and debated if you care about such things even a little bit). The post introduces (as in “brings to our attention” not as in “invents”) a method of easily coding an epsilon greedy strategy for testing web (or whatever) optimization and claims that it is superior to the well-established standby of  A/B testing (oooh, thems fightin words!) This post has inspired a number of responses by folks who run A/B tests, folks who optimize and test websites, and computer nerds interested in arguing about this type of stuff in general.

The normal array of folks weigh in – the engineers who almost understand* the experimental/statistical approach to A/B testing, statistician purists who sing about the superiority of A/B testing, and the practical coders who just ask for something that is simple, intuitive, and works towards the end goal of making money on their site. It’s the interplay between these folks that I found interesting (and entertaining). Full disclosure – I’m not a master statistician but have plenty of experience in experimental design, A/B testing, and general statistics. I am not by any stretch a website optimizer or an engineer.

At the end of the day, for the majority of those who are interested in testing – and I’m not talking about Google or Bing but rather the rest of the world, they want something that works very well and that converges on an optimal or close to optimal solution quickly. Maybe we should even get fuzzy with the term “optimal” by saying it means exacting maximum price/conversion/experience AND painlessness/stability/implementation ease. The main arguments against the A/B testing framework is that while it is a well-established, experimentally sound methodology, it takes time to execute and collect the data, requires knowledge of statistics to accurately interpret (not to mention know how long to run the test, how to set up the test, and to understand why you don’t just “watch the numbers until it gets significant”) and needs to finish before you can tweak and deploy your “winning” solution. The epsilon greedy algorithm is relatively self tuning based on the rules given to it, making it get closer to an optimization (assuming a static universe) relatively quickly**. One big argument against the epsilon greedy strategy is that it can mistakenly optimize based on a temporal correlate or something similar (e.g. your site shows breakfast and dinner ads at 8 a.m. – guess which one gets more clicks? That ad is then considered optimal and keeps getting shown past noon, into the evening until, finally, the dinner ads get enough clicks to flip it – but way later than is optimal). Maybe some good strategies are to reset/retest every X hours, or to decay older clicks against newer ones for a faster “flip” when such a pattern emerges.

My take is that if you don’t care about the statistical rigor or experimental soundness angles to optimization – and if you aren’t particularly keen on analyzing every tweak, covariate, and adjustment you make to your website (again, excluding Google and the big boys here who definitely do care), then the epsilon greedy algo is worth implementing. That is not a dig by any means – sometimes you care and sometimes you want this thing to get rolling as fast as possible, you will never revisit it, etc. If you are trying to sell a service to scientists or stats nerds, need the experimental rigor, need to optimize for the long term, expect to run tests at all times, and want to use as much data as possible to craft the ultimate presentation models (or whatever it is you are testing for) then you should probably be using the slower-but-steady A/B testing approach or something similar. As with most things – use the tool that meets your goals, but figure out your goals ahead of time for optimal tool selection.

In the end, I feel like the debate around the method to use consists of folks who are discussing and exploring some of the pros and cons of each approach without enumerating the actual uses. They mistakenly assume that both are used for exactly the same reason. While this is, at the broadest level (optimizing between choices) is true, the actual reasons behind the use of one over the other is vastly different. Hatchet versus lathe – both cut wood, but which one does it better?

* I say “almost” because in the discussions, many of them fail to point out simple errors others are making in assumptions. If they were statistically savvy engineers they would say things like “you should reset your test metrics every time you make a change, and this is true whether you use the epsilon greedy strategy or A/B testing”.

** I’m ignoring the cases where the groups have disgustingly similar conversion rates.

Here are some articles and reference pieces for your perusal:

  1. Original post “20 lines of code that will beat A/B testing every time
  2. Wikipedia article on Multi Armed Bandit problem  and the concept of the epsilon greedy algorithm
  3. Blog on why Bandit algos are better than A/B testing
  4. Discussion and debate around the original article on y-combinator
  5. Blog on some hard-knocks and learning around A/B testing in general
  6. Blog summarizing a cool article by Microsoft Bing folks on weird outcomes experienced with online testing:
  7. Actual MSFT article on weird A/B outcomes explained (PDF)