A friend sent me a great blog post (see #1 in the list at the end of this post) around testing that has been buzzing around (and should be read and debated if you care about such things even a little bit). The post introduces (as in “brings to our attention” not as in “invents”) a method of easily coding an epsilon greedy strategy for testing web (or whatever) optimization and claims that it is superior to the well-established standby of A/B testing (oooh, thems fightin words!) This post has inspired a number of responses by folks who run A/B tests, folks who optimize and test websites, and computer nerds interested in arguing about this type of stuff in general.
The normal array of folks weigh in – the engineers who almost understand* the experimental/statistical approach to A/B testing, statistician purists who sing about the superiority of A/B testing, and the practical coders who just ask for something that is simple, intuitive, and works towards the end goal of making money on their site. It’s the interplay between these folks that I found interesting (and entertaining). Full disclosure – I’m not a master statistician but have plenty of experience in experimental design, A/B testing, and general statistics. I am not by any stretch a website optimizer or an engineer.
At the end of the day, for the majority of those who are interested in testing – and I’m not talking about Google or Bing but rather the rest of the world, they want something that works very well and that converges on an optimal or close to optimal solution quickly. Maybe we should even get fuzzy with the term “optimal” by saying it means exacting maximum price/conversion/experience AND painlessness/stability/implementation ease. The main arguments against the A/B testing framework is that while it is a well-established, experimentally sound methodology, it takes time to execute and collect the data, requires knowledge of statistics to accurately interpret (not to mention know how long to run the test, how to set up the test, and to understand why you don’t just “watch the numbers until it gets significant”) and needs to finish before you can tweak and deploy your “winning” solution. The epsilon greedy algorithm is relatively self tuning based on the rules given to it, making it get closer to an optimization (assuming a static universe) relatively quickly**. One big argument against the epsilon greedy strategy is that it can mistakenly optimize based on a temporal correlate or something similar (e.g. your site shows breakfast and dinner ads at 8 a.m. – guess which one gets more clicks? That ad is then considered optimal and keeps getting shown past noon, into the evening until, finally, the dinner ads get enough clicks to flip it – but way later than is optimal). Maybe some good strategies are to reset/retest every X hours, or to decay older clicks against newer ones for a faster “flip” when such a pattern emerges.
My take is that if you don’t care about the statistical rigor or experimental soundness angles to optimization – and if you aren’t particularly keen on analyzing every tweak, covariate, and adjustment you make to your website (again, excluding Google and the big boys here who definitely do care), then the epsilon greedy algo is worth implementing. That is not a dig by any means – sometimes you care and sometimes you want this thing to get rolling as fast as possible, you will never revisit it, etc. If you are trying to sell a service to scientists or stats nerds, need the experimental rigor, need to optimize for the long term, expect to run tests at all times, and want to use as much data as possible to craft the ultimate presentation models (or whatever it is you are testing for) then you should probably be using the slower-but-steady A/B testing approach or something similar. As with most things – use the tool that meets your goals, but figure out your goals ahead of time for optimal tool selection.
In the end, I feel like the debate around the method to use consists of folks who are discussing and exploring some of the pros and cons of each approach without enumerating the actual uses. They mistakenly assume that both are used for exactly the same reason. While this is, at the broadest level (optimizing between choices) is true, the actual reasons behind the use of one over the other is vastly different. Hatchet versus lathe – both cut wood, but which one does it better?
* I say “almost” because in the discussions, many of them fail to point out simple errors others are making in assumptions. If they were statistically savvy engineers they would say things like “you should reset your test metrics every time you make a change, and this is true whether you use the epsilon greedy strategy or A/B testing”.
** I’m ignoring the cases where the groups have disgustingly similar conversion rates.
Here are some articles and reference pieces for your perusal:
- Original post “20 lines of code that will beat A/B testing every time”
- Wikipedia article on Multi Armed Bandit problem and the concept of the epsilon greedy algorithm
- Blog on why Bandit algos are better than A/B testing
- Discussion and debate around the original article on y-combinator
- Blog on some hard-knocks and learning around A/B testing in general
- Blog summarizing a cool article by Microsoft Bing folks on weird outcomes experienced with online testing:
- Actual MSFT article on weird A/B outcomes explained (PDF)