Jul 302012

A friend sent me a great blog post (see #1 in the list at the end of this post) around testing that has been buzzing around (and should be read and debated if you care about such things even a little bit). The post introduces (as in “brings to our attention” not as in “invents”) a method of easily coding an epsilon greedy strategy for testing web (or whatever) optimization and claims that it is superior to the well-established standby of  A/B testing (oooh, thems fightin words!) This post has inspired a number of responses by folks who run A/B tests, folks who optimize and test websites, and computer nerds interested in arguing about this type of stuff in general.

The normal array of folks weigh in – the engineers who almost understand* the experimental/statistical approach to A/B testing, statistician purists who sing about the superiority of A/B testing, and the practical coders who just ask for something that is simple, intuitive, and works towards the end goal of making money on their site. It’s the interplay between these folks that I found interesting (and entertaining). Full disclosure – I’m not a master statistician but have plenty of experience in experimental design, A/B testing, and general statistics. I am not by any stretch a website optimizer or an engineer.

At the end of the day, for the majority of those who are interested in testing – and I’m not talking about Google or Bing but rather the rest of the world, they want something that works very well and that converges on an optimal or close to optimal solution quickly. Maybe we should even get fuzzy with the term “optimal” by saying it means exacting maximum price/conversion/experience AND painlessness/stability/implementation ease. The main arguments against the A/B testing framework is that while it is a well-established, experimentally sound methodology, it takes time to execute and collect the data, requires knowledge of statistics to accurately interpret (not to mention know how long to run the test, how to set up the test, and to understand why you don’t just “watch the numbers until it gets significant”) and needs to finish before you can tweak and deploy your “winning” solution. The epsilon greedy algorithm is relatively self tuning based on the rules given to it, making it get closer to an optimization (assuming a static universe) relatively quickly**. One big argument against the epsilon greedy strategy is that it can mistakenly optimize based on a temporal correlate or something similar (e.g. your site shows breakfast and dinner ads at 8 a.m. – guess which one gets more clicks? That ad is then considered optimal and keeps getting shown past noon, into the evening until, finally, the dinner ads get enough clicks to flip it – but way later than is optimal). Maybe some good strategies are to reset/retest every X hours, or to decay older clicks against newer ones for a faster “flip” when such a pattern emerges.

My take is that if you don’t care about the statistical rigor or experimental soundness angles to optimization – and if you aren’t particularly keen on analyzing every tweak, covariate, and adjustment you make to your website (again, excluding Google and the big boys here who definitely do care), then the epsilon greedy algo is worth implementing. That is not a dig by any means – sometimes you care and sometimes you want this thing to get rolling as fast as possible, you will never revisit it, etc. If you are trying to sell a service to scientists or stats nerds, need the experimental rigor, need to optimize for the long term, expect to run tests at all times, and want to use as much data as possible to craft the ultimate presentation models (or whatever it is you are testing for) then you should probably be using the slower-but-steady A/B testing approach or something similar. As with most things – use the tool that meets your goals, but figure out your goals ahead of time for optimal tool selection.

In the end, I feel like the debate around the method to use consists of folks who are discussing and exploring some of the pros and cons of each approach without enumerating the actual uses. They mistakenly assume that both are used for exactly the same reason. While this is, at the broadest level (optimizing between choices) is true, the actual reasons behind the use of one over the other is vastly different. Hatchet versus lathe – both cut wood, but which one does it better?

* I say “almost” because in the discussions, many of them fail to point out simple errors others are making in assumptions. If they were statistically savvy engineers they would say things like “you should reset your test metrics every time you make a change, and this is true whether you use the epsilon greedy strategy or A/B testing”.

** I’m ignoring the cases where the groups have disgustingly similar conversion rates.

Here are some articles and reference pieces for your perusal:

  1. Original post “20 lines of code that will beat A/B testing every time
  2. Wikipedia article on Multi Armed Bandit problem  and the concept of the epsilon greedy algorithm
  3. Blog on why Bandit algos are better than A/B testing
  4. Discussion and debate around the original article on y-combinator
  5. Blog on some hard-knocks and learning around A/B testing in general
  6. Blog summarizing a cool article by Microsoft Bing folks on weird outcomes experienced with online testing:
  7. Actual MSFT article on weird A/B outcomes explained (PDF)
Apr 132012

In dealing with thoughtful people who have either never taken a stats class or have not been students of the discipline for many years, I have often run across a few interesting false premises or areas of ignorance. One worth mentioning in passing (and this falls into the paradigm of a little knowledge being a bad thing) is the belief that to run any kind of experiment you can only manipulate one variable. This belief falls squarely in the “I’m going to tell you something right now to help you understand a general concept, but later you will be told to forget it because it is false” camp, but since most folks only take that first stats class they never get to the “it is false… and here is why”. For those of you in that camp who are waiting for me to explain how this insane proclamation can be true, I have some sad news… I’m not going to cover it here. However, there are plenty of great online resources that will clue you in on how this is possible – check out descriptions of regression analysis, ANOVA (but not one-way ANOVA) and MANOVA to get yourself started down the path of complex experimental design.

I specifically call out “thoughtful  people” in the opening paragraph because these folks tend to make quick associations, recognize patterns easily, and apply knowledge from other sources to solve or contemplate current concerns. This, at times, can be a crutch due to the misapplication of a concept or set of rules. Stats tend to be tricky, and especially vex those who are mathematically inclined but who don’t have the full statistical picture. While the discipline of statistics isn’t exactly math, it relies on sometimes complex mathematical functions quite a bit, and as such folks like engineers tend to think about statistics with the mindset of applying the rigid rules they would otherwise apply to mathematical processes. Silly engineers. An interesting observation I made while in grad school was that the majority of us who were stats savvy were actually math-poor. I have never taken calculus, and the last time I really used complex algebra was high school. Statistics as a field has a lot of essentially black-box processes that we let computers munge through while we get on with the business of interpreting the results. This can and does lead to unintentional misuse, but also lets us graduate in 4-6 years rather than 10. That said, I am a math nerd wanna-be… one day I’ll understand the full picture.

Now that I have waxed philosophically on the mystery and magic of statistics (or the misunderstanding of what statistics, as a discipline, actually is) I’ll get to the meat of the post: nonsignificant interpretations from statistically significant results. It’s pretty dead simple when you break it down – there are two classes of significance: statistical and practical. Statistical significance is simple, rules based, math based, and straightforward – take the data, run your test, and see what the math tells you.  Practical significance is squishy, relative, and as such, confusing to those not used to having to add in a “furthermore” after calculating a mathematical outcome. Practical significance refers to the cost/benefit of the statistically significant outcome. “What?” I hear you say… bear with me (or is it bare… looked it up, bear it is). Say I ran an extremely well designed experiment (perfect sampling, impeccable design, in a method so perfect that nobody could fault it in any way) whereby one set of children was taught math in a novel way and another set was taught in a traditional way. The question is whether the novel method produces significant improvement on the final math test in the course. Again, remember there is nothing wrong with the methodology of the design, sampling, etc. After the test is run, it is found that, lo and behold, children in the “novel technique” group score significantly higher on their tests. Champagne corks are popped, a new era has dawned, and math instruction is changed forever. Right? Well, the first thing any discerning statistician will ask (once she is convinced that the design and test were valid and appropriate, and damn you I’m telling you for this thought experiment, they were!) is what was the effect size? To keep it simple, let’s ask “what was the difference in the average scores between the groups?” This is half of the equation (and sometimes enough to answer the grater question at large). The other half of the equation (sometimes irrelevant) is “what was the cost” of this new and amazing technique?

The importance of the first question is to make a judgment about the actual expected improvement one will see when employing this new technique.   If the results were that the average difference seen was 15 points on a 100 point test, we may determine that it is a nice, healthy lift in scores. If the results were a meager 2 points (it is absolutely possible to have small but significant result – parametric statistical tests are based around standard deviations of scores, so a test where scores of a group “clump together” can have small absolute differences that are still significant) we may not be as impressed. Two points… hmm… is the cost of the alternate method worth it for 2 points? This is where we get into the practical. How much does it cost (where “cost” can have lots of definitions)? If to apply the new technique you have to retrain all math teachers for 50 hours and buy special course materials @ $55 per student… do the math. It may be determined that while the technique improves student scores, it’s just too damn costly to implement for what you get out of it. In other words, it’s impractical – statistically significant, but practically not worthwhile. One is mathematically derived, the other is not.  As an opposite example (where a very small difference still has a major impact) imagine Google running a UX experiment whereby they significantly increase ad click rates by one one-tenth of a percent by changing the font from X to Y. Considering the change is trivial and the impact is an extra ten million dollars assuming $10B a year in ad revenue, it’s obvious that this tiny effect is not only statistically significant, but also probably practically significant as well. The real point here is that finding a statistically significant result is not the last step in your experimental quest – it’s somewhere in the middle, s it lets you get to the next step of asking “so what”. The follow up questions centering on the practical implications are essential to the decision around whether the findings fall into the camp of “OMG” or “BFD”.

Dec 182011

When I was in graduate school hustling to get my dissertation data in order, it took me over a year to collect everything I needed. Granted, the research took place via both an alumni survey and in a social-services organization with each posing its own set of challenges.  Once I had the data it was entered into spreadsheets and loaded into SPSS and AMOS whereby I ran my analyses after a bit of cleaning. Today in my work we have an A/B testing platform that allows certain experiments to be run by changing configuration files in real time on the live system. For those of you out of the loop, A/B testing is the term used for running split test experiments (traditionally in marketing).  For me it’s a tool to manipulate a very small percentage of user experiences in order to measure the impact of potential system-wide changes (visual or algorithmic).  If the test works, the change gets rolled out to everyone. If the test fails, we try something else.

When I was collecting data in grad school, one of my primary concerns was making sure I had a large enough sample to ensure that after throwing out mistakes and goofballs, (always have a few special questions to detect those who blindly fill in the bubbles), I would have enough data points to detect differences in the populations. Additionally, one problem with survey techniques (and most experimental designs in the field) is that you never know if the people who respond are actually representative of the population you are trying to measure. To use a lame example, you may want to measure people of all dispositions, but it may be that only really happy people answer your survey, and that skews the results in some particular way. The beauty of the online experiments I am running is that the user often doesn’t know they are in a test or that tests even exist.  This cuts down on the goofball and selective respondent issues. Also, in the online world, getting the needed sample size is the least of my worries. In fact, I can gather enough data in 5 minutes to fuel 10 dissertations. My biggest concern falls in the representativeness category – is my sample representative, and how can I help this?

First, by gathering a lot of data I am more likely to approximate the total population of users. A question I am often asked is how many people need to be in the study for it to be valid or for us to collect enough data. The answer is not straightforward (it depends on the effect size) and I would argue that it doesn’t actually matter because we get responses from thousands of users every hour. In other words, I could take the time to figure out the minimum necessary number of unique users needed for a high degree of confidence, but in the time it takes me to do the calculation we will have gathered more than that many users’ responses to the test in our logs.  No, my biggest concern is not the number of users needed. My concern is actually around temporal effects.

Seasonal behaviors are well known (retail sales soar in late November, etc.) to the public. What is less well known is that mobile phone application usage behavior shows regular temporal fluctuations by hour and day of week. The busiest time is in the evening, and weekends have higher traffic volume than weekdays. I don’t know if the makeup of weekend users matches weekday users, or if 7am users are similar to 9pm users. Because of this I want to get a good swath of time when running my tests, so that the effects of day of week and time of day inform the overall results evenly. Think of it like this: if I run an experiment for five minutes  because it gets me 10x the data I will need, how can I be sure that it was a “good” or “representative” five minutes? Is 10:00 – 10:05 on Sunday night the same as 3:15 – 3:20 on Friday afternoon? The only way to know is to test those time periods after collecting the data, but that is beside the point. Unfortunately, until I know better, my belief is that a proper A/B test should run against a small percentage of randomly selected users for 28 days. This gives several daily cycles and four example of each day of the week. The good news is you can run several tests on incredibly small populations over the month and still get a heap of data for each. The bad news is you have to wait a while to answer your questions. As I run more tests I may find that 28 days never gives better data than 14 or 7 or any given 10 minute period. Until then, I will stick with the slow and steady methodology. I’d love to hear other’s experiences regarding appropriate time periods for running online experiments.