Dec 182011

When I was in graduate school hustling to get my dissertation data in order, it took me over a year to collect everything I needed. Granted, the research took place via both an alumni survey and in a social-services organization with each posing its own set of challenges.  Once I had the data it was entered into spreadsheets and loaded into SPSS and AMOS whereby I ran my analyses after a bit of cleaning. Today in my work we have an A/B testing platform that allows certain experiments to be run by changing configuration files in real time on the live system. For those of you out of the loop, A/B testing is the term used for running split test experiments (traditionally in marketing).  For me it’s a tool to manipulate a very small percentage of user experiences in order to measure the impact of potential system-wide changes (visual or algorithmic).  If the test works, the change gets rolled out to everyone. If the test fails, we try something else.

When I was collecting data in grad school, one of my primary concerns was making sure I had a large enough sample to ensure that after throwing out mistakes and goofballs, (always have a few special questions to detect those who blindly fill in the bubbles), I would have enough data points to detect differences in the populations. Additionally, one problem with survey techniques (and most experimental designs in the field) is that you never know if the people who respond are actually representative of the population you are trying to measure. To use a lame example, you may want to measure people of all dispositions, but it may be that only really happy people answer your survey, and that skews the results in some particular way. The beauty of the online experiments I am running is that the user often doesn’t know they are in a test or that tests even exist.  This cuts down on the goofball and selective respondent issues. Also, in the online world, getting the needed sample size is the least of my worries. In fact, I can gather enough data in 5 minutes to fuel 10 dissertations. My biggest concern falls in the representativeness category – is my sample representative, and how can I help this?

First, by gathering a lot of data I am more likely to approximate the total population of users. A question I am often asked is how many people need to be in the study for it to be valid or for us to collect enough data. The answer is not straightforward (it depends on the effect size) and I would argue that it doesn’t actually matter because we get responses from thousands of users every hour. In other words, I could take the time to figure out the minimum necessary number of unique users needed for a high degree of confidence, but in the time it takes me to do the calculation we will have gathered more than that many users’ responses to the test in our logs.  No, my biggest concern is not the number of users needed. My concern is actually around temporal effects.

Seasonal behaviors are well known (retail sales soar in late November, etc.) to the public. What is less well known is that mobile phone application usage behavior shows regular temporal fluctuations by hour and day of week. The busiest time is in the evening, and weekends have higher traffic volume than weekdays. I don’t know if the makeup of weekend users matches weekday users, or if 7am users are similar to 9pm users. Because of this I want to get a good swath of time when running my tests, so that the effects of day of week and time of day inform the overall results evenly. Think of it like this: if I run an experiment for five minutes  because it gets me 10x the data I will need, how can I be sure that it was a “good” or “representative” five minutes? Is 10:00 – 10:05 on Sunday night the same as 3:15 – 3:20 on Friday afternoon? The only way to know is to test those time periods after collecting the data, but that is beside the point. Unfortunately, until I know better, my belief is that a proper A/B test should run against a small percentage of randomly selected users for 28 days. This gives several daily cycles and four example of each day of the week. The good news is you can run several tests on incredibly small populations over the month and still get a heap of data for each. The bad news is you have to wait a while to answer your questions. As I run more tests I may find that 28 days never gives better data than 14 or 7 or any given 10 minute period. Until then, I will stick with the slow and steady methodology. I’d love to hear other’s experiences regarding appropriate time periods for running online experiments.