Oct 072013

When taking on an analytics project, or designing a reporting system (dashboard or otherwise) a core component to superior execution is to properly understand the question(s) the vehicle is expected to answer.  This may seem like an obvious statement, but it is amazing how often the metrics of focus are done so for convenience rather than impact. Additionally, dashboards and reports are often (at least initially) put together by individuals with little training in design and business reporting. A monkey can make a graph, but it takes a bit of thought and planning to make something impactful.  I would argue that the state of business intelligence in general suffers from this issue – people undervalue the opportunities for using data to make great business decisions because they have learned that the data available to them is not useful for doing so. Instead of insisting that the metrics and reports be useful for business decisions, they instead write off the full potential of the data and go back to the inefficiencies of traditional gut based decision making. What they fail to realize or are not made aware of, is the wide variety of data available that is not being used. Empowering decision makers to utilize data is a core purpose of an analyst.

While assigning fault for the poor state of BI affairs isn’t particularly helpful, it’s worth noting that it is a systemic issue based on the past delivery of inferior metrics and reports coupled with limited decision making timeframes. This can also be compounded by the general ignorance of what data is available. The analyst’s job is to right these wrongs and must retrain the organization around how data is approached, assembled, and utilized for business decisions. This reorganization begins with the most basic step in the analytics chain: Defining the right questions.

The reason that the question definition is key is because all further analytical planning actions stem from it. Question definition, while the job of the analyst, requires input from the consumers of the information. Interviewing those who will use the analysts’ output is key to deciding what any given analytical product will contain including how many metrics are needed, in what format, and how frequently they will need to be updated.  The unsophisticated organization derives metrics by default, in a post-hoc manner based around whatever data has happened to be collected. This approach is not likely to have the impact that a more carefully planned set of information tailored to the business needs will.

Additionally, some decision makers will believe they know exactly what data they want and need, and it is important that the analyst probe and make sure this is correct. Finding out what a particular metric will be used for, or why a specific piece of information is useful can uncover unintended interpretation mistakes (e.g. when a unique-id is not indicative of a unique user due to cookie expiration).  It is safe to say that while business owners often do not understand the data used to create particular metrics, they often have strong opinions about what the metrics mean. It is the job of the analyst to educate and redefine based around these shortcomings. Furthermore, the analyst should be aware of the myriad of data sources that are available for creating metrics from, helping to aid the business owner through discovery. This is a major reason it is critical to get the BI consumer/business decision maker to talk about what the questions they are trying to answer are rather than to expect them to rattle off a full list of needed metrics. The analyst defines the metrics based on the needs of the business owner. It is crucial that the analyst take an active, participatory role in the design stage rather than a passive “fast food order” style of design. You are a data educator – act like one.

In closing, there are a number of additional steps to defining the mechanism necessary for answering the questions settled upon. Some questions require multiple metrics, new data collection, derivative metrics (e.g. time between purchases is derived from last purchase timestamp minus previous purchase timestamp), or a re-statement of the question due to limitations in available data. This layer of the design lies in between the initial question definition and the visual display of data, but again is key to successful execution of an analytic output. The analyst is an artist, a teacher, a designer, a data counselor, and a storyteller as well as being the one who knows the data. You can’t design an output mechanism if you don’t know what the input question is.

Question –> translates to data –> summarized by metrics –> interpreted by analysis –> converted to display –> answers or informs question

Nov 252012

So Nate Silver is the stats nerd of the year for his great (or lucky, if you hate science) methodology around poll aggregation and the poll weighting algorithm he employed regarding the prediction of the outcome of the recent national elections. Congratulations Nate, if I didn’t live in a country with Byzantine banking laws, I would have made a tidy sum using your leg work (among others – I firmly believe in leveraging the wisdom of crowds of experts) to invest on “Obama to win” via the event based market InTrade. I haven’t been able to find any apologies by the demonizers who suggested Nate was just another political hack (like them?) who was rooting for the wrong team and trying to hide it behind some sort of magical thinking in the guise of science, but I can’t say I looked too hard.

While the disappointing part of the whole Nate Silver predicting the elections bit lies in the constant misinterpretation of what Nate actually did to come by his numbers due to the general publics’ pseudounderstanding of statistics, the beauty of the press he received both before and after the election has elevated the role of data in decision making – even messy social data like poll results (essentially surveys, with all their inherent issues). The age old “gut feeling” as the sole driver of decision making (i.e. guessing) is coming under needed scrutiny in an age where having current and historical information is finally possible. Those who fail to incorporate data, especially data that is readily available or easily gathered, will be left behind or when successful in their guesses (expertise does have its place) will be less efficient.

It is my firm opinion that gut feeling is a garnish best placed on top of data driven analysis where the depth of gut needed is (roughly) inversely proportionate to the data available. Nate doesn’t use gut feelings, he uses data, which can then then be handed to those responsible for making decisions.

So how does Nate Silver make my job easier? As Silver commented to Jon Stewart on the Daily Show after being asked about what it would mean if his model had been wrong, Nate responded “It would have been bad, I think, because for some reason 538 became invested with this symbolic power, and you know symbolic power isn’t particularly rational, right, but it became this symbol for people who were believing in hey, let’s look at the polls. Let’s do some empirical research, right.” Empirical research was shown to best guts. This research was contrary to a huge contingent of, not surprisingly, biased observers but was shown to be superior to all other estimations, guesses, scrying stone proclamations, etc. even those made by individuals with a vested interest in Obama winning. His. Model. Won. Data won.  As data of such a highly scrutinized, over-thought, expensive contest won over individual “expert” opinion, my job got easier. The hugely symbolic power of that specific use of data helps serve as a powerful example of what data can do. When talking to organizations about the value of data, the value of quality data, and the usefulness of measurement to drive business decisions I now have an example that everyone knows, and in some small way, understands. Am I comparing myself to Nate Silver? Not particularly – we come from very different backgrounds, education, approaches etc. But one thing is certain – he has just made the human interaction part of my job a lot easier – that part where I am convincing a client to invest in data resources, to care about data quality, data completeness, and data driven decision making. Thanks Nate.

Jul 302012

A friend sent me a great blog post (see #1 in the list at the end of this post) around testing that has been buzzing around (and should be read and debated if you care about such things even a little bit). The post introduces (as in “brings to our attention” not as in “invents”) a method of easily coding an epsilon greedy strategy for testing web (or whatever) optimization and claims that it is superior to the well-established standby of  A/B testing (oooh, thems fightin words!) This post has inspired a number of responses by folks who run A/B tests, folks who optimize and test websites, and computer nerds interested in arguing about this type of stuff in general.

The normal array of folks weigh in – the engineers who almost understand* the experimental/statistical approach to A/B testing, statistician purists who sing about the superiority of A/B testing, and the practical coders who just ask for something that is simple, intuitive, and works towards the end goal of making money on their site. It’s the interplay between these folks that I found interesting (and entertaining). Full disclosure – I’m not a master statistician but have plenty of experience in experimental design, A/B testing, and general statistics. I am not by any stretch a website optimizer or an engineer.

At the end of the day, for the majority of those who are interested in testing – and I’m not talking about Google or Bing but rather the rest of the world, they want something that works very well and that converges on an optimal or close to optimal solution quickly. Maybe we should even get fuzzy with the term “optimal” by saying it means exacting maximum price/conversion/experience AND painlessness/stability/implementation ease. The main arguments against the A/B testing framework is that while it is a well-established, experimentally sound methodology, it takes time to execute and collect the data, requires knowledge of statistics to accurately interpret (not to mention know how long to run the test, how to set up the test, and to understand why you don’t just “watch the numbers until it gets significant”) and needs to finish before you can tweak and deploy your “winning” solution. The epsilon greedy algorithm is relatively self tuning based on the rules given to it, making it get closer to an optimization (assuming a static universe) relatively quickly**. One big argument against the epsilon greedy strategy is that it can mistakenly optimize based on a temporal correlate or something similar (e.g. your site shows breakfast and dinner ads at 8 a.m. – guess which one gets more clicks? That ad is then considered optimal and keeps getting shown past noon, into the evening until, finally, the dinner ads get enough clicks to flip it – but way later than is optimal). Maybe some good strategies are to reset/retest every X hours, or to decay older clicks against newer ones for a faster “flip” when such a pattern emerges.

My take is that if you don’t care about the statistical rigor or experimental soundness angles to optimization – and if you aren’t particularly keen on analyzing every tweak, covariate, and adjustment you make to your website (again, excluding Google and the big boys here who definitely do care), then the epsilon greedy algo is worth implementing. That is not a dig by any means – sometimes you care and sometimes you want this thing to get rolling as fast as possible, you will never revisit it, etc. If you are trying to sell a service to scientists or stats nerds, need the experimental rigor, need to optimize for the long term, expect to run tests at all times, and want to use as much data as possible to craft the ultimate presentation models (or whatever it is you are testing for) then you should probably be using the slower-but-steady A/B testing approach or something similar. As with most things – use the tool that meets your goals, but figure out your goals ahead of time for optimal tool selection.

In the end, I feel like the debate around the method to use consists of folks who are discussing and exploring some of the pros and cons of each approach without enumerating the actual uses. They mistakenly assume that both are used for exactly the same reason. While this is, at the broadest level (optimizing between choices) is true, the actual reasons behind the use of one over the other is vastly different. Hatchet versus lathe – both cut wood, but which one does it better?

* I say “almost” because in the discussions, many of them fail to point out simple errors others are making in assumptions. If they were statistically savvy engineers they would say things like “you should reset your test metrics every time you make a change, and this is true whether you use the epsilon greedy strategy or A/B testing”.

** I’m ignoring the cases where the groups have disgustingly similar conversion rates.

Here are some articles and reference pieces for your perusal:

  1. Original post “20 lines of code that will beat A/B testing every time
  2. Wikipedia article on Multi Armed Bandit problem  and the concept of the epsilon greedy algorithm
  3. Blog on why Bandit algos are better than A/B testing
  4. Discussion and debate around the original article on y-combinator
  5. Blog on some hard-knocks and learning around A/B testing in general
  6. Blog summarizing a cool article by Microsoft Bing folks on weird outcomes experienced with online testing:
  7. Actual MSFT article on weird A/B outcomes explained (PDF)
Apr 132012

In dealing with thoughtful people who have either never taken a stats class or have not been students of the discipline for many years, I have often run across a few interesting false premises or areas of ignorance. One worth mentioning in passing (and this falls into the paradigm of a little knowledge being a bad thing) is the belief that to run any kind of experiment you can only manipulate one variable. This belief falls squarely in the “I’m going to tell you something right now to help you understand a general concept, but later you will be told to forget it because it is false” camp, but since most folks only take that first stats class they never get to the “it is false… and here is why”. For those of you in that camp who are waiting for me to explain how this insane proclamation can be true, I have some sad news… I’m not going to cover it here. However, there are plenty of great online resources that will clue you in on how this is possible – check out descriptions of regression analysis, ANOVA (but not one-way ANOVA) and MANOVA to get yourself started down the path of complex experimental design.

I specifically call out “thoughtful  people” in the opening paragraph because these folks tend to make quick associations, recognize patterns easily, and apply knowledge from other sources to solve or contemplate current concerns. This, at times, can be a crutch due to the misapplication of a concept or set of rules. Stats tend to be tricky, and especially vex those who are mathematically inclined but who don’t have the full statistical picture. While the discipline of statistics isn’t exactly math, it relies on sometimes complex mathematical functions quite a bit, and as such folks like engineers tend to think about statistics with the mindset of applying the rigid rules they would otherwise apply to mathematical processes. Silly engineers. An interesting observation I made while in grad school was that the majority of us who were stats savvy were actually math-poor. I have never taken calculus, and the last time I really used complex algebra was high school. Statistics as a field has a lot of essentially black-box processes that we let computers munge through while we get on with the business of interpreting the results. This can and does lead to unintentional misuse, but also lets us graduate in 4-6 years rather than 10. That said, I am a math nerd wanna-be… one day I’ll understand the full picture.

Now that I have waxed philosophically on the mystery and magic of statistics (or the misunderstanding of what statistics, as a discipline, actually is) I’ll get to the meat of the post: nonsignificant interpretations from statistically significant results. It’s pretty dead simple when you break it down – there are two classes of significance: statistical and practical. Statistical significance is simple, rules based, math based, and straightforward – take the data, run your test, and see what the math tells you.  Practical significance is squishy, relative, and as such, confusing to those not used to having to add in a “furthermore” after calculating a mathematical outcome. Practical significance refers to the cost/benefit of the statistically significant outcome. “What?” I hear you say… bear with me (or is it bare… looked it up, bear it is). Say I ran an extremely well designed experiment (perfect sampling, impeccable design, in a method so perfect that nobody could fault it in any way) whereby one set of children was taught math in a novel way and another set was taught in a traditional way. The question is whether the novel method produces significant improvement on the final math test in the course. Again, remember there is nothing wrong with the methodology of the design, sampling, etc. After the test is run, it is found that, lo and behold, children in the “novel technique” group score significantly higher on their tests. Champagne corks are popped, a new era has dawned, and math instruction is changed forever. Right? Well, the first thing any discerning statistician will ask (once she is convinced that the design and test were valid and appropriate, and damn you I’m telling you for this thought experiment, they were!) is what was the effect size? To keep it simple, let’s ask “what was the difference in the average scores between the groups?” This is half of the equation (and sometimes enough to answer the grater question at large). The other half of the equation (sometimes irrelevant) is “what was the cost” of this new and amazing technique?

The importance of the first question is to make a judgment about the actual expected improvement one will see when employing this new technique.   If the results were that the average difference seen was 15 points on a 100 point test, we may determine that it is a nice, healthy lift in scores. If the results were a meager 2 points (it is absolutely possible to have small but significant result – parametric statistical tests are based around standard deviations of scores, so a test where scores of a group “clump together” can have small absolute differences that are still significant) we may not be as impressed. Two points… hmm… is the cost of the alternate method worth it for 2 points? This is where we get into the practical. How much does it cost (where “cost” can have lots of definitions)? If to apply the new technique you have to retrain all math teachers for 50 hours and buy special course materials @ $55 per student… do the math. It may be determined that while the technique improves student scores, it’s just too damn costly to implement for what you get out of it. In other words, it’s impractical – statistically significant, but practically not worthwhile. One is mathematically derived, the other is not.  As an opposite example (where a very small difference still has a major impact) imagine Google running a UX experiment whereby they significantly increase ad click rates by one one-tenth of a percent by changing the font from X to Y. Considering the change is trivial and the impact is an extra ten million dollars assuming $10B a year in ad revenue, it’s obvious that this tiny effect is not only statistically significant, but also probably practically significant as well. The real point here is that finding a statistically significant result is not the last step in your experimental quest – it’s somewhere in the middle, s it lets you get to the next step of asking “so what”. The follow up questions centering on the practical implications are essential to the decision around whether the findings fall into the camp of “OMG” or “BFD”.

Dec 182011

When I was in graduate school hustling to get my dissertation data in order, it took me over a year to collect everything I needed. Granted, the research took place via both an alumni survey and in a social-services organization with each posing its own set of challenges.  Once I had the data it was entered into spreadsheets and loaded into SPSS and AMOS whereby I ran my analyses after a bit of cleaning. Today in my work we have an A/B testing platform that allows certain experiments to be run by changing configuration files in real time on the live system. For those of you out of the loop, A/B testing is the term used for running split test experiments (traditionally in marketing).  For me it’s a tool to manipulate a very small percentage of user experiences in order to measure the impact of potential system-wide changes (visual or algorithmic).  If the test works, the change gets rolled out to everyone. If the test fails, we try something else.

When I was collecting data in grad school, one of my primary concerns was making sure I had a large enough sample to ensure that after throwing out mistakes and goofballs, (always have a few special questions to detect those who blindly fill in the bubbles), I would have enough data points to detect differences in the populations. Additionally, one problem with survey techniques (and most experimental designs in the field) is that you never know if the people who respond are actually representative of the population you are trying to measure. To use a lame example, you may want to measure people of all dispositions, but it may be that only really happy people answer your survey, and that skews the results in some particular way. The beauty of the online experiments I am running is that the user often doesn’t know they are in a test or that tests even exist.  This cuts down on the goofball and selective respondent issues. Also, in the online world, getting the needed sample size is the least of my worries. In fact, I can gather enough data in 5 minutes to fuel 10 dissertations. My biggest concern falls in the representativeness category – is my sample representative, and how can I help this?

First, by gathering a lot of data I am more likely to approximate the total population of users. A question I am often asked is how many people need to be in the study for it to be valid or for us to collect enough data. The answer is not straightforward (it depends on the effect size) and I would argue that it doesn’t actually matter because we get responses from thousands of users every hour. In other words, I could take the time to figure out the minimum necessary number of unique users needed for a high degree of confidence, but in the time it takes me to do the calculation we will have gathered more than that many users’ responses to the test in our logs.  No, my biggest concern is not the number of users needed. My concern is actually around temporal effects.

Seasonal behaviors are well known (retail sales soar in late November, etc.) to the public. What is less well known is that mobile phone application usage behavior shows regular temporal fluctuations by hour and day of week. The busiest time is in the evening, and weekends have higher traffic volume than weekdays. I don’t know if the makeup of weekend users matches weekday users, or if 7am users are similar to 9pm users. Because of this I want to get a good swath of time when running my tests, so that the effects of day of week and time of day inform the overall results evenly. Think of it like this: if I run an experiment for five minutes  because it gets me 10x the data I will need, how can I be sure that it was a “good” or “representative” five minutes? Is 10:00 – 10:05 on Sunday night the same as 3:15 – 3:20 on Friday afternoon? The only way to know is to test those time periods after collecting the data, but that is beside the point. Unfortunately, until I know better, my belief is that a proper A/B test should run against a small percentage of randomly selected users for 28 days. This gives several daily cycles and four example of each day of the week. The good news is you can run several tests on incredibly small populations over the month and still get a heap of data for each. The bad news is you have to wait a while to answer your questions. As I run more tests I may find that 28 days never gives better data than 14 or 7 or any given 10 minute period. Until then, I will stick with the slow and steady methodology. I’d love to hear other’s experiences regarding appropriate time periods for running online experiments.