Dec 022013

While the information gathered around the functioning and user interaction with a product or service can be priceless to multiple organizational parties for making key business decisions, data is often approached from the point of view of a secondary by-product (exhaust), or as secondary to basic product functionality. In my experience this occurs for a few reasons:

  • Data collection is not necessary for a product or service to function
  • Data collection has no positive effect on real time product performance
  • Data collection often reflects a complex mixture of needs from multiple internal parties, muddying ownership and responsibility


As such, due to market pressure and a lack of data advocacy, products and services can launch and get initial traction with no data collection mechanisms whatsoever. This creates a false sense of hierarchy regarding the place of data oriented development within the product. In an agile framework especially, this can be a severe impediment to getting a robust data collection system in place due to constant ranking below-the-line of assigned man-hours unless data collection (and its dependencies) is perceived as having equal footing alongside high ranking product features.


At heart, I am a philosopher, and I encourage business owners to humor me with their own philosophies (or visions) around the product they are creating or the organization they are in charge of. This exploration allows me to understand their overarching positions and expectations and gives me opportunities to design data strategies (i.e. collection, analysis) towards meeting those stated expectations and, most importantly, the unstated future expectations. It is this approach that lets me meaningfully rank order system requirements, anticipate pivots, and produce analytic output in anticipation of its necessity.


System level thinking around a company vision drives the expectations and needs of business intelligence and also gives solid reasoning around what is needed in each phase of a product lifecycle. A product is derived from the expectations of how to solve the identified problem in the market.  The role of data lies in providing information regarding the extent to which the problem exists, whether it is addressed and solved or lessened by the product functions, and whether the nature of the problem changes over time. By defining data needs from the vision level, systems can be designed (green field, retro-fit, refactor, or otherwise) with the flexibility to grow comfortably into the realization of the vision as the product matures. A focus on providing narrow intelligence for an initial product, or relying too heavily on a reporting template is a sure path to mediocrity and sub-par output.


The vision, when properly evaluated, gives key insight into the minimally viable BI product, the importance of particular kinds of data over others, the importance of data quality for each data source, the extent to which a product must be instrumented, key milestones in the creation of a mature data system, and a number of other basic needs for a successful system of analytics. In other words, the enunciation of the vision allows for an appropriate data strategy to be created and implemented alongside product features. This allows multiple data consumers the opportunity to engage with data throughout the product lifecycle, promotes data-centric thinking amongst owners, and helps engineers and other builders to design systems that include robust data pipelines. It changes the conversation between a data team and engineering from “do you want me to collect that?” to “how do you want me to collect that?”

Jan 022013

The data buzz-phrase of the current century, “Big Data”, is often approached as a magical construct that one might lash themselves to and, like Odin to the Yggdrasil, walk away with great knowledge after a time – maybe just by being near it. The idea being that using this toolset is THE way for extracting value from your data. I’m not the first to say it, but this is similar to how relational data bases have been sold for years, only now the promise extends out to unstructured and semi-structured data. Pro tip – you still have to manipulate the data to get anything worthwhile from it, and that assumes you collected the right stuff to begin with.


It’s unfortunate that a lot of people in the organizational position to make investments in data infrastructures, technologies, and tools get stuck playing a game of mad libs instead of figuring out what each tool can do and more importantly what they need each tool to do to be useful.  By that I mean that they have a sentence that goes something like “If only I had _____ technology all my _____ problems would be solved”. On the flip side, companies trying to sell Big Data services love these kinds of decision makers, promising them that “cloud based, big data solutions” solve all data problems. I mean, take any kind of data (structured, unstructured, semi-structured) upload it to the cloud, throw it into HBASE, run a map/reduce job against it in Hadoop and BAM! Cool… then what? Cloud storage is infinitely sized, safe, and depending on how much you rent it for, geo-redundant. Problems solved, right? Or are they?


Let’s back up and start…at the beginning. If you have a business that can potentially generate a lot of data (transactional, operational, etc.) you fall into one of two camps: you currently have a ton of data you are warehousing/archiving or you do not have a ton of data (for one or more of several reasons) but now you can once you instrument your systems to spit out proper logs.


Let’s assume you are in the first camp and have a ton of data. What kind of data have you gathered, and in what format? How much data do you generate every day? Lastly – could you vastly shrink the amount of useful data you gather by applying simple ETL jobs? I’d argue that most organizations (not all) that are looking into big data solutions are actually doing so very prematurely. Just because you can suddenly collect and infinitely store every piece of data your servers generate, the output from your web logs, and all public mentions of your organization on Twitter and Facebook is probably more a curse than a blessing – the concept of infinite storage for cheap promotes an unthinking “dump it in here and we will sort it out later” approach to data collection. It’s true, storage is cheap, but paying developers to pick through the garbage later (often over and over again) is mind-numbingly expensive. A better solution is to structure your data collection intelligently, write ETL jobs that make your data compact and accessible and let your developers spend their time using the data to improve your business instead of (potentially over and over again).


Now switch to the second camp – no data now, but lots ASAP.  What kinds of data can you and should you gather? How should it be structured? What will you do with it? The nature of these questions suggest that trying to choose the tools you will use without an initial grounding in what you can have and what will use the tools for makes the choices premature at best. But the experts you talk to may suggest you just start collecting as much as fast as you can, since storage is cheap and…


This “I have a hammer and everything looks like a nail” approach to capturing and deriving value from data (or data exhaust, as it is sometimes called) by using a particular tool alone is really shortsighted, and a recipe for expensive failure as you hire expensive experts to troll through your piles of garbage looking for gold rather than setting up your organization for successful insights ahead of time. Use the current fixation on big data to promote your data strategy, to get developers instrumenting your products and services deeply, in the hopes that you will soon have a high quality data asset that screams out for some tool to tame it. This may be a big data tool like Hadoop, or it may be a set of perl scripts, or (gasp) an Excel spreadsheet. The point is that Hadoop and the rest are tools to be pointed and fired at specific issues in specific situations. You are not Google, and you probably don’t need the tools Google uses. You do need to be smart about data, which is something the big data buzz has highlighted. The beauty of the current landscape is that if you actually need massive scale processing that fits in the map/reduce paradigm, you can have it. In other words, you are no longer limited (or forced to sample) when you have a large set of data. All the other issues with data quality that have plagued us forever are still present, important (maybe moreso), and in need of attention. Don’t be lulled into a false sense of security just because you have a larger bucket for use in panning – you still have to sift through it all to find the gold, IF you captured enough of the right types of data to begin with.

Sep 262011

First off, read this.

So Netflix says to the SEC that churn is not important to them. Except that they didn’t actually say that. They said “the churn metric is a less reliable measure of business performance, specifically consumer acceptance of the service.” meaning that the metric, for them, is broken and therefore should not be used to compare them to others in the marketplace. The cynic would respond “what are you hiding?” but the truth is that they are correct: in their business, churn is so different that trying to compare it across companies is a disservice to the naive public. The information would be misconstrued and therefore should not be revealed. I generally am of the mind that you let the consumer of information make the decision about the quality of information but here I am with Netflix – the consumer is likely to misuse and ignorantly, accidentally compare it to other types of “churn” (a difficult metric to define to begin with).

In the BI universe, we use KPI’s to monitor the progress or success of a product, system, business, etc. but we also use them to compare and benchmark against like products. Pageviews, time on page, click through rates, etc. are the common bellwethers of awesomeness or supreme suck in the web world. But what happens if you make a website that uses a continuous scrolling method… like say an image search results page? Suddenly your pageviews per user drops massively compared to industry standards! I would argue that the continuous scrolling image search is superior to the tired paging image search (in fact, so superior that Google ripped off Bing to some degree… a rarity to be sure) but have heard through the grapevine that one specific search-y company refused to drink the continuous scrolling Kool-ade due to the impact it would have on third party web reporting metrics. Sacrifice the user experience for sake of the KPI. So what does this mean for Netflix vs. the SEC?

When the paradigm changes, it’s often hard to jump out of the traditional KPI rut. Those KPIs are comparable, comfortable, expected (here with NFLX we are talking about quarterly churn, but we could be talking about unique users, time on page, page views, or any other metric). Remember P/E ratio arguments during the Dot Com boom? I find the same issues in my job as a BI manager – we have a product that is an Android widget and someone asks me a question about pageviews – what the hell is a pageview on a widget? Is every time a user focuses on the widget a “pageview”? Are all actions in the widget separate pageviews or part of the same initial pageview? Lastly, (and more importantly) does a pageview count matter or is it (or something similar) only useful when used as an internal metric?

The key, in my opinion, is the use of internal versus comparative metrics. Netflix is saying that giving out churn numbers as they are traditionally calculated is a great way to confuse and freak out their investors since so many customers “quit” and then rejoin a few months later. The definition of churn is too narrow (“users who quit the service” / “total users”) because a user who quits in January and rejoins in March has technically “churned” in Q1 even though they are now a customer again. As an internal metric, understanding their churn from quarter to quarter makes sense. They might want to (and surely do) offset that by calculating a metric of “sticky churn” i.e. people who, in the words of Marsellus Wallace in Pulp Fiction “and when you’re gone, you stay gone”.  Or even better would be a whole suite of metrics around churn and churn like behaviors – new never before seen people, returning after a short break people, returning after a long break people, totally gone from the system as far as we know it people. Lots of options, nothing perfect, nothing overly clear, and everything confusing to investors who only know how to compare the metrics they know and love within and across companies. I don’t blame Netflix for keeping the numbers to themselves. Of course, it would be nice for them to release a case study on all the cool and weird ways people migrate around their services – not for investors’ sakes, but for my own nerdy curiosity.

Jul 062011

I have an intern right now, which allows me the opportunity for a lot of teaching opportunities which I love. In order to give her a real piece of work to chew on and master, I handed her the analysis of a new product my company has launched into the Android market.  It’s a mobile coupon aggregation service called Lotza, and the idea behind it is that with all the daily deal sights that are popping up everywhere like Groupon and Tippr, it would be nice if there was something that a) showed all the deals in one place and b) only showed the most likely deals that would be of interest and c) stored all purchased offers in a single wallet, for simple (virtual) retrieval. There’s a lot more to it in the development pipeline, but at beta, that is the core functionality. Using our analytics backend to power a direct-to-consumer product is a great opportunity for us to own the data we generate, and to experiment freely with layout, design, and analytic algorithms. Being the complete masters of our own system comes with all the accompanying responsibilities you might expect.

As per the norm, my Business Intelligence team drove product instrumentation and logging requirements to capture the user experience to the fullest. As is often true in software development, after all was said and done everything was not perfect (but was close) making the derivation of metrics less straightforward than simple aggregations across fields. Multiple tables with specialized information about certain aspects of a user profile, in session experiences, and behaviors make for a great multidimensional ball of potential confusion.

My intern had begun working with me at a time after these requirements had been written, after the schemas were defined, and after most of the logging was already implemented. To bring her up to speed I had her create a logging guide document showing every page in the product and all user actions possible per page along with the associated logging. In running actual tests on a phone, and tailing the logs live, she found several small bugs, proving to me that she was paying attention and was understanding the structure of the data (even though it was complex in parts). Once this document existed, a set of reports was defined for a presentation she is expected to perform to the internal stakeholders next week. This presentation will be a general overview of the product, and a discussion of usage so far including important KPIs such as sessions per user, the conversion funnel, and page abandonment rates. None of these metrics are as straightforward as they could be in a perfect product world, which is one point of this post. The important stuff is often under the initial layer of data, and requires special filters along with an intimate understanding of both what the logging requirements and definitions were, and how they have actually been implemented.

The second point is that the analyst has to know these nuanced methods for counting what would otherwise be a simple sum or count across unique values. The analyst who isn’t able to roll up her sleeves and dive into the (potential) logging morass is really trusting the insights derived to the gods of perfect logging. My intern has recognized anomalies, and worked around them. When she gives her talk next week, she will be armed with the ability to answer almost any question thrown at her – in other words, the “why” to her “what”. Even though this data wading has eaten up a lot of her time (she worked over July 4th and the 5th, a company holiday) she thanked me for the opportunity. She is confident in her findings, has learned a lot, and knows more about the logging system for our product than I currently do (and I wrote the logging spec and defined the original schema!). Her approach of looking at logs, reading documents (that are sometimes out of date), and tailing actual logs to identify examples and verify accurate data capture (and her own understanding of what is happening) represent the many hats I believe an analyst should wear. With so much data being generated every day, and the complexity of that data increasing, the analyst must be a data generalist. I’m not proposing that all analysts must be masters of SQL and statistics and technical writing and math and the art of visual presentation – I’m suggesting that the best analysts will utilize whatever tools they need (including engineers) to get the right data in the right format to the right people. It is my opinion that anyone so lopsided in their training as to only know one of the above mentioned skills is likely to underutilize what they find – the data will be inaccurate, confusing, mysterious, overwhelming. In short, it will be a disservice to their organization, and due to their focus, they might not even realize the problem. Come on analysts, diversify!

In closing: Data is messy. Roll up your sleeves. Question your results. Triangulate to verify. Cross-reference values. Perform sniff tests. As an analyst you should be the most engaged and knowledgeable person with the data you own – your consumers rely on you to play the role of a translator and to represent your confidence in the accuracy of your findings.  By projecting this confidence and integrity, those times when you find absolutely horrendous data, or unbelievable (in a bad way) information you can either find ways to salvage some useful information from it and/or proclaim with certainty that the data is, for the most part, so dirty that your analyses are unreliable and therefore not worth the effort. Your consumers may not always like the answers you provide, but they will respect you for declaring as much, especially when you know the data so well you can pinpoint the major issues that bring reliability into doubt. Of course, when you find amazing insights you can confidently present them (and show backup verification that you didn’t make some simple error – you know the data that well.)

Apr 202011

As a business analyst, I live and die by logging. This makes me vigilant about what products are being developed by my organization, and how they change from concept to wireframes to implementation. Rarely do these three stages look the same, and sometimes the end product is a far cry from the original beast due to time pressures, build vs. buy decisions, scope creep, and a number of other fun issues. Regardless of my vigilance, I find that logging, and thoughts around instrumentation almost always come last. I am not alone in my observations as other analyst friends have made the same comment. In fact, this was verified by a development lead at a large organization recently when he commented to me “you know, we always wait until it’s too late to add logging, if we even consider it in the first place.”

Why is it that engineers have such an aversion to extended, non-performance instrumentation, and find it so onerous or unimportant? They write unit tests. They instrument for speed of throughput, heartbeat, and error messaging but tend to ignore the basics of user behavior on the products they have built.  It is seen as extraneous, performance impacting, nonsensical even. This is unfortunate.

When I was in graduate school my dissertation focused on how individual’s beliefs about the degree to which their organization in general, and their supervisor specifically, impacted their work behaviors.  In other words, if you think your supervisor cares about you as a person, does that make you work harder? What about your overall organization – does that matter? Are there special traits of supervisors that make you more or less likely to do your job well, to help others, to protect the organization from lawsuits or other problems, to decide to stay instead of quitting?  It took me almost 2 years to collect enough data to answer this set of questions. Two years. Today, I can ask interesting, in-depth questions about the data I collect every 2 minutes. The only reason this is possible is because the damn products are instrumented like mad to tell me everything the user is doing, seeing, interacting with (and choosing to ignore). This information is powerful for understanding usability, discovery, annoying product issues like confusing pages or buttons. Predictive analytic models can be built off of this behavior (user X likes this stuff, hates that stuff, buys this stuff, ignores that stuff etc…) but only if it is logged. With both a strong BI opportunity and predictive analytics opportunities, why is logging so often ignored, perfunctory, or offloaded to companies like Google – almost as an afterthought?

My theory is that because the nuances of logging often make it fragile and complex, it isn’t easy to determine if it is accurate when in development. As the underlying systems change – whether that be schema shuffling or enumerated value redefinition (or recycling) for example and many hands are touching the code that creates the product, it makes sense to wait until things settle down to begin adding the measurement devices. Unfortunately, there are often special cases introduced – invisible to an end user, but obvious under the hood that makes straightforward logging difficult. The end result is often a pared down version of logging that is seen as “good enough” but not ideal. The classic “we’ll do this right in vNext” is my most hated phrase to hear.

The workaround to this malady, when possible, is to introduce clear, concise, standardized logging requirements that engineers can leverage across products. Often a block of specific types of values (timestamp, screen size, operating system, IP, user-id, etc) describe a majority of the values the analyst needs for pivoting, monitoring, etc. the remaining portion of a schema can then contain the pieces that are unique to the specific product (like “query string” if searching is a possible action in one product but not others).

The analyst must be vigilant, aware, engaged, and on the lookout for implementations that introduce actions or behaviors that are currently unlogged or that break expectations so that he or she can engage engineers proactively, before it’s too late, to add functionality to logging and be sure that important and essential user behavioral data does not go down the tube of the dreaded “vNext”.