Jan 022013

The data buzz-phrase of the current century, “Big Data”, is often approached as a magical construct that one might lash themselves to and, like Odin to the Yggdrasil, walk away with great knowledge after a time – maybe just by being near it. The idea being that using this toolset is THE way for extracting value from your data. I’m not the first to say it, but this is similar to how relational data bases have been sold for years, only now the promise extends out to unstructured and semi-structured data. Pro tip – you still have to manipulate the data to get anything worthwhile from it, and that assumes you collected the right stuff to begin with.


It’s unfortunate that a lot of people in the organizational position to make investments in data infrastructures, technologies, and tools get stuck playing a game of mad libs instead of figuring out what each tool can do and more importantly what they need each tool to do to be useful.  By that I mean that they have a sentence that goes something like “If only I had _____ technology all my _____ problems would be solved”. On the flip side, companies trying to sell Big Data services love these kinds of decision makers, promising them that “cloud based, big data solutions” solve all data problems. I mean, take any kind of data (structured, unstructured, semi-structured) upload it to the cloud, throw it into HBASE, run a map/reduce job against it in Hadoop and BAM! Cool… then what? Cloud storage is infinitely sized, safe, and depending on how much you rent it for, geo-redundant. Problems solved, right? Or are they?


Let’s back up and start…at the beginning. If you have a business that can potentially generate a lot of data (transactional, operational, etc.) you fall into one of two camps: you currently have a ton of data you are warehousing/archiving or you do not have a ton of data (for one or more of several reasons) but now you can once you instrument your systems to spit out proper logs.


Let’s assume you are in the first camp and have a ton of data. What kind of data have you gathered, and in what format? How much data do you generate every day? Lastly – could you vastly shrink the amount of useful data you gather by applying simple ETL jobs? I’d argue that most organizations (not all) that are looking into big data solutions are actually doing so very prematurely. Just because you can suddenly collect and infinitely store every piece of data your servers generate, the output from your web logs, and all public mentions of your organization on Twitter and Facebook is probably more a curse than a blessing – the concept of infinite storage for cheap promotes an unthinking “dump it in here and we will sort it out later” approach to data collection. It’s true, storage is cheap, but paying developers to pick through the garbage later (often over and over again) is mind-numbingly expensive. A better solution is to structure your data collection intelligently, write ETL jobs that make your data compact and accessible and let your developers spend their time using the data to improve your business instead of (potentially over and over again).


Now switch to the second camp – no data now, but lots ASAP.  What kinds of data can you and should you gather? How should it be structured? What will you do with it? The nature of these questions suggest that trying to choose the tools you will use without an initial grounding in what you can have and what will use the tools for makes the choices premature at best. But the experts you talk to may suggest you just start collecting as much as fast as you can, since storage is cheap and…


This “I have a hammer and everything looks like a nail” approach to capturing and deriving value from data (or data exhaust, as it is sometimes called) by using a particular tool alone is really shortsighted, and a recipe for expensive failure as you hire expensive experts to troll through your piles of garbage looking for gold rather than setting up your organization for successful insights ahead of time. Use the current fixation on big data to promote your data strategy, to get developers instrumenting your products and services deeply, in the hopes that you will soon have a high quality data asset that screams out for some tool to tame it. This may be a big data tool like Hadoop, or it may be a set of perl scripts, or (gasp) an Excel spreadsheet. The point is that Hadoop and the rest are tools to be pointed and fired at specific issues in specific situations. You are not Google, and you probably don’t need the tools Google uses. You do need to be smart about data, which is something the big data buzz has highlighted. The beauty of the current landscape is that if you actually need massive scale processing that fits in the map/reduce paradigm, you can have it. In other words, you are no longer limited (or forced to sample) when you have a large set of data. All the other issues with data quality that have plagued us forever are still present, important (maybe moreso), and in need of attention. Don’t be lulled into a false sense of security just because you have a larger bucket for use in panning – you still have to sift through it all to find the gold, IF you captured enough of the right types of data to begin with.