Organic Data Modeling in the Age of the Extrabase #analytics
Sorry for the buzzwordy title of this post, but hopefully you'll agree that sometimes they can be useful to communicating an important Zeitgeist.
I'm working with one of our clients right now to develop a new, advanced business intelligence capability that uses state-of-the art in-memory data visualization tools like Tableau and Spotfire that will ultimately connect multiple data sets to answer a range of important questions. I've also been involved recently in a major analysis of advertising effectiveness that included a number of data sources that were either external to the organization, or non-traditional, or both. In both cases, these efforts are likely to evolve toward predictive models of behavior to help prioritize efforts and allocate scarce resources.
Simultaneously, today's NYT carried an article about Clear Story, a Silicon Valley startup that aggregates APIs to public data sources about folks, and provides a highly simplified interface to those APIs for analysts and business execs. I haven't yet tried their service, but I'll save that for a separate post. The point here is that the emergence of services like this represent an important step in the evolution of Web 2.0 -- call it Web 2.2 -- that's very relevant for marketing analytics in enterprise contexts.
So, what's significant about these experiences?
Readers of Ralph Kimball's classic Data Warehouse Toolkit will appreciate both the wisdom of his advice, but also today, how the context for it has changed. Kimball is absolutely an advocate for starting with a clear idea of the questions you'd like to answer and for making pragmatic choices about how to organize information to answer them. However, the major editions of the book were written in a time when three things were true:
- You needed to organize information more thoughtfully up front, because computing resources to compensate for poor initial organization were less capable and more expensive
- The number of data sources you could integrate were far more limited, allowing you to be more definitive up front about the data structures you defined to answer your target questions
- The questions themselves, or the range of possible answers to them, were more limited and less dynamic, because the market context was so as well
Together, these things made for business intelligence / data warehouse / data management efforts that were longer, and a bit more "waterfall" and episodic in execution. However, over the past decade, many have critiqued such efforts for high failure rates, mostly in which they collapse of their own weight: too much investment, too much complexity, too few results. Call this Planned Data Modeling.
Now back to the first experience I described above. We're using the tools I mentioned to simultaneously hunt for valuable insights that will help pay the freight of the effort, define useful interfaces for users to keep using, and through these efforts, also determine the optimal data structures we need underneath to scale from the few million rows in one big flat file we've started with to something that will no doubt be larger, more multi-faceted, and thus more complex. In particular, we're using the ability of these tools to calculate synthetic variables on the fly out of the raw data to point the way toward summaries and indeces we'll eventually have to develop in our data repository. This will improve the likelihood that the way we architect that will directly support real reporting and analysis requirements, prioritized based on actual usage in initial pilots, rather than speculative requirements obtained through more conventional means. Call this Organic Data Modeling.
Further, the work we've done anticipates that we will be weaving together a number of new sources of data, many of them externally provided, and that we'll likely swap sources in and out as we find that some are more useful than others. It occurred to me that this large, heterogenous, and dynamic collection of data sources would have characteristics sufficiently different in terms of their analytic and administrative implications that a different name altogether might be in order for the sum of the pieces. Hence, the Extrabase.
These terms are not meant to cover up a cop-out. In other words, some might say that mashing up a bunch of files in an in-memory visualization tool could reflect and further contribute to a lack of intellectual discipline and wherewithal to get it right. In our case, we're hedging that risk, by having the data modelers responsible for figuring out the optimal data repository structure work extremely closely with the "front-end" analysts so that as potential data structure implications flow out of the rubber-meets-the-road analysis, we're able to sift them and decide which should stick and which we can ignore.
But, as they say sometimes in software, "that's a feature, not a bug." Meaning, mashing up files in these tools and seeing what's useful is a way of paying for and disciplining the back end data management process more rigorously, so that what gets built is based on what folks actually need, and gets delivered faster to boot.
Comments