Big Data image via Shutterstock
By James Kobielus (@jameskobielus)
Big data is a complex, tricky thing to govern. Often, it’s an unholy siloed mess of disparate databases under various business units, on various data platforms, and managed by various “stewards” with various tools and workflows.
Consolidation of your big-data assets must be an ongoing initiative, both to reduce overhead and to free up the insights that come from correlating disparate data sets. But you can scarcely consolidate such a mission-critical resource without addressing the administrative issue of big-data governance head-on. Presumably, you already have some level of governance–aka data stewardship or master data management–in your data warehousing and business intelligence practices.
Smart big-data consolidation demands the following double-barreled approach to governing the assets that matter:
- Governing analytic data: Keeping your big data under control means, among other things, determining what small subset of it should be managed with tight stewardship. Usually, those are the system-of-record relational data you’ve long managed within the master tables of your enterprise data warehouse. In other words, your official records on customer, finances, human resources, the supply chain will still be governed tightly in the era of big data, and probably on your scaled-up enterprise data warehouse. But the larger volume of unstructured data–such as social marketing intelligence, real-time sensor data feeds, browser clickstream sessions, and IT system logs–can remain outside your governance practice until such time as it is linked to systems of record.
- Governing analytic models: Big-data applications ride on a never-ending stream of new statistical, predictive, segmentation, behavioral, and other advanced analytic models. As you ramp up your data scientist teams and give them more powerful modeling tools, you will soon be swamped with models. Big data analytics demands governance of analytic models, if they’re to be deployed into production business applications. Key governance features include check in/check-out, change tracking, version control, and collaborative development and validation. Your big-data sandboxing platforms and modeling tools should ensure consistent governance automation, and managed collaboration across multidisciplinary teams working on your most challenging big data analytics initiatives.
No, governance is not the sexy side of big data. It’s often an afterthought in big-data projects. But it’s absolutely essential if you wish to keep your data clean, your models fit, and your big-data applications delivering reliable insights throughout the business.
James Kobielus is an IBM Big Data evangelist.