Data science image via Shutterstock
By James Kobielus (@jameskobielus)
The best scientists speak with authority that is grounded in their mastery of empirical observations and of the tools and methods needed to find powerful truths. In modern society, we’d like to think that scientists of any stripe are unimpeachable authorities, because, after all, isn’t science a noble calling? Or, if individual scientists are fallible and occasionally dishonest human beings, isn’t the scientific process supposed to expose their lies, sanction them severely, and possibly end their careers.
So it’s especially disturbing when we find that some scientists abuse our trust by falsifying the data and models at the heart of their work. This recent Information Week article (http://www.informationweek.com/big-data/news/big-data-analytics/big-data-fakers-5-warning-signs/240152921) discusses several scientific researchers who were caught fabricating data, models, and experiments.
None of the cited examples specifically involves data scientists doing work with big data in commercial organizations, but it makes you wonder. Data scientists are the rockstars of the big-data revolution and they carry an increasing amount of perceived authority. This category includes statistical analysts, data miners, predictive modelers, computational linguists, and other smart people whose job is to find deep insights in large, complex data sets.
Data scientists are like any skilled person in any esteemed profession. Most are honest, have professional integrity, and stand behind their work. But there’s always the opportunity for an unscrupulous data scientist, in any context, to fake their work. To the extent that a secretly dishonest data scientist operates autonomously, without independent oversight or peer-vetting of their work, they can do incalculable damage to your big-data initiatives. If every other data scientist in your organization uses their (falsified) data and their (bogus) models that were trained to that data, it might take a long time (if ever) before you realize you’ve been had.
Trustworthy data science demands trustworthy data scientists. But trustworthiness, of course, requires continual independent verification. Where big data is concerned, do you have full lineage, access and version controls, and audit trails of every record stored in your data science sandboxes, and also a equivalent governance process applying to all models built on that data?
If you don’t, your big-data initiatives may be operating on a single version of a lie. And that can expose your business to significant legal, operational, and strategic risks.