Enterprise IT Watch Blog

Jun 21 2016   3:20PM GMT

Eliciting high-quality data science from non-traditional sources

Michael Tidmarsh Michael Tidmarsh Profile: Michael Tidmarsh

Data Science
Data scientist


Data image via FreeImages

By James Kobielus (@jameskobielus)

I’m a pragmatist. I like to think that you are what you do. So it you look, walk, and quack like a data scientist, you’re a data scientist, aren’t you?

This is not a metaphysical inquiry. As we encourage more people to acquire data science tools and skills, what point is there in distinguishing between data scientists and those who, for all intents and purposes, are of the same species, albeit without traditional track records, tools, and certifications?

This question occurred to me as I was reading about a new DARPA program called Data-Driven Discovery of Models (D3M). What it’s all about is enabling greater automation throughout the data-science lifecycle. The program recognizes that many of the most critical tasks will be performed by people who are new to this field and who may not fit the traditional profile of the professional data scientist. As stated by the agency, the program’s goal is to “develop algorithms and software to help overcome the data-science expertise gap by facilitating non-experts to construct complex empirical models through automation of large parts of the model-creation process.”

What’s exciting about this initiative is that it focuses on the imperative of multiplying the productivity of data-science teams. It seeks innovative approaches that use automated machine-learning algorithms to accelerate the upfront process of composing data-scientific models that are best suited to a particular analytic challenge. I like the fact that it focuses on giving subject matter experts the tools to specify the analytic challenge to be addressed, identify the data to be analyzed, and evaluate the findings from the machine-learning models that are automatically composed. I think it’s good that they’re building hooks into this environment that would allow established data scientists to evaluate the results of automated methods. And I’m encouraged that the program will also address automation of data-science initiatives that are underspecified in terms of the features to be modeled and the data sets to be analyzed.

If it realizes its objectives, DARPA’s program will enable everybody everywhere to enjoy the fruits of high-quality data-science tools. However, I take issue with the self-contradictory notion, as expressed by DARPA in its solicitation, that the subject matter experts who would use such a tool are “non-experts.” Fortunately, the agency expresses its intention more cogently at another point in the document when it states its aim of enabling “users with subject matter expertise but no data science background [to] create empirical models of real, complex processes.”

But that statement also suffers from a fundamental conceptual flaw. What DARPA spells out sounds very much like the core competency of an expert data scientist, rather than a “non-expert” dabbler. After all, the core competency of data scientists is the creation and testing of complex empirical models. No matter what their academic or professional background, data scientists specialize in identifying analytic problems to be solved; defining the principal features of that problem that can be statistically modeled; acquiring, evaluating, cleansing, and preparing data sources to be used in the modeling; and building, testing, evaluating, and refining the resultant models.

At another point in the solicitation, DARPA states that one of its program’s core objectives is to develop a framework for “formal definition of modeling problems and curation of automatically constructed models by users who are not data scientists.” But that’s a self-devouring distinction. If someone, of any background, is able to use such a tool to perform this entire lifecycle of data-science tasks, they are thereby a genuine data scientist. They are not merely some incorporeal “virtual data scientist” or robotic “automated data scientist” (to cite two marginalizing phrases that Network World uses in this article about the DARPA program). And they are not necessarily a “citizen data scientist,” in the “impassioned amateur” sense in which many construe that phrase.

When deciding whether a subject matter expert is also a bona fide data scientist, the fact that they performed these data-science functions in a largely tool-automated fashion, rather than through manual techniques, is irrelevant. DARPA’s discussion seems to be hung up on the bogus notion that “curation”—the core of their “non-data scientist” distinction–is something less than full-blooded data science. Essentially, the agency uses this term to refer to two distinct data-science lifecycle tasks: evaluating the relevance of data sources to a specific modeling problem, and assessing the predictive fit of a constructed model to that same problem. However, by anybody’s reckoning, these tasks are at the heart of professional data science. The former is central to data engineering, and the latter to data modeling.

But I should point out that, for all its scoping flaws, DARPA’s initiative is on the right track. Modeling automation initiatives such as this are driving the new era of democratized data science. If subject matter experts everywhere embrace self-service tools for high-quality data science, we will unlock a world of data-driven creativity and innovation.

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: