Big data image via Shutterstock
By James Kobielus (@jameskobielus)
Big data is not a volume fetish, though some cynics regard it as such. More context is better than less, when what you’re doing is analyzing data in order to ascertain its full significance. Likewise, more content is better than less, when you’re trying to identify all of the variables, relationships, and patterns in your problem domain to a finer degree of granularity.
The bottom line is this: more context plus more content usually equals more data. That’s the central reason why big data can be a powerful analytics tool. To justify your investment in big-data technologies, you must have a clear sense for which analytical use cases can best achieve their objectives at greater scale. Big data’s core applications are any scenarios where objectives can best be achieved at data volumes, velocities, and/or varieties beyond ordinary. See my IBM Big Data Hub blog from earlier this year for a detailed discussion of big data’s core use cases.
Analytic algorithms are obviously essential to most models that distill big data. However, there’s a growing industry consensus that, even at “small data” scales, incorporating more data into your models usually yields greater results than introducing newer, more complex, more arcane statistical algorithm.
This recent article by Garrett Wu presents a powerful argument for that point of view. His core thesis is: “having more data allows the ‘data to speak for itself,’ instead of relying on unproven assumptions and weak correlations.”
In other words, having less data in your training set means that are exposing yourself to the following modeling risks. First, you are more likely to overlook some key predictive variables when you build your statistical model. Also, you are more likely to skew the model to non-representative samples. And you are more likely to find spurious correlations that would disappear if you had a more complete data sets revealing the underlying relationships (linear and non-linear, parametric or nonparametric) at work.
Accumulating more detail data, he argues, allows the data scientist to engage in explorations that “make fewer initial assumptions about the underlying model and let the data guide which model is most appropriate.” By “detail data,” he refers to the “attributes and interactions of entities—usually users or customers…preferences, impressions, clicks, ratings and transactions are all examples of detail data).” In other words, detail data provides, by definition, deeper context on the entities and relationships of interest.
Clearly, algorithms are important in data science, but they’re the cart that must follow the data horse, rather than vice versa. You should be accumulating context–in the form of more detail data–in order to identify the most appropriate algorithmic modeling approach. These data-science practices are valid at any scale of data, big or small, that you’re developing your model upon.