Enterprise IT Watch Blog

Mar 31 2014   2:22PM GMT

Extract maximum insights before you purge your data

Michael Tidmarsh Michael Tidmarsh Profile: Michael Tidmarsh


Data image via Shutterstock

By James Kobielus (@jameskobielus)

Data is overrated. You don’t actually need data in order to generate brilliant insights. At the logical extreme, you can get by with sheer intuition. This is the human mind’s ability to grasp intrinsic patterns that are revealed by little or no a priori data. However, data isn’t completely outside the intuition equation. If your intuition isn’t subsequently confirmed through data (such as the evidence of your own senses), then it was simply a bad guess. And if your intuitions, over time, are right roughly 50 percent of the time, then they are guesswork, rather than the insights of a perceptive mind.

When our thoughts turn to retaining data, we tend to forget that its payload, not the physical or logical representation of that payload, is what we should hold onto. The true payload of data consists of the real-world entities, properties, and relationships that it denotes (e.g., customer purchases, employee profiles, financial accounts), and the correlations, trends, forecasts, and other statistical patterns it describes. Anything in the data that is superfluous, tangential, or irrelevant to any of this can safely be discarded. And that latter rule is essentially what guides data professionals’ routine decisions to purge, deduplicate, and compress the data they hold.

Compression involves reducing your retained data’s bitload down to its irreducible payload. But some data resists efficient compression, for the simple reason that it contains no significant patterns that would allow further reduction without sacrificing payload. As this recent article by Vincent Granville notes (http://ow.ly/tUHlr ), “any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, cannot be compressed…. In fact, the vast majority of all data sets, are almost random and not compressible.”

Actually, I take issue with that last statement. Most structured data sets have patterns, either at the row or column level, and can thereby be compressed to varying degrees. Most unstructured data sets can be compressed using dictionary encoding. And most video, audio, and image files can be compressed through extraction and encoding of their patterns. None of these objects are random in any true sense of that word.

And many complex data sets are reducible to the statistical models that a data scientist might extract from them. Conceivably, you might purge the bulk of the data that you used to build and train these models. The models themselves are the core insights–the patterns–that you were amassing all this data for in the first place.

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: