Data image via Shutterstock
By James Kobielus (@jameskobielus)
Data is overrated. You don’t actually need data in order to generate brilliant insights. At the logical extreme, you can get by with sheer intuition. This is the human mind’s ability to grasp intrinsic patterns that are revealed by little or no a priori data. However, data isn’t completely outside the intuition equation. If your intuition isn’t subsequently confirmed through data (such as the evidence of your own senses), then it was simply a bad guess. And if your intuitions, over time, are right roughly 50 percent of the time, then they are guesswork, rather than the insights of a perceptive mind.
When our thoughts turn to retaining data, we tend to forget that its payload, not the physical or logical representation of that payload, is what we should hold onto. The true payload of data consists of the real-world entities, properties, and relationships that it denotes (e.g., customer purchases, employee profiles, financial accounts), and the correlations, trends, forecasts, and other statistical patterns it describes. Anything in the data that is superfluous, tangential, or irrelevant to any of this can safely be discarded. And that latter rule is essentially what guides data professionals’ routine decisions to purge, deduplicate, and compress the data they hold.
Compression involves reducing your retained data’s bitload down to its irreducible payload. But some data resists efficient compression, for the simple reason that it contains no significant patterns that would allow further reduction without sacrificing payload. As this recent article by Vincent Granville notes (http://ow.ly/tUHlr ), “any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, cannot be compressed…. In fact, the vast majority of all data sets, are almost random and not compressible.”
Actually, I take issue with that last statement. Most structured data sets have patterns, either at the row or column level, and can thereby be compressed to varying degrees. Most unstructured data sets can be compressed using dictionary encoding. And most video, audio, and image files can be compressed through extraction and encoding of their patterns. None of these objects are random in any true sense of that word.
And many complex data sets are reducible to the statistical models that a data scientist might extract from them. Conceivably, you might purge the bulk of the data that you used to build and train these models. The models themselves are the core insights–the patterns–that you were amassing all this data for in the first place.