Data reduction is one of those “compass point” activities, meaning that it’s fundamental to so many other storage processes that it has become an objective in and of itself. Data compression was an early data reduction technology, one that evolved into a standard process for most storage components. Deduplication is now in a similar position as it’s evolved from a technology enabling cost-effective disk backup into another fundamental data reduction process, one that’s getting integrated into more and more storage components. With deduplication moving further up the stack to primary storage, it’s being included in storage systems that also have compression. This has prompted the questions of which technology is better for primary storage and whether they can be used in conjunction with each other.
Basically, dedupe requires (surprise) duplicate data blocks in the same data set in order to generate the opportunity to reference a block multiple times. The applications that suit this process are (surprise, again) backup, virtual machine images, traditional office documents and email, to name a few. These applications involve multiple instances of the same data associated with redundant operations or multiple people doing similar work.
Compression, thankfully, works well on data types that don’t do well with dedupe. It works on all data, although with varying effectiveness, and is a good solution for data sets that contain essentially unique data blocks. Media content, like audio and video files, are a natural fit, as are scientific and sensor-based data sets. These files can get very large, so even a small percentage of effective compression can make a difference. But what about running compression and dedupe together?
When dedupe first came out it was assumed that it wouldn’t work on compressed data. But this wasn’t accurate; combining compression and dedupe is actually a good practice, especially in primary storage systems, where data sets include a range of file types and software applications. In fact, compression and dedupe together can produce a higher rate of data reduction than the sum of both when run separately.
Primary storage systems are coming to market with deduplication embedded in their controllers, and dedupe technologies are evolving to meet this new use case. Like compression before it, primary storage dedupe carries lower overall reduction rates than backup. But the combination of dedupe and compression for primary storage can produce better results.
Follow me on Twitter: EricSSwiss