Deduplication was effectively introduced by Data Domain about 10 years ago, as a storage appliance. It was presented to the backup software as either a disk LUN or, more frequently, a NAS mount point. This dedupe technology’s early success was due largely to the complexity of the disk-to-disk backup alternative of the day, virtual tape libraries (VTLs).
Although another dedupe tool was available at about the same time, it was installed on the client server. The storage appliance implementation was the overwhelming favorite as a technology, since it didn’t require replacement of the backup software or tax client servers with the dedupe processing overhead.
Deduplication algorithms differ somewhat, but all use some method of examining each block of data, assigning it a unique identifier (a “hash key”) and comparing it with an index (a “hash table”) of all previous blocks. In this way, duplicate blocks can be represented by a reference (or pointer) to the original block, with the space savings coming from the act of recording the reference for each duplicate block instead the entire block. This hash table can grow very large and in some cases become the gating factor over how large the dedupe device itself can grow, since it must usually reside in memory. Currently, different dedupe tools use different block sizes, different ways to index the hash keys, etc., but still perform these same classify and look-up steps on data blocks that enter the dedupe engine.
The deduplication algorithm is either applied “inline,” meaning it’s run as data enters the dedupe engine, or “post-process,” which caches data first, then runs the deduplication algorithm as a second step. Which method is better probably depends on the environment, and the issue is certainly debated. But in general, inline can require less storage space since it dedupes all data first, and post-process can require less backup window time, as it writes all data to disk without waiting to dedupe it. But where the dedupe process occurs is probably more important than how it’s done.
Some dedupe tools perform this process on the client server (source-side or client-side) and others on the storage array itself (target-side). Source-side dedupe reduces data volumes before they’re sent over the network, so it can reduce real backup window times and bandwidth requirements but puts the CPU load of running the deduplication algorithm on the client server. It also doesn’t typically include as much data in the dedupe comparison as target-side dedupe, so its compression numbers aren’t usually as good. Target-side dedupe technology is easier to implement (there’s one appliance as opposed to loading software on each client) and typically produces better compression. But it also loads the network with the entire backup data set and doesn’t reduce backup window time appreciably.
As dedupe technology has evolved, several variants have appeared and the technology has found its way into most storage device categories. Backup is still the primary application, with more successful source-side products available, but dedupe is also being put into primary storage arrays as an optimization technology, usually for lower tiers. Primary storage dedupe ratios are much lower than they are for backup data (typically single-digit percentages), due to the lack of redundancy in primary data, but it’s still worth the effort.
Deduplication isn’t the “gee whiz” technology that will get you appointments like it used to. It’s almost a checkbox item for backup storage products and not the differentiator it used to be. But as a VAR, it’s advisable to keep up with dedupe technology enough to understand which product to introduce into each application. Given the breadth of solutions available, more than a single dedupe product should be on the line card. Understanding how each works can also help you win deals against other dedupe technologies.
In general, the more storage processes that occur after dedupe, the better, as they will enjoy a smaller data set. This would indicate that, everything else being equal, source-side dedupe is preferable to target-side and any dedupe is a good idea before off-site replication. That said, everything usually isn’t equal and the dedupe technologies can be close enough in performance that there will be multiple appropriate solutions. This puts dedupe into the familiar “choosing a vendor” decision matrix that every VAR deals with on a regular basis.
Follow me on Twitter: EricSSwiss.