Understanding Inline vs. post-processing de-duplication
A major difference in de-duplication product offerings is related to when the de-duplication occurs: “In-line” (or “real-time” as the data is flowing – before it is written) or “Post-process” (after the data has been written). The benefits and drawbacks of Inline and post-processing is a much debated one.
In “Inline” process, the de-duplication hash calculations are made as the data enters the device in real time. If the block that is already stored on the system is spotted, the new block is not stored and instead just references to the existing block are added. Obviously, the benefit of in-line de-duplication is that it requires lesser storage as the data is not duplicated at all. But since it means the hash-calculations and lookups have to take place during the storage, the throughput could be lesser.
The process of de-duplication is CPU-intensive and involves processing every bit of information in a given volume or backup. So it is argued that if de-duplication occurs say during backup processing, the backup process would slow down. As a means of refuting this argument, vendors with in-line de-duplication have demonstrated that their product performance to be similar to that of their post-process de-duplication counterparts.
In “Post-process” de-duplication, the new data is first stored on the device and the analysis for de-duplication happens at a later time. Again the obvious benefit, is that the hash calculation and lookup is not required before storing the data ensuring that the storage performance is not impacted. The essential drawback is that the duplicate data is stored – albeit a short time before de-duplication – and may necessitate need for large storage space than actually required (with the worst case being the storage to be available is near full capacity).
Post-process de-duplication vendors response to this is that their solutions are designed such that de-duplication can be completed quickly and definitely before the next scheduled backup or vault to tape can occur (i.e., the extra storage requirement is not high enough to treat it as a major drawback).
Inability to do forward referencing is mentioned as a drawback for “inline” de-duplication.
The typical approach in de-duplication (also referred to as “reverse referencing”) is to eliminate the recent duplicate data and create pointers to the previously stored data. As an alternative, forward referencing (currently seems to be supported by only one vendor – SEPATON) keeps the most recent data and replaces the previously stored data with pointers to the most recent. Though it requires more work to do forward referencing, the benefit is in case of emergency, the most recent backup is in complete form for faster restore.
Some points in favor of “Inline” de-duplication (mostly as claimed from vendors DataDomain and IBM) are:
- Inline de-duplication is the most efficient and economic method of de-duplication as it reduces the raw data capacity needed in the system as the full, not-yet-de-duplicated data is never written to the storage. If the data stored is in terabytes, the storage saved by “inline” de-duplication would be of significant size.
- Replication can occur concurrently (as part of) with backup and inline de-duplication. This can reduce network bandwidth by up to 95% and more importantly optimized time-to-DR (disaster recovery) as there is no need to wait for the data to be written completely and then de-duplicated (which is the case in post-process where replication to remote site has to wait). In other words, post-process de-duplication increase the lag time before de-duplication is complete, and by extension, when replication will complete.
- DataDomain points out that post-process de-duplication may create operational issues as there are two storage zones – each with different policies and behaviors to manage. In some cases, since the redundant storage zone is the default and more important design for some vendors, the dedupe zone is also much less performant and resilient.
- IBM points out that while the post-process de-duplication vendors claim their approach shortens the “backup window,” they fail to mention that it adds a new “de-duplication window” that needs to be scheduled and managed. This approach increases the amount of storage needed to temporarily store un-deduplicated data and more than doubles CPU cycles and I/O requirements, meaning higher costs and more headaches.
- Another trend pointed out in http://www.datacenterpost.com/2011/03/data-deduplication-not-just-for-backup.html is that third generation inline data de-duplication systems instead of using expensive DRAM as a cache with limited space to improve the performance of de-duplication, use high speed solid state disk (SSD) that can hold the entire de-duplication database. Thus resulting in a versatile storage system that provides excellent I/O performance with the added efficiency of compression and data deduplication.
The following are good references while making decisions on inline versus post-processing de-duplication:
As expected, some of the vendors (NetApp, Quantum) are expected to provide the option to the user to be able to configure or toggle between inline and post-process de-duplication, depending on the available CPU, storage size and the workload.