Storage Soup

May 2 2007   1:14PM GMT

At the brink of the data deduplication wars

Ndamour Nicole D'Amour Profile: Ndamour

OK folks, the data deduplication war has begun. In the center of the war are vendors such as Data Domain, Diligent Technologies, ExaGrid, FalconStor, Quantum and SEPATON. It is only a question of time before EMC, NetApp and Symantec join the fray. To understand what is happening let’s start with what has happened in the last five years relative to disk-to-disk technologies. The early pioneers included the six vendors mentioned above. Diligent, Quantum and SEPATON brought in their VTL products in the market approximately three years ago and started marketing the value of secondary disk for backup and restore. By making disk look like tape they correctly maintained that the backup procedures would require no alterations and yet you would see vast improvements in backup and restore speeds and reliability. I think all of them have adequately proved that value. Many of you have told me you have seen speed improvements of 3x in backups, with 30-50% improvements very common.

None of these vendors said a thing about data deduplication at the time they entered the market. Data Domain, on the other hand, took a very different tack. They came to market with a disk-based product targeting the same space but focused on data deduplication, front and center. Their premise from the beginning was that by eliminating duplication of data at a sub-file level one could keep months of backup data on disk and therefore have fast access to not just what was backed yesterday but data that was month’s, even years old. When viewed through the data deduplication lens, Data Domain took a lion’s share of the market with 53% of the storage with data deduplication in 2006, according to our estimates. Along with Avamar (now EMC), another data deduplication-centric backup software vendor, they presented an argument for changing the role of tape to that of very long term retention.

The lift off for Data Domain took some time in the market and they focused initially on the SMB market. This was no surprise to me because all paradigm shifting ideas take time to sink in. And frankly, you did the right thing in testing the waters before jumping in head first. But the idea made sense. If you could keeps months of backup data on disk but do it at prices that came close to tape, why wouldn’t you? Once the concept was validated and you built trust in the vendor you started buying hundreds of terabytes of secondary disk.

While Data Domain was pushing the data deduplication, they were also inherently pushing disk as a media for long term storage for backups. At the same time, others were presenting their VTL solutions and convincing you on the merits of secondary disk but without any data deduplication. But, behind the scenes, they all knew they had to add data deduplication as quickly as possible to compete in this nascent but $1B+ market. Each worked on different ways to squeeze redundancy out of backup data.

At the concept level, they all do the same thing. The way full and incremental backups have been done for years, there is a lot of redundancy built in. Take, for instance, the full backups that you typically do once a week. How much of that data is the same week to week? 90% would not be a bad guess. Why keep copying the same stupid thing again and again. Even with incremental backup, existing files that have even a single byte changed is backed up again. Why? It is best to not get me going on that front. I happen to think that the legacy backup vendors did a miserable job on that front. But, we will leave that aside for now. Back to data deduplication. So, the idea is to break the file into pieces and keep each unique piece only once, replacing redundant uses of it with small pointers that point to the original piece. As long as you keep doing full and incremental backups using legacy products from Symantec, EMC (Legato), IBM Tivoli, HP or CA, you will continue to see vast amounts of redundant data that can be eliminated. The value of eliminating this redundant data has been made abundantly clear in the past year by Data Domain customers.

2006 saw data deduplication offerings from the VTL players: Diligent Technologies, FalconStor, Quantum (via its acquisition of ADIC who had just acquired Rocksoft, an Australian vendor focused strictly on deduplication technologies). ExaGrid, an SMB player, uses the NAS interface and had deduplication integral to their product. Each does data deduplication differently. Some using hashing algorithms such as MD5 or SHA-1 or 2. Others use content awareness, versions or “similar” data patterns to identify byte-level differences. Each claims to get 20:1 data reduction ratios and more over time. Each presents its value proposition and achievable ROI, based on its internal testing. Some do inline data deduplication; others perform backups without deduplication first and then reduce the data in a separate process, after the backup is finished. Each presents its solution to be the best. Are you surprised? I am not.

What is clear to me is the following:

1. The value proposition of using disk for backup and restore is clear. No one can argue that anymore. The proof points are abundant and clear.

2. The merits of data deduplication are also abundantly clear.

3. However, the merits of various methods of data deduplication and the resultant reduction ratios achieved are not clear to you today (in general).

4. The market for these is huge (Taneja Group has projected $1022M for capacity optimized (i.e. with data deduplication) version of VTL and $1,615M for all capacity optimized version of disk-based products in 2010)

5. Both VTL and NAS interface will prevail. The battlefront is on data deduplication.

6. Vendors will do all they can in 2007 to convince you of their solution’s advantages. Hundreds of millions of dollars are at stake here.

7. By the end of this year we will see the separation between winners and losers. Of course, without de-duping I believe a product is dead in any case.

So, be prepared to see a barrage of data coming your way. I am suggesting to the vendor community that they run their products using a common dataset to identify the differences in approaches. I think you should insist on it. Without that, the onus is on you to convert their internal benchmarks to how it might perform in your environment. You may even need to try the system using your own data. This area of data protection is so important that I think we need some standard approach. We are doing our part in causing this to happen. You should do yours.

I think we have just seen the beginnings of a war between vendors on this issue alone. To make matters even more interesting we will see EMC apply the data deduplication algorithms from their Avamar acquisition to other data protection products, may be even the EMC Disk Library product (OEM’d from FalconStor). I expect NetApp to throw a volley out there soon. Symantec has data deduplication technology acquired from DCT a few years ago, but currently only applied to their PureDisk product. IBM and Sun, both OEMs of FalconStor may use Single Instance Repository (SIR) from FalconStor or something else, no one is sure. I certainly am not. But, I am certain that none of the major players in the data protection market dare stay out of this area.

Data deduplication is such a game changing technology that the smart ones know they have to play. What I can say to you is simple: Evaluate data deduplication technologies carefully before you standardize on one. Three years from now, you will be glad you did. Remember that for your environment whether you get 15:1 reduction ratio or 25:1 will translate into millions of dollars in terms of disk capacity purchased. I will be writing more about the subtle differences in these technologies. So stay tuned!

6  Comments on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • Paul
    You incorrectly through IBM's Tivoli Storage Manager product into the same class of backup products that do regular Full and Incremental backups. In fact, TSM does what they call "Progressive Incremental Backup" which does not require any periodic full backups. In addition, TSM has had something they call "Dynamic Subfile Backup" which can be used to incrementally backup only that *part* of a file that has changed since the last backup. We are running in this environment. I have been following with interest the developments in the deduplication wars as you call them, and I am unclear how large the benefit would be of dedup technology in the TSM world. To be sure, you could dedup redundant copies of operating system and application files that are duplicated across multiple systems. While this will result in some reduction, it won't be near the amount of reduction that you would see with a traditional father/son/grandson backup scheme. The other wild card that concerns me is encryption. The need for encrypting backup data is increasing. If data is encrypted before it hits the VTL, then deduplication would be foiled for that encrypted data. If data is not encrypted until it hits the VTL, then it may be in cleartext while it traverses the network. I think there is still lots of room for improving how deduplication, encryption, and compression can all work together in a coordinated manner. ..Paul
    0 pointsBadges:
  • Kevin Perry
    Paul, Your comment about encryption is incorrect. If the same key is used on the data you will get the same result. In short you will loose your deduplication gain only if you change your key on a very frequent basis. Remember be it blowfish, DES or DES-3 it is the key that is important. Additionally, one should take into account the disk technology this intellectual property is sitting on. Yes SATA drives are better today than yesterday, but do really think spinning media will last 10+ years. Ok... there are tricks such as rewriting the data on a scheduled basis, spinning disk down to rest for extended periods, and of course double parity raid protection. Bottom line moving things break, and to think that Mr. DMX , Shark or Filer will be around forever is a big mistake. Do the math.
    0 pointsBadges:
  • Jim R.
    Pauls point about de-dupe in a TSM environment taking advantage of compression are valid. File Data, using the progressive backup method of TSM will not reap the same reward as tradional fulls + incrementals on a weekly schedule. Where TSM users can benifit are when backing up large Databases. A common demand is doing full DB Backups nightly. Here you'll see good compression, as the changes with in those DB are minimal. Typically TSM doesn't have a lot of tape resources because a lot of backups are done to disk then destaged to tape. VTL's multiple virtual mount points, using LAN-Free and only recording changed data (even though fulls are being sent) can increase backup performance. And replication of backup data is now possible. Using new TSM 5.4 features, you can segregate your "Active" backup data in TSM on a DeDupe STG Pool and perhaps replicate that off-site, as only the block level changes of your backup are being sent, not the entire file (provided there are existing similarities.
    0 pointsBadges:
  • Peter Elliman
    This article does not point to the fact that there are two areas where data deduplication can occur - at the source (like PureDisk and Avamar) or at the disk target (e.g., Data Domain, Diligent, etc.). Two different use cases for data deduplication - the former is to improve network-based backups, especially for distributed data. The later use case is geared towards managing storage growth, recovering from disk from multiple locations (via replication), and reducing reliance on tape for data with short retention spans. The data deduplication wars will not be over in 2007, by any means. Media hype precedes enterprise (read data center) adoption by several years. How long has disk been hyped as the next thing and yet how many large enterprises use all disk and no tape? Disk is only now becoming the final target for backups and still less than 50% of most backup data. The math for figuring out dedupe claims can be understood if you separate bandwidth reductions (which client side dedupe claims) from storage reductions. Most of the storage reductions relate to comparisons that include total retention time of data - a major driver of the reduction factor - and involve some assumptions about the tape type backup (wkly full / daily incr. or daily fulls) since some dedupe vendors state that every backup is like a full. NetBackup has synthetic backups - the same as TSMs incremental forever. And all source side dedupe product have large reductions in data versus this approach. Why use encryption on a disk(VTL)in the data center? I can see the need to encrypt before you go to tape (a more transportable media).
    0 pointsBadges:
  • Mike Sanders
    In addition to the other vendors mentioned, Asigra also provides data de-duplication with its Televaulting ROBO backup software. In the case of Asigra the data de-dupe is designed to minimize WAN bandwidth requirements.
    0 pointsBadges:
  • Randy Wilcox
    In your opinion, what vendor/company/partner would you trust the most to run an enterprise level Fortune 100 data dedup project from the ground up? I have a large scale project beginning in the November time frame and need to start locating resources immediately. I would like to start by locating a superstar technical PM, if possible.....Thanks for any help!
    0 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: