Storage Soup

May 12 2008   3:33PM GMT

VendorFights: Data Deduplication Edition

Beth Pariseau Beth Pariseau Profile: Beth Pariseau

With data deduplication in the news today, I recommend checking out the responses to Jon Toigo’s questionnaire for data deduplication vendors. I found his questions about backing up deduped data to tape and the potential legal ramifications of changing data through dedupe especially interesting. The responses from the vendors so far about hardware-based hashing are also interesting, in that they seem to break down according to whether or not their companies offer a hardware- or software-based product.

It would be pretty disappointing if Hifn’s announcement of hardware-based hashing led to a religious war around software- vs. hardware-based dedupe systems. It’s clear (and has been generally accepted, or so I thought) that hardware performs better than software, meaning it’s in users’ best interest to improve the throughput of data deduplication systems by moving processor-intensive calculations to hardware. And the dedupe market is full of enough FUD as it is.

Speaking of which, Data Domain and EMC are getting all slapper-fight about dedupe thanks to today’s product announcement from Data Domain (and attendant comparisons to EMC/Avamar), and the fact that EMC is planning to finally roll out deduping tape libraries at EMC World (based on Quantum’s dedupe).

EMC blogger Storagezilla calls the statement by DD in a press release that its new product is 17 times faster than Avamar’s RAIN grid “nose gold” (props for the phraseology, at least), and then points out that Avamar’s back end doesn’t actually do any deduping, which is something I still don’t quite get.

So Data Domain’s box is faster at de-dup than the Avamar back end which doesn’t do any de-dup.

Since the de-dup is host based and only globally unique data leaves the NIC do I get to count the aggregate de-dup performance of all the hosts being backed up?

Yes, I do!

How does Avamar decide what data is ‘globally unique’? If this is determined before data leaves the host, than that processing must be done at the host. ‘Zilla even says he can count the aggregate performance of all the hosts being backed up in the dedupe performance equation. . .which brings me back to the first point again: Avamar’s back end doesn’t do de-dupe, but it’s faster at dedupe than Data Domain anyway?

Chris Mellor explored this further:

Accrding to EMC, Avamar moves data at 10 GB/hr per node (moving unique sub-file data only). Avamar reduces typical file system data by 99.7 percent or more, so only 0.3 percent is moved daily in comparison to the amount that Data Domain has to move in conjunction with traditional backup software. This equals a 333x reduction compared to a traditional full backup (Avamar has customer data indicating as much as 500X, but 333X is a good average).

‘An EMC spokesperson’ (should we assume it was, or wasn’t, Storagezilla himself?) further stated to Mellor:

“Remember that Data Domain has to move all of the data to the box, so naturally they’re focusing on getting massive amounts of data in quickly. EMC Avamar never has to move all of that data, so instead we focus on de-dupe efficiency, high-availability and ease of restore. Attributes that are more meaningful to the customer concerned with effective backup operations. “

Again I ask, where does the determination that data is ‘globally unique’ take place? It’s got to be taking up processor cycles somewhere. The rate at which it makes those determinations, and where it makes those determinations, would be the apples-to-apples comparison with DD, which is making those calculations as data is fed into its single-box system.

All of that is overlooking that the real meat and potatoes when it comes to dedupe is single-stream performance, anyway — total aggregate throughput over groups of nodes (which is really what both vendors are talking about) doesn’t mean as much. For one thing, Data Domain’s aggregate isn’t really aggregate, because it doesn’t have a global namespace yet. For another, I fail to see how EMC can even quote an aggregate TB/hr figure when talking about a group of networked nodes. Doesn’t network speed factor in pretty heavily to that equation?

Personally, I don’t think either vendor is really putting it on the line in this discussion (c’mon guys, get MAD out there ;)!). And if Avamar really performs better than Data Domain, why isn’t its dedupe IP being used in EMC’s forthcoming VTLs? (EMC continues to deny this officially, or at least refuses to confirm, but there’s internal documentation floating around at this point that indicates Quantum is the partner.)

Meanwhile, according to EMC via Mellor:

EMC says Data Domain continues to compare apples and oranges because it wants to avoid the discussion that there are a number of different backup solutions that fit a variety of unique customer use cases.

I have to admit this made me chuckle. Most of the discussions I’ve had about EMC over the last year or so have involved their numerous backup and replication products and what the heck they’re going to do with them all long-term. Finally, it seems we have an answer: Turn it into a marketing talking point!

I don’t think Data Domain even really wants to avoid that subject, either. They’re well aware that there are a number of different products out there that fit different use cases, given their positioning specifically for SMBs who want to eliminate tape.

At the same time, it’s interesting to watch the EMC marketing machine fire itself up in anticipation of a new major announcement–the scale and coordination are something to behold. This market has already been a contentious one. It’ll be interesting to see what happens now that EMC’s throwing more of its chips on the table.

19  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Beth Pariseau
    Beth; Let me answer the one question that I can without getting into the fracas over speeds, feeds, and who has the right approach the dedup, EMC or DD. That is a discussion for another time. The question is: how does Avamar do dedup at the client and only do globally unique? And the answer is: each client runs a backup. That backup identifies unique segments (bigger than blocks, smaller than files). Those segments are hashed (twice to prevent collisions). The hashes (only, so far) are sent to the Avamar server. The server identifies which of those hashes/segments it already has, and communicates that back to the client. The client then identifies the remaining segments, which by definition, are globally unique, and sends them to the server. Done. And the server doesn't need to perform any deduplication operations at all in this model. I will provide a fuller description of the workings of Avamar on my blog over the next few days.
    0 pointsBadges:
    report
  • Beth Pariseau
    Hmmm... well my first attempt at commenting was mysteriously deleted. Lets chalk it up to the caprice of the internets. I attempted to answer the question: how does Avamar globally dedup, but not dedup on the server? The answer is that every client dedups its data set by segment (bigger than a block, smaller than a file). It then hashes these segments (twice, to prevent collisions). It then sends unique hashes to the server, which compares it to the list of hashes already stored. The server sends back the list of hashes/segments which it does not already have, and the client transmits these unique segments to the server. Voila, global deduplication with all deduplication performed at the client level. I will have more on the workings of Avamar on my blog over the next few days.
    0 pointsBadges:
    report
  • Beth Pariseau
    Hifn does not do de-dupe. It just gives a hardware interface to compute a hash function, and someone still has the write all the remaining 99.99% of the software to get de-dupe. It is amazing that Hifn guys have managed to sneak in their name without actually doing de-dupe.
    0 pointsBadges:
    report
  • Beth Pariseau
    appreciate the clarifications, guys. if i'm understanding you correctly, the hashing and comparisons are done in software, but broken up so the performance impact on each server is minimized. what i'm saying is, at each client, are processor cycles being used? how fast are those calculations done at the client level? as data is being sent back and forth, i would imagine that both ends of the wire are doing SOME processing. so what amount is it? how fast is it done? is it done entirely in software? if so, this would seem to lend itself to rumblings i've heard, that avamar doesn't scale. it would make sense if a software-only product hit a wall at some point, no matter how many smaller pieces it's broken up into, given a large enough amount of data... also, i'm still hoping someone can clarify for me how the aggregate performance of Avamar's clients is calculated?
    0 pointsBadges:
    report
  • Beth Pariseau
    Beth: hashing and comparisons are performed in software, at the client. Just like any other backup application, a client is a server or host of something you want to protect. So yes client cycles are being used. More than a "traditional" backup application would too. However, what we find is that the backup occurs much faster than a traditional backup, often 90% faster, so we use more cycles for much less time. As for the "doesn't scale" comment--I don't know what to say to that. If you could be more specific, then I could probably give you a good answer. All I can say is that we have Avamar customers protecting thousands of systems with 100s of TB. And I don't mean some marketing double speak there: as in thousands of customers, and individual customers are protecting 1000s of clients, and 100s of TB. That strikes me as scaling pretty well!
    0 pointsBadges:
    report
  • David
    Draft_ceo... you should read more of Beth's articles... Hifn does DeDup: http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1304236,00.html#
    0 pointsBadges:
    report
  • Beth Pariseau
    My big question is "What's EMC's future with FalconStor?" These new EDLs are a clear slap in the face to FalconStor's SIR technology. It has been rumored that EMC approached a FalconStor to do a sole-source agreement of its SIR technology and when FalconStor said "No" EMC found another partner. Is this the beginning of the end for FalconStor's EMC business? EMC bought NearTek at the end of 2006 and now this partnership with Quantum. Clearly EMC is not happy with FalconStor at some level. Do you expect EMC to ever create a NearTek-based VTL solution?
    0 pointsBadges:
    report
  • Beth Pariseau
    //Beth: hashing and comparisons are performed in software, at the client. Just like any other backup application, a client is a server or host of something you want to protect. So yes client cycles are being used. More than a “traditional” backup application would too. However, what we find is that the backup occurs much faster than a traditional backup, often 90% faster, so we use more cycles for much less time.// OK. So how fast - at what throughput in MBps / GBps - do each of those clients do the calculations? The 10 TB / hour figure Mellor quoted? Something else? What role does network latency play in the overall performance of the system? David and draft_, oddly enough, you're kind of both right. Hifn does the hashing calculations, which are the most processor-intensive in the dedupe process, but it's still reliant on its OEMs to write software around it for other parts of the dedupe process. So does Hifn 'do' dedupe? Yes and no. I'm not sure where draft_ceo's comment was directed, however, since I was pretty careful to refer to 'hashing calculations' when it comes to Hifn's product. Storage Dork, we're all going to have to wait till EMC officially announces Quantum to ask that question, and we may never really get a straight answer. I have heard the same thing - that EMC wanted exclusivity on SIR. Then again, I was also told that SIR didn't make it through IBM's qualification process, which led to the Diligent acquisition. Exactly why there have been these issues has been a mystery - first with the delay in qualification of SIR by OEMs, now with the outright lack of qualification by major OEMs - for about a year now.
    0 pointsBadges:
    report
  • Beth Pariseau
    Beth: Chris Mellor was talking about Avamar nodes. Nodes are component pieces of the server "grid" (hate the term under the circumstances, but it is as good as any). There is no correlation between node/grid speed and client speed. Specific client speed is, as you can imagine, dependent on the speed and power of the client. The best way I can put it (and most accurate without a specific machine sizing to refer to) is that it is "somewhat" more intensive than a typical backup client. But, as I said, it is also usually 90%+ faster to complete the backup job. Network latency plays a surprisingly small role. That was one of my first questions too. :) Having said that, the biggest area you might think latency enters is the checking of hashes between client and server to determine if the server already has a segment. But Avamar bundles many of these hashes into a single packet, to reduce latency. About the only other specific comment I can make is the annecdotal comment that I have run an Avamar client under a wide variety of network conditions (some hotel networks are really bad) and performance and backup time did not seem to degrade much at all irrespective of the network I was on. For a typical client, so little data is transmitted, and backups require so little time to complete, the whole issue is a non issue. Finally, not that there are a whole bunch of options on how to deploy the Avamar server: from virtual editions under VMware to full blown multi node grids, to single nodes, all with and without replication as you choose. So there is significant flexibility. We really are discussing apples and oranges when it comes to this and DD, more on my blog in the next little while... I don't want to monopolize your space!
    0 pointsBadges:
    report
  • Beth Pariseau
    In a physical server the CPU cost is probably not a big deal given most physical hosts are grossly underutilized in terms of CPU cycles. However, the same statement can not be made for server virtualization environment. So even if backups complete faster, the CPU cost are "more than a traditional application would too" What that says to me is that I have to schedule backups in VMware environments in such as way as to incur as minimal impact as possible on the physical host. Dealing with 1200 VMs and 95 servers in my environment this has to be a backup scheduling nightmare although the dedup aspect is rather intriguing. Stephan
    0 pointsBadges:
    report
  • Beth Pariseau
    "first with the delay in qualification of SIR by OEMs, now with the outright lack of qualification by major OEMs" I agree and I do hope someone from falconstor steps in here and explains this for us (or an on the ball author gives them a call ;-) ). But, even if they're late, they do offer something different from the "host-based" and "in-band" solutions out there. There are a lot of good reasons to implement a SIR solution.
    0 pointsBadges:
    report
  • Beth Pariseau
    no worries on monopolizing, scott, this is a good discussion and informative for everybody, i think. still, i have to look back on the original posts that sparked this disucssion and wonder...we all seem to agree that the products are apples and oranges (and avamar's performance is dependent on so many variables), so how could either side be claiming one performs better than the other? now i'm REALLY curious as to where data domain's multiple came from in their press release...
    0 pointsBadges:
    report
  • Jesse
    Performance evaluations are always subjective, especially when they are being done by the company who made the product. Testing is done in a pristine environment without much regard to the real-world, but yet the results are used to justify introduction of these products into the real world. Any product can be set up to perform well in a lab setting that may or may not be indicitave of "Production" situations. I always take such performance claims with a grain of salt. I've worked in R&D, specifically in Qualification and Test, and I remember times when the same test run back to back produced slightly skewed results. Not enough to invalidate the test, but definately enough to make you wonder what else is going on in the background of the software. I'm not a big fan of compression (or de-duplication as their calling it now). I remember having a sales guy come in to pitch Avamar to me at one of my last jobs, and the fact that they wanted to sell me a product for tens of thousands of dollars to save a few bucks on my tape footprint was just staggering. De-Duplication is essentially the same process used in zip/tar operations. You're taking repetitive blocks of data within a stream and replacing it with a kind of count/pointer system. All of it takes cycles. If it is done on the host it takes CPU cycles, if it's done in-line via an appliance it increases latency. Just my $2.12 (adjusted for inflation)
    0 pointsBadges:
    report
  • Beth Pariseau
    love the inflation adjustment, Jesse. i think you're right...and it's also very dependent on the environment, as we've seen in this discussion. still, i've at least gotten DD to cop to a single-stream transfer rate of about 150 MBps in the past. was hoping to have some kind of genuinely comparitive number like that w/r/t avamar. to at least have specific, numerical, competing *claims* would be a start...:) EMC'ers always act shocked when i bring up that scalability thing about avamar, but they've got to know that persistent viewpoint is out there. i've tried to dig into it in the past, and haven't yet personally spoken with any avamar users with 1000s of servers protecting hundreds of TB as was mentioned above. i've spoken with one who had an overall environment that could be described that way, but had a more limited amount of avamar in production. tho really, at this point, all of the above might be moot in the face of the 'rip and replace' factor when it comes to EMC's ability to use / sell Avamar's software...
    0 pointsBadges:
    report
  • Beth Pariseau
    Isn't EMC shelving Avamar? I know I've heard this somewhere out there. Is there any truth to this?
    0 pointsBadges:
    report
  • Beth Pariseau
    "Isn’t EMC shelving Avamar?" I've done some research. They are keeping Avamar but are refining the market they want to address with it.
    0 pointsBadges:
    report
  • Beth Pariseau
    Final piece of follow up here. First, we are absolutely not shelving Avamar. Not in any way shape or form. I suppose the comment that we are refining the market would be more accurately represented as "we refined" the market about 18 months ago. Not to say that there aren't overzealous sales people or partners or that it wasn't pitched as a panacea for backup by one of them. But it isn't. It does have places where it is a better fit than others. But it will be a core part of EMC backup strategy. Secondly, I have started to throw up some basic information about Avamar on my blog, feel free to read and question. Client and server descriptions are first, use cases come next, with a focus on VMware. That may come after EMC World, but it will come.
    0 pointsBadges:
    report
  • Beth Pariseau
    Beth, thanks for the reference to my blog questionnaire about de-dupe at DrunkenData. We have about 10 vendor responses now and I was told at Storage Decisions to expect others from IBM/Diligent, FalconStor and Data Domain. Missing is any response from EMC. I have been told via email by an insider that they probably see no value in posting any responses since "there is no upside in it for them." Curious that.
    0 pointsBadges:
    report
  • Beth Pariseau
    Why do we de-dupe? Simply because intuitively we know that a lot of the data we manage is multiple copies, and we are already storing enough stuff. It is growing too fast, costing too much, and continuing to spiral out of control. So de-dupe makes sense - but, wait a minute. Somewhere between 70 - 80% of purchased disk is wasted. Now, I am not saying it isn't used, but it is wasted as the traditional approach of "carving" out a LUN, 10 times larger than the data we have, creates all this white space, that is wasted. We create this canister that is seldom filled up on very expensive disks The problem is traditional storage - building RAID sets and then stacking them in a box. New GUI's and features dont fix the underlying architectural problem - expensive filing cabinets. This worked in an older time when 100GB's of data was a "boatload". It is not working today. So think about it, only about 20% of the space on all the disks I buy, contain data. Then out of all the data I have, only about 20% of it is even accessed! So, I buy all this 15K FC disks for only what amounts to 5% of active data. This is ludicrous. But folks, this is the world of traditional storage. This is why we use de-dupe to try and reclaim some of this precious space. How about changing the architecture? How about designing systems with technology that meets performance and storage requirements? How about designing systems to match IOPS requirements, and store everything else on SATA? Utilizing innovative technologies like "Real" Thin Provisioning (only really available from 3PAR and Compellent) makes a huge leap forward. Automated ILM then manages the data so the right stuff is in the right place. After we have restructured the architecture, then, lets de-dupe. We can bolt it on after we gain control, rather than leading with it. Paul Clifford Davenport Group www.davenportgroup.com
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: