Scott W |
Let me answer the one question that I can without getting into the fracas over speeds, feeds, and who has the right approach the dedup, EMC or DD. That is a discussion for another time.
The question is: how does Avamar do dedup at the client and only do globally unique? And the answer is: each client runs a backup. That backup identifies unique segments (bigger than blocks, smaller than files). Those segments are hashed (twice to prevent collisions). The hashes (only, so far) are sent to the Avamar server. The server identifies which of those hashes/segments it already has, and communicates that back to the client. The client then identifies the remaining segments, which by definition, are globally unique, and sends them to the server. Done. And the server doesn’t need to perform any deduplication operations at all in this model.
I will provide a fuller description of the workings of Avamar on my blog over the next few days.
Hmmm… well my first attempt at commenting was mysteriously deleted. Lets chalk it up to the caprice of the internets.
I attempted to answer the question: how does Avamar globally dedup, but not dedup on the server?
The answer is that every client dedups its data set by segment (bigger than a block, smaller than a file). It then hashes these segments (twice, to prevent collisions). It then sends unique hashes to the server, which compares it to the list of hashes already stored. The server sends back the list of hashes/segments which it does not already have, and the client transmits these unique segments to the server. Voila, global deduplication with all deduplication performed at the client level.
I will have more on the workings of Avamar on my blog over the next few days.
Hifn does not do de-dupe. It just gives a hardware interface to compute a hash function, and someone still has the write all the remaining 99.99% of the software to get de-dupe. It is amazing that Hifn guys have managed to sneak in their name without actually doing de-dupe.
Beth Pariseau |
appreciate the clarifications, guys. if i’m understanding you correctly, the hashing and comparisons are done in software, but broken up so the performance impact on each server is minimized.
what i’m saying is, at each client, are processor cycles being used? how fast are those calculations done at the client level? as data is being sent back and forth, i would imagine that both ends of the wire are doing SOME processing. so what amount is it? how fast is it done? is it done entirely in software?
if so, this would seem to lend itself to rumblings i’ve heard, that avamar doesn’t scale. it would make sense if a software-only product hit a wall at some point, no matter how many smaller pieces it’s broken up into, given a large enough amount of data…
also, i’m still hoping someone can clarify for me how the aggregate performance of Avamar’s clients is calculated?
Scott W |
Beth: hashing and comparisons are performed in software, at the client. Just like any other backup application, a client is a server or host of something you want to protect. So yes client cycles are being used. More than a “traditional” backup application would too. However, what we find is that the backup occurs much faster than a traditional backup, often 90% faster, so we use more cycles for much less time.
As for the “doesn’t scale” comment–I don’t know what to say to that. If you could be more specific, then I could probably give you a good answer. All I can say is that we have Avamar customers protecting thousands of systems with 100s of TB. And I don’t mean some marketing double speak there: as in thousands of customers, and individual customers are protecting 1000s of clients, and 100s of TB. That strikes me as scaling pretty well!
Storage Dork |
My big question is “What’s EMC’s future with FalconStor?” These new EDLs are a clear slap in the face to FalconStor’s SIR technology. It has been rumored that EMC approached a FalconStor to do a sole-source agreement of its SIR technology and when FalconStor said “No” EMC found another partner. Is this the beginning of the end for FalconStor’s EMC business? EMC bought NearTek at the end of 2006 and now this partnership with Quantum. Clearly EMC is not happy with FalconStor at some level. Do you expect EMC to ever create a NearTek-based VTL solution?
Beth Pariseau |
//Beth: hashing and comparisons are performed in software, at the client. Just like any other backup application, a client is a server or host of something you want to protect. So yes client cycles are being used. More than a “traditional” backup application would too. However, what we find is that the backup occurs much faster than a traditional backup, often 90% faster, so we use more cycles for much less time.//
OK. So how fast – at what throughput in MBps / GBps – do each of those clients do the calculations? The 10 TB / hour figure Mellor quoted? Something else? What role does network latency play in the overall performance of the system?
David and draft_, oddly enough, you’re kind of both right. Hifn does the hashing calculations, which are the most processor-intensive in the dedupe process, but it’s still reliant on its OEMs to write software around it for other parts of the dedupe process. So does Hifn ‘do’ dedupe? Yes and no. I’m not sure where draft_ceo’s comment was directed, however, since I was pretty careful to refer to ‘hashing calculations’ when it comes to Hifn’s product.
Storage Dork, we’re all going to have to wait till EMC officially announces Quantum to ask that question, and we may never really get a straight answer. I have heard the same thing – that EMC wanted exclusivity on SIR. Then again, I was also told that SIR didn’t make it through IBM’s qualification process, which led to the Diligent acquisition. Exactly why there have been these issues has been a mystery – first with the delay in qualification of SIR by OEMs, now with the outright lack of qualification by major OEMs – for about a year now.
Scott W |
Beth: Chris Mellor was talking about Avamar nodes. Nodes are component pieces of the server “grid” (hate the term under the circumstances, but it is as good as any). There is no correlation between node/grid speed and client speed.
Specific client speed is, as you can imagine, dependent on the speed and power of the client. The best way I can put it (and most accurate without a specific machine sizing to refer to) is that it is “somewhat” more intensive than a typical backup client. But, as I said, it is also usually 90%+ faster to complete the backup job.
Network latency plays a surprisingly small role. That was one of my first questions too. Having said that, the biggest area you might think latency enters is the checking of hashes between client and server to determine if the server already has a segment. But Avamar bundles many of these hashes into a single packet, to reduce latency. About the only other specific comment I can make is the annecdotal comment that I have run an Avamar client under a wide variety of network conditions (some hotel networks are really bad) and performance and backup time did not seem to degrade much at all irrespective of the network I was on. For a typical client, so little data is transmitted, and backups require so little time to complete, the whole issue is a non issue.
Finally, not that there are a whole bunch of options on how to deploy the Avamar server: from virtual editions under VMware to full blown multi node grids, to single nodes, all with and without replication as you choose. So there is significant flexibility.
We really are discussing apples and oranges when it comes to this and DD, more on my blog in the next little while… I don’t want to monopolize your space!
Stephan Sebuit |
In a physical server the CPU cost is probably not a big deal given most physical hosts are grossly underutilized in terms of CPU cycles.
However, the same statement can not be made for server virtualization environment. So even if backups complete faster, the CPU cost are “more than a traditional application would too”
What that says to me is that I have to schedule backups in VMware environments in such as way as to incur as minimal impact as possible on the physical host.
Dealing with 1200 VMs and 95 servers in my environment this has to be a backup scheduling nightmare although the dedup aspect is rather intriguing.
Storage Dork |
“first with the delay in qualification of SIR by OEMs, now with the outright lack of qualification by major OEMs”
I agree and I do hope someone from falconstor steps in here and explains this for us (or an on the ball author gives them a call ). But, even if they’re late, they do offer something different from the “host-based” and “in-band” solutions out there. There are a lot of good reasons to implement a SIR solution.
Beth Pariseau |
no worries on monopolizing, scott, this is a good discussion and informative for everybody, i think. still, i have to look back on the original posts that sparked this disucssion and wonder…we all seem to agree that the products are apples and oranges (and avamar’s performance is dependent on so many variables), so how could either side be claiming one performs better than the other? now i’m REALLY curious as to where data domain’s multiple came from in their press release…
Performance evaluations are always subjective, especially when they are being done by the company who made the product. Testing is done in a pristine environment without much regard to the real-world, but yet the results are used to justify introduction of these products into the real world.
Any product can be set up to perform well in a lab setting that may or may not be indicitave of “Production” situations.
I always take such performance claims with a grain of salt. I’ve worked in R&D, specifically in Qualification and Test, and I remember times when the same test run back to back produced slightly skewed results. Not enough to invalidate the test, but definately enough to make you wonder what else is going on in the background of the software.
I’m not a big fan of compression (or de-duplication as their calling it now). I remember having a sales guy come in to pitch Avamar to me at one of my last jobs, and the fact that they wanted to sell me a product for tens of thousands of dollars to save a few bucks on my tape footprint was just staggering.
De-Duplication is essentially the same process used in zip/tar operations. You’re taking repetitive blocks of data within a stream and replacing it with a kind of count/pointer system. All of it takes cycles. If it is done on the host it takes CPU cycles, if it’s done in-line via an appliance it increases latency.
Just my $2.12 (adjusted for inflation)
Beth Pariseau |
love the inflation adjustment, Jesse. i think you’re right…and it’s also very dependent on the environment, as we’ve seen in this discussion.
still, i’ve at least gotten DD to cop to a single-stream transfer rate of about 150 MBps in the past. was hoping to have some kind of genuinely comparitive number like that w/r/t avamar. to at least have specific, numerical, competing *claims* would be a start…:)
EMC’ers always act shocked when i bring up that scalability thing about avamar, but they’ve got to know that persistent viewpoint is out there. i’ve tried to dig into it in the past, and haven’t yet personally spoken with any avamar users with 1000s of servers protecting hundreds of TB as was mentioned above. i’ve spoken with one who had an overall environment that could be described that way, but had a more limited amount of avamar in production.
tho really, at this point, all of the above might be moot in the face of the ‘rip and replace’ factor when it comes to EMC’s ability to use / sell Avamar’s software…
Storage Dork |
Isn’t EMC shelving Avamar? I know I’ve heard this somewhere out there. Is there any truth to this?
Storage Dork |
“Isn’t EMC shelving Avamar?”
I’ve done some research. They are keeping Avamar but are refining the market they want to address with it.
Scott W |
Final piece of follow up here. First, we are absolutely not shelving Avamar. Not in any way shape or form. I suppose the comment that we are refining the market would be more accurately represented as “we refined” the market about 18 months ago. Not to say that there aren’t overzealous sales people or partners or that it wasn’t pitched as a panacea for backup by one of them. But it isn’t. It does have places where it is a better fit than others. But it will be a core part of EMC backup strategy. Secondly, I have started to throw up some basic information about Avamar on my blog, feel free to read and question. Client and server descriptions are first, use cases come next, with a focus on VMware. That may come after EMC World, but it will come.
Jon Toigo |
Beth, thanks for the reference to my blog questionnaire about de-dupe at DrunkenData. We have about 10 vendor responses now and I was told at Storage Decisions to expect others from IBM/Diligent, FalconStor and Data Domain.
Missing is any response from EMC. I have been told via email by an insider that they probably see no value in posting any responses since “there is no upside in it for them.” Curious that.
Paul Clifford |
Why do we de-dupe? Simply because intuitively we know that a lot of the data we manage is multiple copies, and we are already storing enough stuff. It is growing too fast, costing too much, and continuing to spiral out of control. So de-dupe makes sense – but, wait a minute. Somewhere between 70 – 80% of purchased disk is wasted. Now, I am not saying it isn’t used, but it is wasted as the traditional approach of “carving” out a LUN, 10 times larger than the data we have, creates all this white space, that is wasted. We create this canister that is seldom filled up on very expensive disks
The problem is traditional storage – building RAID sets and then stacking them in a box. New GUI’s and features dont fix the underlying architectural problem – expensive filing cabinets. This worked in an older time when 100GB’s of data was a “boatload”. It is not working today.
So think about it, only about 20% of the space on all the disks I buy, contain data. Then out of all the data I have, only about 20% of it is even accessed! So, I buy all this 15K FC disks for only what amounts to 5% of active data. This is ludicrous. But folks, this is the world of traditional storage. This is why we use de-dupe to try and reclaim some of this precious space.
How about changing the architecture? How about designing systems with technology that meets performance and storage requirements? How about designing systems to match IOPS requirements, and store everything else on SATA?
Utilizing innovative technologies like “Real” Thin Provisioning (only really available from 3PAR and Compellent) makes a huge leap forward. Automated ILM then manages the data so the right stuff is in the right place.
After we have restructured the architecture, then, lets de-dupe. We can bolt it on after we gain control, rather than leading with it.