OK folks, the data deduplication war has begun. In the center of the war are vendors such as Data Domain, Diligent Technologies, ExaGrid, FalconStor, Quantum and SEPATON. It is only a question of time before EMC, NetApp and Symantec join the fray. To understand what is happening let’s start with what has happened in the last five years relative to disk-to-disk technologies. The early pioneers included the six vendors mentioned above. Diligent, Quantum and SEPATON brought in their VTL products in the market approximately three years ago and started marketing the value of secondary disk for backup and restore. By making disk look like tape they correctly maintained that the backup procedures would require no alterations and yet you would see vast improvements in backup and restore speeds and reliability. I think all of them have adequately proved that value. Many of you have told me you have seen speed improvements of 3x in backups, with 30-50% improvements very common.
None of these vendors said a thing about data deduplication at the time they entered the market. Data Domain, on the other hand, took a very different tack. They came to market with a disk-based product targeting the same space but focused on data deduplication, front and center. Their premise from the beginning was that by eliminating duplication of data at a sub-file level one could keep months of backup data on disk and therefore have fast access to not just what was backed yesterday but data that was month’s, even years old. When viewed through the data deduplication lens, Data Domain took a lion’s share of the market with 53% of the storage with data deduplication in 2006, according to our estimates. Along with Avamar (now EMC), another data deduplication-centric backup software vendor, they presented an argument for changing the role of tape to that of very long term retention.
The lift off for Data Domain took some time in the market and they focused initially on the SMB market. This was no surprise to me because all paradigm shifting ideas take time to sink in. And frankly, you did the right thing in testing the waters before jumping in head first. But the idea made sense. If you could keeps months of backup data on disk but do it at prices that came close to tape, why wouldn’t you? Once the concept was validated and you built trust in the vendor you started buying hundreds of terabytes of secondary disk.
While Data Domain was pushing the data deduplication, they were also inherently pushing disk as a media for long term storage for backups. At the same time, others were presenting their VTL solutions and convincing you on the merits of secondary disk but without any data deduplication. But, behind the scenes, they all knew they had to add data deduplication as quickly as possible to compete in this nascent but $1B+ market. Each worked on different ways to squeeze redundancy out of backup data.
At the concept level, they all do the same thing. The way full and incremental backups have been done for years, there is a lot of redundancy built in. Take, for instance, the full backups that you typically do once a week. How much of that data is the same week to week? 90% would not be a bad guess. Why keep copying the same stupid thing again and again. Even with incremental backup, existing files that have even a single byte changed is backed up again. Why? It is best to not get me going on that front. I happen to think that the legacy backup vendors did a miserable job on that front. But, we will leave that aside for now. Back to data deduplication. So, the idea is to break the file into pieces and keep each unique piece only once, replacing redundant uses of it with small pointers that point to the original piece. As long as you keep doing full and incremental backups using legacy products from Symantec, EMC (Legato), IBM Tivoli, HP or CA, you will continue to see vast amounts of redundant data that can be eliminated. The value of eliminating this redundant data has been made abundantly clear in the past year by Data Domain customers.
2006 saw data deduplication offerings from the VTL players: Diligent Technologies, FalconStor, Quantum (via its acquisition of ADIC who had just acquired Rocksoft, an Australian vendor focused strictly on deduplication technologies). ExaGrid, an SMB player, uses the NAS interface and had deduplication integral to their product. Each does data deduplication differently. Some using hashing algorithms such as MD5 or SHA-1 or 2. Others use content awareness, versions or “similar” data patterns to identify byte-level differences. Each claims to get 20:1 data reduction ratios and more over time. Each presents its value proposition and achievable ROI, based on its internal testing. Some do inline data deduplication; others perform backups without deduplication first and then reduce the data in a separate process, after the backup is finished. Each presents its solution to be the best. Are you surprised? I am not.
What is clear to me is the following:
1. The value proposition of using disk for backup and restore is clear. No one can argue that anymore. The proof points are abundant and clear.
2. The merits of data deduplication are also abundantly clear.
3. However, the merits of various methods of data deduplication and the resultant reduction ratios achieved are not clear to you today (in general).
4. The market for these is huge (Taneja Group has projected $1022M for capacity optimized (i.e. with data deduplication) version of VTL and $1,615M for all capacity optimized version of disk-based products in 2010)
5. Both VTL and NAS interface will prevail. The battlefront is on data deduplication.
6. Vendors will do all they can in 2007 to convince you of their solution’s advantages. Hundreds of millions of dollars are at stake here.
7. By the end of this year we will see the separation between winners and losers. Of course, without de-duping I believe a product is dead in any case.
So, be prepared to see a barrage of data coming your way. I am suggesting to the vendor community that they run their products using a common dataset to identify the differences in approaches. I think you should insist on it. Without that, the onus is on you to convert their internal benchmarks to how it might perform in your environment. You may even need to try the system using your own data. This area of data protection is so important that I think we need some standard approach. We are doing our part in causing this to happen. You should do yours.
I think we have just seen the beginnings of a war between vendors on this issue alone. To make matters even more interesting we will see EMC apply the data deduplication algorithms from their Avamar acquisition to other data protection products, may be even the EMC Disk Library product (OEM’d from FalconStor). I expect NetApp to throw a volley out there soon. Symantec has data deduplication technology acquired from DCT a few years ago, but currently only applied to their PureDisk product. IBM and Sun, both OEMs of FalconStor may use Single Instance Repository (SIR) from FalconStor or something else, no one is sure. I certainly am not. But, I am certain that none of the major players in the data protection market dare stay out of this area.
Data deduplication is such a game changing technology that the smart ones know they have to play. What I can say to you is simple: Evaluate data deduplication technologies carefully before you standardize on one. Three years from now, you will be glad you did. Remember that for your environment whether you get 15:1 reduction ratio or 25:1 will translate into millions of dollars in terms of disk capacity purchased. I will be writing more about the subtle differences in these technologies. So stay tuned!
Do all companies fit neatly into the emerging corporate data protection architecture? The predominant model that most enterprise data protection products are pegging corporate users into is backing up data at their remote offices and then sending a copy of that backed up data to a central data center for long term management. One only has to look at recent vendor announcements to support this belief.
With NetBackup 6.1, Symantec started to provide more integration between it and their NetBackup PureDisk product by including an export function in NetBackup PureDisk 6.0. This allows NetBackup 6.1 to import and centrally catalog the remote office data backed up by the PureDisk product. Symantec also announced that they are more closely integrating Veritas Backup Reporter with NetBackup PureDisk. This will allow users to create reports that quantify how much raw data NetBackup PureDisk is backing up at remote sites and what type of deduplication ratios users are realizing using NetBackup PureDisk.
Enterprise VTL vendors are also making similar enhancements to their product lines to centralize data backed up at remote offices. SEPATON’s new S2100-DS2 Series 500 that is targeted for either SMBs or remote offices offers lower storage capacity and less processing power than their enterprise model but uses the same software features as the enterprise model. This allows users to configure it to copy remote office virtual tapes from remote site VTLs back to the central corporate VTL.
But, here is the problem with pegging most users into this model – what if a company has no central data center with a staff dedicated to managing all of their remote office backup data? Suppose companies are made up of loosely federated remote offices that do not have, do not want or can not afford this model. Why standardize on it? Obviously, this does not preclude a company structured this way from fitting into this model but that is a little bit like fitting a square peg in a round hole.
That being said, I find Arkeia Software’s forthcoming EdgeFort Backup Appliance intriguing on two fronts. By itself, the EdgeFort Backup Appliance includes backup software, Arkeia Software’s Network Backup, internal disk that functions as a disk target and, on some of its models, internal tape drives. This gives companies with remote offices a one stop shop for deploying a backup product into their remote offices with all of the functionality they need to get going and stabilize their environment.
But, what makes this product worth a deeper look is that it also includes a policy engine that allows one site to set up policies that can be applied universally to all sites or just one specific site. This is especially important for those companies with a small IT staff, who may have a couple of people adept at using the product but are not responsible for managing each site’s data. This allows companies to create a template of global and/or site specific policies and apply and update them when needed but leave the more burdensome and time consuming daily administration tasks of monitoring and handling daily backups to the designated administrator at each local site.
Is Arkeia Software’s new EdgeFort Backup Appliance perfect? Certainly not. It still does not remove the human factor at remote sites from the backup process, and it lacks deduplication, which forces remote offices to possibly consume more disk and continue to handle and manage tapes more frequently than they may want. But, I do think one should not lightly dismiss this product offering either. The fact that Arkeia Software takes loosely structured companies down an evolutionary rather than a revolutionary path means it has more than a fighting chance of winning over some corporate customers. Especially those that don’t fit neatly into the general data protection hole into which backup software vendors often have users pegged.
Updated to add: Hitachi Data Systems’ Hu Yoshida also has a response on his blog.
1) HSM, ILM, ITIL still lack the appropriate policy management software to truly allow storage to move between tiers. Sad, really – processing power is at an all-time high price/performance ratio, yet we don’t have software that can effectively leverage this power to even “brute force” analyze data and provide automated, policy-driven data migration.
2) The midrange storage space is about to explode with new products. All of the SAS products coming to market are going to put increased pressure on FC margins and revenue. Who knows? In three years, we say some really nifty products in attractive, inexpensive packages. I’m waiting for storage vendors to start offering enclosures with hundreds of 2.5” drives. The thought of a thousand 2.5” drives per rack solves lots of problems for me.
3) Things are tough all over. No, really – every facet of IT spending seems to be falling under scrutiny. Many of my peers, who once bought IBM, EMC and HP are talking with – and buying from – Hitachi, NEC, Pillar and a host of others, just because these players are offering more (more professional services, migrations, price breaks, etc).
So far our posts on reactions to the newly proposed FCoE standard have drawn quite a discussion. Adding to that discussion last week were two stories by Senior News Writer Nicole Lewis on our sister site, SearchStorageChannel.com, which detailed response to the proposal by iSCSI vendors and resellers. The first story, iSCSI vendors: Fibre Channel over Ethernet (FCoE) proposal is purely defensive, from April 17, highlights also that the ANSI T11 standards body has also heard arguments for a slightly different “Fibre Channel over Convergence Enhanced Ethernet (FCoCEE) standard”.
The second, posted this week, is a two-part point-counterpoint Q & A with Doug Ingraham, Brocade’s senior director of product management, and John Joseph, EqualLogic’s vice president of marketing. Ingraham argues FC and iSCSI are still going after different markets; Joseph comes out swinging with the statement that “this announcement means Fibre Channel vendors are planning to get rid of the FC wire, but are keeping the protocol, which is hard for customers to implement and manage, but preserves vendors’ high-margin equipment, professional services and peripheral sales.”
As expected, Overland Storage has been forced to make some drastic cuts in the wake of losing two major OEM customers: HP and Dell. The company announced it has laid off 54 employees, or 14% of its workforce, in the second round of job cuts. Here’s a story that includes a response from the CEO on how the company hopes to pick up the pieces.
Replication tips from JPMorgan-Chase session on DR
JPMorgan-Chase vice president and senior architect Dmitri Ryutov (who pointed out his initials are DR) said one of his business units recently decided on asynchronous replication with guaranteed write order as a compromise between the performance hit associated with synchronous replication, the cost associated with multi-hop and the potential for error with asynchronously replicating highly active databases. However, Young, who is using EMC Corp.’s SRDF-A between two sites 160 miles apart, warned that at the secondary DR site regular point-in-time copies are necessary, because a lost WAN link will destroy the write order—something other users in the session said they were glad they learned.
Another user in Ryutov’s session said an under-addressed issue when it comes to wide-area disaster recovery is compliance with international regulations. The user cited Chinese restrictions on Web content, which a traditional open WAN link for replication may violate, and said he hasn’t yet found a tool that adequately addresses this issue.
Finally, another attendee at the session pointed out that users frequently forget to factor in telco costs in a disaster recovery plans. JPMorgan-Chase’s replication plan, a seven-figure deal for hardware and software according to Ryutov, probably also includes as much as $50,000 a month in telco costs for the four 1 Gbps WAN links between the financial giant’s two sites, the attendee estimated.
UK email archiver zips into US market
British email archiving company C2C, which was working the press briefing circuit at this year’s show in the hopes of drumming up interest in the US market, incorporates an interesting data-reduction feature into its archiving software—the software zips and unzips every attachment in the archive automatically as it stores messages, and can also be used to zip attachments in existing archives.
Data integrity initiative
Oracle, Emulex, LSI Logic and Seagate were briefing journalists in their hotel suite about the new Data Integrity Initiative, which will see each of the vendors add the T-10 Data Integrity Field (DIF) standard—a checksum algorithm that adds 8 bits to each block of data—into their silicon. The companies will be demonstrating and promoting an end-to-end DIF package consisting of each of their products, but say there are no plans as yet for a bundled product offering. Each said that T-10 DIF will become a permanent part of their product lines. Products are expected next year.
The big picture
An emerging trend in the storage industry—if this year’s show is any indication–is a growing shift from current, technology-only discussions to matters of global and long-term impact. The discussion on energy efficiency that began Monday carried into Tuesday’s keynotes, in a talk given by IBM VP of system storage Barry Rudolph. “Energy savings has gone from a recommendation to a mandate,” he said.
Cora Carmody, Chief Information Officer for Science Applications International Corporation (SAIC) also gave a speech that tied the Book of Revelation together with the future of technology. Carmody predicted that “personally addressable humans” would be connected to “presence awareness” systems—an example is a kitchen that starts cooking dinner when your car is a certain distance from home. Most interestingly, though Carmody dismissed the concept as superstition, she acknowledged that bionics—computerizing the human body—are believed by some to be the “Mark of the Beast,” a permanent bodily device required to do business in a global economy brought about by the Antichrist and a precursor to the apocalypse detailed in the Biblical Book of Revelation.
A little bit off the beaten path, maybe, but more evidence that the storage industry is starting to look outward and forward.
Another SNW trend: 51 percent of respondents in the main ballroom Tuesday morning voted in a poll that server and storage virtualization was the fastest-growing priority in their IT shop; implementing an information-based management strategy came in second at 28 percent, and service-oriented architectures scored third with 20 percent.
As I continue to delve more deeply into next generation data protection technologies, I continue to talk to users about their experiences. Of these technologies, there are always some that users find more relevant than others and with no technology does that seem more true than with backup deduplication.
Granted, users that I interview for the different columns and articles I write are often supplied by the vendors so they are certainly not going to provide examples of users with failed installs of their products. Also users who do agree to do interviews often put their best foot forward to put their experience in the best positive light since no one wants to go on the record sounding like they made a bad decision. But having worked as an end-user, I can usually tell pretty quickly by the tone and inflection of a user’s voice how much of their experience is genuine and how much is contrived.
And what I am hearing – and maybe more importantly sensing – from those employing deduplication is that it is working as well as vendors advertised – at least in SMB and remote office environments – and in some cases, maybe better. Too often I find vendors exaggerate the benefits of specific technologies but in talking to users employing deduplication I don’t sense that is happening here.
When talking to users about their deduplication deployment using either backup software or VTL products, they seem genuinely content. While admittedly every one has had some issues, none appear beyond the scope associated with the deployment of any new product and certainly pale in comparison to the ones they encountered on a daily and weekly basis with their previous tape based approaches.
Most users simply sound relieved that they have had success in dealing with their daily backups and can now finally begin to turn their focus to more importang strategic initiatives like performing tests to ensure they can recover their data and offsite disaster recovery. And while the user experiences and emotions I am discussing here certainly shouldn’t translate into anyone into going out and buying a backup deduplication product, I think it certainly merits one taking a closer look at this class of technology.
It’s a dirty job working in this kind of environment,
but somebody’s gotta do it…
Propellerheads and communication
Among the sessions at a new professional development track being tried out at SNW this year was a talk entitled “Interpersonal Communication Skills for Propellerheads” led by Deborah Johnson, CEO of Infinity I/O. As far as we could tell said propellerheads, some 30 in all, didn’t object to the moniker.
Johnson addressed attendees about other humans using their native language, i.e. technical jargon, telling her audience that interpersonal communication requires the befuddled propellerhead to “assess the ‘map’ of the person you’re talking to—” similarly to how they’d map a drive, we surmise. “Understand your goal and the audience context to select the right channel for your communication,” Johnson told the group, approximately one-third of whom were thumbing away busily on Blackberries or typing on laptops.
“Sometimes it takes more than one communication event to get your message across,” Johnson continued, further encouraging attendees to “ask questions to ensure you are decoding messages correctly [from others].”
So was it useful? “I do need to work on my communications skills,” said one self-professed propellerhead, adding that he has recently begun to cut his emails down from several pages to a strict one-page limit. Oy vey!
Deep dive on dedupe
An early session on deduplication turned into a standing room only event, with Curtis Preston, VP of data protection services at GlassHouse, a.k.a Mr. Backup, holding court on the topic. He went through the different products, in-line versus post-process and the different schemes for identifying redundant data. But two points really stood out. First, data deduplication products are currently only appropriate for small to medium-sized environments, he said. “Do not take a 100 TB Oracle database and throw it at a data dedupe.” Second, ignore all the claims about deduplication ratios. “Your data will dedupe very differently to the guy next to you,”
Talking to a couple of users after the session, one thought dedupe could go the way of CDP. “It’s the big topic this year but we’ll see if it’s still around next year,” he said. Another user, with 800 TBs to deal with, said the economics were too good for this technology to be a flash in the pan, if, and that was a big if in his mind, the products are robust and scalable enough. He noted that’s FalconStor’s SIR (single instance repository) doesn’t ship in volume for another couple of months, so it’s still very early days for this technology.
Users feeling out file virtualization
Comcast Media Center manager of server and storage operations Paul Yarborough gave a talk Monday afternoon on his company’s decision to virtualize NetApp 3020 filers with Acopia Networks’ ARX switch. Another presenter on file virtualization, Stephen Warner of Quest Diagnostics, has also deployed Acopia, to virtualize EMC Celerra boxes. Some tidbits that arose out of the presentations:
Yarborough’s company was so strapped for space on the NetApp filers due to the 16 TB filesystem restriction that they were spending dozens of man-hours on a regular basis reingesting digital content that had been deleted from overutilized disk.
Meanwhile, Yarborough said he evaluated NetApp’s OnTap GX as a means to solve the filesystem limit, but he remained pointedly noncommittal on his findings, saying only that it was a very new product when he evaluated it.
Warner, who heads up an EMC shop, said he believes that truly vendor diagnostic virtualization will not come from a large vendor. In Acopia, he said, his company found a startup it could influence (of course, it helps to have just under a petabyte of data under management if you’re looking to influence other companies).
Other users’ questions during Yarborough’s session were as interesting as the presentation itself. During the Q&A users peppered Yarborough with questions about performance impact, how much training it had required to get his staff up to speed on the Acopia product, and whether or not Acopia was truly effective in virtualizing Windows and Linux systems equally. Yarborough answered that there had been no performance impact that he’d seen, that training on the Acopia switch had taken a little longer since it operates on a switch and his admins are not used to managing switches, and that yes, Acopia is effective in virtualizing heterogeneous OSes.
Another user questioned the fact that Comcast had installed an Acopia agent on its domain controller. “That would never fly in our environment,” the commenter said.
First Intel. Now the White House. We’d love to know what, if any, email archiving products these guys are using. And we don’t know what’s worse–if they have implemented archiving procedures, or if they haven’t.