Managed service provider RenewData briefed us today on its launching of a data migration service specifically for transferring email archives from one archiving product to another while maintaining a legal chain of custody.
Renew has partnerships with EMC, Symantec and CA (for the former iLumin product), which allow its proprietary data migration software to bypass the archiving application and extract data directly from the archive for quicker transfers. According to James Smith, vice president of enterprise solutions for RenewData, the company had already begun offering these migration services on an on-demand basis for customers and Smith says Renew has performed dozens of migrations already — the formal packaging and marketing of the service is what’s new.
There were no firms willing to speak to the press about their use of the service, but the fact that Renew anticipates a market for such a service is interesting evidence of the influence that e-discovery and email archiving in particular have in the storage industry of late. It’s difficult to tell what it means at this point if there’s a large market for assisted migration between email archiving tools–would it mean that users are not making the best choice of archiving systems the first time? Or would it mean that email archiving systems are not delivering on their promises?
The bottom line is that this service is anticipating at least some market because in many ways email archiving, as well as migration between archives, can be a painful and proprietary exercise. According to Smith, the service can be used to create a “baseline” copy of data in intercustodial deduplicated format. The service can also export to “standard” formats such as HTML or XML. However, most often the service has been used to migrate from one proprietary archive to another, according to Smith.
“Very few products out there archive the pure message file,” he said. “They put it in their own format so that it’s more painful to migrate away.”
If you think data protection management (DPM) tools that monitor and report on backup successes and failures are going to disappear with the introduction of virtual tape libraries (VTLs), think again.
It is easy to view DPM tools only in the context of an all-tape environment, since that is where the source of most backup troubles are and most of their value is derived. However, this can lead one to mistakenly assume that by bringing in a VTL, one can eliminate both the more vexing problems associated with backups and the need for DPM software. Unfortunately, VTLs create their own unique sets of problems that require users to keep DPM software available to help them identify and report on these issues.
This was made abundantly clear to me in a case study that Agite Software recently shared with me, in which a company had installed and tested Agite Software’s backupVISUAL DPM software. This company used it to monitor their backup environment. They had recently begun to use a Sepaton VTL and wanted to document to what degree the backup situation had improved since they switched from tape to disk. Much to everyone’s surprise, backups to the Sepaton VTL were still failing 30% of the time.
But, this was neither a Sepaton nor a backup software problem, per se. It was an oversight on the part of the administrators. The company decommissioned the tape drives that the servers were previously using as their backup target, and the servers had nowhere to direct the backup job, causing the subsequent backup jobs to fail.
While this is obviously an extreme case (and I am sure one that Agite Software brought to my attention to demonstrate the value of their product), it does illustrate that there is always more to consider when purchasing any new product than just plugging it in and letting it rip.
In today’s environments, where everything is so interconnected and interdependent, no one should believe any vendor’s claim that their product is “Plug’N’Play”. And even if everything appears to work fine on the surface, rest assured that any level of examination will almost always unveil some blatant gaps in service and performance.
Things have gotten kicked off in earnest out here in the Windy City at this year’s Storage Decisions conference in Chicago. Today was the first full day of sessions at this year’s edition of the conference, and attendees heard discussions of hot topics from blue-chip companies including United Airlines, Federal Reserve Bank, and Bank of America.
Gary Pilafas, managing director of enterprise architecture for United Airlines (UAL), gave a presentation this morning about his company’s DR plans, much of which centered around classifying data according to criticality, and setting disaster recovery levels appropriately, a common trend in DR of late. Pilafas said he steered application admins away from insisting on Tier 1 DR (after all, no application admin wants to say his data isn’t of top importance) by emphasizing cost.
On this he was challenged by Michael Thomas, storage architect for the Federal Reserve, who said he’d seen that kind of planning go awry in some cases after 9/11 and Hurricane Katrina. “Some business units had [scaled back] DR plans based on cost, but then their SLAs didn’t match their true business requirements,” Thomas said. “They still expected IT to respond, and we did, but not in as timely a manner as they would have liked in some cases.”
Pilafas acknowledged that getting a true sense of business requirements and managing application interdependencies made tiering for DR a tricky project. However, he said UAL is currently testing service-bus software products including IBM’s Websphere MQ and BEA’s Aqualogic, layered over Hitachi Data Systems’ Universal Storage Platform for a services-oriented architecture. That plan, he said, will decouple data services from individual business units, specific applications or devices, eliminating the issue of application interdependencies. He said it will also go a long way toward addressing the confusion about business units and their priorities. “This way we can discuss each business unit’s priorities, map it back to services, and the higher-priority services float to the top,” he said. “It’s like taking the opposite of the lowest common denominator.”
Thomas himself had a different approach to making DR plans more effective, which is to go back to the drawing board with testing. “One of the big problems in this industry is that a lot of people don’t really test their DR plans,” he said. “They send people out a week in advance and prepare, and then test.” Thomas advocated more spontaneous tests and recounted one test in an earlier position where employees were “toe-tagged” at random to more realistically simulate a disaster scenario.
Meanwhile, if there’s anything that requires as much careful planning and precise procedure as DR, it’s e-discovery, and on hand with a keynote speech on that subject was Daniel Blair, e-discovery, investigation and incident support within the information security and business continuity division of Bank of America (say that five times fast).
Among the nuggets offered by Blair was the estimation that for every 1 GB of data produced for e-discovery, 6.25 GB of storage space is needed for multiple working copies, indexing and conversion to TIFF formats as well as the production of copies for opposing counsel. BOA’s approach to cut down on storage costs is to put the original “golden” copy of data onto lower-performing, high-capacity SATA disk (backed up vigorously, of course) and use higher-performing FC storage for the processing.
Blair wasn’t able to discuss specifics because of the sensitive nature of corporate litigation, but he did say that so far, he has yet to find a single comprehensive product for e-discovery. He also said that BOA uses a combination of in-house work and outsourcing, specifically with TIFF conversion, to lighten the workload and save financially.
Ultimately, though Blair said the new federal rules of civil procedure could make e-discovery a more bearable undertaking (since they recognize a “good faith” effort to preserve data), further attention on e-discovery means that more savvy practitioners will find new ways to key on process vulnerabilities during a lawsuit.
As the pressure grows, Blair said there’s plenty of room for improvement in the technology space. “Real-time indexing, content categorization, records management for the lifecycle, true policy-based management, and better scalability,” he listed off immediately when asked for ideas.
One other item of note: Compellent was the name on everybody’s lips during the expo on the show floor tonight. Users said they had always liked Compellent’s automated tiered storage feature, but it had taken some time to see more customer traction in the market and product maturity for the emerging company.
So, what are you hearing at the show? Give us your thoughts in the comments section.
Is your data on fire? In this case, I am not talking about how frequently your data is accessed or how great the information is contained in your data? I am talking about literally on fire.
Why do I ask? This week I am attending the PRISM International conference (www.prismintl.org) conference in Savannah, Ga., and one of the focuses of the conference is the lessons learned from last year’s Iron Mountain fire in London. In attending the first of two sessions on this topic, one of the questions asked was how many records management companies have had fires in their facilities. Out of about 200 -250 attendees in the room, 2 or 3 raised their hands. Sure, that’s only 1% of the total number in attendance but from my perspective, that is a lot. And from the soberness of those in attendance, their sentiments would seem to match mine and that the entire records management industry, and Iron Mountain in particular, are taking this occurrence very seriously and taking steps to prevent this from occurring again.
To their credit, one of the steps that Iron Mountain took was an attitude of full disclosure and cooperation with the public fire officials in the U.K. The results of the study by an outside independent consultancy were that Iron Mountain’s fire and security systems were properly maintained but their building services were not. That sounds worse than it is. That means items like pallets or a dumpster with flamable materials (cardboard, paper, etc) were too close to the building. In this circumstance, if a fire does get started, even with these other systems in place, the fire now has a source and a steady supply of oxygen which can overwhelm the other systems and lead to a catastrophic loss, as in Iron Mountain’s case.
What is most disconcerting is that in London, according to Mike Murphy, a director with Osborn Associated, Ltd., and the independent fire protection consultanting firm in the UK that assisted with the Iron Mountain investigation, 60% of the fires started are as a result of arson. Unfortunatly the statistics in the US are similar. According to the most recent statistics on the U.S. Fire Administration’s Web site, there were 31,500 intentially set fires in 2005 which caused 315 deaths and $664 million in structural losses.
So, what does this mean for the rest of us? One should not assume we are immune from something similar happening either to our records management provider or even our own facility. We need to make sure the grounds around our own company’s facilities are clear of flammable debris of any kind. While they obviously cannot catch on fire by themselves, with 50% of the fires in the US set by juveniles, why give them any temptation to do so? Also, be sure to ask your records management provider to do the same and maybe even occasionally drive by and check out their facility to be sure they are because their standards for protecting your offsite data should be no less than your own.
Jay Kidd, NetApp’s chief of emerging technologies, dropped some bombs about virtualization in our Q & A with him April 26. The article touched off an immediate response, which has branched out into ongoing discussion about storage virtualization in the blogosphere among both execs and users. Here are the latest responses and counter-responses:
Acopia’s Kirby Wadsworth responds to Kidd’s comments directly with a post simply entitled, “Shame.”
Nigel over at Ruptured Monkey also gives the Q&A a link, saying, “I tend to read through blogs and storage articles while munching on some food. The thing is, this is becoming an increasingly hazardous pass time… these vendor evangelists are casually dropping in comments that are causing me to choke…”
It’s been said in the IT industry that big companies can’t innovate. EMC is at least giving it the old college try with the announcement of the EMC Innovation Network, essentially a framework for feeding research findings through its product development and marketing machine. The key to this plan, according to Jeff Nick, EMC’s senior vice president and Chief Technology Officer, is that EMC will use its global services staff to get new products into users’ storage shops for proof-of-concept faster.
More interesting than the “innovation network”–which so far consists of a lot of people agreeing to a lot of things on paper–is what EMC is promising to move with its products, including highly scalable “Web 2.0” storage. According to Dr. Burt Kaliski, formerly chief scientist for RSA Laboratories and now the head of the research network reporting to Nick, that scalability will be over geographic distance as well as scalability within systems. Targeting Web 2.0 will also be “semantic Web, search, context, and ontological views” of information through software developments. EMC is also looking to get into grids, multi-site virtualization and service-oriented architecture. There isn’t much of a time frame yet for products, of course, but EMC officials are promising at least some fruits of the new R & D network will appear later this year.
OK folks, the data deduplication war has begun. In the center of the war are vendors such as Data Domain, Diligent Technologies, ExaGrid, FalconStor, Quantum and SEPATON. It is only a question of time before EMC, NetApp and Symantec join the fray. To understand what is happening let’s start with what has happened in the last five years relative to disk-to-disk technologies. The early pioneers included the six vendors mentioned above. Diligent, Quantum and SEPATON brought in their VTL products in the market approximately three years ago and started marketing the value of secondary disk for backup and restore. By making disk look like tape they correctly maintained that the backup procedures would require no alterations and yet you would see vast improvements in backup and restore speeds and reliability. I think all of them have adequately proved that value. Many of you have told me you have seen speed improvements of 3x in backups, with 30-50% improvements very common.
None of these vendors said a thing about data deduplication at the time they entered the market. Data Domain, on the other hand, took a very different tack. They came to market with a disk-based product targeting the same space but focused on data deduplication, front and center. Their premise from the beginning was that by eliminating duplication of data at a sub-file level one could keep months of backup data on disk and therefore have fast access to not just what was backed yesterday but data that was month’s, even years old. When viewed through the data deduplication lens, Data Domain took a lion’s share of the market with 53% of the storage with data deduplication in 2006, according to our estimates. Along with Avamar (now EMC), another data deduplication-centric backup software vendor, they presented an argument for changing the role of tape to that of very long term retention.
The lift off for Data Domain took some time in the market and they focused initially on the SMB market. This was no surprise to me because all paradigm shifting ideas take time to sink in. And frankly, you did the right thing in testing the waters before jumping in head first. But the idea made sense. If you could keeps months of backup data on disk but do it at prices that came close to tape, why wouldn’t you? Once the concept was validated and you built trust in the vendor you started buying hundreds of terabytes of secondary disk.
While Data Domain was pushing the data deduplication, they were also inherently pushing disk as a media for long term storage for backups. At the same time, others were presenting their VTL solutions and convincing you on the merits of secondary disk but without any data deduplication. But, behind the scenes, they all knew they had to add data deduplication as quickly as possible to compete in this nascent but $1B+ market. Each worked on different ways to squeeze redundancy out of backup data.
At the concept level, they all do the same thing. The way full and incremental backups have been done for years, there is a lot of redundancy built in. Take, for instance, the full backups that you typically do once a week. How much of that data is the same week to week? 90% would not be a bad guess. Why keep copying the same stupid thing again and again. Even with incremental backup, existing files that have even a single byte changed is backed up again. Why? It is best to not get me going on that front. I happen to think that the legacy backup vendors did a miserable job on that front. But, we will leave that aside for now. Back to data deduplication. So, the idea is to break the file into pieces and keep each unique piece only once, replacing redundant uses of it with small pointers that point to the original piece. As long as you keep doing full and incremental backups using legacy products from Symantec, EMC (Legato), IBM Tivoli, HP or CA, you will continue to see vast amounts of redundant data that can be eliminated. The value of eliminating this redundant data has been made abundantly clear in the past year by Data Domain customers.
2006 saw data deduplication offerings from the VTL players: Diligent Technologies, FalconStor, Quantum (via its acquisition of ADIC who had just acquired Rocksoft, an Australian vendor focused strictly on deduplication technologies). ExaGrid, an SMB player, uses the NAS interface and had deduplication integral to their product. Each does data deduplication differently. Some using hashing algorithms such as MD5 or SHA-1 or 2. Others use content awareness, versions or “similar” data patterns to identify byte-level differences. Each claims to get 20:1 data reduction ratios and more over time. Each presents its value proposition and achievable ROI, based on its internal testing. Some do inline data deduplication; others perform backups without deduplication first and then reduce the data in a separate process, after the backup is finished. Each presents its solution to be the best. Are you surprised? I am not.
What is clear to me is the following:
1. The value proposition of using disk for backup and restore is clear. No one can argue that anymore. The proof points are abundant and clear.
2. The merits of data deduplication are also abundantly clear.
3. However, the merits of various methods of data deduplication and the resultant reduction ratios achieved are not clear to you today (in general).
4. The market for these is huge (Taneja Group has projected $1022M for capacity optimized (i.e. with data deduplication) version of VTL and $1,615M for all capacity optimized version of disk-based products in 2010)
5. Both VTL and NAS interface will prevail. The battlefront is on data deduplication.
6. Vendors will do all they can in 2007 to convince you of their solution’s advantages. Hundreds of millions of dollars are at stake here.
7. By the end of this year we will see the separation between winners and losers. Of course, without de-duping I believe a product is dead in any case.
So, be prepared to see a barrage of data coming your way. I am suggesting to the vendor community that they run their products using a common dataset to identify the differences in approaches. I think you should insist on it. Without that, the onus is on you to convert their internal benchmarks to how it might perform in your environment. You may even need to try the system using your own data. This area of data protection is so important that I think we need some standard approach. We are doing our part in causing this to happen. You should do yours.
I think we have just seen the beginnings of a war between vendors on this issue alone. To make matters even more interesting we will see EMC apply the data deduplication algorithms from their Avamar acquisition to other data protection products, may be even the EMC Disk Library product (OEM’d from FalconStor). I expect NetApp to throw a volley out there soon. Symantec has data deduplication technology acquired from DCT a few years ago, but currently only applied to their PureDisk product. IBM and Sun, both OEMs of FalconStor may use Single Instance Repository (SIR) from FalconStor or something else, no one is sure. I certainly am not. But, I am certain that none of the major players in the data protection market dare stay out of this area.
Data deduplication is such a game changing technology that the smart ones know they have to play. What I can say to you is simple: Evaluate data deduplication technologies carefully before you standardize on one. Three years from now, you will be glad you did. Remember that for your environment whether you get 15:1 reduction ratio or 25:1 will translate into millions of dollars in terms of disk capacity purchased. I will be writing more about the subtle differences in these technologies. So stay tuned!
Do all companies fit neatly into the emerging corporate data protection architecture? The predominant model that most enterprise data protection products are pegging corporate users into is backing up data at their remote offices and then sending a copy of that backed up data to a central data center for long term management. One only has to look at recent vendor announcements to support this belief.
With NetBackup 6.1, Symantec started to provide more integration between it and their NetBackup PureDisk product by including an export function in NetBackup PureDisk 6.0. This allows NetBackup 6.1 to import and centrally catalog the remote office data backed up by the PureDisk product. Symantec also announced that they are more closely integrating Veritas Backup Reporter with NetBackup PureDisk. This will allow users to create reports that quantify how much raw data NetBackup PureDisk is backing up at remote sites and what type of deduplication ratios users are realizing using NetBackup PureDisk.
Enterprise VTL vendors are also making similar enhancements to their product lines to centralize data backed up at remote offices. SEPATON’s new S2100-DS2 Series 500 that is targeted for either SMBs or remote offices offers lower storage capacity and less processing power than their enterprise model but uses the same software features as the enterprise model. This allows users to configure it to copy remote office virtual tapes from remote site VTLs back to the central corporate VTL.
But, here is the problem with pegging most users into this model – what if a company has no central data center with a staff dedicated to managing all of their remote office backup data? Suppose companies are made up of loosely federated remote offices that do not have, do not want or can not afford this model. Why standardize on it? Obviously, this does not preclude a company structured this way from fitting into this model but that is a little bit like fitting a square peg in a round hole.
That being said, I find Arkeia Software’s forthcoming EdgeFort Backup Appliance intriguing on two fronts. By itself, the EdgeFort Backup Appliance includes backup software, Arkeia Software’s Network Backup, internal disk that functions as a disk target and, on some of its models, internal tape drives. This gives companies with remote offices a one stop shop for deploying a backup product into their remote offices with all of the functionality they need to get going and stabilize their environment.
But, what makes this product worth a deeper look is that it also includes a policy engine that allows one site to set up policies that can be applied universally to all sites or just one specific site. This is especially important for those companies with a small IT staff, who may have a couple of people adept at using the product but are not responsible for managing each site’s data. This allows companies to create a template of global and/or site specific policies and apply and update them when needed but leave the more burdensome and time consuming daily administration tasks of monitoring and handling daily backups to the designated administrator at each local site.
Is Arkeia Software’s new EdgeFort Backup Appliance perfect? Certainly not. It still does not remove the human factor at remote sites from the backup process, and it lacks deduplication, which forces remote offices to possibly consume more disk and continue to handle and manage tapes more frequently than they may want. But, I do think one should not lightly dismiss this product offering either. The fact that Arkeia Software takes loosely structured companies down an evolutionary rather than a revolutionary path means it has more than a fighting chance of winning over some corporate customers. Especially those that don’t fit neatly into the general data protection hole into which backup software vendors often have users pegged.
Updated to add: Hitachi Data Systems’ Hu Yoshida also has a response on his blog.