Yesterday I met with execs from a company called Gluster, which is developing an open-source, software-only, scale out NAS system for unstructured data. As we discussed their market, products and competitors, we got into the nitty gritty of their technical differentiation as well – pasted below is an extended Q&A with CTO and co-founder Anand Babu Periasamy about Gluster’s way of handling metadata, most often a bugaboo when it comes to file system scalability.
Beth Pariseau: So as I’m sure you’re aware, there are many scale-out file system products out there addressing unstructured data growth. What’s Gluster’s differentiation in this increasingly crowded market?
ABP: What we figured out was that centralized and distributed metadata both have their own problems, so you have to get rid of them both. That’s the most important advantage when it comes to the Gluster file system. The reason why we got to a production-ready stage very quickly – we wrote the file system in six months and took it to production, because a customer had already paid for it, and they had a desperate need to scale with seismic data that was very critical, and they could no longer reason with that data because it was all sitting on tapes. I looked around, there was no file system around – the file systems they had used before were for scratch data, they had found a scalability advantage [to scale-out], but the problem was metadata.
The problem with the metadata is if you have centralize metadata it becomes a choke point, and distributed gets extremely complicated, and the problem with both is if your metadata server is lost, your data is gone, it’s finished…became very clear we had to get rid of the metadata server. The moment you separate data and metadata you are introducing cache coherency issues that are incredibly complicated to solve. By eliminating the need to separate data and metadata we made the file system resilient. On the underlying disks, you can format them with any standard file system – we don’t need any of its features. We just want the disk to be accessible by a standard interface, so even tomorrow if you don’t like the Gluster file system or there is a serious crash in your data center, you can just pull the drives out, put them in a different machine and have your data with you – you’re not tied to Gluster at all. Because we didn’t have any metadata the data can be kept, as files and folders, the way users copied it onto the global namespace.
Within the file system the scalability problem became seamless because we didn’t have to put a lock on metadata and slow down the whole thing, we can pretty easily scale because every machine in the system is self-contained and intelligent, equally, as all other machines. So if you want more capacity, more performance, you just throw more machines at it, and the file system pretty much linearly scales, because there’s nothing centrally holding the scalability.
BP: So it’s an aggregation of multiple file systems rather than one coherent file system that has to maintain consistency?
ABP: No. The disk file system is just a matter of formatting the drives. The Gluster file system is a complete storage operating system stack. We did not rely on the underlying operating system at all, because we figured out very quickly [things like] memory manager, volume manager , software RAID, we even already support RDMA over 10 Gigabit Ethernet or InfiniBand, you pretty much have the entire storage operating system stack that’s a lot more scalable than a Unix or Linux kernel. We treat the underlying kernel more like a hypervisor, or a microkernel and don’t rely on any of its features. By pushing everything to user space we were able to very quickly innovate new complicated things that were not possible before and pretty much scale very nicely across multiple machines.
Gluster VP of marketing Jack O’Brien: The three big architectural decisions we made early on…one is that we were in user space rather than kernel space and the second is that rather than having a centralized or distributed metadata store, there’s this concept called elastic cacheing where essentially you algorithmically determine where the data lies and the metadata is with the data rather than being separated. And the third is open source.
BP: Did you see the EMC announcement about VPlex or are you familiar with YottaYotta and what they did with cache coherency, having a pointer rather than having to make sure all data is replicated across all nodes? Is it similar to that?
ABP: What it sounds like they’re describing is basically asynchronous replication with locking, that’s how they bring you the cache coherency issue. But what I explained was, the file system is completely self-contained and distributed so we don’t have to handle the cache coherency issue. The cache coherency issue comes when you separate the data and metadata so when you’re modifying a file you have to hold the lock until a change appears…because we don’t have to hold metadata separately, we don’t have to hold the lock in the first place because we don’t have the cache coherency issue.
JO: Another way to think of it is, every node in the cluster runs the same algorithm to identify where a file’s located and every file has its own unique filename. The hash translates that into a unique ID—
BP: Oh, so it’s object based.
ABP: It is a hashing algorithm inside the file system, but for the end user it’s still files and folder.
BP: But this is how Panasas is, too, right? Underneath that file system interface to the user they have an object-based system with unique IDs..
ABP: But those IDs are stored in a distributed metadata server. We don’t have to do that.
JO: Our ID is part of the extended attributes of the file itself.
ABP: The back end file name is already unique enough, you don’t really need to store it in a separate index in a separate metadata server, we figured out we can come up with a smarter approach to do this. The reason [competitors] all had complications is because they parallelized it at the block layer, basically they took a disk file system and parallelized it, it’s a very complicated problem…you should parallelize at a much higher layer, at the VFS layer, and have a much simpler, more elegant approach.
JO: So a node doesn’t have to look up something centrally and it doesn’t have to ask anybody else in the cluster. It knows where the file’s located algorithmically.
BP: I think that’s what’s giving me the ice-cream headache here. So each node has a database within it? The thing I’m sticking on is ‘algorithmically knows where to look for it?
ABP: At the highest level …given a file name and path that’s already unique, if you hash it it comes out to a number. If you have 10 machines, the number has to be between 1 and 10. No matter how many times from wherever you calculate it, you get the same number. So if the number is, for example seven, then the file has to be on the seventh node on the seventh particular data tree. The problem in hashing is when you add the 11th and 12th node you have to rehash everything. Hashing is a static logic, as you copy more and more data you can easily get hot spots and you can’t solve that problem. The others parallelize at the block layer and put the blocks across. Because we solve the problem at the file level, if you want to find a file…internally what happens is the operating system sends a lookup call…to verify whether the home directory exists and [the user] has the necessary permissions…and then it sends an open call on the file. Internally what happens is by the time the directory calls come, the call on the directory…has all the information about the file properties. We also send information about a bit map.
Instead of taking a simple plain hash logic which cannot scale…you don’t have to physically think that you only have 10 machines. You can think logically, mathematically, you can think you have a thousand machines, there is nothing stopping you from doing that, it’s the idea of a virtual storage solution. It’s like with virtual machines, you may have only 10 machines but you think you have a thousand virtual machines, so we mathematically think we have a thousand disks. It can be any bigger number, and the actual number is really big. Then we present each logical disk as a bit, so the entire information is basically just a bit array, and the bit array is stored as a standard attribute on the data tree itself. By the time the OS or application tries to open the file, a stat call comes and the stat call already has this bitmap, and the hash logic will index into a virtual disk which really doesn’t exist, it could be some 33,000th cluster disk. And whichever directory wants that bit, you know that the file is in that machine, and don’t need to ask the metadata server, ‘tell me where my block is, hold the lock on the metadata because I need to change this bit.”
BP: But then if two people want to write at the same file at the same time…
ABP: We have a distributed locking mechanism. Because the knowledge of files is there across the stack, we only had to write a locking module that knows how to handle one file.
Hitachi Data Systems’ latest earnings results show a modest year-over-year increase as the recession fades. They also show an interesting shift in HDS sales towards services and software.
Remember when HDS was known as a high-end storage array vendor with little software or services? That’s no longer the case. HDS’ $882 million in revenue last quarter increased 6% over the previous year, despite a “single digit” decline in revenue from its USP enterprise storage platform. The USP platform still makes up most of HDS revenue, but services accounted for 30% and software 15% of its revenue last quarter.
HDS VP of corporate marketing Asim Zaheer says services and software both increased in double digits over last year, as did the HDS Adaptable Modular Storage platform. File and content (archiving) storage grew 200%, thanks to a new midrange NAS system that HDS gets from its OEM relationship with BlueArc.
USP sales may be impacted by customers waiting for a widely anticipated product refresh, although HDS execs won’t confirmed any upgrades are coming They say the change in product mix reflects a shift in buying patterns away from traditional high-end enterprise arrays towards modular SAN and file-based storage, as well as tiering enabled by virtualization.
Claus Mikkelsen, CTO of storage architectures for HDS, says customers are combining the USP virtualization capability with lower-cost disk and using features such as dynamic provisioning to save money through tiering and prolong the life of their storage arrays.
“We view the USP as a virtualization engine, and not a storage array per se,” he said. “There is clearly a blurring of lines in terms of tiering storage. We’re starting to see this new tier one-and-a-half that seems to be emerging, bringing high-end feature sets to other use cases that traditionally have not been considered high end.”
Zaheer says the lower priced NAS system based on BlueArc’s Mercury platform “revitalized our NAS portfolio” by making it more attractive to mainstream shops. The HDS execs say their midrange ASM storage platform grew in sales each quarter since it was introduced in late 2008. Mikkleson says a lot of that storage is being used behind the USP virtualization controller.
“We used to talk about high-end customers and midrange customers, but I think that was the wrong way of looking at it,” he said. “It’s more a case of customers that have different needs. Now we have more native software support in the midrange with features such as replication, copy on write, and dynamic provisioning.”
Mikkleson also said customers are looking at storage costs differently now, too. “It’s no longer about dollars per gigabyte, that went out about 20 years ago,” he said. “Now you factor in storage, maintenance, power and cooling, and the burden rate for employees.”
Mikkleson says Oracle’s decision to end the Sun OEM deal for the USP platform won’t hurt sales.
“If customers used to buying Hitachi storage from Sun can’t do that anymore, they’ll buy it from HDS,” he said.
The Justice Dept. today said EMC paid $87.5 million to settle a lawsuit that charged the vendor with false pricing claims and taking part in a kickback scheme with consulting firms who do business with government agencies.
The Justice Dept. claims EMC committed fraud by inducing the General Services Administration (GSA) to enter a contract with prices that were higher than they should have been. The GSA purchases products for the federal government. The Justice Dept. said EMC claimed during contract negotiations that for each government order under the contract, the vendor would conduct a price comparison to ensure that the government received the lowest price provided to any of its commercial customers – claims EMC could not live up to because it could not make such price comparisons.
Under the kickback scheme detailed in the Justice Dept. press release, EMC paid consulting companies fees whenever the consultants recommended that a government agency buy EMC products. EMC is not alone here – the DOJ said it has settled with three other technology companies and other investigations are pending. It did not name the other vendors.
“Misrepresentations during contract negotiations and the payment of kickbacks or illegal inducements undermine the integrity of the government procurement process,” Tony West, assistant Attorney General for the Civil Division of the Department of Justice, said in the Justice Dept. release. “The Justice Department is acting to ensure that government purchasers of commercial products can be assured that they are getting the prices they are entitled to.”
EMC denied any wrongdoing when the charges were first made public in March of 2009, and an EMC spokesman today emailed a statement to StorageSoup saying the vendor “has always denied these allegations and will continue to deny any liability arising from the allegations made in this case. We’re pleased that the expense, distraction and uncertainty of continued litigation are behind us.”
The EMC spokesman said some of the charges are almost 10 years old.
Saying it’s looking to appeal to larger shops with its online data backup service, Iron Mountain Digital released version 7.0 of its LiveVault SaaS product today with new support for multithreaded applications and larger data sets.
Previously, LiveVault’s “sweet spot” was protecting servers up to 1 TB, according to Jackie Su, senior product marketing manager for Iron Mountain Digital. The new version will protect up to 7 TB thanks to beefier processors and memory in the LiveVault TurboRestore on-site appliance, and the Data Shuttle option becoming a built-in feature. Previously, if users wanted to transport large data sets on portable hard drives, it was done only on request in special circumstances. The new TurboRestore appliance can now hold up to 24 TB of disk, and has a 64-bit memory cache.
Iron Mountain claims it’s seen growing adoption of cloud data protection in midsized enterprises among its customer base for LiveVault, citing this shift as the reason for its scalability updates with this release, but did not provide a specific number of midsized customers, percentage of growth in those customers compared with last year, or average deal size, though chief marketing officer TM Ravi said deal sizes are growing, which “indicates we’re covering larger and larger environments.”
Online data backup so far has been among the most popular uses of cloud data storage, particularly among enterprise users, but according to Storage Magazine’s most recent storage purchasing survey, “it’s still more hype than happening”.
Hewlett-Packard Co. added another scale-out NAS system to its portfolio yesterday when it announced DataDirect Networks (DDN)’s S2A9900 disk array will be bundled with the Lustre File System resold by the Scalable Computing and Infrastructure (SCI) group within HP.
HP began collecting scale-out file systems when it acquired PolyServe in 2007, then saw some false starts with its ExDS9100 product for Web 2.0 and HPC use cases. HP continued its track record of acquiring its partners in the space with the acquisition of Ibrix last July. Yet HP still found a gap in its scale-out file system portfolio for DataDirect and Lustre with this agreement, according to Ed Turkel, manager of business development for SCI.
“Basically, both the X9000 [based on Ibrix] and [the new offering with] DDN are scale-out file systems sold as an appliance model,” Turkel said. But Lustre is geared more toward “the unique demands of HPC users” in which multiple servers in a cluster simultaneously read and write to a single file at the same time, requiring very high single file bandwidth. “The X9000 is more general purpose, with scalable aggregate bandwidth” rather than high single-file performance.
DDN’s VP of marketing Jeff Denworth said the two vendors have “a handful” of joint customers already, but Denworth and Turkel both dismissed the idea that DDN could be HP’s next scale-out acquisition. “If I respond to that question in any fashion, I’m probably going to get my hand slapped, but it’s certainly not the purpose of this announcement,” Turkel said. However, this product will replace a previous offering HP launched in 2006, also based on Lustre, called the Scalable File Share (SFS).
DDN is now partnered for storage with every large HPC OEM vendor there is — previously it has announced reseller and OEM relationships with IBM, Dell and SGI. “This sounds similar to the arrangement that DDN has with IBM, Dell and SGI to provide a turnkey solution to certain niche customers, more likely aligned with the HP server group than the storage group,” wrote StorageIO founder and analyst Greg Schulz in an email to Storage Soup.
Amazon Web Services today added a new offering for its Simple Storage Service (S3) called Reduced Redundancy Storage (RRS). RRS offers users the ability to choose fewer “hops” of object replication among Amazon’s facilities for a lower cost per gigabyte. With RRS, objects would survive one complete data center failure, but wouldn’t be replicated enough times to survive two concurrent data center failures. It’s like RAID 6 vs. RAID 5 storage tiering writ large.
Some users like the CDN capabilities Amazon offers with S3, and Amazon officials say those capabilities will still be offered with RRS, claiming no difference in performance between RRS and S3. However, the cloud data storage vendors that have introduced gateway and caching devices for S3 will have to update their support to offer users the option of RRS on the back end. I’m sure we can anticipate a flurry of announcements from companies such as Nasuni, StorSimple and TwinStrata in the coming months (ETA: at least where Nasuni is concerned, I stand corrected…).
Ten cents per GB is already raising eyebrows, but that’s actually just the starting price for RRS. According to an emailed statement from Amazon S3 general manager Alyssa Henry, “Base pricing for Reduced Redundancy Storage covers the first 50 TB of RRS storage in a month. This tier is charged at a price of $0.10 per GB per month. As customers increase their storage, the price declines to as low as $0.037 per GB per month for customers with more than 5 petabytes of RRS storage.”
Henry was mum on whether Amazon has any more gradations of storage tiering up its sleeve, saying, “The RRS offering was the result of feedback from customers who, for their particular use cases, did not require the level of durability that Amazon S3 provides today. We’ll continue to listen to feedback from our customers on what’s important to them in terms of future functionality but have no other announcements today.”
S3 customers we’ve gotten in touch with so far seem intrigued by the new offering. Stay tuned for a followup in the coming days about reaction to this announcement in the market.
Coraid today added a ZFS-based NAS to its platform of Ethernet SANs.
Coraid’s base product is a non-iSCSI IP SAN called EtherDrive based on ATA over Ethernet (AOE), but the vendor has been looking to expand its product line since closing a $10 million funding round and hiring Kevin Brown as CEO in January.
The new EtherDrive Z-Series NAS includes two models. The Z2000 has four cores, 32 GB of RAM and either eight Gigabit Ethernet or four 10-Gigabit Ethernet ports. The Z3000 has eight cores, 48 GB of RAM, level 2 SSD cache, and either eight GigE or four 10-GigE ports.
Coraid relies on ZFS for features such as Inline deduplication, replication, unlimited snapshots and automatic tiering.
The Z-Series replaces Coraid’s Linux-based CLN NAS platform. “ZFS is a better fit for our ECODrive systems,” said Carl Wright, Coraid’s VP of sales and product management. “We’ve had a lot of requests from our customers for open-source ZFS systems.”
Wright described the Z-Series as a scale-out architecture because “as customers need capacity, they add EtherDrive data blocks on back.” The EtherDrive SAN and NAS systems can be managed from the same interface, he said.
Wright says the Z-Series uses the same Intel X25-E SSDs as in the EtherDrive SRX SAN platform it launched in March, but the SSDs serve as cache only for the NAS appliance (read cache is standard, and write cache is optional).
Compellent last month launched a ZFS-based NAS option to its Storage Center SAN system. Wright says the big difference between the Coraid and Compellent NAS offerings is price. He says Coraid’s Z series is priced at about $1,000 per TB while Compellent’s starting price is $84,000 for 8.7 TB for new customers and $36,000 for its current SAN customers.
Double-Take Software’s executives and directors have agreed to be acquired by Vision Solutions in a deal valued at $242 million, pending approval of the company’s shareholders. Vision Solutions is a portfolio company of Thoma Bravo, LLC, a private equity firm.
Last month Double-Take disclosed it had received indications of acquisition interest during, but did not name any suitors. After issuing a press release saying “The Double-Take board of directors unanimously approved the agreement and has recommended the approval of the transaction to Double-Take’s stockholders,” Double-Take officials declilned further comment today.
Vison Solutions specializes in disaster recovery and HA software for IBM System i and AIX servers. Vision Solutions previously had an OEM agreement to rebrand Double-Take’s software when it needed to support x86 Windows or Linux servers. That relationship was subsequently changed to a reseller agreement that remains ongoing.
Industry observers see this deal as an exit strategy for Double-Take’s board, after the first calendar quarter of 2010 finished “weaker than expected” for Double-Take. Said one source familiar with the company and the deal, speaking on condition of anonymity: “Double-Take’s magic sauce was that it could make your Exchange on a server in, say, Chicago run just like it would in Boston. But VMware came along and said, ‘Stick it in a VM, it’s the same thing, and you don’t have to install agents or worry about third-party software.’
“The folks at Double-Take saw it coming, but they couldn’t jump out of the way of that train fast enough.”
It wasn’t just VMware that began eating away at Double-Take’s market, but other backup and DR tools began to crop up and undercut Double-Take’s offerings in price, our source said.
With the deal not expected to close until after July, it’s officially “business as usual” for Double-Take customers. After the deal closes, Vision Solutions is expected to continue supporting Double-Take’s existing customers. “You have $90 million cash on hand and $40 million a year in software maintenance business — Thoma Bravo has every reason to keep existing customers happy,” said the insider.
Should’ve checked the phone again before hitting ‘publish’ on my EMC World Reporter’s Notebook — there were a few more shots from the show last week I’d overlooked —