The NetApp Open Solution for Hadoop Rack includes NetApp FAS and E-Series storage along with Hewlett-Packard servers and Cisco switches. The base configuration consists of four Hadoop servers, two FAS2040 storage modules, three E2660 NetApp storage modules for 360TB of storage, 12 compute servers and two Ethernet switches. The system scales with data expansion racks made up of four NetApp E2660 modules, 16 compute servers and two Cisco switches.
The FAS2040 – including NFS – is used in the Hadoop NameNode and the E2660 with Hadoop Distributed File System (HDFS) is used in the DataNode. The goal is to enable enterprises to move Apache Hadoop quickly from the test lab into production.
“We’ve taken the approach that there is an issue with the NameNode in Hadoop,” said Bill Peterson, who heads solutions marketing for NetApp’s Hadoop and “Big Data” systems. “If that crashes, you lose the entire Hadoop cluster. The community is fixing that so it will no longer be a single point of failure. We decided we would put a FAS box inside the solution, so we could do a snapshot of the NameNode. We use E-Series boxes for MapReduce jobs. So the database of record is on FAS and fast queries are on the E-Series.”
The NetApp Open Solution for Hadoop Rack became available this week.
NetApp also signed on to develop and pre-test Hadoop systems that use the new Hortonworks Data Platform (HDP), which became generally available Wednesday. NetApp joint solutions with Hortonworks are expected later this year. NetApp also has partnerships with Apache and Cloudera, and will support all three versions of Hadoop on its Open Solutions Rack.
“That’s why NetApp has open in the name. We want as many partnerships there as possible,” Peterson said.
For greater detail on using Hadoop with enterprise storage, I recommend the excellent series from John Webster of Evaluator Group on SearchStorage.com, beginning here.]]>
Version 2 is the first major upgrade for Red Hat since it acquired startup Gluster last year. Current versions of Red Hat Storage on the market are re-branded versions of the GlusterFS product with tweaks to better support the Red Hat Enterprise Linux (RHEL) operating system.
Red Hat Storage Software 2.0 makes it easier to manage unstructured CIFS, NFS and GlusterFS mount points. The unified file and object feature allows for users to save data as an object and retrieve it as a file, or save data as an object and retrieve it as a file.
“A typical use case would be a customer can choose to save something as an object or file. So you can upload a photo as a file but in the portal software it is converted into an object,” said Sarangan Rangachari, general manager for storage at Red Hat.
The 2.0 version supports Hadoop MapReduce, which is a programming language and software framework for writing applications that rapidly process large amounts of data in parallel on large clusters of compute modes. “What we provide in this release is the underlying file system in MapReduce-based applications that use the Hadoop Distributed File System (HDFS),” Rangachari said.
The Red Hat Storage Software provides a global namespace capability that aggregates disk and memory resources into a unified storage volume. The software runs on commodity servers and uses a combination of open source Gluster software, which Red Hat acquired in October 2011, and Red Hat Linux 6. In February, Red Hat also introduced the Red Hat Virtual Storage Appliance for scale-out NAS delivered as a virtual appliance. This allows customers to deploy virtual storage servers the same way virtual machines are deployed in the cloud.
The Red Hat appliance allows the ability to aggregate both Elastic Block Storage (EBS) and Elastic Compute Cloud (EC2) instances in Amazon Web Service environments.]]>
Don Angspatt, VP of product management for Symantec’s storage and availability management group, said the vendor has a prototype of the application working and he expects the product to ship in 2012. He won’t provide many details yet, but said the concept is similar to what EMC is doing with its integration of Isilon scale-out NAS and Greenplum analytics file system. The difference is that the Symantec app will work across heterogeneous storage.
Last month, EMC gave its Isilon OneFS operating system native support for the Hadoop Distributed File System (HDFS) and released the EMC Greenplum HD on Isilon.
The idea is to remove limitations such as a single point of failure and lack of shared storage capabilities that prevent Hadoop from working well in enterprises.
Angspatt said the Hadoop product will be sold separately from Symantec’s Storage Foundation storage management suite, although it will work in a storage environment. He said the application will compete with Cloudera – which has a partnership with NetApp – and MapR Technologies software that EMC uses as part of its Greenplum HD.
“We want to make sure Hadoop and MapReduce work well in an enterprise environment,” Angspatt said. “We’re not going to do business intelligence. We’ll be involved in the infrastructure behind it to make sure it’s enterprise-ready. Our application will talk directly to Hadoop, similar to the EMC Greenplum-Isilon integration. But with us you don’t get locked into a specific hardware.”]]>
Goldick says solid-state drives (SSDs) can help run analytics for Hadoop and NoSQL databases better in storage racks than in shared-nothing server configurations.
“We’re focused on the analytics end of Big Data – getting Hadoop and NoSQL into reliable infrastructures while getting them to scale out horizontally,” he said. “Scale-out NAS is a different part of the market.”
Today, Violin said its 3000 Series flash Memory Arrays have been certified to work with IBM’s SAN Volume Controller (SVC) storage virtualization arrays. Goldick pointed to this combination as one way that Violin technology can help optimize Big Data analytics. The vendors say SVC’s FlashCopy, Easy Tier, live migration and replication data management capabilities work with Violin arrays.
Goldick said running Violin’s SSDs with storage systems speeds the Hadoop “shuffle phase” and provides more IOPS without having to add spindles. SVC brings the management features that Violin’s array lacks.
“Hadoop is well-optimized for SATA drives, but there’s always a phase when it’s doing random I/O called the ‘shuffle phase,’ and you’re stalled waiting for disks to catch up,” said Goldick, who came to Violin from LSI to set the startup’s data management strategy. “We’re looking at a hybrid storage model for Big Data. You’ve heard of top-of-the-rack switches, we look at Violin as the middle-of-the-rack array. It gives you fault tolerance and the high performance you need to make Big Data applications run at real-time speeds.”
He said Hadoop holds data in transient data stores and persistent data stores. It’s the persistent data – which is becoming more prevalent in Hadoop architectures – where flash can help. “So you think of Hadoop not just as analytics but as a storage platform,” he said. “That’s where IBM SVC bridges a gap for us. When data is transient you don’t need data management services as much. When you start keeping the data there, it becomes a persistent data store of petabytes of information. You need data management features that enterprise users have come to expect – things like snapshotting, metro-clustering, fault tolerance over distance.”
Violin’s 3000 series is also certified on EMC’s Vplex federated storage system. EMC is talking about Big Data more than any other storage vendor, with its Isilon clustered NAS as well as its Greenplum analytics systems. EMC president Pat Gelsinger last week said Big Data technologies will be the focus of EMC’s acquisitions over the coming months.
If Goldick is correct, we’ll be hearing a lot more about Big Data analytics in storage.
“Last year Big Data was about getting it to work,” he said. “This year it’s about optimizing performance for a rack. People don’t want to run thousands of servers if they can get the efficiency from a rack.”
There are other ways of using SSDs to speed analytics – inside arrays, or as PCIe cards in storage systems or servers. Violin’s Big Data success will be determined by its performance against a crowded field of competitors.
Developed at IBM Research Almaden, the General Parallel File System-Shared Nothing Cluster (GPFS-SNC) uses the Hadoop Distributed File System (HDFS) for what IBM calls “high availability through advanced clustering, dynamic file system management and data replication, and can even continue to provide data access when the cluster experiences storage or node malfunctions.”
Prasenjit Sarkar, master inventor for storage analytics and resiliency at IBM Research said the technology uses a distributed architecture where each node is independent and tasks are divided between computers. No node has to wait for another to perform a task. This removes bottlenecks associated with SANs because there is no single point of failure.
“The goal is to store large amounts of data as efficiently as possible,” he said. “This is an architecture for petabytes and even exabytes.”
He said the architecture includes enterprise features such as client-side caching, disk caching, wide area replication, and archiving.
Sarkar said he couldn’t talk about any product plans or roadmap for GPFS-SNC, but he said possible use cases include analytical queries, largescale data warehousing products and cloud computing where storage is accessed in parallel. GPFS is used in IBM’s SONAS scale-out NAS product and its Smart Business Compute Cloud, so the new architecture is likely to show up there. It’s also a candidate for IBM’s recently acquired Netezza data warehousing platform.]]>