Posted by: Dave Raffo
cloud, data warehousing, GPFS, hadoop, ibm
IBM won the Supercomputing 2010 HPC Storage Challenge this week with a technology designed for signifcantly improvement performance of running analytics and queries for large data sets as well as cloud applications.
Developed at IBM Research Almaden, the General Parallel File System-Shared Nothing Cluster (GPFS-SNC) uses the Hadoop Distributed File System (HDFS) for what IBM calls “high availability through advanced clustering, dynamic file system management and data replication, and can even continue to provide data access when the cluster experiences storage or node malfunctions.”
Prasenjit Sarkar, master inventor for storage analytics and resiliency at IBM Research said the technology uses a distributed architecture where each node is independent and tasks are divided between computers. No node has to wait for another to perform a task. This removes bottlenecks associated with SANs because there is no single point of failure.
“The goal is to store large amounts of data as efficiently as possible,” he said. “This is an architecture for petabytes and even exabytes.”
He said the architecture includes enterprise features such as client-side caching, disk caching, wide area replication, and archiving.
Sarkar said he couldn’t talk about any product plans or roadmap for GPFS-SNC, but he said possible use cases include analytical queries, largescale data warehousing products and cloud computing where storage is accessed in parallel. GPFS is used in IBM’s SONAS scale-out NAS product and its Smart Business Compute Cloud, so the new architecture is likely to show up there. It’s also a candidate for IBM’s recently acquired Netezza data warehousing platform.