Question: What is the best way to store really large data sets?
We are commonly used to terabytes of data; 3TB hard drives are now usually available for under $200 and 4TB drives are starting to ship at premium prices. It is not unusual for a company to have at least half a petabyte of data floating around on their storage systems these days and a petabyte of total data if you count all those forgotten data bases and buried servers. Working in the storage industry I cannot tell you how many times clients would underestimate their actually storage data set by 50% or more. SAN/NAS solutions, well understood technology that have been around for a while, are robust systems that reliably support storage pools of a petabyte or more. However, as enterprises’ appetite for ever increasing amounts of data – so called big data – grows there is a need for new architectures that take a different approach to managing massive amounts of data (20 petabytes or more) at lower cost. That is where object stores have the advantage over traditional storage approaches because they have the capability to store data very efficiently on commodity hardware, scale horizontally to essentially infinite size and seamlessly handle any type of data.
As enterprise data sets grow to tens of petabytes – i.e. beyond the scale of even the largest SAN/NAS solutions available today, there are some very attractive cloud systems that address the need for those ever expanding pools of storage. It might be worthwhile to take a minute to understand how cloud storage works for very large amounts of data. First introduced in 1993, object stores, unlike traditional file systems that maintain some type of hierarchical organization using the file and folder analogy, take a different approach. Each file is treated as an object – hence the term object store – and the objects are placed in the store using a distributed data base model. Having no central “brain” or master point of control provides greater scalability, redundancy and permanence. It is not a file system or real-time data storage system, but rather a long-term storage system for a more permanent type of static data that can be retrieved, leveraged, and then updated if necessary. The details vary of course, but the ability to find objects from anywhere in the store using a distributed retrieval mechanism is what allows the stores to handle multiple petabytes of data. It is ideal for write once, read many types of data pools. Primary examples of data that best fit this type of storage model are virtual machine images, photo storage, email storage and backup archiving.
The advantage of moving from a SAN storage solution to a cloud solution for very large amounts of data makes sense for many use cases. Some of the advantages include:
- Widely deployed proven technology with hundreds of petabyte data storage in production today
- Most cost efficient solution for the scale – Substantially lower per gigabyte per month storage costs
- Reduced data center floor space utilization
- Enhanced flexibility to meet fluctuating storage demands
- Potential for delivering faster throughput to applications and a better end-user experience
- Highly scalable object storage
- Capable of creating seamless storage pools across multiple back-end systems
- Ability to scale horizontally instead of vertically
- The horizontal architecture scales well beyond the 20 Petabytes maximum that traditional storage architectures allow
- Uses interchangeable commodity hardware
- Simplified operations
The average cost of commercial fully managed cloud storage is running $.11-.15/GB/month. That might be a bit high for companies that have massive data storage needs, but an organization that has the wherewithal to build it in-house can bring the costs down substantially, easily to under $.05/GB/month. Remember, for every 10 petabytes of data, every additional $0.01/GB/month of savings represents $1.2M/year. For one such model, check out Amar Kapadia’s blog on cost projections for building an Openstack Swift store, Can OpenStack Swift Hit Amazon S3 like Cost Points?
In the end, if you have more than 10 petabytes of data, it might be worth checking out cloud object storage to take advantage of its ability to cost effectively and transparently scale to hundreds of petabytes. With the right data set, a company can achieve significant savings and support planned growth. In addition, object storage offers a more flexible architecture for future growth, and improved control over operational and capital costs.
About the Author
Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions