Posted by: Randy Kerns
There are many opinions regarding how to handle information storage for big data analytics. By big data analytics, I’m referring to information associated with a data analytics operation that does the analysis in near real-time to present immediately actionable results. The most common approach to this type of analysis is to provide data that is the source for the real-time analytics process to the compute nodes with minimal latency and at a high data rate.
This requirement has led many data scientists designing analytics systems to require data to come from storage directly attached to the compute nodes. If solid state devices (SSDs) are used for storage, then all the better. This is contrary to most IT organizations’ strategy of delivering efficient storage utilization through networked storage. The approaches for the source data will continue to evolve with new storage systems and methods, but currently the decisions are driven by the designers of the analytics systems.
A more impacting question, is where does the data go after the initial analysis has been done? Some say that the data has already been used and can be discarded. However, future analysis on a larger set of data with different criteria may prove valuable. The problem is where to store that potentially massive amount of data that might be used again.
The most discussed approach is to archive the data for subsequent usage. The target for the data could be:
• A local storage system as a content repository. Usually this would be a NAS system for the unstructured file content used in data analytics, but it could also be a new generation object storage system capable of handling potentially billions of objects.
• Cloud storage may be the target for the analyzed data either as files or objects. With cloud storage, the storage costs could be reduced compared to adding infrastructure and archiving storage systems in IT for what may be a highly varying amount of capacity required. The costs are dependent on the amount of time the data is retained.
Ultimately this could be a massive amount of data. Archiving storage systems are typically self-protecting with remote replication to another archiving system or to cloud storage. The requirement for data protection may be another variable depending on the value of the data.
The big in big data analytics can mean big money if the decisions about where to store the information and how long to retain it are not strategically made. The main focus for big data analytics so far has been on the speed of the initial data analysis. Where to put the data to be retained must be considered as well and this can be a major concern for IT.
(Randy Kerns is Senior Strategist at Evaluator Group, an IT analyst firm).