Posted by: Peter Tran
analytics, Big Data, Case Studies, Data Federation, Data Integration, Data Virtualization, Products
Data is the lifeblood of analytics — the more diverse the better.
In their best-selling book, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Mayer-Schonberger and Cukier describe the synergy that occurs when previously unrelated and disparate data is brought together to uncover hidden insights. But these advanced analytics data requirements are a double-edged sword as these more diverse sources complicate data integration and constrain progress.
The Analytics Pipeline
The analytics pipeline includes six major process stages, often implemented in an iterative manner including:
- Find the Data
- Access the Data
- Build a Sandbox for the Data
- Build the Analytic Model
- Analyze the Results
- Develop and Communicate the Business Insight
Most analysts spend more than half their time and effort assembling the data needed to perform the analytics, and the rise of big data and cloud computing has made this more severe. The typical analyst is faced with numerous data challenges that must be overcome.
Different data shapes
It used to be the case that most data was tabular, and even relational. But that has changed during the last five years with the rise of semi-structured data from web services and other non-relational data streams. Analysts must now work with data in multiple shapes, including tabular, XML, key-value pairs, and semi-structured log data.
Multiple interfaces and protocols
Accessing data has gotten more complicated. An analyst used to simply use ODBC to access a database, or receive a spreadsheet via e-mail from a colleague. But now analysts must access data through a variety of protocols, including web services through SOAP or REST, Hadoop data through Hive, and other types of NOSQL data through proprietary APIs.
Larger data sets
Data sets have grown larger and larger during the last decade, and it is no longer reasonable to assume that all the data can be assembled in one place, especially if that place is your desktop. The rise of Hadoop is fueled by the tremendous amounts of data that can be easily and cheaply stored on this platform. Analysts must be able to work with data by leaving it where it is, and intelligently sub-setting it and combining it with data from multiple sources.
The analytic development process is characterized by exploration and experimentation, and this requires data sets to be iteratively assembled and updated as the exploration proceeds. In other words, data agility is an important part of successful analytics.
What is your Analytics Data Challenge?
Do you agree with these challenges? Are there others to consider as well?
Compare notes with Alpine Data Labs’ Steven Hillion, as his video describes the challenges he sees.