The Wondrous World of Data

Dec 2 2015   6:55PM GMT

Analysts and Data Scientists Need SQL-on-Everything

Rick van der Lans Rick van der Lans Profile: Rick van der Lans

Tags:
Data scientist
Hadoop

Business analysts and data scientists no longer restrict themselves to internally produced data that comes from IT-managed production systems. For their analysis they use all the data they can lay their hands on and that includes external data sources. This is especially true for tech-savvy analysts who obtain data from the internet (such as research results), access social media data, analyze open data and public data, copy files with analysis results from their colleagues, and so on. They mix this external data with internal data to get the most complete and accurate business insights.

Unfortunately, not all of this external data has a schema and a simple structure. In this case, analysts can’t import that data into their favorite analytical tools, so that data is out of their reach. In such situations, analysts must ask IT to assist them with importing the data into some SQL database. Developing such a program can take IT quite some time as they are typically backlogged, which stalls the analysis process considerably (maybe even with weeks).

Do SQL-on-Hadoop engines, such as Apache Hive, Apache Phoenix, and Jethro Data, solve this problem? With SQL-on-Hadoop engines massive amounts of data stored in Hadoop files can be queried fast. This is very useful, because it allows analysts to study big data using their analytical tools. Unfortunately, many SQL-on-Hadoop engines can only access data stored in Hadoop files and they can only access that data if it has a simple, relational, flat structure and if the schema definition exists.

In this respect Apache Drill is different. It allows analysts to use their favorite reporting or analytical tools to play with data using SQL, and in addition it offers SQL access to most of the classic and new data sources, including Hadoop, MongoDB, JSON, cloud storage, and so on. These data sources can even be accessed if no schema for the data exists and if the data doesn’t have a simple structure, but is, for example, hierarchical and contains repeating groups. Apache Drill can even access data when each record in the source has a somewhat different data structure.

Drill is an example of a SQL-on-Everything solution. Analysts don’t have to ask IT for assistance. Analysts can use Drill against any kind of data source as Drill discovers what the structure of the data is while accessing the data. SQL-on-Hadoop is very useful technology, but what many analysts and data scientists want and need is SQL-on-Everything, because that really enriches their analytical capabilities.

 Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: