Nowadays, the term unstructured data pops up everywhere. It owes its popularity for a large part to the success of big data, to successful technologies such as NoSQL and Hadoop, and to formats such as JSON and XML. Unfortunately, different definitions for unstructured data exist. All these different definitions confuse many people and it blurs and obscures many discussions on unstructured data. The reason that so many definitions exist is that the term unstructured data is a misnomer, and maybe we should ban it from our discussions.
For some, unstructured data is textual data, for others it’s data that doesn’t fit rigid relational data structures, and there are those who say that unstructured data refers to tables or files in which each record can have a different structure. For example, in Webopedia unstructured data is defined as follows: “Unstructured data usually refers to information that doesn’t reside in a traditional row-column database.” For example, data stored in XML and JSON documents, CSV files, and Excel files is all unstructured. Definitions can also be very vague. Take for example the definition used in Dummies.com: “Unstructured data is data that does not follow a specified format for big data.”
One reason why so many different definitions exist is that the adjective “unstructured” in combination with the word data makes no sense, because if we take the meaning of the word unstructured literally, then unstructured data doesn’t exist. The literal meaning of the word unstructured according to the Merriam-Webster online dictionary is the following: the adjective unstructured means lacking structure or organization; not formally organized in a set or conventional pattern; and not having a system or hierarchy. Many other dictionaries use comparable definitions. The Free Dictionary adds that in psychology the word unstructured is used to refer to something that has no intrinsic or objective meaning. And Microsoft Word proposes the words formless and shapeless as synonyms for unstructured. For example, a development approach can be unstructured and art can be unstructured; see for example this painting.
So, literally unstructured data is data without a shape or form, not formally organized, and without a system. Why would we want to store that type of data, because if it has all those characteristics, storing unstructured data is useless? It would only fill up the disks, and we would not be able to process it in any way. No organization would store that type of data. Conclusion, if we take the term unstructured data literally, no one would store unstructured data and, therefore, would not exist.
In fact, most data that is currently qualified as unstructured data is quite structured. For example, all the XML and JSON documents are highly structured. The same applies for text. A linguist would never agree with calling text unstructured data, because text has structure. If not, we would not be able to understand what is written and said. Additionally, no audio-to-text transcription software would exist, but it does.
Calling audio and video unstructured makes no sense either. For example, if you open up an MP3 file, you will see that it contains an indication of the version of MP3 used. It contains tags, such as Artist, Composer, Title, and Track number. Agreed, those tags are not always stored at the same spot in the file, sometimes they’re placed at the beginning, sometimes at the end, and sometimes somewhere in the middle, but everyone can read and understand them. MP3 files and all the other audio and video files are highly structured. Else, no tools would be able to recognize and play them.
So, the term unstructured data is a major misnomer. Confucius once said, “The beginning of wisdom is to call things by their proper name.” So, let’s follow his advice, let’s call things by their proper names. Call it data with a fixed or variable data structure, data with repetitive and hierarchical data structures, call it textual data or audio data. But stop calling it unstructured data. Let’s ban this term from now on, it’s a misnomer.
P.S.: And if we stop the word unstructured data, we can also stop using the term structured data, because it then becomes a pleonasm. It’s like the terms wet rain and burning fire. And now that we are on this topic, what does semi-structured data mean? Is that data that is structured for 50%? If so, then semi-structured data equals semi-unstructured data. Not useful either.
Numerous SQL-on-Hadoop engines are available for accessing data stored in HDFS using the familiar SQL language. They all look promising, they all support a rich SQL dialect, but which ones is the fastest? Performance is important, especially when business users interactively use BI tools to access big data via these SQL-on-Hadoop engines.
So, which one is the fastest for an interactive, ad-hoc, and OLAP-like workload? Until now, there wasn’t much information available on this topic. That is, until AtScale published benchmark results across three SQL-on-Hadoop engines: Apache Hive, Cloudera Impala, and Spark SQL. Of course, we have the TPC-H and the TPC-DS benchmarks, but these two don’t represent interactive, ad-hoc, OLAP-like workloads.
AtScale, based in San Mateo, CA, is a software vendor that offers a fast MDX and SQL interface on big data stored in Hadoop. To access the data, AtScale leverages SQL-on-Hadoop engines.
They developed a benchmark that represents an interactive, ad-hoc query, OLAP-like workload. The benchmark is defined on the publicly available Star Schema Benchmark data set. AtScale completed this by defining a set of typical OLAP queries. These queries can be classified in three groups: quick metric queries (compute a particular metric value for a period of time), product insight queries (compute metrics aggregated against a set of product and date based dimensions) and customer insight queries (compute metrics aggregated against a set of product, customer, and date-based dimensions). Together, all these queries represent the types of queries appearing in real life BI environments in which users use tools such as Business Objects, Tableau, Excel, and Qlikview.
The performance results that have come out of this benchmark are intriguing, although they may not be what some people expect. One clear result is that not one SQL-on-Hadoop is the fastest for all of the queries. For some queries Apache Hive is the fastest, and for others it’s Spark SQL or Cloudera Impala.
Life would be easy if one of the engines would always be the fastest. Because that would mean that when an organization wants to select the fastest, they can pick just that one. This benchmark clearly shows that this is not the case. By itself, this is quite interesting, because some specialists have a favorite SQL-on-Hadoop engine and they really think their favorite is always the fastest. This is not confirmed by this benchmark.
Important to understand is that these three engines can access the same HDFS files and the same table descriptions documented in HCatalog. This means that any solution like AtScale and those that generate SQL code for SQL-on-Hadoop engines, such as some ETL tools, should support all three SQL-on-Hadoop engines to access data in HDFS files. They must be smart enough to know which one is the best to use for a particular SQL query. In fact, all data virtualization tools and BI on Hadoop tools that generate SQL code for the SQL-on-Hadoop engines have to be aware of the strengths and weaknesses of these engines.
I am interested to see how this is going to evolve the coming years. But we have to thank AtScale for doing this benchmark. It has given us some more information on the performance aspects of SQL-on-Hadoop engines. I strongly recommend to read the benchmark results. One thing we definitely learned from this benchmark is that we can’t answer the question (yet) which SQL-on-Hadoop is the fastest.
Business analysts and data scientists no longer restrict themselves to internally produced data that comes from IT-managed production systems. For their analysis they use all the data they can lay their hands on and that includes external data sources. This is especially true for tech-savvy analysts who obtain data from the internet (such as research results), access social media data, analyze open data and public data, copy files with analysis results from their colleagues, and so on. They mix this external data with internal data to get the most complete and accurate business insights.
Unfortunately, not all of this external data has a schema and a simple structure. In this case, analysts can’t import that data into their favorite analytical tools, so that data is out of their reach. In such situations, analysts must ask IT to assist them with importing the data into some SQL database. Developing such a program can take IT quite some time as they are typically backlogged, which stalls the analysis process considerably (maybe even with weeks).
Do SQL-on-Hadoop engines, such as Apache Hive, Apache Phoenix, and Jethro Data, solve this problem? With SQL-on-Hadoop engines massive amounts of data stored in Hadoop files can be queried fast. This is very useful, because it allows analysts to study big data using their analytical tools. Unfortunately, many SQL-on-Hadoop engines can only access data stored in Hadoop files and they can only access that data if it has a simple, relational, flat structure and if the schema definition exists.
In this respect Apache Drill is different. It allows analysts to use their favorite reporting or analytical tools to play with data using SQL, and in addition it offers SQL access to most of the classic and new data sources, including Hadoop, MongoDB, JSON, cloud storage, and so on. These data sources can even be accessed if no schema for the data exists and if the data doesn’t have a simple structure, but is, for example, hierarchical and contains repeating groups. Apache Drill can even access data when each record in the source has a somewhat different data structure.
Drill is an example of a SQL-on-Everything solution. Analysts don’t have to ask IT for assistance. Analysts can use Drill against any kind of data source as Drill discovers what the structure of the data is while accessing the data. SQL-on-Hadoop is very useful technology, but what many analysts and data scientists want and need is SQL-on-Everything, because that really enriches their analytical capabilities.
Not so long ago I attended a session in which the speaker was very clear on what big data is and what it is not. In his opinion, big data is unstructured data and unstructured data is big data. With unstructured data he meant textual data, such as emails, social media messages, and contracts, but also video and image so on. According to him structured data was not big data, because we have been processing structured data since the dawn of IT. So, nothing new there.
I don’t agree with this view at all. There are many different forms of big data. The amount of unstructured data can definitely be humongous and qualify as big data. But the same holds true for structured data. Many big data systems exist today that process, store, and analyze staggering amounts of structured data. For example, telecommunication companies monitor outages to detect drops in the service level; internet companies monitor every form of usage by website visitors to influence which ads to present or which products to recommend; power plants monitor overheating of components; banks use it with their stock tickers and for real-time fraud detection; distribution centers monitor incoming and outgoing products with their RFID readers; on-line gaming companies deploy it to detect malicious behavior and monitor quality of service; and the list goes on. In all these examples big amounts of structured data are processed. It’s the world of sensor data and machine-generated data.
People may not be aware of it, but in their daily lives they are responsible for generating numerous, massive data streams. For example, by driving their cars they generate several data streams, such as the speedometer that consumes the data stream coming from the speed sensor and the temperature gauge that presents data coming from the sensor that measures the outside temperature. The expectation is that in the future an average car will have 200 sensors. The amount of data all these cars will generate is phenomenal.
The car is not the only device that generates data streams that people use. Smart phones, tablets, and smart watches are continuously producing gigantic amounts of data. The apps on our phones also stream data continuously. And it doesn’t stop with the devices we carry around. When we watch TV, the provider monitors which programs we watch which influences the ads that are shown. Smart energy meters send data to utility companies.
Many of the sensor-driven systems generate massive amounts of machine-generated data. All this data is highly structured. Based on the sheer volume and the speed with which all this data has to be analyzed, it categorizes as big data.
And then we have the Internet of Things in which countless devices talk to each other. Again, the amount of data that will flow between these devices will be staggering. Gartner forecasts that 4.9 billion connected things will be in use in 2015, and this number will reach 25 billion by 2020. Especially the manufacturing, utilities, and transportation industries will be the top three verticals deploying the IoT. Can you imagine the amount of (structured) data being generated?
All this sensor data is highly structured data. If the amount of highly structured, machine-generated data hasn’t surpassed the amount of unstructured data being generated yet, it will do so in the near future. Almost all of this machine-generated and highly structured data is generated and stored for analytical purposes and nothing else. And that makes it big data.
So, big data is not unstructured data only, that’s a myth. Structured data can be big data as well. Let’s not distinguish what’s big data or not based on whether the data is structured or not.
The third big data myth in this series deals with how big data is defined by some. Some state that big data is data that is too big for a relational database, and with that, they undoubtedly mean a SQL database, such as Oracle, DB2, SQL Server, or MySQL.
To proof that such statements are being made, I present two examples. First, the following statement is from PredictiveAnalyticsToday.com: “Big data is data that is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze.” With the term conventional they mean, among other things, the well-known SQL databases. Here is the second one: “There are times when data is either being updated too quickly or the data are simply too large to be handled practically by a relational database.” Again, they probably mean a SQL database.
To be honest, this is a silly and non-constructive way to define big data and to distinguish it from “small” data. True, there are some SQL products that are really not designed to support big data workloads. Even for some of the more well-known products it’s a challenge to store hundreds of terabytes of data and still offer decent performance.
Big data systems can be developed with SQL database server technology. This has not only been proven on paper, but in real life projects as well. I will give two examples of categories of SQL products with which big data systems can be developed.
First, besides the traditional SQL database products, many so-called analytical SQL database servers exist today. These products have been designed and optimized to support analytics on big databases and they all use SQL. With some of them petabyte-large big data systems have been developed. For example, already in 2010 EBay operated a ten petabyte database supported by Teradata. Granted, not every SQL database product is suitable for every possible type of big data workload, but that is not different from NoSQL products. Most of them are also designed and optimized for specific big data workloads.
Second, don’t forget how popular SQL-on-Hadoop engines have become. Some claim that already more than thirty five of them exist. Now, if the interface of a big data system is SQL, then that system is a SQL-system. This is independent of whether the SQL interface is internally supported by a classic SQL database server or Hadoop. SQL-on-Hadoop engines running on Hadoop can support massive databases.
Conclusion, the myth “big data is too big for SQL systems” has never made any sense, and it isn’t making sense at all right now. It’s really a myth. SQL is definitely suitable for developing big data systems. Maybe not for all big data systems, but that applies to every technology. No database technology is perfect for every possible type of big data system.
Self-service analytics allows users to design and develop their own reports and do their own data analysis with minimal support by IT. Most recently, due to the availability of tools, such as those from Qlik, Spotfire, and Tableau, self-service analytics has become immensely popular. Lately, self-service data preparation capabilities have become available which extend the existing analytical and visualization capabilities of self-service tools.
Self-service data preparation guides users in understanding the data and the data structures. Users don’t have to study what the best way is to integrate two data sources or to find out which data values are potentially incorrect. All this is done automatically by the data preparation tool. In general, it becomes easier for users to analyze files and data stores that are completely new to them. Even for such data sources, they don’t have to call IT for help with the integration of data sources. In a nutshell, data preparation is a valuable enrichment of the palette of self-service capabilities.
As we all know, self-service analytics hasn’t replaced all other forms of reporting and analytics, such as standard reporting, embedded analytics, ad-hoc reporting, and mobile BI, but complements them. Self-service analytics is one of the many forms of analytics. All these forms can be divided in two categories: IT-driven BI and Business-driven BI. Self-service analytics belongs to the second category.
The challenge for organizations is to let self-service analytics cooperate with all the IT-driven BI forms. In other words, self-service analytics has to become a fully integrated part of the larger BI environment by bridging the gap between self-service analytics and the other BI forms. This means two things. First, reports initially developed by the business may have to be migrated to the IT-driven BI environment later on. This is referred to as operationalization of self-service reports. Second, reporting specifications developed by IT must be shared by users of self-service tools. For example, the integration of data sources may be so complex that the specifications are developed by IT specialists and are handed over to business users to be used for self-service analytics. In this situation, IT enables business-driven BI.
To bridge the gap between IT-driven BI and Business-driven BI, self-service analytics has to cooperate with data virtualization technology. Four different scenario’s exist of how they can cooperate:
- Make data virtualization a data source for self-service analytical tools. This allows users to access a wide range of data sources, including SQL databases, XML documents, Excel spreadsheets, Hadoop and NoSQL, web services, and applications.
- Use data virtualization to make results developed with self-service tools available for all users, including the ones that do not develop reports themselves; reports are shared across BI-driven and IT-driven BI.
- Use data virtualization to operationalize user-defined reports and results, allowing specifications developed with the self-service tools to be executed by the data virtualization server.
- Let IT specialists use self-service data preparation functionality to develop data virtualization views when new, unfamiliar data sources have to be hooked up to the data virtualization server. This shortens development time.
To summarize, it’s time that BI departments take user-driven, self-service analytics out of its isolation, and integrate it with the rest of the IT-driven BI forms. The solution is to let self-service analytics cooperate with data virtualization technology. For more detailed information, we refer to whitepaper Strengthening Self-Service Analytics with Data Preparation and Data Virtualization.
In Part 1 of this series on big data myths I indicated that the goal of most big data projects is analytics. In other words, big data systems are almost always developed to improve the analytical capabilities of an organization; big data almost always means analytics. Now some try to make us believe that the opposite is true as well: analytics almost always means big data. They see big data and analytics as two sides of the same coin: big data is to support analytics and analytics requires big data. The latter is a myth. Analytical capabilities can definitely be improved and extended with just a little bit of data. Big data is not always a prerequisite.
Let me give an example. Some time ago I ordered on a website an album by a band called Longbranch Pennywhistle. This album was missing in my collection. When ordering this album the website informed me that an album by another band called Shiloh was in stock. I was asked if I was interested that one as well? Which I was, so I bought both of them. Afterwards, I wondered how they did that, because they can’t apply logic such as: 400 customer have bought product A and 250 of them bought product B as well, you’re now ordering A, so you’re probably also interested in B. I can guarantee you, this website can’t apply this kind of logic, because no 400 copies of the album by Longbranch Pennywhistle are being sold in a year, probably just one or two.
So, I kept wondering how they were able to discover a relationship between these two albums? I decided to give them a call. When I had them on the phone, I asked them to connect me with one of their IT specialists, which they did to my surprise. I asked how they were able to recommend that Shiloh album. The guy explained that it was simple. They store everything they know about bands, artists, and albums in a database. It’s like a network of knowledge on music. When someone buys a product, the network is navigated to find relationships. The relationship between these two bands is that one of the members of Longbranch Pennywhistle and one of Shiloh started another band called The Eagles. My final question was whether this network database was a big database. The answer was a definite No. In fact, measured in bytes, it was a really small database. This website was using a small database to support some of their most important forms of analytics.
Conclusion, for some really fancy forms of analytics, big data is not always needed. “Small” data can be sufficient. It’s not about the size, it’s about the quality of the data and about having the right data at the right time. Some forms of analytics really require BIG data, but … not always.
Big data is an incredibly popular topic. Plentiful articles, books, and blogs have been written on this topic and countless sessions have been presented that discuss some aspect of big data. I am a big fan of big data, because organizations have been able to improve and extend their analytical capabilities through their big data systems, resulting in various business benefits.
But I am also aware that big data is sometimes being overhyped. Take for example this statement: “Big data: A revolution that will transform how we live, work and think”, and “Without big data, you are blind and deaf in the middle of a freeway.” These and many others are all hyperboles. The danger is that big data is being over oversold and that this creates a false or skewed picture of big data and leads to false expectations.
Therefore, I decided to write a number of blogs on big data myths. Each myth is based on statements that repeatedly pop up in articles and sessions. I call them the big myths on big data.
The first big myth I want to address is that big data is sometimes presented as the goal of a project or system. Wrong, the goal is never to develop a big, fat database. Can you imagine a meeting in an IT department that ends with a statement like: “Ok guys, let’s see if we can develop the biggest database on the planet.” This is not how it works in real life. That can never be a goal.
Nor is big data ever a question. Users never call the IT department with a question such as: “Please, can you develop a big database for me?”
If big data is something, it’s an answer, or, in fact, it’s part of an answer. But then, if big data is part of an answer, what’s the question? In almost all the big data systems, the question is analytics. Organizations want to increase, enrich, or extend their analytical capabilities, and quite often this requires more data, big data. So, it’s almost always analytics that’s driving big data systems.
Do not confuse analytics with reporting. Reporting is primarily about presenting what has happened, whereas analytics is mainly aimed at showing what may happen and at influencing what’s going to happen. Typical examples of analytics that may require big data are: improving product development, optimizing business processes, optimizing the operations of machines, improving the level of customer care and customer delight, and personalizing products.
So, the goal is not big data, the goal is to improve and extend the analytical capabilities of an organization. However, sometimes, with all these technical discussions on Hadoop and NoSQL, and so on, we tend to forget this. The following somewhat crude saying can apply in big data projects: “When you are up to your ass in alligators, it’s difficult to remember that your initial objective was to drain the swamp.” I think this applies to some big data projects. The goal is not big data itself, it’s analytics.
Hadoop has become a popular and powerful platform for data storage and data processing. Data stored in Hadoop can be used by a wide range of applications and tools and for a wide range of use cases. The fact that SQL can be used to retrieve Hadoop data, has opened up the data to even more tools, especially tools for reporting and analytics. A question organizations have to ask themselves is in which SQL-on-Hadoop technology they should invest? They can go for straightforward SQL-on-Hadoop engines or for data virtualization servers.
Examples of SQL-on-Hadoop engines are Drill, Hive, Impala, and Spark SQL. Many of them only allow data to be queried, but there are some, such as Splice Machine, that offer transactional support on Hadoop. Others, such as Cirro and ScleraDB, support data federation capabilities allowing Hadoop data to be joined with data stored in SQL databases. A technical challenge for most SQL-on-Hadoop engines is how to turn all the non-relational data stored in Hadoop, such as variable data, self-describing data, and schema-less data , into flat relational structures. Not all the engines are capable of that. In other words, only flat data can be accessed by them. Nevertheless, SQL-on-Hadoop engines make it easier to use popular tools for reporting and analytics to access big data stored in Hadoop.
But they are not the only kid in town. Data virtualization servers, such as those of Cisco, Denodo, RedHat, and Stonebond, also allow Hadoop to be accessed through SQL. In fact, most data virtualization servers allow SQL access to data stored in almost any kind of file system or database server, including spreadsheets, XML and JSON documents, sequential files, pre-relational database servers, data hidden behind APIs such as SOAP and REST, and data stored in applications such as SAP and Salesforce.com. As indicated, data virtualization servers offer access to Hadoop as well, and with that they have entered the market of SQL-on-Hadoop solutions. However, when they access Hadoop it’s through one of the existing SQL-on-Hadoop engines.
Note that data virtualization servers are more than engines that translate one language into another. For example, all of them offer data federation capabilities for many non-SQL data sources, they support a high-level design and modeling environment with lineage and impact analysis features, caching capabilities to minimize access to data sources, advanced distributed join optimization techniques are supported, and extensive data security features are offered.
In a nutshell, most current SQL-on-Hadoop engines are tools that solve one technical problem, in this case offering SQL access on Hadoop data. Data virtualization servers are more global solutions that offer access in any language and API on any kind of data source. It’s a more architectural solution.
It’s very likely that SQL-on-Hadoop engines will be extended with typical data virtualization features, and vice versa, data virtualization servers will be enriched with full-blown, native support for Hadoop access by embedding their own SQL-on-Hadoop technology. Because they do try to solve some comparable problems, it’s not unlikely that the two product categories will somehow converge. Some products will merge and others will be extended. This is definitely a market to keep an eye on in the coming years.
In the history of IT, the number of times that IT departments proposed to top management to invest in a new technology, a new data quality program, or a new design technique, but were not able to convince them must be countless. Evidently there’s not just one reason why IT did not always succeed, but the lack of a data strategy is definitely a dominant one. It has always been an indispensable concept, but now that data is increasingly becoming a critical asset for many lines of business, a data strategy is crucial.
A data strategy describes a single, unified, organization-wide plan for the use of corporate data as a vital asset for every form of decision-making. It describes, for example, why an organization wants to store data, which data to store, what the plans are with respect to data usage, what the vision is with respect to data, and what the plan is to implement the data strategy. A data strategy has to be aligned with the business goals and other business strategies.
If there is no data strategy, many new IT proposals have no context. For example, if IT proposes to invest in a new analytical SQL database server, they need to explain why. The reason is probably to improve reporting performance. The question for top management is then why the performance should be improved? How does that fit in the larger context of things? Or, how can they see the value of improving data quality if there is no overall plan and when it’s not clear to them what the impact is on the business? It’s as if IT proposes management to invest in a puzzle piece while there is no puzzle. The lack of such a puzzle makes it hard for top management to justify the investment.
The data strategy is the puzzle. Proposing new ideas, new technologies, and new programs makes more sense when such a puzzle exists and when it has been accepted by management. In this case, every proposal includes a description of how it forms a new puzzle piece and how it fits in the overall puzzle. For example, if a data strategy states that within four years the correctness of data visible to the company’s suppliers must be higher than 99,5%, it makes sense to top management to introduce the concept of a data steward within some parts of the organization. As indicated, the data strategy makes it easier to justify the investment. Without the data strategy, there is no business context.
If an organization doesn’t have a data strategy, one must be developed as soon as possible. Move aside all the new proposals for new products, new projects, new techniques, and so on, and invest first in a data strategy and make sure it’s accepted by top management. Because the data strategy describes the rules for storing, processing, and using data, and the data vision. It is the puzzle, and selling new ideas that fit the puzzle will be so much easier.