Let’s meet in New York, complements of Cisco Information Server, NuoDB, Strata, and Waterline Data.
The latest McKinsey Quarterly, celebrating its 50th anniversary, suggests we need a significantly upgraded “Management intuition for the next 50 years”, explaining that “the collision of technological disruption, rapid emerging-markets growth, and widespread aging is upending long-held assumptions that underpin strategy setting, decision making, and management”. Regular readers of this blog will perhaps be surprised only in how long it has taken McKinsey to notice!
The biz-tech ecosystem concept I introduced in “Business unIntelligence” (how time files—the book is out almost a full year) pointed to a few other real world trends, but the result was the same: wrap them together with the current exponential rate of change in technology, and the world of business, and indeed society as a whole, can and must transform in response. W.B. Yeats was more dramatic: “All changed, changed utterly: A terrible beauty is born”.
Much of the excitement around changing technology has focused on big data, particularly all things Hadoop. I’ve covered that in my last post and will be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, Joe DosSantos, and Jay Zaidi, sponsored by Waterline Data Science, also at Strata on 17 October.
A second important aspect is virtualization of the data resource. This becomes ever more important as data volumes grow, and copying it all into data warehouses or migrating it to Hadoop is difficult or costly. I also dealt with that topic in a recent blog and will be addressing it at Cisco’s Data Virtualization Day, with Rick van der Lans in New York, next Wednesday, 1 October.
However, there is one other aspect that has received less attention: the practical challenge of the existing layered architecture, where the data warehouse is “copied” from the operational environment. There are many good reasons for this approach, but it also has its drawbacks, most especially the latency it introduces in the decision making environment and issues related to distributed and large scale implementations. In “Business unIntelligence”, I discussed the emerging possibility of combining the operational and informational environments, particularly with in-memory database technology. Gartner coined a new acronym, HTAP (Hybrid Transaction/Analytical Processing), last January to cover this possibility. With its harkening back to the old OLTP and OLAP phraseology, the name doesn’t inspire, but the concept is certainly coming of age.
One particularly interesting approach to this topic comes from NuoDB, whose Swifts 2.1 release went to beta a couple of weeks ago. I blogged on this almost a year ago, where I noted that “real-time decision needs also demand the ability to support both operational and informational needs on the primary data store. NuoDB’s Transaction Engine architecture and use of Multi-Version Concurrency Control together enable good performance of both read/write and longer-running read-only operations seen in operational BI applications”. With general availability of this functionality in November, NuoDB is placing emphasis on the idea that a fully distributed, in-memory, relational database is the platform needed to address the issues arising from a layered operational/informational environment. I’ll be speaking to this in New York on 15 October at NuoDB’s breakfast session, where I’ll also be signing copies of my book, complements of the sponsor.
So, New York, New York… here I come!
Doing big data governance can save you from drowning in the Data Lake.
So, how do you design and build a reservoir? Simplistically, you actually design and build a dam, clear the area behind it of everything of value—people, animals and things—and wait for the water to fill the existing valley, drowning everything in its wake.
Of course, I really want to talk about Data Lakes and Data Reservoirs, what the concept might mean, and its implications for data management and governance. Data Lakes, sometimes called Data Reservoirs, are all the rage at the moment. They seem to provide ideal vacationing spots for marketing folks from big data vendors. We’re treated to pictures of sailboats and waterskiing against pristine mountain backdrops. But beyond exhortations to move all our data to a Hadoop-based platform and save truckloads of money by decommissioning decades of investment in relational systems, I’ve so far found little in the way of thoughtful architecture or design.
The metaphor of a lake offers, of course, the opportunity to talk about water and data flowing in freely and users able to dip in with ease for whatever cup full they need. Playful images of recreational use suggest the freedom and fun that business users would have if only they didn’t have to worry about where the data comes from or how it’s structured. Like the crystal clear water in the lake, it is suggested that all data is the same, pure substance waiting to be consumed.
Deeper thinking, even at the level of the lake metaphor, reminds us that there’s more to it. Lake water must undergo significant treatment and cleansing before it’s considered fit to drink. Many lakes are filled with effluent from the rivers that feed them. Even the pleasure seekers on the lake understand that there may be dangerous shallows or hidden rocks.
The rush to discredit the data warehouse, with its structures and rules, its loading and cleansing processes, its governance and management, has led its detractors to throw out the baby with the lake water. It is important to remember that not all data is created equal. It varies along many axes: value, importance, cleanliness, reliance, security, aggregation, and more. Each characteristic demands thought before an individual data element is put to use in the business. At even the most basic level, its meaning must be defined before it’s used. This simple fact is at the foundation of data warehousing, but it often seems forgotten in the rush to the lakeshore.
Big data governance has to start from that most simple act of naming. Much big data arrives nameless or cryptically named. Names, relationships, and boundaries of use must all be established before the data is put to business use. It should not be forgotten that in the world of traditional data, data modelers labored long and hard to do this work before the data was allowed into the warehouse. Now, data scientists must do it for themselves, on the fly with every data set that arrives.
New tools are beginning to emerge, of course, that emphasize data governance and simplify and automate the process. What these tools do is re-create meaning and structure in the data. They differentiate between data that is suitable for this purpose or totally inappropriate for that task. And once you start that process, your data is no longer undifferentiated lake water; it has been purified and processed, drawn from the lake and bottled for a specific use.
I’ll be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, author of several books on data governance, Joe DosSantos of EMC Consulting, and Jay Zaidi, Director of Enterprise Data Management at Fannie Mae, sponsored by Waterline Data Science. Do join me at both of these sessions!
With Cisco’s Data Virtualization Day, where I’m a panelist along with Rick van der Lans, coming up fast on 1 Oct in New York, it’s a good moment to revisit the topic.
Data virtualization has come of age. For a few, it still remains unheard of, even if I mention some of its noms de plume, most often data federation. For others, although they are now a decreasing minority, especially in the data warehouse field it remains akin to devil worship! I’m a big supporter (of virtualization, that is!), but a new challenge is emerging – the Data Lake. To understand why, let’s quickly review the history and pros and cons of data virtualization.
As the original proponent of data warehousing in 1988, I was certainly not impressed by data virtualization when it was first talked about in 1991 as Enterprise Data Access and as a product EDA/SQL from Information Builders, and as a component of IBM Information Warehouse Architecture. (I suspect there are few enough of us left in the industry who can talk about that era from firsthand experience!) Back then, and through the 1990s, I believed that virtualization was a technology with very limited potential. Data consistency and quality was still the major drivers of the BI industry, and real-time integration of data from multiple sources was still a very high-risk endeavor.
By the turn of the millennium, however, I was having a change of heart! IBM was preparing to introduce Information Integrator, a product aimed at the market then known as Enterprise Information Integration (EII). The three principle use cases – real-time data access, combining data marts and combining data and content – were gaining traction. And they continue to be the principle use cases for data virtualization from a data warehouse point of view today. My change of heart seemed obvious to me then and now: the use cases were real and growing in importance, they could not be easily satisfied in the traditional data warehouse architecture, and the data quality of the systems to be accessed was gradually improving. Still, I was probably the first data warehousing expert to accept a role for data virtualization; it was not a very popular stance back then!
Within the past five years, data virtualization has become more mainstream. There is a broader acceptance that data exists and will continue to exist on multiple platforms, and therefore there is a need to access and join them in situ.
Long may that recognition continue! For now there is another wave of platform consolidation being proposed. It’s called the Data Lake, and it’s probably one of the most insidious concepts yet proposed – well, that’s my view. The Data Lake is a new attempt to consolidate data. In that sense, it echoes data warehouse thinking. The significant difference, however, is that there is no thought of reconciling the meanings, rationalizing the relationships or considering the timings of the data pouring in. “Just store it,” is the refrain; we’ll figure it out later. To my mind, this is as dangerous as allowing all unmonitored and likely polluted sources of water to flow into a real lake and then declaring it fit to drink. Not a good idea. In my view, as explained in “Business unIntelligence”, a combination of data warehouse and virtualization is what’s needed.
I’ll return to the Data Lake in more detail soon, and I’ll also be speaking about it at Strata New York, on 16 October.
Do join me at one or the other of these events!
Image: zhudifeng / 123RF Stock Photo
Datameer demonstrates Hadoop-based data mart.
Stefan Groschupf, CEO of @Datameer (from the German Sea of Data), has a great way with one-liners. At the #BBBT last Friday, he suggested that doing interactive SQL on Hadoop HDFS was reminiscent of a relational database using tape drives for storage. The point is that HDFS is a sequential file access approach optimized for the large read or write batch jobs typical of MapReduce. A good point that’s often overlooked in the elephant hype.
Another great sound bite was that Datameer could be seen as the Business Objects of the Hadoop world. And it’s that thought that leads me to the actual topic of this post: data marts.
As one of the oldest and most divisive debates since the earliest days of business intelligence, it’s hardly surprising that the old time-to-value discussions of data warehouse vs. data mart should reemerge in the world of Hadoop. After all, Hadoop is increasingly being used to integrate data from a wide variety of sources for analysis. Such integration always begs the question: do it in advance to create data quality or do it as part of the analysis to reduce time to value? As seen in the image above, Datameer is clearly at the latter end of the spectrum. It’s a data mart.
And in the big data world, it’s certainly not the only data mart type of offering. A growing number of products built in the Hadoop ecosystem are touting typical data mart values: time to value, ease-of-use, focus on analysis and visualization, self-service, and so on. What’s different about Datameer is that it has been around for nearly 5 years and has an impressive customer base.
At an architectural level, we should consider how the quality vs. timeliness, mart vs. warehouse trade-off applies in the world of big data, including the emerging Internet of Things (IoT), discussed at length in my Business unIntelligence book. Are the characteristics of this world sufficiently different from those of traditional BI that we can reevaluate the balance between these two approaches? The answer boils down to the level of consistency and integrity demanded by the business uses of the data involved. Simple analytic uses of big data such a sentiment analysis, churn prediction, etc. are seldom mission-critical, so quality and integrity are less demanding. However, more care is required when such data is combined with business-critical transactional or reference data. This latter data is well-managed (or, at least, it should be) and combining it with poorly curated big data leads inevitably to a result set of lower quality and integrity. Understanding the limitations of such data is vital.
This is particularly important in the case of the growing—and, in my view, unfortunate—popularity of the Data Lake or Data Reservoir concept. In this approach, previously cleansed and integrated business data from operational systems is copied into Hadoop, an environment notorious for poor data management and governance. The opportunities to introduce all sorts of integration or quality errors multiply enormously. In such cases, the data mart approach may amount to nothing more than a fast track to disaster.
Cisco’s acquisition of Composite Software begins to bear fruit.
Last Friday, the Cisco / Composite crowd turned up at the #BBBT. And I do mean crowd; seven participants from Cisco were present, including four speakers. Even a full year after its acquisition by Cisco, the Composite name still lingers for me at least, even though the Cisco branding and technical emphasis were on full view.
A major portion of this year’s session focused on Cisco’s vision of an architectural framework to enable the Internet of Things (IoT) or, in Cisco’s own words, the Internet of Everything. This fascinating presentation by Jim Green, former Composite CEO and now CTO of the Data and Analytics Business Group at Cisco, provoked extensive discussion among the members. Unfortunately for readers of this post, the details are under NDA, but I look forward to writing about them in the future. However, as a teaser, I will say this. Given the enormous growth predicted in the number of sensor-enabled devices—some 50 billion by 2020—our current data warehouse and big data architectures, which focus on centralized storage and manipulation of data, will need to be significantly augmented with real-time, streaming processing distributed throughout the network to simply cope with the volumes of data being generated. I even learned a new marketing buzzword—fog computing… to bring the cloud down to earth!
For BBBT old-timers, like me, our primary interest remains data virtualization, particularly as it relates to BI, and how the acquisition helped or hindered the development of that core function. One fear I had is that this small element of a bigger picture could get lost in the big corporation messaging of Cisco. That fear remains. For example, it took me quite some time to find the “Composite portion” of the Cisco website… it’s here, if you want to bookmark it. This is an aspect of the merged positioning that needs more attention.
But what about data virtualization? Virtualization function is a mandatory component of any modern approach to business intelligence and decision making support. It is a key component (under the name reification) of the process space of my Business unIntelligence REAL architecture. BI folks always focus on the data, of course. But, there’s much more to it.
Here’s what I wrote last year soon after the acquisition and last BBBT Cisco / Composite appearance: “One of the biggest challenges for virtualization is to understand and optimize the interaction between databases and the underlying network. When data from two or more distributed databases must be joined in a real-time query, the query optimizer needs to know, among other things, where the data resides, the volumes in each location, the available processing power of each database, and the network considerations for moving the data between locations. Data virtualization tools typically focus on the first three database concerns, probably as a result of their histories. However, the last concern, the network, increasingly holds the key to excellent optimization… And who better to know about the network and even tweak its performance profile to favor a large virtualization transfer than a big networking vendor like Cisco? The fit seems just right.” (See the full blog here.)
Clearly, this work is progressing with combined hardware/software offerings based on the Cisco (was Composite) Information Server and Cisco’s server and router offerings, although little was provided by way of technical detail or performance claims. Within the data virtualization space itself, the emphasis is on two key aspects: (1) to simplify the use and adoption of data virtualization and (2) to expand data virtualization sources particularly for big data and the Internet of Things. While these two aspects can be taken separately, I believe that it is in their combination that most benefit can be found. It has long been clear to me that big data and IoT data do not belong in the traditional data warehouse. Nor does it make sense to move business transaction data from well-managed relational environments to Hadoop—the Data Lake approach. (In my view, copying legally binding transaction and contractual data into a loosely defined and poorly controlled Data Lake risks corrupting and losing control over vital business assets.)
Virtualization across these two environments, and others, is the most sensible—and perhaps only possible—way to enable business users to combine such very different types of data. Furthermore, providing users with the ability to understand and use such combined data through abstraction and directory services is vital. So, the direction articulated by the Cisco Data and Analytics Business Group brings together two of the vital components to address the growing need for business users to understand, find and combine information from traditional and (so-called) big data sources. A grand challenge, indeed…
I look forward to Cisco’s Data Virtualization Day in New York on 1 October, from where I expect to report further developments on this interesting roadmap.
Data management is back on the agenda, finally with a big data flavor.
I’d like to think that Teradata was driven by my blog of 10 July, “So, how do you eat a Hadoop Elephant?” in the acquisitions announced today of Hadapt and, of more interest here, Revelytix. Of course, I do know that the timing is coincidental. However, the move does emphasize my contention that it will be the traditional data warehouse companies that will ultimately drive real data management into the big data environment. And hopefully kill the data lake moniker in the process!
To recap, my point two weeks ago was: “The challenge was then[in the early days of data warehousing]—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures. [This demands] defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world.”
Revelytix is (or was) a Boston-based startup focusing on the problems of data scientists in preparing data for analytic use in Hadoop. The Revelytix process begins with structuring the incoming soft (or loosely structured) data into a largely tabular format. This is unsurprising to anyone who understands how business analysts have always worked. These tables are then explored iteratively using a variety of statistical and other techniques before being transformed and cleansed into the final structures and value sets needed for the required analytic task. The process and the tasks will be very familiar to anybody involved in ETL or data cleansing in data warehousing. The output—along with more structured data—is, of course, metadata, consisting of table and column names, data types and ranges, etc., as well as the lineage of the transformations applied. In short, the Revelytix tools produce basic technical-level metadata in the Hadoop environment, the initial component of any data management or governance approach.
In my book, “Business unIntelligence”, I proposed for a variety of reasons that we should start thinking about context-setting information (or CSI, for short), rather than metadata. A key driver was to remind ourselves that this is actually information that extends far beyond the limited technical metadata we usually consider coming from ETL. And if I might be so bold as to advise Teradata on what to focus on with their new baby, I would suggest that they place emphasis on the business-related portion of the CSI being created in the world of the data scientists. It is there that the business meaning for external data emerges. And it is there that it must be captured and managed for proper data governance.
As the big data market matures, the focus shifts from the new data itself to its use in concert with traditional operational business data.
For the business analyst, big data can be very seductive. It exists in enormous quantities. It contains an extensive and expanding record of every interaction that makes up people’s various daily behaviors. According to all the experts, the previously unnoticed correlations it contains hold the potential for discovering customer preferences, understanding their next actions, and even creating brand new business models. Trailblazing businesses in every industry, especially Internet startups, are already doing this, largely based in Hadoop. The future beckons…
However, a key data source—traditional business transactions and other operational and informational data—has been largely isolated from this big data scene. And although the Hadoop store is the default destination for all big data, this older but most important data of the actual business—the customer and product records, the transactions, and so on—usually reside elsewhere entirely, in the relational databases of the business’ operational and informational systems. This data is key to many of the most useful analyses the business user may desire. The graphic above depicts how a modern customer journey accesses and creates data in a wide variety of places and formats, suggesting the range of sources required for comprehensive analytics and the importance of the final purchasing stage.
There are a number of approaches to bringing these disparate data sources together. For some businesses, copying a subset of big data to traditional platforms is a preferred tactic. Others, particularly large enterprises, prefer a data virtualization approach as described in the IDEAL architecture of Business unIntelligence. For businesses based mostly or largely in the cloud, bringing operational data into the Hadoop environment often makes sense, given that the majority of their data resides here or in other cloud platforms. The challenge that arises, however, is how to make analytics of this combined data most usable. Technical complexity and a lack of contextual information in Hadoop can be serious barriers to adoption of big data analytics on this platform by ordinary business analysts.
To overcome these issues, four areas of improvement in today’s big data analytics are needed:
1. Combine data from traditional and new sources
2. Create context for data while maintaining agile structure
3. Support iterative, speed-of-thought analytics
4. Enable business-user-friendly analytical interface
Big data is commonly loaded directly into Hadoop in any of a range of common formats, such as CSV, JSON, web logs and more. Operational and informational data, however, must first be extracted from its normal relational database environments before loading it in a flat-file format. Careful analysis and modeling is needed to ensure that such extracts faithfully represent the actual state of the business. Such skills are often to be found in the ETL (extract-transform-load) teams responsible for traditional business intelligence systems, and should be applied here too.
To process such data, users need to be able to define the meaning of the data before exploring and playing with it, in order to address improvement #2 above. Given analysts’ familiarity with tabular data formats, such as spreadsheets and relational tables, a simple modeling and enhancement tool that overlays such a structure on the data is a useful approach. This separates the user from the technical underlying programming methods.
At the level of the physical data access and processing required to return results to the users, one approach is to translate the users’ queries into MapReduce batch programs to run directly against the Hadoop file store. Another approach adds a columnar, compressed, in-memory appliance. This provides iterative, speed-of-thought analytics, in line with improvement #3, by offering an analytic data mart sourced from Hadoop. In this environment, the analyst interacts iteratively with visual dashboards. This is analogous to BI tools, operating on top of a relational database. This top layer provides for the fourth required improvement: a business-user-friendly analytical interface.
The four improvement areas listed here are at the heart of Platfora’s approach to delivering big data analytics. For a more detailed explanation, as well as descriptions of a number of customer implementations, please see my white paper, “Demystifying big data analytics” or the accompanying webinar on this topic.
With MapR’s recent announcement of $110 million in funding, following on from Hortonwork’s $100 million and Cloudera’s $900 million, both in March, debate is rife about their different approaches to the market and, of course, which of this big three will eventually win out. Throw in some fear, uncertainty and doubt about the future of the current big data warehouse vendors, a plethora of other players with varying offerings, and you have the food for a real media feeding frenzy.
No doubt the market is undergoing some significant changes and there will be winners and losers. Of course, vendor funding and marketing momentum do make a difference. Certainly, the flood of data from previously untapped or even nonexistent sources expands what businesses can hope to achieve.
But, amid all the excitement, one reality remains constant. One not-so-sexy topic—or actually a related set of topics—will drive the success or failure of real-world implementations. The same topic has been at the heart of data warehousing for nearly thirty years. And whether we call it data warehouse, data lake or data hub, or whether we build it on a relational database or an elephant’s back, is largely irrelevant. This oft-overlooked topic is information (or data) management… using the term in its broadest sense.
Since the earliest days of data warehousing, a significant tension has existed between the urge to deliver early business value and the need to ensure the integrity of the underlying data. Believe it or not, business users were as excited in the 1980s about the opportunities offered by relational databases as today’s users are about big data technologies. The underlying message is not that much different: drive better decision making based on more and better data. The challenge was then—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures.
For old-timers like me, the open source, big data environment is very reminiscent of the early days of relational databases in the 1980s and data warehousing in the 1990s. The focus is on improving the technological underpinnings, component by individual component. A better database optimizer. Faster throughput load and update (ETL). Security and authentication tools. Moving from batch to interactive and eventually near real-time use.
In data warehousing, the focus has long shifted to the overall process of ensuring data quality and consistency, from modeling business requirements all the way through to production delivery and ongoing maintenance. We see this in tools such as Wherescape and Kalido, which have emerged from teams who had to build and support real, ongoing and changing business needs. Once the excitement of delivering the first data warehouse, lake or hub wears off, the real challenge become apparent—how to keep it going in the face of ever changing and increasingly urgent business demands.
So, how do you eat the Hadoop elephant? In exactly the same way as we’ve eaten relational databases, data warehouses and business intelligence: by lining up the pieces, defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world. Its absence is unsurprising; this is a market still in the first flush of delivering discreet helpings of business value.
But, in the long run (and it will be long), this is where the worlds of data warehousing and big data will converge. The knowledge and tooling of information management from data warehousing will be applied to big data. The roles of both relational databases and non-relational techniques will become clearly complementary. A hybrid architecture as outlined in my book, Business unIntelligence, will become the preferred approach. And maybe we’ll discover that the elephant we need to eat is that of information meaning and management rather than the basic data manipulation we see in Hadoop today.
Outrage about Facebook’s psychological experiment is misplaced.
The past weekend saw another outpouring of outrage about Facebook’s abuse of personal data. An article in the scientific journal Proceedings of the National Academy of Sciences of the United States of America reported the results of a psychological experiment into the emotional impact of seeing emotionally charged content in social media. In brief, during one week in January 2012, Facebook deliberately manipulated the levels of positive or negative posts in the News Feeds of almost 700,000 users and measured the resulting emotional behavior of the same users by the level of positivity or negativity in their subsequent posts. Commentators take issue that the people involved were neither informed of the experiment nor gave their consent. Facebook obviously disagrees.
Sorry folks, Facebook is correct. Not only did the users consent, but they (and all other users of social media) willingly participate daily in the same type of experiment. The results of these experiments are never published in respectable journals. They are silently used to target advertising and drive marketing. Facebook has been deciding which posts users see for many years, based on relevance, as determined by a proprietary algorithm. Advertisements are delivered in a similar fashion, also based on assumed relevance. The central questions are: Relevant to whom? Relevant on what basis? And how might advertisement relevance be related to the News Feed posts shown at the same time?
What this experiment has emphasized—I presume inadvertently—is that the algorithm(s) used to choose the posts and advertisements you see can be “tuned” in any manner the programmer desires and you will not be any the wiser. If your News Feed can drive your negativity through filtering of posts, is that an opportunity to advertise anti-depressants? If your friends are equally positive about products X and Y, but the social media provider can earn more from ads for X, might that lead to the favoring of posts liking X?
So, we return again to the issue I raised only two weeks ago. Internet services such as social media or search funded by advertising allow and invite manipulation of the data gathered for increased profit. If we agree that such services are socially desirable or now necessary, can we afford to expose them to even the possibility of such manipulation?
My bottom line is that the free internet is an oxymoron. Your individual freedom will be severely constrained by your desire for free stuff.
Although the yellow elephant continues to trample all over the world of Information Management, it is becoming increasingly difficult to say where more traditional technologies end and Hadoop begins.
Actian’s (@ActianCorp) excellent presentation by John @santaferraro and @emmakmcgrattan at the #BBBT on 24 June emphasized again—if such emphasis were needed—that the boundaries of the Hadoop world are becoming very ill-defined indeed, as more traditional engines are adapted to run on or in the Hadoop cluster. The Actian Analytics Platform – Hadoop SQL Edition embeds their existing X100 / Vectorwise SQL engine directly in the nodes of the Hadoop environment. The approach offers the full range of SQL support previously available in Vectorwise on Hadoop, and claims 4-30 times speed improvement over Cloudera Impala in a subset of TPC-DC benchmarks.
Architecturally as interesting, shown in the accompanying figure, is the creation and use of column-based, binary, compressed vector files by the X100 engine for improved performance and the subsequent replication of these files by the Hadoop system. These latter files support co-location of data for joins for a further performance boost.
This is, of course, the type of integration one would expect from seasoned database developers when they migrate to a new platform. Actian is not alone in doing this. Pivotal’s HAWQ has Greenplum technology embedded. It would be surprising if IBM’s on-Hadoop Big SQL offering is not based on DB2 knowledge at the very least. These are the types of development that YARN facilitates in version two of Hadoop. Debate will rage about how deeply integrated the technologies are and how far they take advantage of the Hadoop infrastructure. But that’s just details.
The real point is that the mix and match of functionality and data seen here emphasizes the conundrum I posed at the top of the blog. Where does Hadoop end? And where does “NoHadoop” (well, if we can have NoSQL…) begin? What does this all mean for the evolution of Information Management technology over the coming few years?
As the title suggests, I believe that we are on the crest of the third wave of Hadoop. As in Alvin Toffler’s prescient 1980 book of the same name, this third wave of Hadoop could also be claimed to be post-industrial in nature. Let’s look at the three waves in context.
The first wave of Hadoop was the fertile soil of the Internet in which the cute yellow elephant would grow. The technical pioneers of the Web, particularly Google, defined and built bespoke versions of the new data management (in a loose sense of the term) ecosystem that was needed for the novel types and enormous volumes of data they were handling. Their choice of parallelized commodity hardware and software was the foundation for and driving force of the second wave.
The second wave industrialized the approach through the open source software movement. Here we saw the proliferation of Apache projects and the emergence of commercial, independent distros from the likes of Cloudera and Hortonworks. The ecosystem gradually moved from custom code built by expert developers to a parallel programming environment with a plethora of utilities to aid development, deployment and use. This wave is now receding as it has become clear that an integrated, managed and database-centric environment is now needed. Such a development is fully expected: we had exactly the same cycle in mainframes in the ’60s and ’70s and in distributed computing in the ’80s and ’90s. However, there is an important difference to consider now as the third wave of Hadoop breaks: we are no longer on a virgin shore.
The third wave of Hadoop is seeing the devaluing of the file system in favor of databases that run on top of it. Individual programs are being displaced by systems to manage resource allocation, ensure transaction integrity and provide security. While the companies and individuals who drove the second wave do recognize this shift and are developing systems such as Impala, Falcon, Sentry and more, they start from a disadvantage. The database and other system management technologies that were developed in the mainframe and distributed environments are far more robust and can be migrated to the new commodity hardware and software platform. Commercially, the vendors of these tools have no choice but to move into this market. And they are doing so. YARN has begun to unlock Hadoop from its programming origins.
I suggest that the unique strength of the Hadoop world comes not from its open source software base but from its hardware foundation of parallel commodity machines. Such hardware drives down the capital cost of playing in the big data arena. On the other hand, it increases the operational cost and management complexity. These latter aspects will militate against the open source, let-a-thousand-flowers-bloom approach that is currently being pursued; we need a data management infrastructure, including a fully functional relational database, in this environment far more than yet another NoSQL (or YANS, for short?). Realistically, such mission-critical software is more likely to come from traditional vendors, adapted from existing products, patents and skills. In this, Actian and others are showing the way.
In this third wave, of course, a new model for funding must emerge. Traditional, and often exorbitant, software pricing models cannot survive. On the other hand, the open source free-software-paid-maintenance model, while offering much innovation, is unlikely to be able to fund the dedicated, on-going development required for robust, reliable and secure infrastructure. Are any of the big players in the merging Hadoop market of this third, post-industrial wave willing to step up to this challenge?
Pictures courtesy (1) Actian; (2) Bhajju Shyam, The London Jungle Book.