Data management is back on the agenda, finally with a big data flavor.
I’d like to think that Teradata was driven by my blog of 10 July, “So, how do you eat a Hadoop Elephant?” in the acquisitions announced today of Hadapt and, of more interest here, Revelytix. Of course, I do know that the timing is coincidental. However, the move does emphasize my contention that it will be the traditional data warehouse companies that will ultimately drive real data management into the big data environment. And hopefully kill the data lake moniker in the process!
To recap, my point two weeks ago was: “The challenge was then[in the early days of data warehousing]—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures. [This demands] defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world.”
Revelytix is (or was) a Boston-based startup focusing on the problems of data scientists in preparing data for analytic use in Hadoop. The Revelytix process begins with structuring the incoming soft (or loosely structured) data into a largely tabular format. This is unsurprising to anyone who understands how business analysts have always worked. These tables are then explored iteratively using a variety of statistical and other techniques before being transformed and cleansed into the final structures and value sets needed for the required analytic task. The process and the tasks will be very familiar to anybody involved in ETL or data cleansing in data warehousing. The output—along with more structured data—is, of course, metadata, consisting of table and column names, data types and ranges, etc., as well as the lineage of the transformations applied. In short, the Revelytix tools produce basic technical-level metadata in the Hadoop environment, the initial component of any data management or governance approach.
In my book, “Business unIntelligence”, I proposed for a variety of reasons that we should start thinking about context-setting information (or CSI, for short), rather than metadata. A key driver was to remind ourselves that this is actually information that extends far beyond the limited technical metadata we usually consider coming from ETL. And it I might be so bold as to advise Teradata on what to focus on with their new baby, I would suggest that they place emphasis on the business-related portion of the CSI being created in the world of the data scientists. It is there that the business meaning for external data emerges. And it is there that it must be captured and managed for proper data governance.
As the big data market matures, the focus shifts from the new data itself to its use in concert with traditional operational business data.
For the business analyst, big data can be very seductive. It exists in enormous quantities. It contains an extensive and expanding record of every interaction that makes up people’s various daily behaviors. According to all the experts, the previously unnoticed correlations it contains hold the potential for discovering customer preferences, understanding their next actions, and even creating brand new business models. Trailblazing businesses in every industry, especially Internet startups, are already doing this, largely based in Hadoop. The future beckons…
However, a key data source—traditional business transactions and other operational and informational data—has been largely isolated from this big data scene. And although the Hadoop store is the default destination for all big data, this older but most important data of the actual business—the customer and product records, the transactions, and so on—usually reside elsewhere entirely, in the relational databases of the business’ operational and informational systems. This data is key to many of the most useful analyses the business user may desire. The graphic above depicts how a modern customer journey accesses and creates data in a wide variety of places and formats, suggesting the range of sources required for comprehensive analytics and the importance of the final purchasing stage.
There are a number of approaches to bringing these disparate data sources together. For some businesses, copying a subset of big data to traditional platforms is a preferred tactic. Others, particularly large enterprises, prefer a data virtualization approach as described in the IDEAL architecture of Business unIntelligence. For businesses based mostly or largely in the cloud, bringing operational data into the Hadoop environment often makes sense, given that the majority of their data resides here or in other cloud platforms. The challenge that arises, however, is how to make analytics of this combined data most usable. Technical complexity and a lack of contextual information in Hadoop can be serious barriers to adoption of big data analytics on this platform by ordinary business analysts.
To overcome these issues, four areas of improvement in today’s big data analytics are needed:
1. Combine data from traditional and new sources
2. Create context for data while maintaining agile structure
3. Support iterative, speed-of-thought analytics
4. Enable business-user-friendly analytical interface
Big data is commonly loaded directly into Hadoop in any of a range of common formats, such as CSV, JSON, web logs and more. Operational and informational data, however, must first be extracted from its normal relational database environments before loading it in a flat-file format. Careful analysis and modeling is needed to ensure that such extracts faithfully represent the actual state of the business. Such skills are often to be found in the ETL (extract-transform-load) teams responsible for traditional business intelligence systems, and should be applied here too.
To process such data, users need to be able to define the meaning of the data before exploring and playing with it, in order to address improvement #2 above. Given analysts’ familiarity with tabular data formats, such as spreadsheets and relational tables, a simple modeling and enhancement tool that overlays such a structure on the data is a useful approach. This separates the user from the technical underlying programming methods.
At the level of the physical data access and processing required to return results to the users, one approach is to translate the users’ queries into MapReduce batch programs to run directly against the Hadoop file store. Another approach adds a columnar, compressed, in-memory appliance. This provides iterative, speed-of-thought analytics, in line with improvement #3, by offering an analytic data mart sourced from Hadoop. In this environment, the analyst interacts iteratively with visual dashboards. This is analogous to BI tools, operating on top of a relational database. This top layer provides for the fourth required improvement: a business-user-friendly analytical interface.
The four improvement areas listed here are at the heart of Platfora’s approach to delivering big data analytics. For a more detailed explanation, as well as descriptions of a number of customer implementations, please see my white paper, “Demystifying big data analytics” or the accompanying webinar on this topic.
With MapR’s recent announcement of $110 million in funding, following on from Hortonwork’s $100 million and Cloudera’s $900 million, both in March, debate is rife about their different approaches to the market and, of course, which of this big three will eventually win out. Throw in some fear, uncertainty and doubt about the future of the current big data warehouse vendors, a plethora of other players with varying offerings, and you have the food for a real media feeding frenzy.
No doubt the market is undergoing some significant changes and there will be winners and losers. Of course, vendor funding and marketing momentum do make a difference. Certainly, the flood of data from previously untapped or even nonexistent sources expands what businesses can hope to achieve.
But, amid all the excitement, one reality remains constant. One not-so-sexy topic—or actually a related set of topics—will drive the success or failure of real-world implementations. The same topic has been at the heart of data warehousing for nearly thirty years. And whether we call it data warehouse, data lake or data hub, or whether we build it on a relational database or an elephant’s back, is largely irrelevant. This oft-overlooked topic is information (or data) management… using the term in its broadest sense.
Since the earliest days of data warehousing, a significant tension has existed between the urge to deliver early business value and the need to ensure the integrity of the underlying data. Believe it or not, business users were as excited in the 1980s about the opportunities offered by relational databases as today’s users are about big data technologies. The underlying message is not that much different: drive better decision making based on more and better data. The challenge was then—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures.
For old-timers like me, the open source, big data environment is very reminiscent of the early days of relational databases in the 1980s and data warehousing in the 1990s. The focus is on improving the technological underpinnings, component by individual component. A better database optimizer. Faster throughput load and update (ETL). Security and authentication tools. Moving from batch to interactive and eventually near real-time use.
In data warehousing, the focus has long shifted to the overall process of ensuring data quality and consistency, from modeling business requirements all the way through to production delivery and ongoing maintenance. We see this in tools such as Wherescape and Kalido, which have emerged from teams who had to build and support real, ongoing and changing business needs. Once the excitement of delivering the first data warehouse, lake or hub wears off, the real challenge become apparent—how to keep it going in the face of ever changing and increasingly urgent business demands.
So, how do you eat the Hadoop elephant? In exactly the same way as we’ve eaten relational databases, data warehouses and business intelligence: by lining up the pieces, defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world. Its absence is unsurprising; this is a market still in the first flush of delivering discreet helpings of business value.
But, in the long run (and it will be long), this is where the worlds of data warehousing and big data will converge. The knowledge and tooling of information management from data warehousing will be applied to big data. The roles of both relational databases and non-relational techniques will become clearly complementary. A hybrid architecture as outlined in my book, Business unIntelligence, will become the preferred approach. And maybe we’ll discover that the elephant we need to eat is that of information meaning and management rather than the basic data manipulation we see in Hadoop today.
Outrage about Facebook’s psychological experiment is misplaced.
The past weekend saw another outpouring of outrage about Facebook’s abuse of personal data. An article in the scientific journal Proceedings of the National Academy of Sciences of the United States of America reported the results of a psychological experiment into the emotional impact of seeing emotionally charged content in social media. In brief, during one week in January 2012, Facebook deliberately manipulated the levels of positive or negative posts in the News Feeds of almost 700,000 users and measured the resulting emotional behavior of the same users by the level of positivity or negativity in their subsequent posts. Commentators take issue that the people involved were neither informed of the experiment nor gave their consent. Facebook obviously disagrees.
Sorry folks, Facebook is correct. Not only did the users consent, but they (and all other users of social media) willingly participate daily in the same type of experiment. The results of these experiments are never published in respectable journals. They are silently used to target advertising and drive marketing. Facebook has been deciding which posts users see for many years, based on relevance, as determined by a proprietary algorithm. Advertisements are delivered in a similar fashion, also based on assumed relevance. The central questions are: Relevant to whom? Relevant on what basis? And how might advertisement relevance be related to the News Feed posts shown at the same time?
What this experiment has emphasized—I presume inadvertently—is that the algorithm(s) used to choose the posts and advertisements you see can be “tuned” in any manner the programmer desires and you will not be any the wiser. If your News Feed can drive your negativity through filtering of posts, is that an opportunity to advertise anti-depressants? If your friends are equally positive about products X and Y, but the social media provider can earn more from ads for X, might that lead to the favoring of posts liking X?
So, we return again to the issue I raised only two weeks ago. Internet services such as social media or search funded by advertising allow and invite manipulation of the data gathered for increased profit. If we agree that such services are socially desirable or now necessary, can we afford to expose them to even the possibility of such manipulation?
My bottom line is that the free internet is an oxymoron. Your individual freedom will be severely constrained by your desire for free stuff.
Although the yellow elephant continues to trample all over the world of Information Management, it is becoming increasingly difficult to say where more traditional technologies end and Hadoop begins.
Actian’s (@ActianCorp) excellent presentation by John @santaferraro and @emmakmcgrattan at the #BBBT on 24 June emphasized again—if such emphasis were needed—that the boundaries of the Hadoop world are becoming very ill-defined indeed, as more traditional engines are adapted to run on or in the Hadoop cluster. The Actian Analytics Platform – Hadoop SQL Edition embeds their existing X100 / Vectorwise SQL engine directly in the nodes of the Hadoop environment. The approach offers the full range of SQL support previously available in Vectorwise on Hadoop, and claims 4-30 times speed improvement over Cloudera Impala in a subset of TPC-DC benchmarks.
Architecturally as interesting, shown in the accompanying figure, is the creation and use of column-based, binary, compressed vector files by the X100 engine for improved performance and the subsequent replication of these files by the Hadoop system. These latter files support co-location of data for joins for a further performance boost.
This is, of course, the type of integration one would expect from seasoned database developers when they migrate to a new platform. Actian is not alone in doing this. Pivotal’s HAWQ has Greenplum technology embedded. It would be surprising if IBM’s on-Hadoop Big SQL offering is not based on DB2 knowledge at the very least. These are the types of development that YARN facilitates in version two of Hadoop. Debate will rage about how deeply integrated the technologies are and how far they take advantage of the Hadoop infrastructure. But that’s just details.
The real point is that the mix and match of functionality and data seen here emphasizes the conundrum I posed at the top of the blog. Where does Hadoop end? And where does “NoHadoop” (well, if we can have NoSQL…) begin? What does this all mean for the evolution of Information Management technology over the coming few years?
As the title suggests, I believe that we are on the crest of the third wave of Hadoop. As in Alvin Toffler’s prescient 1980 book of the same name, this third wave of Hadoop could also be claimed to be post-industrial in nature. Let’s look at the three waves in context.
The first wave of Hadoop was the fertile soil of the Internet in which the cute yellow elephant would grow. The technical pioneers of the Web, particularly Google, defined and built bespoke versions of the new data management (in a loose sense of the term) ecosystem that was needed for the novel types and enormous volumes of data they were handling. Their choice of parallelized commodity hardware and software was the foundation for and driving force of the second wave.
The second wave industrialized the approach through the open source software movement. Here we saw the proliferation of Apache projects and the emergence of commercial, independent distros from the likes of Cloudera and Hortonworks. The ecosystem gradually moved from custom code built by expert developers to a parallel programming environment with a plethora of utilities to aid development, deployment and use. This wave is now receding as it has become clear that an integrated, managed and database-centric environment is now needed. Such a development is fully expected: we had exactly the same cycle in mainframes in the ’60s and ’70s and in distributed computing in the ’80s and ’90s. However, there is an important difference to consider now as the third wave of Hadoop breaks: we are no longer on a virgin shore.
The third wave of Hadoop is seeing the devaluing of the file system in favor of databases that run on top of it. Individual programs are being displaced by systems to manage resource allocation, ensure transaction integrity and provide security. While the companies and individuals who drove the second wave do recognize this shift and are developing systems such as Impala, Falcon, Sentry and more, they start from a disadvantage. The database and other system management technologies that were developed in the mainframe and distributed environments are far more robust and can be migrated to the new commodity hardware and software platform. Commercially, the vendors of these tools have no choice but to move into this market. And they are doing so. YARN has begun to unlock Hadoop from its programming origins.
I suggest that the unique strength of the Hadoop world comes not from its open source software base but from its hardware foundation of parallel commodity machines. Such hardware drives down the capital cost of playing in the big data arena. On the other hand, it increases the operational cost and management complexity. These latter aspects will militate against the open source, let-a-thousand-flowers-bloom approach that is currently being pursued; we need a data management infrastructure, including a fully functional relational database, in this environment far more than yet another NoSQL (or YANS, for short?). Realistically, such mission-critical software is more likely to come from traditional vendors, adapted from existing products, patents and skills. In this, Actian and others are showing the way.
In this third wave, of course, a new model for funding must emerge. Traditional, and often exorbitant, software pricing models cannot survive. On the other hand, the open source free-software-paid-maintenance model, while offering much innovation, is unlikely to be able to fund the dedicated, on-going development required for robust, reliable and secure infrastructure. Are any of the big players in the merging Hadoop market of this third, post-industrial wave willing to step up to this challenge?
Pictures courtesy (1) Actian; (2) Bhajju Shyam, The London Jungle Book.
In the year since Edward Snowden spoke out on governmental spying, much has been written about privacy but little enough done to protect personal information, either from governments or from big business.
It’s now a year since the material gathered by Edward Snowden at the NSA was first published by the Guardian and Washington Post newspapers. In one of a number of anniversary-related items, Vodafone revealed that secret wires are mandated in “about six” of the 29 countries in which it operates. It also noted that, in addition, Albania, Egypt, Hungary, India, Malta, Qatar, Romania, South Africa and Turkey deem it unlawful to disclose any information related to wiretapping or content interception. Vodafone’s move is to be welcomed. Hopefully, it will encourage further transparency from other telecommunications providers on governmental demands for information.
However, governmental big data collection and analysis is only one aspect of this issue. Personal data is also of keen interest to a range of commercial enterprises, from telcos themselves to retailers and financial institutions, not to mention the Internet giants, such as Google and Facebook, which are the most voracious consumers of such information. Many people are rightly concerned about how governments—from allegedly democratic to manifestly totalitarian—may use our personal data. To be frank, the dangers are obvious. However, commercial uses of personal data are more insidious, and potentially more dangerous and destructive to humanity. Governments at least purport to represent the people to a greater or lesser extent; commercial enterprises don’t even wear that minimal fig leaf.
Take, as one example among many, indoor proximity detection systems based on Bluetooth Low Energy devices such as Apple’s iBeacon and Google’s rumored upcoming Nearby. The inexorable progress of communications technology—smaller, faster, cheaper, lower power—enables more and more ways of determining the location of your smartphone or tablet and, by extension, you. The operating system or app on your phone requires an opt-in to enable it to transmit your location. However, it is becoming increasingly difficult to avoid opting-in as many apps require it to work at all. More worrying are the systems that record and track without asking permission the MAC addresses of smartphones and tablets that poll public Wi-Fi network routers, which all such devices automatically do. (See, for example, this article, subscription required.) The only way to avoid such tracking is to turn off the device’s Wi-Fi receiver. On the desktop, the situation is little better, with Facebook last week joining Google and Yahoo! in ignoring browser “do not track” settings.
It would be simple to blame the businesses involved—both the technology companies that develop the systems and the businesses that buy or use the data. They certainly must take their fair share of responsibility, together with the data scientists and other IT staff involved in building the systems. But the reality is that it is we, the general public, who hand over our personal data without a second thought about its possible uses, who must step up to demanding real change in the collection and use of such data. This demands significant rethinking in at least two areas.
First is the oft-repeated marketing story that “people want more targeted advertising”, reiterated again last week by Facebook’s Brian Boland. A more nuanced view is provided by Sara M. Watson, a Fellow at the Berkman Center for Internet and Society at Harvard University, in a recent Atlantic article Data Doppelgängers and the Uncanny Valley of Personalization: “Data tracking and personalized advertising is often described as ‘creepy.’ Personalized ads and experiences are supposed to reflect individuals, so when these systems miss their mark, they can interfere with a person’s sense of self. It’s hard to tell whether the algorithm doesn’t know us at all, or if it actually knows us better than we know ourselves. And it’s disconcerting to think that there might be a glimmer of truth in what otherwise seems unfamiliar. This goes beyond creepy, and even beyond the sense of being watched.”
I would suggest that given the choice between less irrelevant advertising or, simply, less advertising on the Web, many people would opt for the latter, particularly given the increasing invasiveness of the data collection needed to drive allegedly more accurate targeting. Clearly, this latter choice would not be in the interest of the advertising industry, a position that crystalizes in the widespread resistance to limits on data gathering, especially in the United States. An obvious first step in addressing this issue is a people-driven, legally mandated move from opt-out data gathering to a formal opt-in approach. To be really useful, of course, this would need to be preceded by a widespread mass deletion of previously gathered data.
This leads directly to the second area in need of substantial rethinking—the funding model for Internet business. Most of us accept that “there’s no such thing as a free lunch”. But a free email service, Cloud store or search engine, well apparently that’s eminently reasonable. Of course, it isn’t. All these services cost money to build and run, costs that are covered (with significant profits in many cases) by advertising. More of it and supposedly better targeted via big data and analytics.
There is little doubt that the majority of people using the Internet gain real, daily value from it. Today, that value is paid for through personal data. The loss of privacy seems barely noticed. People I ask are largely disinterested in any possible consequences. However, privacy is the foundation for many aspects of society, including democracy—as can be clearly seen in totalitarian states, where widespread surveillance and destruction of privacy are among the first orders of business. We, the users of the Web, must do the unthinkable: we must demand the right to pay real money for mobile access, search, email and so on in exchange for an end to tracking personal data.
These are but two arguably simplistic suggestions to address issues that have been made more obvious by Snowden’s revelations. A more complete theoretical and legal foundation for a new approach is urgently needed. One possible starting point is The Dangers of Surveillance by Neil Richards, Professor of Law at Washington University Law, published in the Harvard Law Review a few short months before Snowden spilled at least some of the beans.
Image courtesy Marc Kjerland