Now...Business unIntelligence

Page 1 of 212

December 9, 2014  10:54 AM

Dirty Data Lakes and Dubious Analytics

Barry Devlin Barry Devlin Profile: Barry Devlin
Analytics, Big Data

water_pollution53“Gold is down almost 40% since it peaked in 2011. But it’s still up almost 350% since 2000. Although since 1980, on an inflation-adjusted basis, it’s basically flat. However, since the early-1970s it’s up over 7% per year (or about 3.4% after inflation).” Ben Carlson, an institutional investment manager provides this wonderful example of how statistical data can be abused, in this case by playing with time horizons. Ben is talking about making investment decisions. Let me replay his conclusions, but with a more general view (my changes in bold).

“It’s very easy to cherry-pick historical data that fits your narrative to prove a point about anything. It doesn’t necessarily mean you’re right or wrong. It just means that the world is full of conflicting evidence because the results over most time frames are nowhere close to average. If the performance of everything was predictable over any given time horizon, there would be no risk.”

We have entered a period of history where information has become super-abundant. It would be wise, I suggest, to consider all the ways this information can be misinterpreted or abused. Through ignorance, so-called confirmation bias, intention to deceive, and a dozen other causes, we can mislead, be misled, or slip into analysis paralysis. How can we avoid these pitfalls? Before attempting my own answer, let’s take a look at an example of dangerous thinking that can be found even among big data experts.

Jean-Luc Chatelain, a Big Data Technology & Strategy Executive, recently declared “an end to data torture” courtesy of Data Lakes. Arguing that a leading driver is cost, he says Data Lakes “enable massive amount of information to be stored at a very economically viable point [versus] traditional IT storage hardware”. While factually correct, this latter statement actually nothing about overall cost, with the growth in data volumes probably exceeding the rate of decline in computing costs and, more importantly, the fact that data governance costs grow with increasing volumes and disparity of data stored.

More worryingly, he goes on to say: “the truly important benefit that Data-Lakes bring to the ‘information powered enterprise’ is… ‘High quality actionable insights’”. This conflation of vast stores of often poorly-defined and -managed data with high quality actionable insights flies in the face of common sense. High quality actionable insights more likely stem from high quality, well-defined, meaningful information rather than from large, ill-defined data stores. Actionable insights require the very human behavior of contextualizing new information within personal or organizational experience. No amount of Lake Data can address this need. Finally, choosing actions may be based on the best estimate of whether the information offers a valid forecast about the outcome… or may be based on the desires, intentions, vision, etc. of the decision maker, especially if the information available is deemed to be a poor indicator of the future likely outcome. And Chatelain’s misdirected tirade against ETL (extract, torture and lose, as he labels it) ignores most of the rationale behind the process in order to cherry-pick some well-known implementation weaknesses.

Whether data scientist or business analyst, the first step with data—especially with disparate, dirty data—is always to structure and cleanse it; basically, to make it fit for analytic purpose. Despite a very short history, it is already recognized that 80% or more of data scientists’ effort goes into this data preparation. Attempts to automate this process and to apply good governance principles are already underway from start-ups like @WaterlineData, @AlpineDataLabs as well as long-standing companies like @Teradata and @IBMbigdata. But, as always, the choice of what to use and how to use it depends on human skill and experience. And make no mistake, most big data analytics moves very quickly from “all the data” to a subset that is defined by its usefulness and applicability to the issue in hand. Big data rapidly becomes focused data in production situations. Returning again and again to the big data source for additional “insights is governed by the law of diminishing returns.

It is my belief that our current fascination with collecting data about literally everything is taking us down a misleading path. Of course, in some cases, more data and, preferably, better data can offer a better foundation for insight and decision making. However, it is wrong to assume that more data always leads to more insight or better decisions. As in the past evolution of BI, we are again focusing on the tools and technology. Where we need to focus is on improving our human ability to contextualize data and extract valid meaning from it. We need to train ourselves to see the limits of data’s ability to predict the future and the privacy and economic dangers inherent in quantifying everything. We need to take responsibility for our intentions and insights, our beliefs and intuitions that underpin our decisions in business and in life.

“The data made me do it” is a deeply disturbing rationale.

November 25, 2014  2:27 PM

The world of Hybrid Transaction/Analytical Processing

Barry Devlin Barry Devlin Profile: Barry Devlin
architecture, Business Intelligence, Database

Gartner’s new acronym, HTAP. What does it mean and why should you care?

What if we lived in a world where business users didn’t have to think about using different systems, depending on whether they wanted to use current or historical data? Where staff didn’t have to distinguish between running and managing the business? Where IT didn’t have to design and manage complex processes to copy and cleanse all data from operational systems to data warehouses and marts for business intelligence (BI)? The reality of today’s accelerating business drivers is that we urgently need to enable those new world behaviors of both business and IT.

My 2013 book, “Business unIntelligence”, described how the merging of business and technology is transforming and reinventing business processes. Such dramatic changes demand that current and historical data are combined in a more integrated, closed-loop way. In many industries, the success—and even survival—of companies will depend on their ability to bridge the current divide between their operational and informational systems. In 2014, Gartner (1) coined the term hybrid transaction/analytical processing (HTAP) to describe the same need. In terms of its implementation, they pointed to the central role of in-memory databases. This technology is certainly at the core, but other hardware and software considerations come into play.

My recent white paper “The Emergent Operational/Informational World” explores this topic in depth. Starting from the original divergence of operational and informational systems in the 1970s and 1980s, the paper explains how we arrived in today’s layered data world and why it must change, citing examples from online retailing, financial services and the emerging Internet of Things. It describes the three key technological drivers enabling the re-convergence, (1) In-memory databases, (2) techniques to reduce contention in data access, and (3) scaling out of relational databases, and how the modern NuoDB relational database product addresses these drivers.

For a brief introduction to the topic, join me and Steve Cellini of NuoDB on December 2nd, 1pm EST for our webinar “The Future of Data: Where Does HTAP Fit?
(1) Gartner Press Release, “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation”, G00259033, 28 January 2014


November 20, 2014  12:32 PM

Data-driven danger

Barry Devlin Barry Devlin Profile: Barry Devlin
Uncategorized

Data-driven business looks like its emerging as the next big buzz phrase. Should you be worried?

Mammoth BIBack in Cape Town after six weeks on the road in the US and Europe, my first task was to step on stage at Mammoth BI and do some myth busting about data-driven business.

Mammoth BI is the brain-child of local entrepreneur, Jason Haddock of Saratoga, a local IT solutions company. The one-day conference was modelled on the TED format of 15-minute entertaining and informative presentations, but focusing on big data, analytics and BI. This inaugural event was a big success, drawing a large audience to the Cape Town International Conference Centre, including a large number of university students, who were offered free attendance as a give-back to the community.

I was presenter number 13 of 17. Amongst 15 presenters extolling the virtues of big data and analytics, describing their successes and techniques, and one great professional comedian (Gareth Woods), my role was to be the old curmudgeon! Or more gently, the reality check against the overwhelming enthusiasm for data-driven solutions to every imaginable problem and data-based silver bullets for every opportunity. Among the many issues I could have chosen, here are the four myths I chose to bust:

  1. Collect all the data to get the best information: Not! Data Lakes epitomize this idea. Anybody who has been involved in data warehousing over the years should know that data is often dirty. Inconsistent, incomplete, simply incorrect. This is like pouring sewage into a lake. You need to be choosy about what you store and apply strict governance procedures.
  2. Decision-making is a fully rational, data-based process: Not! Lovers, advertisers and even executives know this not true. Better and more trusted data can influence the direction of thinking, but many important decisions eventually come down to a mix of information, experience, emotions and intentions. Sometimes called gut-feel or intuition. You need a mix of (wo)man and machine.
  3. Big data can be safely anonymized: Not! The ever increasing set of variables being collected about every man, woman and child is now so large that individuals can always be pinpointed and identified. Privacy is no longer an option. And target marketing can be a nice word for discrimination. Democracy also suffers when all opinions can be charted.
  4. Data-driven processes will solve world hunger: Not! While there are many benefits and opportunities to improve the world through big data and automation, the point that everybody seems to miss is that while the cost of goods drops ever lower by (among other factors) eliminating human labour, these displaced workers no longer have the cash to buy even the cheapest goods. Economists presume that new types of jobs will emerge, as happened in the industrial revolutions; unfortunately, none of them can imagine what those jobs might be.

These four problems are presented in order of increasing impact. Everybody in the data-driven industry needs to consider them carefully. I hope that I’m being too pessimistic, especially in the last two. Please prove me wrong! I’d love to return to the next Mammoth BI, planned for August 2015, with some real answers.


September 25, 2014  1:23 PM

Smaller World, Bigger Data, Faster Change

Barry Devlin Barry Devlin Profile: Barry Devlin
Big Data, Data virtualization

Let’s meet in New York, complements of Cisco Information Server, NuoDB, Strata, and Waterline Data.

NYC-at-Night-with-Statue-of-Liberty-1The latest McKinsey Quarterly, celebrating its 50th anniversary, suggests we need a significantly upgraded “Management intuition for the next 50 years”, explaining that “the collision of technological disruption, rapid emerging-markets growth, and widespread aging is upending long-held assumptions that underpin strategy setting, decision making, and management”. Regular readers of this blog will perhaps be surprised only in how long it has taken McKinsey to notice!

The biz-tech ecosystem concept I introduced in “Business unIntelligence” (how time files—the book is out almost a full year) pointed to a few other real world trends, but the result was the same: wrap them together with the current exponential rate of change in technology, and the world of business, and indeed society as a whole, can and must transform in response. W.B. Yeats was more dramatic: “All changed, changed utterly: A terrible beauty is born”.

Much of the excitement around changing technology has focused on big data, particularly all things Hadoop. I’ve covered that in my last post and will be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, Joe DosSantos, and Jay Zaidi, sponsored by Waterline Data Science, also at Strata on 17 October.

A second important aspect is virtualization of the data resource. This becomes ever more important as data volumes grow, and copying it all into data warehouses or migrating it to Hadoop is difficult or costly. I also dealt with that topic in a recent blog and will be addressing it at Cisco’s Data Virtualization Day, with Rick van der Lans in New York, next Wednesday, 1 October.

However, there is one other aspect that has received less attention: the practical challenge of the existing layered architecture, where the data warehouse is “copied” from the operational environment. There are many good reasons for this approach, but it also has its drawbacks, most especially the latency it introduces in the decision making environment and issues related to distributed and large scale implementations. In “Business unIntelligence”, I discussed the emerging possibility of combining the operational and informational environments, particularly with in-memory database technology. Gartner coined a new acronym, HTAP (Hybrid Transaction/Analytical Processing), last January to cover this possibility. With its harkening back to the old OLTP and OLAP phraseology, the name doesn’t inspire, but the concept is certainly coming of age.

One particularly interesting approach to this topic comes from NuoDB, whose Swifts 2.1 release went to beta a couple of weeks ago. I blogged on this almost a year ago, where I noted that “real-time decision needs also demand the ability to support both operational and informational needs on the primary data store. NuoDB’s Transaction Engine architecture and use of Multi-Version Concurrency Control together enable good performance of both read/write and longer-running read-only operations seen in operational BI applications”. With general availability of this functionality in November, NuoDB is placing emphasis on the idea that a fully distributed, in-memory, relational database is the platform needed to address the issues arising from a layered operational/informational environment. I’ll be speaking to this in New York on 15 October at NuoDB’s breakfast session, where I’ll also be signing copies of my book, complements of the sponsor.

So, New York, New York… here I come!


September 18, 2014  3:23 PM

Big Data Lake Governance

Barry Devlin Barry Devlin Profile: Barry Devlin
Big Data, Data governance

Doing big data governance can save you from drowning in the Data Lake.

Lake pollutedHow do you design and build a lake? You don’t. That’s nature’s job.

So, how do you design and build a reservoir? Simplistically, you actually design and build a dam, clear the area behind it of everything of value—people, animals and things—and wait for the water to fill the existing valley, drowning everything in its wake.

Of course, I really want to talk about Data Lakes and Data Reservoirs, what the concept might mean, and its implications for data management and governance. Data Lakes, sometimes called Data Reservoirs, are all the rage at the moment. They seem to provide ideal vacationing spots for marketing folks from big data vendors. We’re treated to pictures of sailboats and waterskiing against pristine mountain backdrops. But beyond exhortations to move all our data to a Hadoop-based platform and save truckloads of money by decommissioning decades of investment in relational systems, I’ve so far found little in the way of thoughtful architecture or design.

The metaphor of a lake offers, of course, the opportunity to talk about water and data flowing in freely and users able to dip in with ease for whatever cup full they need. Playful images of recreational use suggest the freedom and fun that business users would have if only they didn’t have to worry about where the data comes from or how it’s structured. Like the crystal clear water in the lake, it is suggested that all data is the same, pure substance waiting to be consumed.

Deeper thinking, even at the level of the lake metaphor, reminds us that there’s more to it. Lake water must undergo significant treatment and cleansing before it’s considered fit to drink. Many lakes are filled with effluent from the rivers that feed them. Even the pleasure seekers on the lake understand that there may be dangerous shallows or hidden rocks.

The rush to discredit the data warehouse, with its structures and rules, its loading and cleansing processes, its governance and management, has led its detractors to throw out the baby with the lake water. It is important to remember that not all data is created equal. It varies along many axes: value, importance, cleanliness, reliance, security, aggregation, and more. Each characteristic demands thought before an individual data element is put to use in the business. At even the most basic level, its meaning must be defined before it’s used. This simple fact is at the foundation of data warehousing, but it often seems forgotten in the rush to the lakeshore.

Big data governance has to start from that most simple act of naming. Much big data arrives nameless or cryptically named. Names, relationships, and boundaries of use must all be established before the data is put to business use. It should not be forgotten that in the world of traditional data, data modelers labored long and hard to do this work before the data was allowed into the warehouse. Now, data scientists must do it for themselves, on the fly with every data set that arrives.

New tools are beginning to emerge, of course, that emphasize data governance and simplify and automate the process. What these tools do is re-create meaning and structure in the data. They differentiate between data that is suitable for this purpose or totally inappropriate for that task. And once you start that process, your data is no longer undifferentiated lake water; it has been purified and processed, drawn from the lake and bottled for a specific use.

I’ll be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, author of several books on data governance, Joe DosSantos of EMC Consulting, and Jay Zaidi, Director of Enterprise Data Management at Fannie Mae, sponsored by Waterline Data Science. Do join me at both of these sessions!


September 11, 2014  3:23 PM

Data Virtualization vs. Data Lake

Barry Devlin Barry Devlin Profile: Barry Devlin
Data virtualization, Data warehousing

With Cisco’s Data Virtualization Day, where I’m a panelist along with Rick van der Lans, coming up fast on 1 Oct in New York, it’s a good moment to revisit the topic.

Nighttime traffic junction - 13491450_s- cropData virtualization has come of age. For a few, it still remains unheard of, even if I mention some of its noms de plume, most often data federation. For others, although they are now a decreasing minority, especially in the data warehouse field it remains akin to devil worship! I’m a big supporter (of virtualization, that is!), but a new challenge is emerging – the Data Lake. To understand why, let’s quickly review the history and pros and cons of data virtualization.

As the original proponent of data warehousing in 1988, I was certainly not impressed by data virtualization when it was first talked about in 1991 as Enterprise Data Access and as a product EDA/SQL from Information Builders, and as a component of IBM Information Warehouse Architecture. (I suspect there are few enough of us left in the industry who can talk about that era from firsthand experience!) Back then, and through the 1990s, I believed that virtualization was a technology with very limited potential. Data consistency and quality was still the major drivers of the BI industry, and real-time integration of data from multiple sources was still a very high-risk endeavor.

By the turn of the millennium, however, I was having a change of heart! IBM was preparing to introduce Information Integrator, a product aimed at the market then known as Enterprise Information Integration (EII). The three principle use cases – real-time data access, combining data marts and combining data and content – were gaining traction. And they continue to be the principle use cases for data virtualization from a data warehouse point of view today. My change of heart seemed obvious to me then and now: the use cases were real and growing in importance, they could not be easily satisfied in the traditional data warehouse architecture, and the data quality of the systems to be accessed was gradually improving. Still, I was probably the first data warehousing expert to accept a role for data virtualization; it was not a very popular stance back then!

Within the past five years, data virtualization has become more mainstream. There is a broader acceptance that data exists and will continue to exist on multiple platforms, and therefore there is a need to access and join them in situ.

Long may that recognition continue! For now there is another wave of platform consolidation being proposed. It’s called the Data Lake, and it’s probably one of the most insidious concepts yet proposed – well, that’s my view. The Data Lake is a new attempt to consolidate data. In that sense, it echoes data warehouse thinking. The significant difference, however, is that there is no thought of reconciling the meanings, rationalizing the relationships or considering the timings of the data pouring in. “Just store it,” is the refrain; we’ll figure it out later. To my mind, this is as dangerous as allowing all unmonitored and likely polluted sources of water to flow into a real lake and then declaring it fit to drink. Not a good idea. In my view, as explained in “Business unIntelligence”, a combination of data warehouse and virtualization is what’s needed.

I’ll return to the Data Lake in more detail soon, and I’ll also be speaking about it at Strata New York, on 16 October.

Do join me at one or the other of these events!

Image: zhudifeng / 123RF Stock Photo


August 18, 2014  9:29 AM

Big Data and the Return of the Data Marts

Barry Devlin Barry Devlin Profile: Barry Devlin
Big Data, Data Management, Data marts, Data quality, Data warehouse

Datameer demonstrates Hadoop-based data mart.

Stefan Groschupf, CEO of @Datameer (from the German Sea of Data), has a great way with one-liners. At the #BBBT last Friday, he suggested that doing interactive SQL on Hadoop HDFS was reminiscent of a relational database using tape drives for storage. The point is that HDFS is a sequential file access approach optimized for the large read or write batch jobs typical of MapReduce. A good point that’s often overlooked in the elephant hype.

Datameer - Fastest Time to InsightsAnother great sound bite was that Datameer could be seen as the Business Objects of the Hadoop world. And it’s that thought that leads me to the actual topic of this post: data marts.

As one of the oldest and most divisive debates since the earliest days of business intelligence, it’s hardly surprising that the old time-to-value discussions of data warehouse vs. data mart should reemerge in the world of Hadoop. After all, Hadoop is increasingly being used to integrate data from a wide variety of sources for analysis. Such integration always begs the question: do it in advance to create data quality or do it as part of the analysis to reduce time to value? As seen in the image above, Datameer is clearly at the latter end of the spectrum. It’s a data mart.

And in the big data world, it’s certainly not the only data mart type of offering. A growing number of products built in the Hadoop ecosystem are touting typical data mart values: time to value, ease-of-use, focus on analysis and visualization, self-service, and so on. What’s different about Datameer is that it has been around for nearly 5 years and has an impressive customer base.

At an architectural level, we should consider how the quality vs. timeliness, mart vs. warehouse trade-off applies in the world of big data, including the emerging Internet of Things (IoT), discussed at length in my Business unIntelligence book. Are the characteristics of this world sufficiently different from those of traditional BI that we can reevaluate the balance between these two approaches? The answer boils down to the level of consistency and integrity demanded by the business uses of the data involved. Simple analytic uses of big data such a sentiment analysis, churn prediction, etc. are seldom mission-critical, so quality and integrity are less demanding. However, more care is required when such data is combined with business-critical transactional or reference data. This latter data is well-managed (or, at least, it should be) and combining it with poorly curated big data leads inevitably to a result set of lower quality and integrity. Understanding the limitations of such data is vital.

This is particularly important in the case of the growing—and, in my view, unfortunate—popularity of the Data Lake or Data Reservoir concept. In this approach, previously cleansed and integrated business data from operational systems is copied into Hadoop, an environment notorious for poor data management and governance. The opportunities to introduce all sorts of integration or quality errors multiply enormously. In such cases, the data mart approach may amount to nothing more than a fast track to disaster.


August 6, 2014  7:13 AM

Data… Network… Action!

Barry Devlin Barry Devlin Profile: Barry Devlin
Cisco networks, Data virtualization, Networking

Cisco’s acquisition of Composite Software begins to bear fruit.

FogSanFranLast Friday, the Cisco / Composite crowd turned up at the #BBBT. And I do mean crowd; seven participants from Cisco were present, including four speakers. Even a full year after its acquisition by Cisco, the Composite name still lingers for me at least, even though the Cisco branding and technical emphasis were on full view.

A major portion of this year’s session focused on Cisco’s vision of an architectural framework to enable the Internet of Things (IoT) or, in Cisco’s own words, the Internet of Everything. This fascinating presentation by Jim Green, former Composite CEO and now CTO of the Data and Analytics Business Group at Cisco, provoked extensive discussion among the members. Unfortunately for readers of this post, the details are under NDA, but I look forward to writing about them in the future. However, as a teaser, I will say this. Given the enormous growth predicted in the number of sensor-enabled devices—some 50 billion by 2020—our current data warehouse and big data architectures, which focus on centralized storage and manipulation of data, will need to be significantly augmented with real-time, streaming processing distributed throughout the network to simply cope with the volumes of data being generated. I even learned a new marketing buzzword—fog computing… to bring the cloud down to earth!

For BBBT old-timers, like me, our primary interest remains data virtualization, particularly as it relates to BI, and how the acquisition helped or hindered the development of that core function. One fear I had is that this small element of a bigger picture could get lost in the big corporation messaging of Cisco. That fear remains. For example, it took me quite some time to find the “Composite portion” of the Cisco website… it’s here, if you want to bookmark it. This is an aspect of the merged positioning that needs more attention.
But what about data virtualization? Virtualization function is a mandatory component of any modern approach to business intelligence and decision making support. It is a key component (under the name reification) of the process space of my Business unIntelligence REAL architecture. BI folks always focus on the data, of course. But, there’s much more to it.

Here’s what I wrote last year soon after the acquisition and last BBBT Cisco / Composite appearance: “One of the biggest challenges for virtualization is to understand and optimize the interaction between databases and the underlying network. When data from two or more distributed databases must be joined in a real-time query, the query optimizer needs to know, among other things, where the data resides, the volumes in each location, the available processing power of each database, and the network considerations for moving the data between locations. Data virtualization tools typically focus on the first three database concerns, probably as a result of their histories. However, the last concern, the network, increasingly holds the key to excellent optimization… And who better to know about the network and even tweak its performance profile to favor a large virtualization transfer than a big networking vendor like Cisco? The fit seems just right.” (See the full blog here.)

Clearly, this work is progressing with combined hardware/software offerings based on the Cisco (was Composite) Information Server and Cisco’s server and router offerings, although little was provided by way of technical detail or performance claims. Within the data virtualization space itself, the emphasis is on two key aspects: (1) to simplify the use and adoption of data virtualization and (2) to expand data virtualization sources particularly for big data and the Internet of Things. While these two aspects can be taken separately, I believe that it is in their combination that most benefit can be found. It has long been clear to me that big data and IoT data do not belong in the traditional data warehouse. Nor does it make sense to move business transaction data from well-managed relational environments to Hadoop—the Data Lake approach. (In my view, copying legally binding transaction and contractual data into a loosely defined and poorly controlled Data Lake risks corrupting and losing control over vital business assets.)

Virtualization across these two environments, and others, is the most sensible—and perhaps only possible—way to enable business users to combine such very different types of data. Furthermore, providing users with the ability to understand and use such combined data through abstraction and directory services is vital. So, the direction articulated by the Cisco Data and Analytics Business Group brings together two of the vital components to address the growing need for business users to understand, find and combine information from traditional and (so-called) big data sources. A grand challenge, indeed…

I look forward to Cisco’s Data Virtualization Day in New York on 1 October, from where I expect to report further developments on this interesting roadmap.


July 22, 2014  4:14 PM

Big data management

Barry Devlin Barry Devlin Profile: Barry Devlin
Data Management

Data management is back on the agenda, finally with a big data flavor.

I’d like to think that Teradata was driven by my blog of 10 July, “So, how do you eat a Hadoop Elephant?” in the acquisitions announced today of Hadapt and, of more interest here, Revelytix. Of course, I do know that the timing is coincidental. However, the move does emphasize my contention that it will be the traditional data warehouse companies that will ultimately drive real data management into the big data environment. And hopefully kill the data lake moniker in the process!

To recap, my point two weeks ago was: “The challenge was then[in the early days of data warehousing]—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures. [This demands] defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world.”

Revelytix is (or was) a Boston-based startup focusing on the problems of data scientists in preparing data for analytic use in Hadoop. The Revelytix process begins with structuring the incoming soft (or loosely structured) data into a largely tabular format. This is unsurprising to anyone who understands how business analysts have always worked. These tables are then explored iteratively using a variety of statistical and other techniques before being transformed and cleansed into the final structures and value sets needed for the required analytic task. The process and the tasks will be very familiar to anybody involved in ETL or data cleansing in data warehousing. The output—along with more structured data—is, of course, metadata, consisting of table and column names, data types and ranges, etc., as well as the lineage of the transformations applied. In short, the Revelytix tools produce basic technical-level metadata in the Hadoop environment, the initial component of any data management or governance approach.

In my book, “Business unIntelligence”, I proposed for a variety of reasons that we should start thinking about context-setting information (or CSI, for short), rather than metadata. A key driver was to remind ourselves that this is actually information that extends far beyond the limited technical metadata we usually consider coming from ETL. And if I might be so bold as to advise Teradata on what to focus on with their new baby, I would suggest that they place emphasis on the business-related portion of the CSI being created in the world of the data scientists. It is there that the business meaning for external data emerges. And it is there that it must be captured and managed for proper data governance.


July 17, 2014  7:12 AM

Big data: necessary but insufficient for analytics

Barry Devlin Barry Devlin Profile: Barry Devlin
Analytics, Hadoop

As the big data market matures, the focus shifts from the new data itself to its use in concert with traditional operational business data.

Customer JourneyFor the business analyst, big data can be very seductive. It exists in enormous quantities. It contains an extensive and expanding record of every interaction that makes up people’s various daily behaviors. According to all the experts, the previously unnoticed correlations it contains hold the potential for discovering customer preferences, understanding their next actions, and even creating brand new business models. Trailblazing businesses in every industry, especially Internet startups, are already doing this, largely based in Hadoop. The future beckons…

However, a key data source—traditional business transactions and other operational and informational data—has been largely isolated from this big data scene. And although the Hadoop store is the default destination for all big data, this older but most important data of the actual business—the customer and product records, the transactions, and so on—usually reside elsewhere entirely, in the relational databases of the business’ operational and informational systems. This data is key to many of the most useful analyses the business user may desire. The graphic above depicts how a modern customer journey accesses and creates data in a wide variety of places and formats, suggesting the range of sources required for comprehensive analytics and the importance of the final purchasing stage.

There are a number of approaches to bringing these disparate data sources together. For some businesses, copying a subset of big data to traditional platforms is a preferred tactic. Others, particularly large enterprises, prefer a data virtualization approach as described in the IDEAL architecture of Business unIntelligence. For businesses based mostly or largely in the cloud, bringing operational data into the Hadoop environment often makes sense, given that the majority of their data resides here or in other cloud platforms. The challenge that arises, however, is how to make analytics of this combined data most usable. Technical complexity and a lack of contextual information in Hadoop can be serious barriers to adoption of big data analytics on this platform by ordinary business analysts.

To overcome these issues, four areas of improvement in today’s big data analytics are needed:
1. Combine data from traditional and new sources
2. Create context for data while maintaining agile structure
3. Support iterative, speed-of-thought analytics
4. Enable business-user-friendly analytical interface

Big data is commonly loaded directly into Hadoop in any of a range of common formats, such as CSV, JSON, web logs and more. Operational and informational data, however, must first be extracted from its normal relational database environments before loading it in a flat-file format. Careful analysis and modeling is needed to ensure that such extracts faithfully represent the actual state of the business. Such skills are often to be found in the ETL (extract-transform-load) teams responsible for traditional business intelligence systems, and should be applied here too.

To process such data, users need to be able to define the meaning of the data before exploring and playing with it, in order to address improvement #2 above. Given analysts’ familiarity with tabular data formats, such as spreadsheets and relational tables, a simple modeling and enhancement tool that overlays such a structure on the data is a useful approach. This separates the user from the technical underlying programming methods.

At the level of the physical data access and processing required to return results to the users, one approach is to translate the users’ queries into MapReduce batch programs to run directly against the Hadoop file store. Another approach adds a columnar, compressed, in-memory appliance. This provides iterative, speed-of-thought analytics, in line with improvement #3, by offering an analytic data mart sourced from Hadoop. In this environment, the analyst interacts iteratively with visual dashboards. This is analogous to BI tools, operating on top of a relational database. This top layer provides for the fourth required improvement: a business-user-friendly analytical interface.

The four improvement areas listed here are at the heart of Platfora’s approach to delivering big data analytics. For a more detailed explanation, as well as descriptions of a number of customer implementations, please see my white paper, “Demystifying big data analytics” or the accompanying webinar on this topic.


Page 1 of 212

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: