When it comes to externally-sourced data, data scientists are left to pick up the pieces. New tools can help, but let’s also address the deeper issues.
Trifacta presented at the Boulder BI Brain Trust (#bbbt) last Friday, 13 March to a generally positive reaction from the members. In a sentence, @Trifacta offers a visual data preparation and cleansing tool for (typically) externally-sourced data to ease the burden on data scientists, as well as other power data users, who today can spend 80% of their time getting data ready for analysis. In this, the tool does a good job. The demo showed an array of intuitively invoked methods for splitting data out of fields, assessing the cleanliness of data within a set, correcting data errors, and so on. As the user interacts with the data, Trifacta suggests possible cleansing approaches, based on both common algorithms and what the user has previously done when cleaning such data. The user’s choices are recorded as transformation scripts that preserve the lineage of what has been done and that can be reused. Users start with a sample of data to explore and prove their cleansing needs, with the scaled-up transformations running on Hadoop within a monitoring and feedback loop.
This is clearly a useful tool for the data scientist and power user that tackles a persistent bottleneck in the journey from data to insight. It also prompts discussion on the process that should exist around the ingestion and use of external data.
There is a persistent desire to reduce the percentage (to zero if possible!) of time spent by data scientists in preparing and cleansing data. Yet, if we accept that such practitioners are indeed scientists, we should recognize that in “real” science, most of the effort goes into experiment design, construction and data gathering/preparation; the statistical validity and longer term success of scientific work depends on this upfront work. Should it be different with data scientists? I believe not. The science resides in the work of experimentation and preparation. Of course, easing the effort involved and automating reuse is always valid, so Trifacta is a useful tool. But, we should not be fooled that the oft quoted 80% can or should be reduced to even 50% in real data science cases. And among power users, their exploration of data is also, to some degree, scientific research. Preparation and discovery are iterative and interdependent processes.
What is often further missed in the hype around analytics is that after science comes engineering: how to put into production the process and insights derived by the data scientists. While there is real value in the “ah-ha” moment when the unexpected but profitable correlation (or even better, in a scientific view, causation) is found, the longer term value can only be wrought by eliminating the data scientists and explorers, and automating the findings within the ongoing processes of the business. This requires reverting to all the old-fashioned procedures and processes of data governance and management, and with the added challenge that the incoming data is—almost by definition—dirty, unreliable, changeable, and a list other undesirable adjectives. The knowledge of preparation and cleansing built by the data scientists is key here, so Trifacta’s inclusion of lineage tracking is an important step towards this move to production.
Remember lastminute.com? How is this for their last word on personal data?
Important information about your personal data
With effect from today, the lastminute.com business has been acquired by Bravofly Rumbo Group. As a result, your personal data has been transferred to LMnext UK Ltd (a member of the Bravofly Rumbo Group) registered in England and Wales with company registration number 9399258.
LMnext UK Ltd is committed to respect the confidentiality of your personal data and will process it fairly and lawfully and in accordance with applicable data protection law.
You are also reminded that you may exercise your rights of access, rectification or removal of your personal data from our database at any time by sending a written request to lastminute.com, Dukes Court, Duke Street, Woking, Surrey, GU21 5BH providing a copy of your ID.
Please do not hesitate to contact us if you have any queries
The team at lastminute.com and Bravofly Rumbo Group”
I assume that they know my name, since they are holding my personal data, but they can’t rise to a mail-merge process for customer relationship?
More irritatingly, they demand a physical instruction with a scan of my ID for a removal. Why? Is it because there is more interesting data about me to be scraped from said ID? Or is it just to discourage me from asking?
So, no I won’t be asking for removal from their database. Nor will I ever do business with them or any company to whom they pass my data. This e-mail is symptomatic of the lack of respect in which many companies hold our personal data. In itself, it not a big deal. But, taken in a broader context, it epitomises the old adage: caveat emptor or even caveat scriptor!
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp? See also Part 1
The Data Lake has been filling up nicely since its 2010 introduction by James Dixon, with a number of vendors and analysts sailing forth on the concept. Its precise, architectural meaning has proven somewhat fluid, to continue the metaphor. I criticized it in an article in April last, struggling to find a firm basis for discussion of a concept that is so architecturally vague that it has already spawned multiple interpretations. Dixon commented in a September blog that I was mistaken and set forth that: “A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse” and “A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.” This doesn’t clarify much for me, especially when read in conjunction with Dixon’s response to one of his commenters: “The fact that [Booz Allen Hamilton] are putting data from multiple data sources into what they call a ‘Data Lake’ is a minor change to the original definition.”
This “minor change” is actually one of the major problems I see from a data management viewpoint, and Dixon admits as much in his next couple of sentences. “But it leads to confusion about the model because not all of the data is necessarily equal when you do that, and metadata becomes much more of an issue. In practice these conceptual differences won’t make much, if any, impact when it comes to the implementation. If you have two data sources your architecture, technology, and capabilities probably won’t differ much whether you consider it to be one data lake or two.” In my opinion, this is the sort of weak-as-water architectural thinking about data that can drown implementers very quickly indeed. Apply it to the data swamp that is the Internet of Things, and I am convinced that you will end up on the Titanic. Given the obvious focus of HDS on the IoT, alarm bells are already ringing loudly indeed.
But there’s more. Recently, Dixon has gone further, suggesting that the Data Lake could become the foundation of a cleverly named “Union of the State”: a complete history of every event and change in data in every application running in the business, an “Enterprise Time Machine” that can recreate on demand the entire state of the business at any instant of the past. In my view, this concept has many philosophical misunderstandings, business misconceptions, and technical impracticalities. (For a much more comprehensive and compelling discussion of temporal data, I recommend Tom Johnston’s “Managing Time in Relational Databases: How to Design, Update and Query Temporal Data”, which actually applies far beyond relational databases.) However, within the context of the HDS acquisition, my concern is how to store, never mind manage, the entire historical data record of even that subset of the Internet of Things that would be of interest to Hitachi or one of its customers. To me, this would truly result in a data quagmire of unimaginable proportions and projects of such size and complexity that would dwarf even the worst data warehouse or ERP project disasters we have seen.
To me, the Data Lake concept is vaguely defined and dangerous. I can accept its validity as a holding pond for the vast quantities of data that pour into the enterprise in vast quantities at high speed, with ill-defined and changeable structures, and often dubious quality. For immediate analysis and quick, but possibly dirty, decisions, a Data Lake could be ideal. Unfortunately, common perceptions of the Data Lake are that, in the longer term, all of the data in the organization could reside there in its original form and structure. This is, in my view, and in the view of Gartner analysts and Michael Stonebraker, to name but a few, not only dangerous in terms of data quality but a major retrograde step for all aspects of data management and governance.
Dixon says of my original criticism “Barry Devlin is welcome to fight a battle against the term ‘Data Lake’. Good luck to him. But if he doesn’t like it he should come up with a better idea.” I fully agree, tilting at well-established windmills is pointless. And as we discovered in our last EMA/9sight Big Data survey (available soon, see preview presentation from January), Data Lake implementations, however variously defined, are already widespread. I believe I have come up with a better idea, too, in the IDEAL and REAL information architectures, defined in depth in my book, Business unIntelligence.
To close on the HDS acquisition of Pentaho, I believe it represents a good deal for both companies. Pentaho gets access to a market and investment stream that can drive and enhance its products and business. And, IoT is big business. HDS gets a powerful set of tools that complement its IoT direction. Together, the two companies should have the energy and resources to clean up the architectural anomalies and market misunderstandings of the Data Lake by formally defining the boundaries and describing the structures required for comprehensive data management and governance.
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp?
This week’s announcement of Hitachi Data Systems’ (HDS, @HDScorp) intention to acquire @Pentaho poses some interesting strategic and architectural questions about big data that are far more important than the announcement’s bland declaration about it being “the largest private big data acquisition transaction to date”. We also need to look beyond the traditional acquisition concerns about integrating product lines, as the companies’ products come from very different spaces. No, the real questions circle around the Internet of Things, the data it produces, and how to manage and use that data.
As HDS and Pentaho engaged as partners and flirted with the prospect of marriage, we may assume that for HDS, aligning with Hitachi’s confusingly named Social Innovation Business was key. Coming from BI, you might imagine that Social Innovation refers to social media and other human-sourced information. In fact, it is Hitachi’s Internet of Things (IoT) play. Hitachi, as a manufacturer of everything from nuclear power plants to power tools, from materials and components to home appliances, as well as being involved in logistics and financial services, is clearly positioned at the coalface of IoT. With data as the major product, the role of HDS storage hardware and storage management software is obvious. What HDS lacked was the software and skills to extract value from the data. Enter Pentaho.
Pentaho comes very much from the BI and, more recently, big data space. Empowering business users to access and use data for decision making is their business for over 10 years. Based on open source, Pentaho have focused on two areas. First, they provide BI, analysis and dashboard tools for end-users. Second, they offer data access and integration tools across a variety of databases and big data stores. Both aspects are certainly of interest to HDS. Greg Knieriemen (@Knieriemen), Hitachi Data Systems Technology Evangelist, agrees and adds big data and cloud embedding for good measure. The BI and analytics aspect is straightforward: Pentaho offers a good set of functionality and it’s open source. A good match for the HDS needs and vision, job done. The fun begins with data integration.
Dan Woods (@danwoodsearly) lauds the acquisition and links it to his interesting concept of a “Data Supply Chain… that accepts data from a wide variety of sources, both internal and external, processes that data in various nodes of the supply chain, passing data where it is needed, transforming it as it flows, storing key signals and events in central repositories, triggering action immediately when possible, and adding data to a queue for deeper analysis.” The approach is often called a “data refinery”, by Pentaho and others. Like big data, the term has a range of meanings. In simple terms, it is an evolution of the ETL concept to include big data sources and a wider range of targets. Mike Ferguson (@mikeferguson1) provides perhaps the most inclusive vision in a recent white paper (registration required). However broadly or narrowly we define data refinery, HDS is getting a comprehensive set of tooling from Pentaho in this space.
However, along with Pentaho’s data integration tooling, HDS is also getting the Data Lake concept, through its cofounder and CTO, James Dixon, who could be called the father of the Data Lake, having introduced the term in 2010. This could be more problematical, given the debates that rage between supporters and detractors of the concept. I fall rather strongly in the latter camp, so I should, in fairness, provide context for my concerns by reviewing some earlier discussions. This deserves more space than I have here, so please stay tuned for part 2 of this blog!
Why, oh why does the relationship between analytics, automation, profit and employment seem to elude so many people?
A nicely rounded post by Scott Mongeau, “Manager-machine: analytics, artificial intelligence, and the uncertain future of management”, from last October came to my attention today via James Kobielus’ recent response, “Cognitive Computing and the Indelible Role of Human Judgment”. Together, they reminded me again of a real-world problem that has been bothering me since the publication of my book, “Business unIntelligence”.
Mongeau gives a reasoned analysis of the likely increasing impact of analytics and artificial intelligence on the role of management. His thesis appears very realistic: over the coming few decades, many of the more routine tasks of management will fall within the capability of increasingly powerful machines. From driverless cars to advanced logistics management, many more tasks only recently considered the sole remit of humans can be automated. Mongeau also provides a list of tasks where analytics and automation may never (or perhaps more slowly) encroach: he cites strategic decision making, and tasks requiring leadership and personal engagement, although, even in strategic decisions, IBM’s Watson is already making a play. He also offers some possible new job roles for displaced managers. However, he misses what I believe is the key implication, to which I’ll return in a moment.
Sadly, Kobielus misses the same point, choosing instead to focus on the irrefutable argument (at least for the foreseeable future) that there will always be some tasks where human judgment or oversight is required. Such tasks will remain, of course, with humans. A sideswipe at Luddism also adds nothing to the argument.
So, what is the missed implication? It seems self-evident, to me, at least, that manufacturing and increasingly services can be delivered more cheaply in many cases, using analytics and automation, by machines rather than people. As both analytics and automation improve exponentially according to Moore’s Law, the disparity can only increase. Therefore, industry progressively invests in the capital of hardware and software rather than labor, driven directly by the profit motive. Given that it is through their labor that the vast majority of consumers earn the money needed to buy industry’s goods and services, at what point will consumption be adversely affected by the resulting growing level of unemployment? This is not an argument about when, if ever, machines can do everything a person can do. It is simply about envisaging a tipping point when a sufficient percentage of the population can no longer afford the goods and services delivered by industry, no matter how cheaply.
Hence, the equation implied in the title of this post: analytics and automation, driven by profit, reduce employment. The traditional economic argument is that technology-driven unemployment has always has always been counteracted by new jobs at a higher level of skill for those displaced by the new technology. This argument simply cannot be applied in the current situation; the “skill level” of analytics and automation is increasing far faster (and actually accelerating) than that of humans.
So, I use this first post of 2015 to reiterate the questions I posed in a series of blogs early last year. To be very frank, I do not know what the answers should be. And the politicians, economists and business leaders, who should be leading the thinking in this area, appear to be fully disengaged. In summary, the quest is: how can we reinvent the current economic system in light of the reality that cheaper and more efficient analytics and automation are driving every industry to reduce or eliminate labor costs without consideration for the fact that employment is also the foundation for consumption and, thus, profit?
Image: Nexi. Credit: Spencer Lowell
“Gold is down almost 40% since it peaked in 2011. But it’s still up almost 350% since 2000. Although since 1980, on an inflation-adjusted basis, it’s basically flat. However, since the early-1970s it’s up over 7% per year (or about 3.4% after inflation).” Ben Carlson, an institutional investment manager provides this wonderful example of how statistical data can be abused, in this case by playing with time horizons. Ben is talking about making investment decisions. Let me replay his conclusions, but with a more general view (my changes in bold).
“It’s very easy to cherry-pick historical data that fits your narrative to prove a point about anything. It doesn’t necessarily mean you’re right or wrong. It just means that the world is full of conflicting evidence because the results over most time frames are nowhere close to average. If the performance of everything was predictable over any given time horizon, there would be no risk.”
We have entered a period of history where information has become super-abundant. It would be wise, I suggest, to consider all the ways this information can be misinterpreted or abused. Through ignorance, so-called confirmation bias, intention to deceive, and a dozen other causes, we can mislead, be misled, or slip into analysis paralysis. How can we avoid these pitfalls? Before attempting my own answer, let’s take a look at an example of dangerous thinking that can be found even among big data experts.
Jean-Luc Chatelain, a Big Data Technology & Strategy Executive, recently declared “an end to data torture” courtesy of Data Lakes. Arguing that a leading driver is cost, he says Data Lakes “enable massive amount of information to be stored at a very economically viable point [versus] traditional IT storage hardware”. While factually correct, this latter statement actually nothing about overall cost, with the growth in data volumes probably exceeding the rate of decline in computing costs and, more importantly, the fact that data governance costs grow with increasing volumes and disparity of data stored.
More worryingly, he goes on to say: “the truly important benefit that Data-Lakes bring to the ‘information powered enterprise’ is… ‘High quality actionable insights’”. This conflation of vast stores of often poorly-defined and -managed data with high quality actionable insights flies in the face of common sense. High quality actionable insights more likely stem from high quality, well-defined, meaningful information rather than from large, ill-defined data stores. Actionable insights require the very human behavior of contextualizing new information within personal or organizational experience. No amount of Lake Data can address this need. Finally, choosing actions may be based on the best estimate of whether the information offers a valid forecast about the outcome… or may be based on the desires, intentions, vision, etc. of the decision maker, especially if the information available is deemed to be a poor indicator of the future likely outcome. And Chatelain’s misdirected tirade against ETL (extract, torture and lose, as he labels it) ignores most of the rationale behind the process in order to cherry-pick some well-known implementation weaknesses.
Whether data scientist or business analyst, the first step with data—especially with disparate, dirty data—is always to structure and cleanse it; basically, to make it fit for analytic purpose. Despite a very short history, it is already recognized that 80% or more of data scientists’ effort goes into this data preparation. Attempts to automate this process and to apply good governance principles are already underway from start-ups like @WaterlineData, @AlpineDataLabs as well as long-standing companies like @Teradata and @IBMbigdata. But, as always, the choice of what to use and how to use it depends on human skill and experience. And make no mistake, most big data analytics moves very quickly from “all the data” to a subset that is defined by its usefulness and applicability to the issue in hand. Big data rapidly becomes focused data in production situations. Returning again and again to the big data source for additional “insights is governed by the law of diminishing returns.
It is my belief that our current fascination with collecting data about literally everything is taking us down a misleading path. Of course, in some cases, more data and, preferably, better data can offer a better foundation for insight and decision making. However, it is wrong to assume that more data always leads to more insight or better decisions. As in the past evolution of BI, we are again focusing on the tools and technology. Where we need to focus is on improving our human ability to contextualize data and extract valid meaning from it. We need to train ourselves to see the limits of data’s ability to predict the future and the privacy and economic dangers inherent in quantifying everything. We need to take responsibility for our intentions and insights, our beliefs and intuitions that underpin our decisions in business and in life.
“The data made me do it” is a deeply disturbing rationale.
Gartner’s new acronym, HTAP. What does it mean and why should you care?
What if we lived in a world where business users didn’t have to think about using different systems, depending on whether they wanted to use current or historical data? Where staff didn’t have to distinguish between running and managing the business? Where IT didn’t have to design and manage complex processes to copy and cleanse all data from operational systems to data warehouses and marts for business intelligence (BI)? The reality of today’s accelerating business drivers is that we urgently need to enable those new world behaviors of both business and IT.
My 2013 book, “Business unIntelligence”, described how the merging of business and technology is transforming and reinventing business processes. Such dramatic changes demand that current and historical data are combined in a more integrated, closed-loop way. In many industries, the success—and even survival—of companies will depend on their ability to bridge the current divide between their operational and informational systems. In 2014, Gartner (1) coined the term hybrid transaction/analytical processing (HTAP) to describe the same need. In terms of its implementation, they pointed to the central role of in-memory databases. This technology is certainly at the core, but other hardware and software considerations come into play.
My recent white paper “The Emergent Operational/Informational World” explores this topic in depth. Starting from the original divergence of operational and informational systems in the 1970s and 1980s, the paper explains how we arrived in today’s layered data world and why it must change, citing examples from online retailing, financial services and the emerging Internet of Things. It describes the three key technological drivers enabling the re-convergence, (1) In-memory databases, (2) techniques to reduce contention in data access, and (3) scaling out of relational databases, and how the modern NuoDB relational database product addresses these drivers.
For a brief introduction to the topic, join me and Steve Cellini of NuoDB on December 2nd, 1pm EST for our webinar “The Future of Data: Where Does HTAP Fit?”
(1) Gartner Press Release, “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation”, G00259033, 28 January 2014
Data-driven business looks like its emerging as the next big buzz phrase. Should you be worried?
Back in Cape Town after six weeks on the road in the US and Europe, my first task was to step on stage at Mammoth BI and do some myth busting about data-driven business.
Mammoth BI is the brain-child of local entrepreneur, Jason Haddock of Saratoga, a local IT solutions company. The one-day conference was modelled on the TED format of 15-minute entertaining and informative presentations, but focusing on big data, analytics and BI. This inaugural event was a big success, drawing a large audience to the Cape Town International Conference Centre, including a large number of university students, who were offered free attendance as a give-back to the community.
I was presenter number 13 of 17. Amongst 15 presenters extolling the virtues of big data and analytics, describing their successes and techniques, and one great professional comedian (Gareth Woods), my role was to be the old curmudgeon! Or more gently, the reality check against the overwhelming enthusiasm for data-driven solutions to every imaginable problem and data-based silver bullets for every opportunity. Among the many issues I could have chosen, here are the four myths I chose to bust:
- Collect all the data to get the best information: Not! Data Lakes epitomize this idea. Anybody who has been involved in data warehousing over the years should know that data is often dirty. Inconsistent, incomplete, simply incorrect. This is like pouring sewage into a lake. You need to be choosy about what you store and apply strict governance procedures.
- Decision-making is a fully rational, data-based process: Not! Lovers, advertisers and even executives know this not true. Better and more trusted data can influence the direction of thinking, but many important decisions eventually come down to a mix of information, experience, emotions and intentions. Sometimes called gut-feel or intuition. You need a mix of (wo)man and machine.
- Big data can be safely anonymized: Not! The ever increasing set of variables being collected about every man, woman and child is now so large that individuals can always be pinpointed and identified. Privacy is no longer an option. And target marketing can be a nice word for discrimination. Democracy also suffers when all opinions can be charted.
- Data-driven processes will solve world hunger: Not! While there are many benefits and opportunities to improve the world through big data and automation, the point that everybody seems to miss is that while the cost of goods drops ever lower by (among other factors) eliminating human labour, these displaced workers no longer have the cash to buy even the cheapest goods. Economists presume that new types of jobs will emerge, as happened in the industrial revolutions; unfortunately, none of them can imagine what those jobs might be.
These four problems are presented in order of increasing impact. Everybody in the data-driven industry needs to consider them carefully. I hope that I’m being too pessimistic, especially in the last two. Please prove me wrong! I’d love to return to the next Mammoth BI, planned for August 2015, with some real answers.
Let’s meet in New York, complements of Cisco Information Server, NuoDB, Strata, and Waterline Data.
The latest McKinsey Quarterly, celebrating its 50th anniversary, suggests we need a significantly upgraded “Management intuition for the next 50 years”, explaining that “the collision of technological disruption, rapid emerging-markets growth, and widespread aging is upending long-held assumptions that underpin strategy setting, decision making, and management”. Regular readers of this blog will perhaps be surprised only in how long it has taken McKinsey to notice!
The biz-tech ecosystem concept I introduced in “Business unIntelligence” (how time files—the book is out almost a full year) pointed to a few other real world trends, but the result was the same: wrap them together with the current exponential rate of change in technology, and the world of business, and indeed society as a whole, can and must transform in response. W.B. Yeats was more dramatic: “All changed, changed utterly: A terrible beauty is born”.
Much of the excitement around changing technology has focused on big data, particularly all things Hadoop. I’ve covered that in my last post and will be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, Joe DosSantos, and Jay Zaidi, sponsored by Waterline Data Science, also at Strata on 17 October.
A second important aspect is virtualization of the data resource. This becomes ever more important as data volumes grow, and copying it all into data warehouses or migrating it to Hadoop is difficult or costly. I also dealt with that topic in a recent blog and will be addressing it at Cisco’s Data Virtualization Day, with Rick van der Lans in New York, next Wednesday, 1 October.
However, there is one other aspect that has received less attention: the practical challenge of the existing layered architecture, where the data warehouse is “copied” from the operational environment. There are many good reasons for this approach, but it also has its drawbacks, most especially the latency it introduces in the decision making environment and issues related to distributed and large scale implementations. In “Business unIntelligence”, I discussed the emerging possibility of combining the operational and informational environments, particularly with in-memory database technology. Gartner coined a new acronym, HTAP (Hybrid Transaction/Analytical Processing), last January to cover this possibility. With its harkening back to the old OLTP and OLAP phraseology, the name doesn’t inspire, but the concept is certainly coming of age.
One particularly interesting approach to this topic comes from NuoDB, whose Swifts 2.1 release went to beta a couple of weeks ago. I blogged on this almost a year ago, where I noted that “real-time decision needs also demand the ability to support both operational and informational needs on the primary data store. NuoDB’s Transaction Engine architecture and use of Multi-Version Concurrency Control together enable good performance of both read/write and longer-running read-only operations seen in operational BI applications”. With general availability of this functionality in November, NuoDB is placing emphasis on the idea that a fully distributed, in-memory, relational database is the platform needed to address the issues arising from a layered operational/informational environment. I’ll be speaking to this in New York on 15 October at NuoDB’s breakfast session, where I’ll also be signing copies of my book, complements of the sponsor.
So, New York, New York… here I come!
Doing big data governance can save you from drowning in the Data Lake.
So, how do you design and build a reservoir? Simplistically, you actually design and build a dam, clear the area behind it of everything of value—people, animals and things—and wait for the water to fill the existing valley, drowning everything in its wake.
Of course, I really want to talk about Data Lakes and Data Reservoirs, what the concept might mean, and its implications for data management and governance. Data Lakes, sometimes called Data Reservoirs, are all the rage at the moment. They seem to provide ideal vacationing spots for marketing folks from big data vendors. We’re treated to pictures of sailboats and waterskiing against pristine mountain backdrops. But beyond exhortations to move all our data to a Hadoop-based platform and save truckloads of money by decommissioning decades of investment in relational systems, I’ve so far found little in the way of thoughtful architecture or design.
The metaphor of a lake offers, of course, the opportunity to talk about water and data flowing in freely and users able to dip in with ease for whatever cup full they need. Playful images of recreational use suggest the freedom and fun that business users would have if only they didn’t have to worry about where the data comes from or how it’s structured. Like the crystal clear water in the lake, it is suggested that all data is the same, pure substance waiting to be consumed.
Deeper thinking, even at the level of the lake metaphor, reminds us that there’s more to it. Lake water must undergo significant treatment and cleansing before it’s considered fit to drink. Many lakes are filled with effluent from the rivers that feed them. Even the pleasure seekers on the lake understand that there may be dangerous shallows or hidden rocks.
The rush to discredit the data warehouse, with its structures and rules, its loading and cleansing processes, its governance and management, has led its detractors to throw out the baby with the lake water. It is important to remember that not all data is created equal. It varies along many axes: value, importance, cleanliness, reliance, security, aggregation, and more. Each characteristic demands thought before an individual data element is put to use in the business. At even the most basic level, its meaning must be defined before it’s used. This simple fact is at the foundation of data warehousing, but it often seems forgotten in the rush to the lakeshore.
Big data governance has to start from that most simple act of naming. Much big data arrives nameless or cryptically named. Names, relationships, and boundaries of use must all be established before the data is put to business use. It should not be forgotten that in the world of traditional data, data modelers labored long and hard to do this work before the data was allowed into the warehouse. Now, data scientists must do it for themselves, on the fly with every data set that arrives.
New tools are beginning to emerge, of course, that emphasize data governance and simplify and automate the process. What these tools do is re-create meaning and structure in the data. They differentiate between data that is suitable for this purpose or totally inappropriate for that task. And once you start that process, your data is no longer undifferentiated lake water; it has been purified and processed, drawn from the lake and bottled for a specific use.
I’ll be discussing “Drowning not Waving in the Data Lake” in more detail at Strata New York, on 16 October, as well as moderating a panel discussion “Hadoop Responsibly with Big Data Governance” with Sunil Soares, author of several books on data governance, Joe DosSantos of EMC Consulting, and Jay Zaidi, Director of Enterprise Data Management at Fannie Mae, sponsored by Waterline Data Science. Do join me at both of these sessions!