Volkswagen alerts us to missing ethical thinking in the otherwise powerful concept of data-driven business. Thanks Guys!
One of the most intriguing examples of being supremely data-driven came recently from, of all places, the automobile industry. I refer to the emissions-rigging escapade of Volkswagen. As an exemplar of using sensors and analytics to achieve a business goal, the company’s approach was impressive, however questionable the ethics. Personally, I would be unsurprised to see further similar approaches exposed in the automobile and other industries. Because, beyond the disappointment and outrage being vented on the company and individuals therein, the scam exposes two widespread misconceptions about data-driven business today.
There is currently immense interest in building data-driven business models across the IT industry. The explosion of so-called big data first from social media and human web interaction and more recently from various sensors and other data generating devices has galvanized technology development in every aspect of hardware, networking and
software. Every vendor worth their silicon is developing an Internet of Things (IoT) solution. Every consultant is discussing how business will be disrupted, while start-ups in every industry are actively doing it. Even traditional businesses are jumping aboard the IoT bandwagon. I’m not going to reel off the possibilities here; there are hundreds, if not thousands, of posts and papers out there already. My interest is in a couple of assumptions that underlie every opportunity: (1) that the data collected is valid and sufficiently reliable for the purpose to which it’s being put and (2) that (even partially) automated decisions based on the analysis of such data are better than those made by biased or otherwise debilitated humans. Both of those assumptions have been fairly well trashed by Volkswagen; I fear that few in the industry have noticed.
The possibility that IoT devices may provide invalid or unreliable data is often discussed in terms of failures of the devices themselves or the network infrastructure that connects them. Such failures are, of course, a reality. However, a more subtle and invidious possibility is that the devices may be designed deliberately to provide misleading data. With estimates of 20-30 billion devices on the IoT within 5 years, the opportunity to make mischief seems unmissable. Theo Priestly correctly questions who governs whether the data generated is even valid in the first place. In a legislative environment where pollutants are monitored and being driven ever lower, the value in devices that under-measure pollutants has been made obvious. Most governments now favor self-regulation by industry players themselves; so the risks of getting caught are limited. But the opportunities for fraud and deception that arise in every sector are widespread. A tiny over-measurement of the gas, water or electricity flowing into a few thousand random households in a city could enable a significant increase in profit for the supplier with minimal risk of detection. As devices become more sophisticated and more programmable remotely, further opportunities for deceit and cover-up are emerging.
In the case of automated decision-making systems, the increasing sophistication of the algorithms in use is already recognized as being beyond the comprehension of the vast majority of business people, and in many cases even beyond the programmers of the systems themselves. In these circumstances, discerning the intention of the algorithm (or actually of its designer) after the fact becomes largely impossible. Gartner is already talking about an algorithm economy where billions of algorithms exist and the workings of which are declared proprietary. Here, the question arises of who governs the ethics of the algorithms; indeed, who is even thinking about the topic? The responsibility lies with the businesses involved, because regulation at this scale is logistically impossible. And the suggestion made by some that there should be third-party devices and algorithms responsible for monitoring the other devices and algorithms only adds another layer of indirection to the question of ultimate responsibility for the correct/fair/balanced/just operation of these technologies. This, in the last analysis, comes down to the ethics of those directly involved.
So, for those businesses jumping aboard the data-driven business train, the question of intent should be front and center in the minds of CIOs and even CEOs. What is the intent behind the design of the sensors and algorithms being built? Is there a firm ethical foundation behind their design and implementation? Is the intention to obey the letter of the law and avoid being caught out, in the interest of maximizing profit? If not, the consequences for your business may be far-reaching. Indeed, the consequences for the entire data and IT industry may be deeply destructive.
In short, we might yet be thankful to the folks at Volkswagen for alerting us to this missing thinking in the otherwise powerful concept of data-driven business.
My upcoming speaking events:
- Athens, 21-22 October, Seminar and IT Directors’ Forum
- London, 2-4 Nov, Workshop and Talk, IRM Enterprise Data & BI Conference
- Rome, 9-10 Nov, Business unIntelligence 2-day class
- Copenhagen, 12 Nov, TDWI Danish Chapter Event, Beyond Big Data
Image: The Death of Socrates, Jacques-Louis David (1787)
It’s indisputable that technology is displacing many of today’s jobs. The question is: what should, or can, we do about it? This series—parts 1, 2 and 3—explores the possible consequences of this shift and how information use and decision making support can, and should, drive a better outcome through enhanced and expanded Business Intelligence. In this final Part 4, I examine how that (somewhat) utopian outcome can be achieved.
Business Intelligence and analytics allow exploration of business performance and predict future trends. These tools are also used to explore where the overall economy is going. Unfortunately, few economists look beyond traditional measures and fail to explore novel economic approaches to the changes ongoing at present. As a non-economist, and without access to the base data, all I can do here is propose some analyses and, based on common sense, suggest what the results might look like in the accompanying graphs (click on graphs to enlarge). There are, no doubt, other valuable analyses—such as how this would look in the developing world.
[A]: trends in labor and technology costs diverge, suggesting that producers will increasingly switch investment from labor to technology driving unequal distribution of wealth from labor to capital.
[B]: for individuals, increased income leads to higher consumption (up to some maximum), but in today’s society as a whole, with the income of mainstream consumers dropping due to trends [A], overall consumption drops.
[C]: profit, driver of the capitalist economy, decreases towards zero as both costs [A] and consumption [B] drop, even if percentage profit margin remains relatively stable.
[D]: irrespective of economics, consumption cannot increase indefinitely due to the finite limit of the Earth’s carrying capacity. The current model is boom and bust; an ideal model would respect a limit to overall consumption.
The (common sense) data shown in the first two graphs suggest that the current economic model—linking production to middle class work and income and on to mass consumption—is no longer viable as the lines in [A] approach/cross and wealth moves from labor to capital. Unless the curves in [A] are radically wrong, which I don’t believe, the only viable solution is to break the work-income linkage for the middle classes as it has long been effectively broken for the upper classes. This is the essential idea behind the concept of basic income—an income unconditionally granted to all on an individual basis, without means test or work requirement—in this context an equitable distribution of income from capital as income from labor dries up.
Basic income, supported by the likes of Martin Luther King, Jr., thus offers the first and essential component of a utopia: freedom from the drudgery of work necessary to put bread on the table and a roof overhead. In such an environment, a true gig economy—where independent people take on discrete tasks only when they want to for advancement or even pleasure—becomes possible and attractive, rather than one chosen in desperation by the hungry and used by employers to reduce committed costs. The approach is already under consideration in a number of European countries.
Graph [C] exposes another economic reality in a technology-enabled economy: the profit motive becomes financially unattractive. Although we have had profit as one of the tenets of capitalism for a couple of hundred years, it should not be seen as sacrosanct. The modern (re-)emergence of the concept of the Commons, particularly in its expression by Jeremy Rifkin as the Collaborative Commons, offers a very different basis for the economics of a society beyond traditional capitalism. This second element of a utopia is based on universal communication and connectivity (Internet and Internet of Things) as well as democratization of energy production and manufacturing through advances in solar energy, 3D printing, and other technologies.
The final component of this (somewhat) utopia emerges from graph [D]. The consumer society is, exactly as the name implies, driven by the idea of ever growing consumption. While it has much deeper roots, an indication of its motivation is given by Victor Lebow in a Journal of Retailing article “Price Competition in 1955” who says: “Our enormously productive economy demands that we make consumption our way of life… that we seek our spiritual satisfactions, our ego satisfactions, in consumption.” In recent years, growth in population and individual consumption levels pose the question of where the limit is in the Earth’s ability to support an ever-growing level of total consumption. Scientific opinion on climate change, ocean warming, etc. suggests we are near that limit. As a consequence, curbing population and consumption seem reasonable, if not inevitable.
I began with three stances on the impact of exponentially improving technology on employment and the economy: head in the sand, dystopian and (somewhat) utopian. My impression is that sentiment is moving today from the first to the second. That’s a bit depressing! Business intelligence and analytics offers the possibility to explore some more innovative and positive implications. I invite those of you with the tools and skills, as well as access to the relevant economic data to take another look at what’s possible. Despite all our fears, perhaps it’s now the moment to consider a possible (somewhat) utopian future! Good luck with your analysis…
It’s indisputable that technology is displacing many of today’s jobs. The question is: what should, or can, we do about it? This series explores the possible consequences of this shift and how information use and decision making support can, and should, drive a better outcome through enhanced and expanded Business Intelligence. Part 3 is dystopian; I apologize in advance.
“The massive forces of globalization and technological progress are removing the need for a lot of the previous kind of white-collar workers,” according to Andrew McAfee of the Center for Digital Business at the M.I.T. Sloan School of Management in a recent New York Times article. It’s just the logical outcome of the trends described in Part 1 and Part 2 of this series. The outcome is the increasing technological displacement of traditional middle and lower-middle job types, combined with continuing downward pressure on wages for these jobs. In response, job seekers are forced to accept lower skilled and paid jobs, often as gig work, without long term security, as well as holding down multiple jobs to make ends meet. Lower incomes and less leisure time will drive down consumption of mass-produced goods, and cause producers to cut costs further, driving further unemployment. A classic race to the bottom. This simple analysis applies to Western economies. In developing economies, further factors come into play, which deserve deeper consideration, but the end result would appear to be largely the same. The economic impacts are severe. In fact, the economic model—as we have operated it for more than two centuries—becomes untenable.
This is not at all about the allegedly coming Singularity; it’s simply about the progression of technology. When technology displaces some yet to be determined percentage of labor, this system becomes unbalanced: there are simply not enough people with sufficient money to buy the products made, no matter how cheaply. We have not yet reached this tipping point because, throughout most of the past two hundred years, the new jobs created by technology have largely offset the losses. However, recent employment trends in the Western world suggest that this effect is becoming less effective.
The subsequent societal disruption can be imagined to be catastrophic. Increasing inequality, already visible today, drives social unrest. Strikes, both legal and “wildcat” become widespread. Many people, whose sense of identity and self-worth is tied to a productive job, drop out, abuse drugs and self-destruct. Violent protest against technologically driven change, already visible around Uber, grows by leaps and bounds. Economic migration, within and across national borders, in search of sustenance becomes endemic. Ghettoization of society ensues: vast sprawling near-shanty towns house the disempowered, while the reducing numbers of the elite retreat into gated communities behind high walls, razor wire and armed patrols.
The outcome may not (yet) reach the “Mad Max” scenario, but a visitor to Brazil or South Africa—to name but two of many examples—can immediately get an idea of how such a dystopian society can emerge, and is already doing so. Crime becomes a way of life, corruption abounds and society disintegrates. I believe that the most likely outcome of the “head in the sand” stance being taken by many economists and most politicians today is to end badly in a dystopian nightmare.
Whither BI in such an environment? With marketing, customer service and, even, worker productivity become memories of a bygone era, the role of BI must inevitably move to the maintenance of wealth and power for those who have them. While Mad Max may focus on mean machines built from scrap automobiles—and they make for more visceral movies—the dispossessed will continue to hack communications and computer security, making the role of data analytics as a defense even more important. But it’s a restricted and increasingly inwardly-focused BI in dystopia.
As a technologist or data management expert following this series, I imagine that the head-in-the-sand and dystopian stances make, at best, distressing reading. Must it end like this? Is there anything that we can do to avoid the Fall? I believe there is. There is a better way, and as I shall demonstrate in the fourth and final part of this series, business intelligence, big data and analytics will be important enablers of the utopian stance.
It’s indisputable that technology is displacing many of today’s jobs. The question is: what should, or can, we do about it? This series explores the possible consequences of this shift and how information use and decision making support can, and should, drive a better outcome through enhanced and expanded Business Intelligence. Part 2 looks at the head in the sand reaction.
In Part 1 of this series, I introduced the three common stances that are taken when confronted with the issue of technological unemployment. Let’s take a deeper look at the first of them now.
Head in the sand
Many mainstream technologists and economists suggest that the jobs market is simply going through a period of re-adjustment—albeit a rather large and painful one—as new technology is adopted. This opinion seems founded mostly on the basis that in previous technology revolutions, such as the move from agriculture to industry in the 1800s and the move from industry to services still ongoing, new jobs have always been created to replace those displaced. Of course, the above timeframes apply to Western economies; emerging economies are at different stages in these transitions. The proposed solutions center on improved and ongoing education, as well as skills diversification. The underlying premise is that there exist, or will soon be created, jobs where robots or algorithms cannot perform better, faster and/or especially cheaper than humans.
The history of predictions of what automated software/hardware solutions cannot do gives little confidence, however. To give but one example, in “The New Division of Labor”, in 2004, the authors describe how driving an automobile requires such complex, instantaneous decisions and actions that it would be extremely difficult for a computer ever to handle it; Google debuted its autonomous car within six years. To be fair to the authors, few people actually get the consequences of the exponential growth rate in computing power that doubles every two years or so. Today’s computers are some 30-40 times more powerful and considerably more cost effective than those of 2004. Whether driving cars or analyzing images for cancerous cells, picking goods from warehouse shelves or making evidence-based recommendations or predictions, technology is displacing an ever-increasing number of previously human activities. My recent TechCrunch article gives some idea of the numbers: they’re not pretty.
On the plus side, new job types are indeed being created. However, their numbers seem small in comparison to those being displaced. A brief review of the possible top jobs in the next ten years, including sex workers (!), from three leading futurists does little to convince that the jobs envisaged will replace the some 4 million driving and support jobs threatened by autonomous cars and trucks.
A recent Fortune article offers the more hopeful view that jobs demanding human accountability, collaborative decision making and interpersonal skills will both be in demand and resistant to automation. I will return to this possibility, in conjunction with “real” BI (actually, Business unIntelligence), as key aspects of the (somewhat) utopian stance. However, from a more contrary viewpoint, we also see robotics aimed at displacing roles that demand human empathy and interaction. The US National Science Foundation (NSF) is spending roughly $1.2 million to fund research on how robots could dress the elderly. Meanwhile, SoftBank has created Pepper, “a social robot able to converse with you, recognize and react to your emotions, move and live autonomously”—seriously!
In the head in the sand stance, business intelligence (BI) plays the traditional role for which it is widely criticized in many businesses: as a means of justification of and reporting on maintaining the status quo. There are always facts and figures to be found and trends to be discovered that justify any viewpoint, especially a mainstream, entrenched view. And, who better to do that than those with their heads in the sand and a deep attachment to the mechanistic, overly rational decision making approaches of the past? In these circumstances, BI definitely makes a meaningful contribution for those involved, but it offers nothing to the understanding or solution of the real issue involved here. Namely, from where will the new sources of income emerge that enables the old consumerist wheel turning?
In part 3 of this series, I address one possible outcome of the wheel seizing up: the dystopian stance where the economy crashes and burns.
It’s indisputable that technology is displacing many of today’s jobs. The question is: what should, or can, we do about it? This series explores the possible consequences of this shift and how information use and decision making support can, and should, drive a better outcome through enhanced and expanded Business Intelligence.
I’ve written occasionally and at length in a Feb-Mar 2014 series on the impact of technology advances on employment. My basic thesis was—and is—as follows. Mass production and competition, facilitated by ever improving technology, have been delivering better and cheaper products and improving many people’s lives (at least in the developed world) for nearly two centuries. Capital, in the form of technology, and people–labor—have worked together relatively well in the consumer society to produce goods that people purchase largely using earnings from their labor. Until now…
As technology grows exponentially better, the return on capital investment in automation technology is improving significantly in comparison to return on investment in labor. The primary goal of the capitalist model is to maximize return on investment. As a result, an ever greater range of jobs become open to displacement by technology. To me, at least, the above logic is largely unarguable. For example, driverless vehicles, from trucks to automobiles, are set to eliminate some 4 million jobs in the US alone. Any complacency that only manual/physical jobs will be displaced by automation is erroneous; many administrative and professional roles are already being outsourced to rapidly improving software solutions. Across the entire gamut of industries and job roles, technology—both hardware and software; and, increasingly, a combination of both—is proving better and/or faster than human labor, and is indisputably cheaper, particularly in developed consumer economies.
What are the possible outcomes from such a dramatic shift in the relative roles and importance of capital (technology) and labor (people)? Let’s keep it simple and restrict the discussion to three main stances that I’ll introduce briefly here, but consider later in depth:
- Head in the sand: the belief of many mainstream technologists and economists that we’re simply going through an adjustment period, after which “normal service will resume” in the market
- Dystopian: the story that our economic and social system is so deeply embedded and increasingly fragile that the shock of such change will lead to a rapid descent to a “Mad Max” world order
- (Somewhat) Utopian: the possibility that we can create a better world for everyone through automation and the transformation of our current economic and social paradigms
Of course, my preference is for option three above! But, how might it work and how would we get there? I believe that judicious application of many of the principles and approaches of Business Intelligence (BI), data warehousing, big data governance in the broadest sense of the concepts, will play a vital role in the new world, and particularly in the transition to it. BI et al. is fundamentally about how decisions are made and how the people who make them can be supported. And business includes the business of government. In the old, narrow sense, BI meant simply providing data from internal systems to decision makers. In the widest sense, which I call “Business unIntelligence”, it encompasses the full scope of such decision making support, from the ingestion and contextualization of all real-world information to the psychological and sociological aspects involved in real humans making optimal decisions. Decisions that increasingly need to go beyond the bottom line of profit.
As of now, I’m not clear where this discussion will take us. But I’d love to incorporate your views and comments. In the next post, I’ll explore the above-mentioned possible stances on the effects of technological unemployment.
Part 2 tackles the head in the sand stance.
The need to clarify the context of information is becoming vital as big data and the Internet of Things become ever more important sources in today’s biz-tech ecosystem.
Suddenly, it seems, it’s almost three months since my last blog entry. My apologies to readers: it’s been a busy time with consulting work, slide preparation for a number of upcoming events in Munich, Rome and Singapore over the coming weeks, and a revamp of my website with a cleaner, fresher look and a mobile friendly layout.
I pick up on a topic that’s close to my heart: the discovery and creation of context around information, triggered by last week’s BBBT appearance of a new startup, Alation, specializing in this very area. It’s a hot topic at present with a variety of new companies and acquisitions making the news over the past 6 to 12 months.
For a number of years now, the IT industry has been besotted with big data. The trend is set to continue as the Internet of Things offers an ever expanding set of bright, shiny, data-producing baubles. The increasing use of data, in real time and at high volumes is driving a biz-tech ecosystem where business value and competition depends entirely on the effective use of IT. What the press often misses—and many of the vendors and analysts too—is that such data is meaningless and, thus, close to useless unless its context can be determined or created. Some point to metadata as the solution. However, as I’ve explored at length in my book, “Business unIntelligence”, metadata is really too small a word to cover the topic. I prefer to call it context setting information (CSI), because it’s information rather than data, its role is simply to set the context of other information, and, ultimately, it is indistinguishable from business information—one man’s business information is another woman’s CSI. In order to describe the full extent of context setting information, I introduced m³, the modern meaning model, that relates information to knowledge and meaning, as shown above. A complete explanation of this model is beyond the scope of this blog, so let’s return to Alation and what’s interesting about the product.
Alation CEO, Satyen Sangani, @satyx, posed the question of what it means to be data literate. At a basic level, this is about knowing what a field means, what a table contains or how a column is calculated. Pressing a little further, questions about the source and currency of data, in essence its quality, arise. Social aspects of its use, such as how often it has been used and who uses it for what, complete the picture. Understanding this level of context about data is a vital prerequisite for its meaningful use within the business.
When dealing with externally sourced data, where precise meanings of fields or calculations of values are unreliable or unavailable, the social and quality aspects of CSI become particularly important. It is often pointed out that data scientists can spend up to 80% of their time “wrangling” big data (see my last blog on Trifacta). However, what is often missed is that this 80% may be repeated again and again by different data scientists at different times on the same data, because the results of prior thinking and analysis are not easily available for reuse. To address this, Alation goes beyond gathering metadata like schemas and comments from databases and data stores to analyzing documentation from wikis to source code, gathering query and usage data, and linking it all to the identity of people who have created or used the data. Making this CSI available in a collaborative fashion to analysts, stewards and IT enables use cases from discovery and analytics to data optimization and governance.
This broad market is red-hot at the moment and rightly so. Big data and the Internet of Things demand a level of context setting previously unheard of. I’ve previously mentioned products in this space, such as Waterline Data Science and Teradata Loom. A challenge they all face is how to define a market that does not carry the baggage of old failed or difficult initiatives such as metadata management, data governance or information quality. Don’t get me wrong, these are all vital initiatives; they have just received very bad press over the years. In addition, there is a strong need to move from perceived IT-centric approaches to something much more business driven. Might I suggest context setting information as a convenient and clarifying category?
When it comes to externally-sourced data, data scientists are left to pick up the pieces. New tools can help, but let’s also address the deeper issues.
Trifacta presented at the Boulder BI Brain Trust (#bbbt) last Friday, 13 March to a generally positive reaction from the members. In a sentence, @Trifacta offers a visual data preparation and cleansing tool for (typically) externally-sourced data to ease the burden on data scientists, as well as other power data users, who today can spend 80% of their time getting data ready for analysis. In this, the tool does a good job. The demo showed an array of intuitively invoked methods for splitting data out of fields, assessing the cleanliness of data within a set, correcting data errors, and so on. As the user interacts with the data, Trifacta suggests possible cleansing approaches, based on both common algorithms and what the user has previously done when cleaning such data. The user’s choices are recorded as transformation scripts that preserve the lineage of what has been done and that can be reused. Users start with a sample of data to explore and prove their cleansing needs, with the scaled-up transformations running on Hadoop within a monitoring and feedback loop.
This is clearly a useful tool for the data scientist and power user that tackles a persistent bottleneck in the journey from data to insight. It also prompts discussion on the process that should exist around the ingestion and use of external data.
There is a persistent desire to reduce the percentage (to zero if possible!) of time spent by data scientists in preparing and cleansing data. Yet, if we accept that such practitioners are indeed scientists, we should recognize that in “real” science, most of the effort goes into experiment design, construction and data gathering/preparation; the statistical validity and longer term success of scientific work depends on this upfront work. Should it be different with data scientists? I believe not. The science resides in the work of experimentation and preparation. Of course, easing the effort involved and automating reuse is always valid, so Trifacta is a useful tool. But, we should not be fooled that the oft quoted 80% can or should be reduced to even 50% in real data science cases. And among power users, their exploration of data is also, to some degree, scientific research. Preparation and discovery are iterative and interdependent processes.
What is often further missed in the hype around analytics is that after science comes engineering: how to put into production the process and insights derived by the data scientists. While there is real value in the “ah-ha” moment when the unexpected but profitable correlation (or even better, in a scientific view, causation) is found, the longer term value can only be wrought by eliminating the data scientists and explorers, and automating the findings within the ongoing processes of the business. This requires reverting to all the old-fashioned procedures and processes of data governance and management, and with the added challenge that the incoming data is—almost by definition—dirty, unreliable, changeable, and a list other undesirable adjectives. The knowledge of preparation and cleansing built by the data scientists is key here, so Trifacta’s inclusion of lineage tracking is an important step towards this move to production.
Remember lastminute.com? How is this for their last word on personal data?
Important information about your personal data
With effect from today, the lastminute.com business has been acquired by Bravofly Rumbo Group. As a result, your personal data has been transferred to LMnext UK Ltd (a member of the Bravofly Rumbo Group) registered in England and Wales with company registration number 9399258.
LMnext UK Ltd is committed to respect the confidentiality of your personal data and will process it fairly and lawfully and in accordance with applicable data protection law.
You are also reminded that you may exercise your rights of access, rectification or removal of your personal data from our database at any time by sending a written request to lastminute.com, Dukes Court, Duke Street, Woking, Surrey, GU21 5BH providing a copy of your ID.
Please do not hesitate to contact us if you have any queries
The team at lastminute.com and Bravofly Rumbo Group”
I assume that they know my name, since they are holding my personal data, but they can’t rise to a mail-merge process for customer relationship?
More irritatingly, they demand a physical instruction with a scan of my ID for a removal. Why? Is it because there is more interesting data about me to be scraped from said ID? Or is it just to discourage me from asking?
So, no I won’t be asking for removal from their database. Nor will I ever do business with them or any company to whom they pass my data. This e-mail is symptomatic of the lack of respect in which many companies hold our personal data. In itself, it not a big deal. But, taken in a broader context, it epitomises the old adage: caveat emptor or even caveat scriptor!
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp? See also Part 1
The Data Lake has been filling up nicely since its 2010 introduction by James Dixon, with a number of vendors and analysts sailing forth on the concept. Its precise, architectural meaning has proven somewhat fluid, to continue the metaphor. I criticized it in an article in April last, struggling to find a firm basis for discussion of a concept that is so architecturally vague that it has already spawned multiple interpretations. Dixon commented in a September blog that I was mistaken and set forth that: “A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse” and “A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.” This doesn’t clarify much for me, especially when read in conjunction with Dixon’s response to one of his commenters: “The fact that [Booz Allen Hamilton] are putting data from multiple data sources into what they call a ‘Data Lake’ is a minor change to the original definition.”
This “minor change” is actually one of the major problems I see from a data management viewpoint, and Dixon admits as much in his next couple of sentences. “But it leads to confusion about the model because not all of the data is necessarily equal when you do that, and metadata becomes much more of an issue. In practice these conceptual differences won’t make much, if any, impact when it comes to the implementation. If you have two data sources your architecture, technology, and capabilities probably won’t differ much whether you consider it to be one data lake or two.” In my opinion, this is the sort of weak-as-water architectural thinking about data that can drown implementers very quickly indeed. Apply it to the data swamp that is the Internet of Things, and I am convinced that you will end up on the Titanic. Given the obvious focus of HDS on the IoT, alarm bells are already ringing loudly indeed.
But there’s more. Recently, Dixon has gone further, suggesting that the Data Lake could become the foundation of a cleverly named “Union of the State”: a complete history of every event and change in data in every application running in the business, an “Enterprise Time Machine” that can recreate on demand the entire state of the business at any instant of the past. In my view, this concept has many philosophical misunderstandings, business misconceptions, and technical impracticalities. (For a much more comprehensive and compelling discussion of temporal data, I recommend Tom Johnston’s “Managing Time in Relational Databases: How to Design, Update and Query Temporal Data”, which actually applies far beyond relational databases.) However, within the context of the HDS acquisition, my concern is how to store, never mind manage, the entire historical data record of even that subset of the Internet of Things that would be of interest to Hitachi or one of its customers. To me, this would truly result in a data quagmire of unimaginable proportions and projects of such size and complexity that would dwarf even the worst data warehouse or ERP project disasters we have seen.
To me, the Data Lake concept is vaguely defined and dangerous. I can accept its validity as a holding pond for the vast quantities of data that pour into the enterprise in vast quantities at high speed, with ill-defined and changeable structures, and often dubious quality. For immediate analysis and quick, but possibly dirty, decisions, a Data Lake could be ideal. Unfortunately, common perceptions of the Data Lake are that, in the longer term, all of the data in the organization could reside there in its original form and structure. This is, in my view, and in the view of Gartner analysts and Michael Stonebraker, to name but a few, not only dangerous in terms of data quality but a major retrograde step for all aspects of data management and governance.
Dixon says of my original criticism “Barry Devlin is welcome to fight a battle against the term ‘Data Lake’. Good luck to him. But if he doesn’t like it he should come up with a better idea.” I fully agree, tilting at well-established windmills is pointless. And as we discovered in our last EMA/9sight Big Data survey (available soon, see preview presentation from January), Data Lake implementations, however variously defined, are already widespread. I believe I have come up with a better idea, too, in the IDEAL and REAL information architectures, defined in depth in my book, Business unIntelligence.
To close on the HDS acquisition of Pentaho, I believe it represents a good deal for both companies. Pentaho gets access to a market and investment stream that can drive and enhance its products and business. And, IoT is big business. HDS gets a powerful set of tools that complement its IoT direction. Together, the two companies should have the energy and resources to clean up the architectural anomalies and market misunderstandings of the Data Lake by formally defining the boundaries and describing the structures required for comprehensive data management and governance.
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp?
This week’s announcement of Hitachi Data Systems’ (HDS, @HDScorp) intention to acquire @Pentaho poses some interesting strategic and architectural questions about big data that are far more important than the announcement’s bland declaration about it being “the largest private big data acquisition transaction to date”. We also need to look beyond the traditional acquisition concerns about integrating product lines, as the companies’ products come from very different spaces. No, the real questions circle around the Internet of Things, the data it produces, and how to manage and use that data.
As HDS and Pentaho engaged as partners and flirted with the prospect of marriage, we may assume that for HDS, aligning with Hitachi’s confusingly named Social Innovation Business was key. Coming from BI, you might imagine that Social Innovation refers to social media and other human-sourced information. In fact, it is Hitachi’s Internet of Things (IoT) play. Hitachi, as a manufacturer of everything from nuclear power plants to power tools, from materials and components to home appliances, as well as being involved in logistics and financial services, is clearly positioned at the coalface of IoT. With data as the major product, the role of HDS storage hardware and storage management software is obvious. What HDS lacked was the software and skills to extract value from the data. Enter Pentaho.
Pentaho comes very much from the BI and, more recently, big data space. Empowering business users to access and use data for decision making is their business for over 10 years. Based on open source, Pentaho have focused on two areas. First, they provide BI, analysis and dashboard tools for end-users. Second, they offer data access and integration tools across a variety of databases and big data stores. Both aspects are certainly of interest to HDS. Greg Knieriemen (@Knieriemen), Hitachi Data Systems Technology Evangelist, agrees and adds big data and cloud embedding for good measure. The BI and analytics aspect is straightforward: Pentaho offers a good set of functionality and it’s open source. A good match for the HDS needs and vision, job done. The fun begins with data integration.
Dan Woods (@danwoodsearly) lauds the acquisition and links it to his interesting concept of a “Data Supply Chain… that accepts data from a wide variety of sources, both internal and external, processes that data in various nodes of the supply chain, passing data where it is needed, transforming it as it flows, storing key signals and events in central repositories, triggering action immediately when possible, and adding data to a queue for deeper analysis.” The approach is often called a “data refinery”, by Pentaho and others. Like big data, the term has a range of meanings. In simple terms, it is an evolution of the ETL concept to include big data sources and a wider range of targets. Mike Ferguson (@mikeferguson1) provides perhaps the most inclusive vision in a recent white paper (registration required). However broadly or narrowly we define data refinery, HDS is getting a comprehensive set of tooling from Pentaho in this space.
However, along with Pentaho’s data integration tooling, HDS is also getting the Data Lake concept, through its cofounder and CTO, James Dixon, who could be called the father of the Data Lake, having introduced the term in 2010. This could be more problematical, given the debates that rage between supporters and detractors of the concept. I fall rather strongly in the latter camp, so I should, in fairness, provide context for my concerns by reviewing some earlier discussions. This deserves more space than I have here, so please stay tuned for part 2 of this blog!