It’s indisputable that technology is displacing many of today’s jobs. The question is: what should, or can, we do about it? This series explores the possible consequences of this shift and how information use and decision making support through enhanced and expanded Business Intelligence.
I’ve written occasionally and at length in a Feb-Mar 2014 series on the impact of technology advances on employment. My basic thesis was—and is—as follows. Mass production and competition, facilitated by ever improving technology, have been delivering better and cheaper products and improving many people’s lives (at least in the developed world) for nearly two centuries. Capital, in the form of technology, and people–labor—have worked together relatively well in the consumer society to produce goods that people purchase largely using earnings from their labor. Until now…
As technology grows exponentially better, the return on capital investment in automation technology is improving significantly in comparison to return on investment in labor. The primary goal of the capitalist model is to maximize return on investment. As a result, an ever greater range of jobs become open to displacement by technology. To me, at least, the above logic is largely unarguable. For example, driverless vehicles, from trucks to automobiles, are set to eliminate some 4 million jobs in the US alone. Any complacency that only manual/physical jobs will be displaced by automation is erroneous; many administrative and professional roles are already being outsourced to rapidly improving software solutions. Across the entire gamut of industries and job roles, technology—both hardware and software; and, increasingly, a combination of both—is proving better and/or faster than human labor, and is indisputably cheaper, particularly in developed consumer economies.
What are the possible outcomes from such a dramatic shift in the relative roles and importance of capital (technology) and labor (people)? Let’s keep it simple and restrict the discussion to three main stances that I’ll introduce briefly here, but consider later in depth:
- Head in the sand: the belief of many mainstream technologists and economists that we’re simply going through an adjustment period, after which “normal service will resume” in the market
- Dystopian: the story that our economic and social system is so deeply embedded and increasingly fragile that the shock of such change will lead to a rapid descent to a “Mad Max” world order
- (Somewhat) Utopian: the possibility that we can create a better world for everyone through automation and the transformation of our current economic and social paradigms
Of course, my preference is for option three above! But, how might it work and how would we get there? I believe that judicious application of many of the principles and approaches of Business Intelligence (BI), data warehousing, big data governance in the broadest sense of the concepts, will play a vital role in the new world, and particularly in the transition to it. BI et al. is fundamentally about how decisions are made and how the people who make them can be supported. And business includes the business of government. In the old, narrow sense, BI meant simply providing data from internal systems to decision makers. In the widest sense, which I call “Business unIntelligence”, it encompasses the full scope of such decision making support, from the ingestion and contextualization of all real-world information to the psychological and sociological aspects involved in real humans making optimal decisions. Decisions that increasingly need to go beyond the bottom line of profit.
As of now, I’m not clear where this discussion will take us. But I’d love to incorporate your views and comments. In the next post, I’ll explore the above-mentioned possible stances on the effects of technological unemployment.
The need to clarify the context of information is becoming vital as big data and the Internet of Things become ever more important sources in today’s biz-tech ecosystem.
Suddenly, it seems, it’s almost three months since my last blog entry. My apologies to readers: it’s been a busy time with consulting work, slide preparation for a number of upcoming events in Munich, Rome and Singapore over the coming weeks, and a revamp of my website with a cleaner, fresher look and a mobile friendly layout.
I pick up on a topic that’s close to my heart: the discovery and creation of context around information, triggered by last week’s BBBT appearance of a new startup, Alation, specializing in this very area. It’s a hot topic at present with a variety of new companies and acquisitions making the news over the past 6 to 12 months.
For a number of years now, the IT industry has been besotted with big data. The trend is set to continue as the Internet of Things offers an ever expanding set of bright, shiny, data-producing baubles. The increasing use of data, in real time and at high volumes is driving a biz-tech ecosystem where business value and competition depends entirely on the effective use of IT. What the press often misses—and many of the vendors and analysts too—is that such data is meaningless and, thus, close to useless unless its context can be determined or created. Some point to metadata as the solution. However, as I’ve explored at length in my book, “Business unIntelligence”, metadata is really too small a word to cover the topic. I prefer to call it context setting information (CSI), because it’s information rather than data, its role is simply to set the context of other information, and, ultimately, it is indistinguishable from business information—one man’s business information is another woman’s CSI. In order to describe the full extent of context setting information, I introduced m³, the modern meaning model, that relates information to knowledge and meaning, as shown above. A complete explanation of this model is beyond the scope of this blog, so let’s return to Alation and what’s interesting about the product.
Alation CEO, Satyen Sangani, @satyx, posed the question of what it means to be data literate. At a basic level, this is about knowing what a field means, what a table contains or how a column is calculated. Pressing a little further, questions about the source and currency of data, in essence its quality, arise. Social aspects of its use, such as how often it has been used and who uses it for what, complete the picture. Understanding this level of context about data is a vital prerequisite for its meaningful use within the business.
When dealing with externally sourced data, where precise meanings of fields or calculations of values are unreliable or unavailable, the social and quality aspects of CSI become particularly important. It is often pointed out that data scientists can spend up to 80% of their time “wrangling” big data (see my last blog on Trifacta). However, what is often missed is that this 80% may be repeated again and again by different data scientists at different times on the same data, because the results of prior thinking and analysis are not easily available for reuse. To address this, Alation goes beyond gathering metadata like schemas and comments from databases and data stores to analyzing documentation from wikis to source code, gathering query and usage data, and linking it all to the identity of people who have created or used the data. Making this CSI available in a collaborative fashion to analysts, stewards and IT enables use cases from discovery and analytics to data optimization and governance.
This broad market is red-hot at the moment and rightly so. Big data and the Internet of Things demand a level of context setting previously unheard of. I’ve previously mentioned products in this space, such as Waterline Data Science and Teradata Loom. A challenge they all face is how to define a market that does not carry the baggage of old failed or difficult initiatives such as metadata management, data governance or information quality. Don’t get me wrong, these are all vital initiatives; they have just received very bad press over the years. In addition, there is a strong need to move from perceived IT-centric approaches to something much more business driven. Might I suggest context setting information as a convenient and clarifying category?
When it comes to externally-sourced data, data scientists are left to pick up the pieces. New tools can help, but let’s also address the deeper issues.
Trifacta presented at the Boulder BI Brain Trust (#bbbt) last Friday, 13 March to a generally positive reaction from the members. In a sentence, @Trifacta offers a visual data preparation and cleansing tool for (typically) externally-sourced data to ease the burden on data scientists, as well as other power data users, who today can spend 80% of their time getting data ready for analysis. In this, the tool does a good job. The demo showed an array of intuitively invoked methods for splitting data out of fields, assessing the cleanliness of data within a set, correcting data errors, and so on. As the user interacts with the data, Trifacta suggests possible cleansing approaches, based on both common algorithms and what the user has previously done when cleaning such data. The user’s choices are recorded as transformation scripts that preserve the lineage of what has been done and that can be reused. Users start with a sample of data to explore and prove their cleansing needs, with the scaled-up transformations running on Hadoop within a monitoring and feedback loop.
This is clearly a useful tool for the data scientist and power user that tackles a persistent bottleneck in the journey from data to insight. It also prompts discussion on the process that should exist around the ingestion and use of external data.
There is a persistent desire to reduce the percentage (to zero if possible!) of time spent by data scientists in preparing and cleansing data. Yet, if we accept that such practitioners are indeed scientists, we should recognize that in “real” science, most of the effort goes into experiment design, construction and data gathering/preparation; the statistical validity and longer term success of scientific work depends on this upfront work. Should it be different with data scientists? I believe not. The science resides in the work of experimentation and preparation. Of course, easing the effort involved and automating reuse is always valid, so Trifacta is a useful tool. But, we should not be fooled that the oft quoted 80% can or should be reduced to even 50% in real data science cases. And among power users, their exploration of data is also, to some degree, scientific research. Preparation and discovery are iterative and interdependent processes.
What is often further missed in the hype around analytics is that after science comes engineering: how to put into production the process and insights derived by the data scientists. While there is real value in the “ah-ha” moment when the unexpected but profitable correlation (or even better, in a scientific view, causation) is found, the longer term value can only be wrought by eliminating the data scientists and explorers, and automating the findings within the ongoing processes of the business. This requires reverting to all the old-fashioned procedures and processes of data governance and management, and with the added challenge that the incoming data is—almost by definition—dirty, unreliable, changeable, and a list other undesirable adjectives. The knowledge of preparation and cleansing built by the data scientists is key here, so Trifacta’s inclusion of lineage tracking is an important step towards this move to production.
Remember lastminute.com? How is this for their last word on personal data?
Important information about your personal data
With effect from today, the lastminute.com business has been acquired by Bravofly Rumbo Group. As a result, your personal data has been transferred to LMnext UK Ltd (a member of the Bravofly Rumbo Group) registered in England and Wales with company registration number 9399258.
LMnext UK Ltd is committed to respect the confidentiality of your personal data and will process it fairly and lawfully and in accordance with applicable data protection law.
You are also reminded that you may exercise your rights of access, rectification or removal of your personal data from our database at any time by sending a written request to lastminute.com, Dukes Court, Duke Street, Woking, Surrey, GU21 5BH providing a copy of your ID.
Please do not hesitate to contact us if you have any queries
The team at lastminute.com and Bravofly Rumbo Group”
I assume that they know my name, since they are holding my personal data, but they can’t rise to a mail-merge process for customer relationship?
More irritatingly, they demand a physical instruction with a scan of my ID for a removal. Why? Is it because there is more interesting data about me to be scraped from said ID? Or is it just to discourage me from asking?
So, no I won’t be asking for removal from their database. Nor will I ever do business with them or any company to whom they pass my data. This e-mail is symptomatic of the lack of respect in which many companies hold our personal data. In itself, it not a big deal. But, taken in a broader context, it epitomises the old adage: caveat emptor or even caveat scriptor!
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp? See also Part 1
The Data Lake has been filling up nicely since its 2010 introduction by James Dixon, with a number of vendors and analysts sailing forth on the concept. Its precise, architectural meaning has proven somewhat fluid, to continue the metaphor. I criticized it in an article in April last, struggling to find a firm basis for discussion of a concept that is so architecturally vague that it has already spawned multiple interpretations. Dixon commented in a September blog that I was mistaken and set forth that: “A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse” and “A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.” This doesn’t clarify much for me, especially when read in conjunction with Dixon’s response to one of his commenters: “The fact that [Booz Allen Hamilton] are putting data from multiple data sources into what they call a ‘Data Lake’ is a minor change to the original definition.”
This “minor change” is actually one of the major problems I see from a data management viewpoint, and Dixon admits as much in his next couple of sentences. “But it leads to confusion about the model because not all of the data is necessarily equal when you do that, and metadata becomes much more of an issue. In practice these conceptual differences won’t make much, if any, impact when it comes to the implementation. If you have two data sources your architecture, technology, and capabilities probably won’t differ much whether you consider it to be one data lake or two.” In my opinion, this is the sort of weak-as-water architectural thinking about data that can drown implementers very quickly indeed. Apply it to the data swamp that is the Internet of Things, and I am convinced that you will end up on the Titanic. Given the obvious focus of HDS on the IoT, alarm bells are already ringing loudly indeed.
But there’s more. Recently, Dixon has gone further, suggesting that the Data Lake could become the foundation of a cleverly named “Union of the State”: a complete history of every event and change in data in every application running in the business, an “Enterprise Time Machine” that can recreate on demand the entire state of the business at any instant of the past. In my view, this concept has many philosophical misunderstandings, business misconceptions, and technical impracticalities. (For a much more comprehensive and compelling discussion of temporal data, I recommend Tom Johnston’s “Managing Time in Relational Databases: How to Design, Update and Query Temporal Data”, which actually applies far beyond relational databases.) However, within the context of the HDS acquisition, my concern is how to store, never mind manage, the entire historical data record of even that subset of the Internet of Things that would be of interest to Hitachi or one of its customers. To me, this would truly result in a data quagmire of unimaginable proportions and projects of such size and complexity that would dwarf even the worst data warehouse or ERP project disasters we have seen.
To me, the Data Lake concept is vaguely defined and dangerous. I can accept its validity as a holding pond for the vast quantities of data that pour into the enterprise in vast quantities at high speed, with ill-defined and changeable structures, and often dubious quality. For immediate analysis and quick, but possibly dirty, decisions, a Data Lake could be ideal. Unfortunately, common perceptions of the Data Lake are that, in the longer term, all of the data in the organization could reside there in its original form and structure. This is, in my view, and in the view of Gartner analysts and Michael Stonebraker, to name but a few, not only dangerous in terms of data quality but a major retrograde step for all aspects of data management and governance.
Dixon says of my original criticism “Barry Devlin is welcome to fight a battle against the term ‘Data Lake’. Good luck to him. But if he doesn’t like it he should come up with a better idea.” I fully agree, tilting at well-established windmills is pointless. And as we discovered in our last EMA/9sight Big Data survey (available soon, see preview presentation from January), Data Lake implementations, however variously defined, are already widespread. I believe I have come up with a better idea, too, in the IDEAL and REAL information architectures, defined in depth in my book, Business unIntelligence.
To close on the HDS acquisition of Pentaho, I believe it represents a good deal for both companies. Pentaho gets access to a market and investment stream that can drive and enhance its products and business. And, IoT is big business. HDS gets a powerful set of tools that complement its IoT direction. Together, the two companies should have the energy and resources to clean up the architectural anomalies and market misunderstandings of the Data Lake by formally defining the boundaries and describing the structures required for comprehensive data management and governance.
In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp?
This week’s announcement of Hitachi Data Systems’ (HDS, @HDScorp) intention to acquire @Pentaho poses some interesting strategic and architectural questions about big data that are far more important than the announcement’s bland declaration about it being “the largest private big data acquisition transaction to date”. We also need to look beyond the traditional acquisition concerns about integrating product lines, as the companies’ products come from very different spaces. No, the real questions circle around the Internet of Things, the data it produces, and how to manage and use that data.
As HDS and Pentaho engaged as partners and flirted with the prospect of marriage, we may assume that for HDS, aligning with Hitachi’s confusingly named Social Innovation Business was key. Coming from BI, you might imagine that Social Innovation refers to social media and other human-sourced information. In fact, it is Hitachi’s Internet of Things (IoT) play. Hitachi, as a manufacturer of everything from nuclear power plants to power tools, from materials and components to home appliances, as well as being involved in logistics and financial services, is clearly positioned at the coalface of IoT. With data as the major product, the role of HDS storage hardware and storage management software is obvious. What HDS lacked was the software and skills to extract value from the data. Enter Pentaho.
Pentaho comes very much from the BI and, more recently, big data space. Empowering business users to access and use data for decision making is their business for over 10 years. Based on open source, Pentaho have focused on two areas. First, they provide BI, analysis and dashboard tools for end-users. Second, they offer data access and integration tools across a variety of databases and big data stores. Both aspects are certainly of interest to HDS. Greg Knieriemen (@Knieriemen), Hitachi Data Systems Technology Evangelist, agrees and adds big data and cloud embedding for good measure. The BI and analytics aspect is straightforward: Pentaho offers a good set of functionality and it’s open source. A good match for the HDS needs and vision, job done. The fun begins with data integration.
Dan Woods (@danwoodsearly) lauds the acquisition and links it to his interesting concept of a “Data Supply Chain… that accepts data from a wide variety of sources, both internal and external, processes that data in various nodes of the supply chain, passing data where it is needed, transforming it as it flows, storing key signals and events in central repositories, triggering action immediately when possible, and adding data to a queue for deeper analysis.” The approach is often called a “data refinery”, by Pentaho and others. Like big data, the term has a range of meanings. In simple terms, it is an evolution of the ETL concept to include big data sources and a wider range of targets. Mike Ferguson (@mikeferguson1) provides perhaps the most inclusive vision in a recent white paper (registration required). However broadly or narrowly we define data refinery, HDS is getting a comprehensive set of tooling from Pentaho in this space.
However, along with Pentaho’s data integration tooling, HDS is also getting the Data Lake concept, through its cofounder and CTO, James Dixon, who could be called the father of the Data Lake, having introduced the term in 2010. This could be more problematical, given the debates that rage between supporters and detractors of the concept. I fall rather strongly in the latter camp, so I should, in fairness, provide context for my concerns by reviewing some earlier discussions. This deserves more space than I have here, so please stay tuned for part 2 of this blog!
Why, oh why does the relationship between analytics, automation, profit and employment seem to elude so many people?
A nicely rounded post by Scott Mongeau, “Manager-machine: analytics, artificial intelligence, and the uncertain future of management”, from last October came to my attention today via James Kobielus’ recent response, “Cognitive Computing and the Indelible Role of Human Judgment”. Together, they reminded me again of a real-world problem that has been bothering me since the publication of my book, “Business unIntelligence”.
Mongeau gives a reasoned analysis of the likely increasing impact of analytics and artificial intelligence on the role of management. His thesis appears very realistic: over the coming few decades, many of the more routine tasks of management will fall within the capability of increasingly powerful machines. From driverless cars to advanced logistics management, many more tasks only recently considered the sole remit of humans can be automated. Mongeau also provides a list of tasks where analytics and automation may never (or perhaps more slowly) encroach: he cites strategic decision making, and tasks requiring leadership and personal engagement, although, even in strategic decisions, IBM’s Watson is already making a play. He also offers some possible new job roles for displaced managers. However, he misses what I believe is the key implication, to which I’ll return in a moment.
Sadly, Kobielus misses the same point, choosing instead to focus on the irrefutable argument (at least for the foreseeable future) that there will always be some tasks where human judgment or oversight is required. Such tasks will remain, of course, with humans. A sideswipe at Luddism also adds nothing to the argument.
So, what is the missed implication? It seems self-evident, to me, at least, that manufacturing and increasingly services can be delivered more cheaply in many cases, using analytics and automation, by machines rather than people. As both analytics and automation improve exponentially according to Moore’s Law, the disparity can only increase. Therefore, industry progressively invests in the capital of hardware and software rather than labor, driven directly by the profit motive. Given that it is through their labor that the vast majority of consumers earn the money needed to buy industry’s goods and services, at what point will consumption be adversely affected by the resulting growing level of unemployment? This is not an argument about when, if ever, machines can do everything a person can do. It is simply about envisaging a tipping point when a sufficient percentage of the population can no longer afford the goods and services delivered by industry, no matter how cheaply.
Hence, the equation implied in the title of this post: analytics and automation, driven by profit, reduce employment. The traditional economic argument is that technology-driven unemployment has always has always been counteracted by new jobs at a higher level of skill for those displaced by the new technology. This argument simply cannot be applied in the current situation; the “skill level” of analytics and automation is increasing far faster (and actually accelerating) than that of humans.
So, I use this first post of 2015 to reiterate the questions I posed in a series of blogs early last year. To be very frank, I do not know what the answers should be. And the politicians, economists and business leaders, who should be leading the thinking in this area, appear to be fully disengaged. In summary, the quest is: how can we reinvent the current economic system in light of the reality that cheaper and more efficient analytics and automation are driving every industry to reduce or eliminate labor costs without consideration for the fact that employment is also the foundation for consumption and, thus, profit?
Image: Nexi. Credit: Spencer Lowell
“Gold is down almost 40% since it peaked in 2011. But it’s still up almost 350% since 2000. Although since 1980, on an inflation-adjusted basis, it’s basically flat. However, since the early-1970s it’s up over 7% per year (or about 3.4% after inflation).” Ben Carlson, an institutional investment manager provides this wonderful example of how statistical data can be abused, in this case by playing with time horizons. Ben is talking about making investment decisions. Let me replay his conclusions, but with a more general view (my changes in bold).
“It’s very easy to cherry-pick historical data that fits your narrative to prove a point about anything. It doesn’t necessarily mean you’re right or wrong. It just means that the world is full of conflicting evidence because the results over most time frames are nowhere close to average. If the performance of everything was predictable over any given time horizon, there would be no risk.”
We have entered a period of history where information has become super-abundant. It would be wise, I suggest, to consider all the ways this information can be misinterpreted or abused. Through ignorance, so-called confirmation bias, intention to deceive, and a dozen other causes, we can mislead, be misled, or slip into analysis paralysis. How can we avoid these pitfalls? Before attempting my own answer, let’s take a look at an example of dangerous thinking that can be found even among big data experts.
Jean-Luc Chatelain, a Big Data Technology & Strategy Executive, recently declared “an end to data torture” courtesy of Data Lakes. Arguing that a leading driver is cost, he says Data Lakes “enable massive amount of information to be stored at a very economically viable point [versus] traditional IT storage hardware”. While factually correct, this latter statement actually nothing about overall cost, with the growth in data volumes probably exceeding the rate of decline in computing costs and, more importantly, the fact that data governance costs grow with increasing volumes and disparity of data stored.
More worryingly, he goes on to say: “the truly important benefit that Data-Lakes bring to the ‘information powered enterprise’ is… ‘High quality actionable insights’”. This conflation of vast stores of often poorly-defined and -managed data with high quality actionable insights flies in the face of common sense. High quality actionable insights more likely stem from high quality, well-defined, meaningful information rather than from large, ill-defined data stores. Actionable insights require the very human behavior of contextualizing new information within personal or organizational experience. No amount of Lake Data can address this need. Finally, choosing actions may be based on the best estimate of whether the information offers a valid forecast about the outcome… or may be based on the desires, intentions, vision, etc. of the decision maker, especially if the information available is deemed to be a poor indicator of the future likely outcome. And Chatelain’s misdirected tirade against ETL (extract, torture and lose, as he labels it) ignores most of the rationale behind the process in order to cherry-pick some well-known implementation weaknesses.
Whether data scientist or business analyst, the first step with data—especially with disparate, dirty data—is always to structure and cleanse it; basically, to make it fit for analytic purpose. Despite a very short history, it is already recognized that 80% or more of data scientists’ effort goes into this data preparation. Attempts to automate this process and to apply good governance principles are already underway from start-ups like @WaterlineData, @AlpineDataLabs as well as long-standing companies like @Teradata and @IBMbigdata. But, as always, the choice of what to use and how to use it depends on human skill and experience. And make no mistake, most big data analytics moves very quickly from “all the data” to a subset that is defined by its usefulness and applicability to the issue in hand. Big data rapidly becomes focused data in production situations. Returning again and again to the big data source for additional “insights is governed by the law of diminishing returns.
It is my belief that our current fascination with collecting data about literally everything is taking us down a misleading path. Of course, in some cases, more data and, preferably, better data can offer a better foundation for insight and decision making. However, it is wrong to assume that more data always leads to more insight or better decisions. As in the past evolution of BI, we are again focusing on the tools and technology. Where we need to focus is on improving our human ability to contextualize data and extract valid meaning from it. We need to train ourselves to see the limits of data’s ability to predict the future and the privacy and economic dangers inherent in quantifying everything. We need to take responsibility for our intentions and insights, our beliefs and intuitions that underpin our decisions in business and in life.
“The data made me do it” is a deeply disturbing rationale.
Gartner’s new acronym, HTAP. What does it mean and why should you care?
What if we lived in a world where business users didn’t have to think about using different systems, depending on whether they wanted to use current or historical data? Where staff didn’t have to distinguish between running and managing the business? Where IT didn’t have to design and manage complex processes to copy and cleanse all data from operational systems to data warehouses and marts for business intelligence (BI)? The reality of today’s accelerating business drivers is that we urgently need to enable those new world behaviors of both business and IT.
My 2013 book, “Business unIntelligence”, described how the merging of business and technology is transforming and reinventing business processes. Such dramatic changes demand that current and historical data are combined in a more integrated, closed-loop way. In many industries, the success—and even survival—of companies will depend on their ability to bridge the current divide between their operational and informational systems. In 2014, Gartner (1) coined the term hybrid transaction/analytical processing (HTAP) to describe the same need. In terms of its implementation, they pointed to the central role of in-memory databases. This technology is certainly at the core, but other hardware and software considerations come into play.
My recent white paper “The Emergent Operational/Informational World” explores this topic in depth. Starting from the original divergence of operational and informational systems in the 1970s and 1980s, the paper explains how we arrived in today’s layered data world and why it must change, citing examples from online retailing, financial services and the emerging Internet of Things. It describes the three key technological drivers enabling the re-convergence, (1) In-memory databases, (2) techniques to reduce contention in data access, and (3) scaling out of relational databases, and how the modern NuoDB relational database product addresses these drivers.
For a brief introduction to the topic, join me and Steve Cellini of NuoDB on December 2nd, 1pm EST for our webinar “The Future of Data: Where Does HTAP Fit?”
(1) Gartner Press Release, “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation”, G00259033, 28 January 2014
Data-driven business looks like its emerging as the next big buzz phrase. Should you be worried?
Back in Cape Town after six weeks on the road in the US and Europe, my first task was to step on stage at Mammoth BI and do some myth busting about data-driven business.
Mammoth BI is the brain-child of local entrepreneur, Jason Haddock of Saratoga, a local IT solutions company. The one-day conference was modelled on the TED format of 15-minute entertaining and informative presentations, but focusing on big data, analytics and BI. This inaugural event was a big success, drawing a large audience to the Cape Town International Conference Centre, including a large number of university students, who were offered free attendance as a give-back to the community.
I was presenter number 13 of 17. Amongst 15 presenters extolling the virtues of big data and analytics, describing their successes and techniques, and one great professional comedian (Gareth Woods), my role was to be the old curmudgeon! Or more gently, the reality check against the overwhelming enthusiasm for data-driven solutions to every imaginable problem and data-based silver bullets for every opportunity. Among the many issues I could have chosen, here are the four myths I chose to bust:
- Collect all the data to get the best information: Not! Data Lakes epitomize this idea. Anybody who has been involved in data warehousing over the years should know that data is often dirty. Inconsistent, incomplete, simply incorrect. This is like pouring sewage into a lake. You need to be choosy about what you store and apply strict governance procedures.
- Decision-making is a fully rational, data-based process: Not! Lovers, advertisers and even executives know this not true. Better and more trusted data can influence the direction of thinking, but many important decisions eventually come down to a mix of information, experience, emotions and intentions. Sometimes called gut-feel or intuition. You need a mix of (wo)man and machine.
- Big data can be safely anonymized: Not! The ever increasing set of variables being collected about every man, woman and child is now so large that individuals can always be pinpointed and identified. Privacy is no longer an option. And target marketing can be a nice word for discrimination. Democracy also suffers when all opinions can be charted.
- Data-driven processes will solve world hunger: Not! While there are many benefits and opportunities to improve the world through big data and automation, the point that everybody seems to miss is that while the cost of goods drops ever lower by (among other factors) eliminating human labour, these displaced workers no longer have the cash to buy even the cheapest goods. Economists presume that new types of jobs will emerge, as happened in the industrial revolutions; unfortunately, none of them can imagine what those jobs might be.
These four problems are presented in order of increasing impact. Everybody in the data-driven industry needs to consider them carefully. I hope that I’m being too pessimistic, especially in the last two. Please prove me wrong! I’d love to return to the next Mammoth BI, planned for August 2015, with some real answers.