Largest cloud provider getting graph database ready for GA
Amazon Web Services (AWS) announced its entry to the graph database market at its AWS reINVENT conference in Seattle in November last year. It was a notable announcement for a couple of reasons: it was the first graph database from the company (it offers a range of relational and NoSQL databases as a service). But it also shone a rather bright light on a database category that has often been considered niche, complex and expensive
Neptune is currently in preview before it reaches general availability, but we expect that to happen soon. So should you be bothered?
A graph database is one that uses graph structures to enable the data to be queried, using the concepts of nodes, edges and properties to represent and store data. The key concept is the fact that the graph directly records the relationships between different data items in the database. Because the graph links related objects directly, it means those that have a relationship with one another can often be retrieved in one operation.
In relational databases, there are no such direct connections between related objects as data is stored in rows and columns. To create a relationship between different elements developers must write a ‘join’. But joins can become unwieldy and affect database performance.
The characteristics of graph databases enables the simple and fast retrieval of complex hierarchical structures that would be harder or even prohibitively time-consuming to model in relational databases.
The slight drawback with graph databases is that they cannot easily be queried with the de facto querying language for relational databases, Structured Query Language (SQL). Not only that, but in the graph database world there is not yet an equivalent de facto query language — there are a number of industry standard languages but there is likely to be a shakeout of some of these as graph databases become more popular and a clear winner possibly emerges.
Amazon says it built Neptune specifically for the cloud, which has its pluses and minuses. The drawback is there isn’t an on-premises version. The advantage though is that due to its economies of scale AWS tends to be able to offer good value subscriptions. As with other AWS managed services Amazon Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across AWS Availability Zones.
It can store billions of relationships and the graph can be queried with milliseconds latency. Neptune supports encryption at rest and in transit. As for that thorny issue of which query languages to support, AWS has hedged its bets with the option of Apache Tinkerpop Gremlin or SPARQL (Microsoft’s cloud graph offering, Azure CosmosDB, supports Gremlin or Gremlin-compatible languages such as Apache Spark GraphX).
I would have liked to see the addition to both of Cypher, a language developed by graph database pioneer Neo4j, as we believe it has very widespread adoption. Neo4j donated it to the openCypher Project in 2015 and as well as Neo4j it’s supported in SAP HANA Graph, Redis and AgensGraph databases.
Early adopters of Neptune are likely to be existing AWS users who have some or all of their data in the cloud already: AWS already offers a range of databases including relational and NoSQL options.
Amazon envisages that Neptune will power graph use cases such as recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security. Security is probably the most common area where graph databases have been pressed into action, but they are also used in logistics, supply chain management, master data management, life sciences, e-commerce and even the hospitality industry.
Companies having a play with Neptune in preview include AstraZeneca, Thomson Reuters, Siemens, and the Financial Industry Regulatory Authority (FINRA). Amazon has been looking into how it can use it to improve its own Amazon Alexa system.
I believe AWS’ move into the graph database space is significant for the sector. It will make it simpler than ever for people to have a play with a graph database inexpensively. With Neptune, you don’t need to worry about hardware provisioning, software patching, setup, configuration, or backups.
It’s not that there are not other graph-as-a-service offerings, but few have quite the reach of AWS. With so many companies already having at least some of their data on AWS, this is an opportunity to see what a graph database can do for you.
There are too many graph databases to mention them all here, but here is a selection of firms large and small (in alphabetical order) to add to those mentioned above. Most offer some kind of pre-production free trial, so you can kick the tires before you jump right in.
Do you have any experience of using graph databases? I’d be interested to hear your thoughts in the comments section.
IoT Back to Basics, chapter 4: IoT projects risk failure without careful consideration of data management processes and analytics. Their ultimate goal after all is to glean valuable information from data coming from the ‘things’ on the network – sensors and smart devices – in order to act on it.
So I thought I’d look at some of the novel trends in in-memory data processing: in-memory databases as well as data fabrics and data streaming engines.
The use of memory in computing is not new. But while memory is faster than disk by an order of magnitude, it is also an order of magnitude more expensive. That has for the most part left memory relegated to acting as a caching layer, while nearly all of the data is stored on disk. However in recent years, the cost of memory has been falling, making it possible to put far larger datasets in memory for data processing tasks, rather than use it simply as a cache.
It’s not just that it is now possible to store larger datasets in memory for rapid analytics, it is also that it is highly desirable. In the era of IoT, data often streams into the data centre or the cloud – the likes of sensor data from anything from a production line to an oilrig. The faster the organization is able to spot anomalies in that data, the better the quality of predictive maintenance. In-memory technologies are helping firms see those anomalies close to, or in, real-time. Certainly much faster than storing data in a disk-based database and having to move packets of data to a cache for analytics.
I expect take-up of in-memory data processing to accelerate dramatically, as companies come to grips with their data challenges and move beyond more traditional data analytics in the era of IoT. In-memory databases are 10 to 100 times faster than traditional databases, depending on the exact use case. When one considers that some IoT use cases involve the collection, processing and analysis of millions of events per second, you can see why in-memory becomes so much more appealing.
There’s another big advantage with in-memory databases. Traditionally, databases have been geared toward one of two main uses: handling transactions, or enabling rapid analysis of those transactions – analytics. The I/O limitations of disk-based databases meant that those handling transactions would slow down considerably when also being asked to return the results of data queries. That’s why data was often exported from the transactional database into another platform – a data warehouse – where it could more rapidly be analyzed without impacting the performance of the system.
Hybrid operational and analytical databases
With in-memory databases, it’s becoming increasingly common for both operational and analytic workloads to be able to run in memory rather than on disk. With an in-memory database, all (or nearly all) of the data is held in memory, making reads and writes an order of magnitude faster – so much so that both transactional duties and analytic queries can be handled by the same database.
There are a number of in-memory database players vying for what has become an even more lucrative market in the era of IoT. The largest incumbent database vendors such as Oracle, IBM and Microsoft have added in-memory capabilities to their time-tested databases. SAP has spent many millions of dollars educating the market about the benefits of its in-memory HANA database, saying it will drop support for all other third party databases under its enterprise software by 2025. There are also smaller vendors vying for market share such as Actian, Altibase, MemSQL and VoltDB.
Data grids & fabrics
Then there is the in-memory data grid (sometimes known as a data fabric) segment. This is an in-memory technology that you ‘slide’ between the applications and the database, thereby speeding up the applications by keeping frequently-used data in memory. It acts as a large in-memory cache, but using clustering techniques (hence being called an in-memory grid) it’s possible to store vast amounts of data on the grid.
In recent years their role has evolved beyond mere caching. They still speed up applications and reduce the load on the database, and have the advantage of requiring little or no rewriting of applications, or interference with the original database. But now as well as caching, they are being pressed into action as data platforms in their own right: they can be queried (very fast, in comparison with a database), they add another layer of high availability and fault tolerance – possibly across data centers – and they are increasingly being used as a destination for machine learning.
There are data grid offerings from a handful of vendors, amongst them Oracle, IBM, Software AG, Amazon Web Services, Pivotal, Red Hat, Tibco, GigaSpaces, Hazelcast, GridGain Systems and ScaleOut Software.
Data streaming engines
The third category, streaming, is also notable in the context of the Internet of Things. Data streaming involves the rapid ingestion and movement of data from one source to another data store. It employs in-memory techniques to give it the requisite speed. Streaming engines ingest data, potentially filter some of it, and also perform analytics on it. They can raise alerts, help to detect patterns, and start to form a level of understanding of what is actually going on with the data (and hence with the sensors, actuators or systems that are being monitored).
While streaming was largely confined to the lowest-latency environments, such as algorithmic trading in the financial sector, more and more use cases in the IoT space are latency sensitive: e-commerce, advertising, online gaming and gambling, sentiment analysis and more.
There are relatively few vendors with data streaming technology. But they include IBM with Streams, Amazon Web Services’ Kinesis in the cloud, Informatica with its Ultra Messaging Streaming Edition, SAS’ Event Stream Processing (ESP), Impetus Technologies with its StreamAnalytix and also TIBCO, Software AG and SAP (which bought StreamBase Systems, Apama and Aleri, respectively).
Smaller competitors include DataTorrent, which has a stream processing application that sits on a Hadoop cluster and can be used to analyze the data as it streams in, and SQL-based event-processing specialist SQLstream. Another young company is Striim.
In the open source space, Apache Spark Streaming and Apache Storm both offer streaming – most vendors have added support for Spark rather than Storm. But that, as they say, is a story for another day.
IoT Back to Basics, chapter 3: It’s no surprise that security and governance are important considerations when it comes to the IoT, but quite how incredibly important they are may not be immediately obvious.
Ensuring that users of IoT systems and smart devices remain safe and secure – which requires that their data stays protected and carefully governed – is vital if businesses and public sector institutions are to initiate successful IoT projects. There isn’t just the risk to a user’s privacy, and the possibility of big fines from regulatory bodies when things go awry, but also the issue of reputational risk and the commercial consequences of confidence in your brand being undermined.
Of course, security should be high on the agenda in all areas of IT. A targeted and sustained ransomware attack on the NHS, in May last year, was just one example of how sophisticated some of the hackers – and their malware – have become. At a machine data analytics conference last year, the chief security officer at Travis Perkins, a British builders’ merchant and home improvement retailer, told us that his organization had faced 3,851 ransomware attacks in just one month last summer.
The extra problem with IoT is that it vastly increases the potential ‘attack surface’ – there are more connected devices and gateways, and hence more areas of potential vulnerability, which gives those with nefarious intent greater opportunity to wreak havoc. And while many existing technologies and data governance methodologies can also be used in the era of IoT, they cannot make up for the broader attack surface.
Some of the ‘things’, such as sensors, are relatively dumb and therefore unlikely to bring much gratification to hackers. There’s not a huge amount of twisted satisfaction to be gained from interrupting temperature or wind-speed readings from a sensor in a wind turbine, for example.
But when you consider that IoT also includes the likes of connected vehicles, wear-at-home medical devices, industrial and hospital equipment, you can see why security is such a vital consideration.
For instance, in 2015 a group of researchers from the University of California, San Diego, discovered a serious weakness in vehicle security that allows hackers to take remote control of a car or lorry, thanks to small black dongles that are connected to the vehicles’ diagnostic ports.
These are common in both cars and lorries, fitted by insurance companies and fleet operators, as a way of tracking vehicles and collecting data such as fuel efficiency and the number of miles driven.
But the researchers found that the dongles could be hacked by sending them SMS text messages, which relayed commands to the car’s internal systems. The hack was demonstrated on a Corvette, where the researchers showed they were able to apply the brakes or even disable them (albeit as long as the car was at low speed).
You can imagine the repercussions of such a hack as we move ever-closer to driverless cars.
There have been other worrying security lapses around IoT that give pause for thought. In 2013, for instance, the US Federal Trade Commission (FTC) filed a complaint against TRENDNet, a Californian maker of home-security cameras that can be monitored over the Internet, for failing to implement sufficient security measures.
TRENDNet’s cameras were hacked via the Internet, leading to the display of private areas of users’ homes on the Web, and allowing unauthorized surveillance of adults as well as children going about their usual daily lives. As well as an invasion of privacy, there was the potential that such covert surveillance could be used to monitor the comings and goings of the occupants of a premises, and hence give rise to further criminal activity once the hacker knows when there is no one at home.
Clearly, some IoT initiatives have different risk profiles to others. For instance, ‘white hat’ hackers last year demonstrated that they had been able to hack into a smart domestic appliance network and turn off ovens made by the British company AGA. Being able to turn them on and adjust the temperature would be more dangerous, but the ramifications are still worrying.
Another penetration testing company discovered that hackers could remotely compromise a connected kettle with relative ease and thus potentially gain unfettered access to a person’s wireless network, from which they could change DNS settings and monitor all web traffic for access to bank accounts and other sensitive data.
It’s obvious that the companies involved in implementing IoT need to be just as sophisticated about their security processes and protocols as the most sophisticated hackers – but time and again we have seen companies outsmarted by either ‘white hat’ or, worse, ‘black hat’ hackers.
The potential security risks around IoT are very real
Organizations contemplating the benefits IoT projects (or in the case of local or federal government, their citizens) would be wise to consider security and data governance very carefully indeed. Authentication and authorization technologies are likely to be necessary. Data masking (removing attributes that would enable a hacker to identify specific people and their habits, for instance) may also be called for, and in some cases even mandated by law.
Ensuring privacy is also an issue. While some consumers or citizens are quite happy to share various data with organizations, others are not. Organizations must therefore ensure that they ask users to ‘opt in’ to IoT-related projects or systems, rather than opting them in without explicit consent (even if they subsequently offer an opt-out).
Companies that don’t do this run the risk of annoying customers and falling foul of auditors and legislators. If potential fines are not sufficient to deter some companies from taking security and data governance seriously, the potential reputational damage certainly should be!
It’s all too easy to conflate the kind of AI being hyped in the industry at the moment with the science fiction notion of machine sentience. We are still a long way from the latter, though, whether you see it as WALL-E or the Terminator.
What’s mostly on offer today from IT vendors and service providers is really just advanced data analytics. With the power and scalability of modern cloud platforms, rules and inferences can be constantly updated and refined as more data is accumulated (machine learning), and applied in near real-time. This can of course create the illusion of intelligence, but sentient computers it isn’t – not yet, at least.
Having said that, we shouldn’t underestimate the potential for today’s kind of AI to have a huge impact on the way some things are done in business. If you work in IT, this is something you need to get to grips with sooner rather than later. Why? Because AI capability is rapidly becoming a lot more accessible, and before long will be pervasive across our application and service estates.
The pace at which things are evolving became clear to me over the course of a number of briefings and conversations I had towards the end of last year. This began with a session at IBM, during which a case study at major oil and gas company was discussed. The Watson ‘cognitive computing’ platform had been used to create a virtual assistant that was transforming IT support by providing users with advice and guidance via a multi-lingual text and speech interface. The results achieved in terms of service level metrics were impressive, but to get there required a substantial professional services engagement – i.e. lots of consulting time and expertise.
In contrast, I then had quite a different conversation with Salesforce.com, which has been acquiring, building and integrating AI capability into its cloud platform for a number of years. In the words of John Taschek, Senior VP of Strategy at the company, “A lot of what we are doing is aimed at making AI a seamless and embedded part of the business process”.
Moving AI into the software stack
One of the examples we discussed was advanced sales forecasting powered by Einstein – the overarching brand name for most things AI in Salesforce.com. The key point here is the notion that you shouldn’t need lots of specialist expertise or coding and integration effort to exploit the potential of AI. It will increasingly be a case of ‘switch on, configure and go’.
More recently, at its January Tech Summit in Birmingham, I heard Microsoft do a pretty good job of spelling out the different routes to AI goodness. If you have the expertise and want to get really ‘down and dirty’, the Azure platform is increasingly going to offer fine-grain AI and machine learning capability, right down to FPGA level. For mainstream developers who need to AI-enable their application without having to worry about the detail, higher-level services are offered so you can access natural-language functionality. For example, a set of APIs can hide all of the underlying AI complexity. Then, further up the software stack, we’ll increasingly be seeing AI smarts embedded seamlessly into Microsoft applications and tools, from Office 365 to its CRM and ERP offerings.
I’ve only mentioned three players here, but technology companies large and small, from Google and Apple to highly-innovative specialist vendors, will be surfacing AI capability in all kinds of different ways. That includes embedding it in the systems and security management tooling used by IT teams.
The upshot is that AI will increasingly find its way into the world of IT professionals – there really won’t be a way of avoiding it. So you need to starting thinking now about the implications in relation to changing user expectations, application design and implementation, service management and support, and not least, security, privacy and compliance.
When we surveyed several hundred IT professionals on the topic of All-Flash Arrays, one thing that came out was just how broad was the chasm in thinking between those whose organisations already owned and used AFAs, and those who did not.
Most current AFA users were positive about the technology’s value, both to the wider business and to IT specifically. However, non-users were much more likely to be cautious or even sceptical about the strategic value and operational benefits of AFA.
We also found that these two groups had quite different ideas of which business workloads work well on AFAs. For those with no direct experience, the top target workloads were database applications and virtual servers, both of which were thoroughly hyped up in the early days of AFA, of course.
Once again, familiarity with the technology had a robust effect: the experienced group were using AFAs to support a much broader range of workloads. As well as databases and VMs, they included online transaction processing, mobile apps and services, virtual desktops, big data, and real-time analytics.
Going beyond simple workload suitability, we also asked about using AFAs to enable business and IT transformation. Here, we were thinking about those changes that derive from Flash working differently from disk, such as its ability to deliver consistent performance and reliable quality of service. The majority of those with direct experience agreed that AFAs were a strategic enabler for both business and IT transformation, while those without direct experience were rather more cautious.
We were also thinking about the way AFAs bring more opportunities for automation, and sure enough the second most significant benefit reported in our survey was that they need less management and tuning. As well as the opportunity to free up skills and redeploy them to create real business value, this also implies less downtime resulting from ‘human errors’.
Of course, we didn’t know exactly why the non-users were non-users. It could be they were indeed sceptical of AFA’s value, or perhaps they simply couldn’t get the budget, hadn’t had a trigger to change, or thought that their applications weren’t appropriate for Flash storage. The result was the same though – actual experience is key to understanding the possibilities of the technology, and they didn’t have that experience, hence the awareness chasm.
There may also be an element of ignorance and working on outdated information. Not everyone is aware of how fast AFA technology has evolved over the last couple of years from the niche-oriented first generation systems, or of how quickly its effective price per GB has fallen. As a result, there is still some residual uncertainty and doubt about the enterprise relevance of Flash – doubt which our experienced users tell us is largely unwarranted today.
Either way, our research shows that, when it comes to understanding and achieving the potential of AFA, experience is a massive help. Once you have worked with it, you ‘get’ it.
But as the saying goes, there’s a first time for everything, and even if you don’t have direct experience to help you, you can still bridge that chasm and build a good business case. Reading our report (it’s free to download) will help when it comes to understanding just how many of your applications could benefit, for instance, as will talking to those who have already gone along the AFA route.
Then it’s careful planning, of course. Put that business case together, profile and test your apps – that’s a key tip from our experienced users – and make sure you choose a supplier with good post-sales support and the ability to advise on best practices.
And if you’re trying to sell the idea of investing in AFA to someone else in your organisation, remember that they might well have a rather distorted idea of what it’s good for!
Microsoft’s revenues are up, but compared to its biggest competitors — Amazon, Apple, Google and Facebook — its mindshare and perceived market relevance are down.
Satya Nadella’s Microsoft is still going through the process (and pain) of reinvention, morphing from the PC-based Windows & Office-centric company that everyone knows, to the cloud-based Azure & AI services company that investors want to have in their portfolios. But with so much riding on this phase of the company’s evolution, Microsoft must convince businesses, enterprises, governments, consumers and partners that it has something useful, if not essential, to offer across a complex mix of markets and sectors.
Normal people don’t think about ‘computing’
We’ll never really know what people thought about Apple’s recent ‘What’s a computer’ iPad Pro promo video, because Apple (unlike Microsoft) doesn’t let people add comments to its YouTube content. However, from a consumer perspective, the ad makes a valid point, in that no one playing a game, using a business application or doing their homework on a PC, mobile phone or tablet device ever thinks of this as ‘computing’.
The visible components of Microsoft’s ‘more personal computing’ strategy currently revolve around Windows 10 and the company’s range of adaptable, yet expensive, Surface devices. This combination will maintain the company’s relevance in its traditional desktop domain, but from a platform perspective, Microsoft needs to engage with the growing number of non-Windows, non-PC users.
A cloud that can listen, learn and predict
For normal people, the word ‘computing’ is something that happens in the cloud or in that place we call the data centre. Of course, Microsoft has a major presence in both locations, and is therefore well placed to service the hybrid IT needs of organisations, but to remain relevant in the noisy consumer market, it must find new and authentic ways to convey the notion of it being useful or, better still, essential.
Microsoft failed in its attempts to make its Windows Phone platform either useful or essential during the mobile technology wave, so it needs think about the lessons it learnt as we enter the ‘digital assistants’ technology wave, driven by voices, smart speakers and intelligent devices. There’s a lot of hype around this topic, but very little of it features Microsoft’s own digital personality, Cortana. So, Microsoft must find a way to insert its intelligent cloud and intelligent edge technologies into the equation, and getting developers to build intelligent applications on its Azure platform is the most obvious option.
Nothing important happens in the office
Microsoft isn’t alone when it talks about reinventing productivity and business processes, but it’s one of the loudest. Office 365 is undoubtedly changing the way that productivity tools and associated services are delivered to employees and end users, but there’s scant evidence to suggest that it’s radically changing the way that people use Microsoft Office.
Organisations don’t really differentiate themselves by the way they manage file servers, email servers or telephony systems. OK, some users are a whizz when it comes to using Word, Excel and SharePoint, but it’s what happens outside of Microsoft Office that ultimately matters to the organisation. This is why partners will ultimately determine the relevance of Microsoft in 2018, and why Microsoft needs them now more than ever.
This article is part of a series on the challenges facing major technology firms in 2018. For more, please see the main Write Side Up blog page.
IoT Back to Basics, chapter 2: In the era of the Internet of Things (IoT) it is becoming increasingly important to be able to process, filter and analyse data close to where it is created, so it can be acted on remotely, rather than having to bring it back to a data-centre or the cloud for filtering and analysis.
The other reason to implement analytics at the edge of the network is because use cases for IoT continue to grow, and in many situations, the volume of data generated at the edge requires bandwidth levels – as well as computing power – that overwhelm the available resources. So it’s possible that streams of data from smart devices, sensors and the like could swamp datacentres designed for more traditional enterprise scale needs.
For example, a temperature reading from a wind turbine motor’s sensor, that falls within the normal range, shouldn’t necessarily be stored every second, as the data volume can soon add up. Rather, it is the readings that fall outside of a normal range or signify a trend – perhaps pointing towards an imminent failure of a component – that should create an alert, and possibly be stored centrally only after that first anomaly, for subsequent analysis.
There are too many vendors in this space to produce an exhaustive list here. But it’s perhaps notable that last year, a company formerly known as JustOne Database performed a root and branch rebranding exercise. It renamed not only its products, but also its company name, which is now Edge Intelligence. It told me it was seeing such good traction for its database – that can run on relatively compact servers at the edge of the network, a data-centre or the cloud – that it changed its name after over six years in the business.
So what are some of the characteristics of edge analytics that you might want to consider if you are trying to push at least some analytics to the edge?
Standards and protocol translation
Although there is likely to be a shakeout of some of the standards in this space, opting for technologies that support standards is likely to make future integrations easier. Again there is a vast array of standards and API’s in this area. Standards and protocols include POSIX and HDFS API’s for file access, SQL for querying, a Kafka API for event streams, and HBase and perhaps an OJAI (Open JSON Application Interface) API to help with compatibility with NoSQL databases. There’s also the need to be able to support older, proprietary telemetry protocols so that legacy equipment (that often have lifetimes measured in decades) can been connected to more modern IoT frameworks. This is especially true in the industrial space, where IoT is of particular value for the likes of predictive maintenance.
Distributed data aggregation
This is to some extent the bread and butter of edge analytics, providing high-speed local processing, which is especially useful for location-restricted or sensitive data such as personally identifiable information (PII), and can be used also to consolidate IoT data from edge sites.
This refers to technologies that adjust throughput from the edge to the cloud and/or data centre, even with occasionally-connected sensors or devices.
Combines operational decision-making with real-time analysis of data at the edge.
Security and identity management
End-to-end IoT security provides authentication, authorization, and access control from the edge to the central clusters. In certain circumstances it will be desirable to offer secure encryption on the wire for data communicated between the edge and the main data centre. Identity management is also a thorny issue: it’s necessary to be able to manage the ’things’ in terms of their authentication, authorization and privileges within or across system and enterprise boundaries.
Delivers a reliable computing environment to handle multiple hardware failures that can occur in remote, isolated deployments.
Integration with the cloud
Even if not now, there may be a requirement in the future to have good integration between an edge analytics node and the cloud. This is so that alert data and even ‘baseline’ data points can be stored in the cloud rather than in one’s own data centre. In this regard integration with your cloud provider of choice – if you have one – would be a wise idea. If you don’t already do much in the way of data processing and storage in the cloud, some of the likely execution venues in your future could include Amazon Web Services, Google Cloud Platform or Microsoft Azure, but it wouldn’t do any harm to know there is support for the open source OpenStack infrastructure as a service (IaaS).
Edge analytics has come on leaps and bounds in the past several years as IoT use cases have shaken out. At the very least it might be worth asking if edge computing has a role to play in any IoT projects that you may be thinking of embarking on.
Most of us already recognise that technology has the potential to wipe out our privacy, if checks and balances are not in place – or at least I hope we do! What’s scary then in the recent hoo-hah about fitness trackers revealing secret locations is that it shows how bad we are – both as users and as technology developers – at spotting those privacy risks ahead of time.
Soldiers and other security staff have been warned for years against revealing their location via social networks. The risks are obvious: in 2007, Iraqi insurgents used geotagged photos to locate and destroy four US attack helicopters, for instance. More recently, geotagged selfies contradicted official Russian claims by revealing Russian soldiers in Ukraine, fighting alongside Ukrainian rebels.
Yet here we are, with people acting all surprised that, when the Strava fitness tracking app openly publishes its users’ location and movement data, it reveals where soldiers exercise, as well as civilians.
You have to wonder what on earth those military users thought they were doing, leaving a tracker wirelessly-connected when they’ve been warned for years about geotagged photos, Facebook Places, Foursquare and all the rest. Did they fail to spot the privacy options on their Strava settings page? (It’s easily done – they are buried a few layers down.) Or did they, as so many of us do, assume that it’s just ephemeral data, of no interest to anyone else?
The tracking scare should remind everyone, not just the world’s militaries, that even a direct order is sometimes not enough. And if it’s an indirect order or mere advice, you’re lucky these days if the recipient scans the first paragraph before muttering “Whatever” and clicking Accept. There must be training too, plus active checks on compliance and probably some form of pen-testing or white-hat hacking.
Beyond that, it also shows why – as the GDPR will require – you need to get a user to actively opt-in to data processing, and why it must be informed consent. Simply providing an opt-out, without a clear explanation of the risks, is nowhere near enough.
To be fair, Strava does recognise that some individuals want anonymity. In a statement it said, “Our global heatmap represents an aggregated and anonymized view of over a billion activities uploaded to our platform. It excludes activities that have been marked as private and user-defined privacy zones.”
Real anonymity is hard
The problem is that this concept of anonymity looks too much like, “Oh, that could be just anyone out there, jogging around Area 51 or that Syrian airbase!” If any more proof were needed that some people in technology have no idea what anonymisation really means, this is it.
There’s a whole bunch of lessons in here, both for Strava and the rest of us. I’ve already mentioned a couple – that privacy needs to be the default, not an opt-out extra, and that anonymisation doesn’t just mean taking the names out. Another is that there is nothing intrinsically good in big data, it’s all in how it’s used – and in who’s using it.
And perhaps it’s also to beware vanity, although that can be a tough challenge for the Instagram generation. Whether it’s soldiers keen to be top of the exercise leaderboard or app developers trumpeting how many million users they have, they’re showing off. Wanting to do your best is one thing, but as the saying goes, pride comes before a fall.
Some assumptions have been held by IT pros for so long that they have almost become articles of faith. One of these is the idea that content management, particularly for files, semi-structured and unstructured content, is so difficult that only the foolhardy attempt to tackle it for anything other than information that regulators say has to be ‘actively managed’.
It’s fair to say that, until very recently, this assumption may even have underestimated the challenges involved getting an effective content management system in place, even for relatively small sets of data and files. But things are changing.
An important development has been recent work to make some of the core elements of content management simpler and more effective. These tasks all begin with data discovery: “What do I have in my storage systems?” Even data protection vendors suppliers such as Veritas, Arcserve and Commvault, amongst others, have started to produce tools that make data discovery something that can be contemplated without fear.
However data discovery is just step one. To move towards managing content and information across the board, not just confining it to those files you are legally forced to look after, requires technology to automate the classification of the files in line with the organisation’s business needs. Traditionally this has relied on where the files live in the file system and folder structure in order for users to be able to search and surface them. And users often “misplace” or move files around, making finding them later something of a challenge.
An era of genuinely-usable data discovery is dawning
But this too is now being addressed, as vendors like Veritas and M-Files bring tools to market that, while not perfect by any means, can at least pass the 80:20 rule of dealing with the majority of files. We are at the start of an era when finding data, and using human insight to turn it into valuable information on demand, should become routine.
Of course, technology developments alone are unlikely to trigger an avalanche of user-adoption without business triggers to fire that process. That said, many organisations today have visible challenges bearing down upon them.
Some have been around for a long time, such as pressure to use storage cost-effectively or ensure data is protected appropriately, but have been placed in the ‘too hard to look at now’ folder. Others, such as various regulatory drivers around data privacy, are charging forwards at high speed with GDPR a major consideration in the boardroom.
I hope that drivers such as GDPR, combined with better technology solutions, will see organisations look more deeply at managing information, and especially at following often-valuable user-generated content throughout its lengthening, but now bounded, lifespan.
There is an additional upside if you do Information management well for all the files in the organisation, if you can generate new business value by exploiting data that was previously hard to locate when needed. And with tools like M-Files and Veritas making it possible to do so without having to move everything into yet another silo, the age of enterprise-wide information management may finally be dawning.
Europeans will in future be able to bring US-style class actions for (alleged) privacy violations, instead of having to sue individually and expensively. It’s thanks to a little-known clause of the EU’s GDPR, which comes into force in May.
Rich and arrogant organisations have long relied on delaying tactics to evade certain of their responsibilities to individuals and small businesses. Who among us has the time and money needed to seek redress at law, when our opponent has a full-time legal staff with nothing better to do than dispute and obstruct? Especially if our reward might only be a few hundred pounds or euro.
A solution used (and yes, some would say abused) in the US is the class action. This allows a single party to lodge a claim on behalf of a group, such as all the shareholders or customers of a company. Add the ability of lawyers to work on a contingency basis, meaning they get nothing if they lose but a percentage of the total – which can be considerable, for a large group – if they win, and infringing organisations can no longer afford to be quite so arrogant.
True, the GDPR does not use the words ‘class’ or ‘group’. But it’s a logical extension of Article 80, which includes the following:
Representation of data subjects
The data subject shall have the right to mandate a not-for-profit body, organisation or association …. to lodge the complaint on his or her behalf
I say it’s a logical extension because several European countries already allow representative or collective actions in a range of cases. Typically these have been restricted to the area of consumer protection, but they demonstrate that the potential advantages to the judicial process – e.g. cost, clarity, equal treatment for claimants – are already understood.
My privacy – none of your business?
One of the first to take up the challenge, if not the first, is Max Schrems, the Austrian lawyer and privacy campaigner whose case against Facebook has been winding its way through the Austrian and European courts for almost four years (a final decision is expected soon). Schrems claims that Facebook Ireland (the company’s EU arm) has spent considerable time and legal effort simply trying to get the case thrown out on procedural grounds, such as the validity of class actions.
So he and others have formed just such an Article 80 body, called None Of Your Business, to take on class action privacy cases in the future. As well as empowering individuals to defend their GDPR rights, NOYB says it wants to support businesses that seek to comply with the law, for example by publishing guidelines and best practices, and by making it harder for cheats to gain competitive advantage.
It’s just one more incentive, if any were needed, for organisations to come to terms with the GDPR and with privacy more generally. Get it right, and you could see profitable spin-offs in areas such as data governance and customer trust; get it wrong, and you could be in the legal – and financial – firing line.