A few years ago, in graduate school, we studied market-basket analysis, where retail companies looked at the receipts to try to find patterns. At the time we called it Data Mining, but when you looked at the number of receipts we were talking about, thousands per day at hundreds of stores for a year, averaging a two dozen items or more … we might call this big data.
The claim was that the stores in the analysis, and found that late at night, customers tended to purchase beer and diapers together. The theory was that the beer was an impulse purchase made by a husband on a diaper run. The company began to put some beer next to the diapers, and boom sales went up. The success of this program led to more data warehouses, data mining, and, to some extent, inspired big data.
Except it isn’t true, or at least, not the way the story implies. The Register tracked down the story; it originated in 1992, at Osco Drug stores, that do not currently place beer near diapers. Most of the details of the story turned out to be false — for example, in graduate school I was told it was Meijer, a local mega store. At the time, the professor in my information systems policy class said there were “tons of” additional examples, but was unable to produce additional information.
The new generation of Data Mining is big data; data so large, and sometimes unstructured, that it needs to be processed in something less like a relational database and more like google. As one Microsoft employee at a conference put it to me last summer, the reason Facebook, Chrome, and other tools are free is to get your data. By monitoring media consumption, the advertisers will be able to serve up exactly what the customers wants, when they want it, delivered in such a way that the customer isn’t really even aware they are being sold.
Color me skeptical.
I doubt those tools can actually do what they promise to do. When Edward Snowden, the American Expatriate, is afraid of government surveillance, America tends to listen – but we seem to be ignoring the same risks when it comes to corporate data gathering.
If you look at the big data claims very carefully, they tend to be about telling the future. To the extent that a company can base sales of cold medicine during the cold season based on previous year-over-year trends and seasonal adjustments, that makes sense. Over the past two or three days, though, most of the injected ads into my facebook feed have been for hotels for vacations I have already booked, credit cards I already have, and products I am not going to buy. There are certainly exceptions. A few years ago I worked with an insurance company that found large populations taking name-brand medicines that were much more expensive than the generics, and sent them letters – but they did all that with good old fashioned SQL queries.
At this point, I’ve got two concerns, first that the technology can’t live up to its promises, but more importantly, as we combine all those technologies to provide a history of one “user” – we are subtly eroding privacy. All it takes is a single ring to combine the information behind a unique key, something like, perhaps, email address.
Snowden worked in a government facility using PRISM software, that listed all the social and web activity of individuals. The government claimed that looking up an individual required a court order; Snowden said new hires generally started out searching old ex-girlfriends.
Google Chrome now has my browsing history, tied into my logging to Gmail. Is that something I want to give away?
Oh, I suppose I could use Firefox unsigned. The convenience of being logged in, of getting Youtube and Amazon recommendations, seems to exceed, for me, the risk my privacy.
But should it?