Yottabytes: Storage and Disaster Recovery

Mar 27 2016   4:48PM GMT

Open Source Database Raises Health Data Privacy Concerns

Sharon Fisher Sharon Fisher Profile: Sharon Fisher


It sounds like a noble cause: A company, Ambry Genetics, is making a database of information it’s collected about 10,000 people with breast and ovarian cancer freely available, in the hopes that other researchers can use it to help develop preventions and cures for such diseases. But while the company no doubt has great intentions, release of medical data like this can create health data privacy concerns.

“The 10,000 people all have or have had breast or ovarian cancer and were tested by Ambry to see if they have genetic variants that increase the risk of those diseases,” writes Andrew Pollack in the New York Times. “Ambry returned to the samples from those customers and, at its own expense, sequenced their exomes — the roughly 1.5 percent of a person’s genome that contains the recipes for the proteins produced by the body. Since proteins perform most of the functions in the body, sequencing just that part of the genome provides considerable information, and is less expensive than sequencing the entire genome.” The company spent $20 million on the project, he adds.

What makes this whole story particularly poignant is that Ambry founder and CEO Charles Dunlop suffers from cancer himself, which he attributes to a genetic mutation, and recently stepped down as CEO. “I would not be resigning if it weren’t for having stage four prostate cancer, which is now in remission,” he writes. “Cancer sucks. The stress of the job coupled with my gene mutation leaves a high likelihood of bringing the cancer back.”

This isn’t the first time databases of such anonymous medical data have been collected. Icelandic company deCODE is working to develop a database of health data for as much as two-thirds of the population of the country. Because the Icelandic population is relatively insular, this was a treasure trove for researchers, writes Emma Jane Kirby for the BBC.

“With little significant immigration since the Norsemen first settled here in the 9th Century, Iceland is among the most homogeneous nations on earth,” Kirby writes. “With so little background noise to filter in the small population of just 320,000 people, it’s much easier for scientists to isolate faulty genes than it is in larger multi-ethnic countries such as Britain or the US. Iceland also has a database containing the genealogy of the entire nation dating back 1,100 years.”

The Ambry Genetics database, known as AmbryShare, is nominally anonymous, Pollack writes. “AmbryShare will not contain the actual exome of each person, because that would pose a risk to patient privacy,” he writes. “Rather it will contain aggregated data on the genetic variants. For example, a researcher could look up how frequently a particular mutation occurs among the 10,000 people. Ones that occur frequently in these 10,000 patients, but not among healthy people, could raise the risk of developing those cancers.”

But health data privacy research has shown that “anonymous” medical data isn’t necessarily so and that individuals can be identified by a remarkably short list of data. In fact, just knowing a gender, birthdate, and zip code is unique for 87 percent of the U.S. population, wrote Seth Schoen for the Electronic Frontier Foundation in 2009.

“The notion of “anonymized” or “sanitized” data is then problematic; researchers habitually share, or even publish, data sets which assign code numbers to individuals,” Schoen wrote. “There have already been conspicuous problems with this practice, like when AOL published “anonymized” search logs, which turned out to identify some individuals from the content of their search terms alone.”

Also recall that law enforcement agencies have been doing what they can to mine genetic information from various private companies that collect it, such as 23andme. While the Ambry database includes only people with breast or ovarian cancer, it doesn’t necessarily mean that it could only help law enforcement track down people with those conditions. Certain components of DNA are passed down through the father and mother. It could happen that a relative of a criminal would be tested and in the database, which would help narrow down the search.

Health data privacy is likely to become even more of an issue in light of President Barack Obama’s Precision Medicine Initiative, which is intended to create a database of medical information for a million people and is expected to cost as much as $1 billion over the next four years.

“When information from one million people is brought together, it would make an attractive target for a hacker working to link the data back to individuals,” writes Dina Fine Maron in Scientific American. “Such a breach could rob both patients and their families of their privacy. Data for research are typically scrubbed of identifying factors like a patient’s name and birth date, but someone with enough information about an individual’s family tree may be able to connect some dots.”

In fact, health data privacy concerns have been enough to keep some people from participating in studies, Maron notes. But the PMI database could also include existing databases with participants who didn’t consent to this specific sort of aggregation, but who agreed that their data could continue to be used for research.

The downside of such privacy concerns is that not making the data accessible is a loss to research. “Admittedly, there’s not much loss to society if IMS Health can’t sell prescription data to marketers,” wrote the late tech journalist Steve Wildstrom in 2011, in response to a legal case on the issue of “anonymous” health databases that turned out not to be. “But there could be a considerable loss if researchers lose access to great masses of aggregated data. We are just at the point where the collection and analysis of vast amounts of data is becoming routinely practical. While there may be considerable risks in assembling that data, there is also a wealth of information about ourselves and our society that could be obtained from them. The debate must weigh both benefits and risks.”

1  Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • wolfgang2
    Your title is a little bit misleading. The topic has nothing to do with "open source databases" but with databases with open content.
    50 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: