Posted by: Sharon Fisher
big data, data center, data storage, federal government, flash, flash drives
Legality and ethics aside — which is something for the law and politics blogs to talk about — the interesting part about the alleged NSA tracking issue are the logistical issues involved. How much data is there? How did the NSA manage it? Where did they put it?
“The NSA is storing all those Verizon (and, presumably, other carrier records) in a massive database system called Accumulo, which it built itself (on top of Hadoop) a few years ago because there weren’t any other options suitable for its scale and requirements around stability or security,” claims Derrick Harris in GigaOm. “The NSA is currently storing tens of petabytes of data in Accumulo.”
With Accumulo, the NSA has said it can process a 4.4-trillion-node (callers), 70-trillion-edge (connections between two callers) graph, according the NSA’s own slide presentation, Harris says. “By way of comparison, the graph behind Facebook’s Graph Search feature contains billions of nodes and trillions of edges.”
(And, in keeping with the Obama Administration’s effort to use open source tools, the NSA donated Accumulo to the Apache Foundation in 2011.)
While people like to think that the amount of data required to spy on a country just isn’t realistic, in practice it doesn’t take that much – particularly if the data stored is only traffic analysis, or the date, length, time, duration, and parties to a call, and not the content of the call itself. Researcher John Villasenor noted last summer that storage costs had dropped by a factor of a million since 1984. And the NSA has always been on the bleeding edge of such research.
“In the 1960s, the National Security Agency used rail cars to store magnetic tapes containing audio recordings and other material that the agency had collected but had never managed to examine, said James Bamford, an author of three books on the agency,” reported Scott Shane in the New York Times, about Villasenor’s work. “In those days, the agency used the I.B.M. 350 disk storage unit, bigger than a full-size refrigerator but with a capacity of 4.4 megabytes of data. Today, some flash drives that are small enough to put on a keychain hold a terabyte of data, about 227,000 times as much.”
And how many calls are we talking now? AT&T has 107.3 million wireless customers and 31.2 million landline customers; Verizon has 98.9 million wireless customers and 22.2 million landline customers; and Sprint has 55 million customers in total, according to the Wall Street Journal. Another Journal article went into more detail on the requirements.
“The task of storing and processing the metadata for all the calls in the U.S. is actually rather trivial, according to Jack Norris, chief marketing officer at MapR Technologies Inc., a company that provides commercial-grade services based on open source database technology such as Hadoop, originally developed by Google Inc. ‘This amount of data is easily analyzed on a MapR Hadoop cluster,’ Mr. Norris said in an email. He assumed, in his calculation, that there are 250 million teenagers and adults in the U.S., each making an average of 10 calls a day, or 2.5 billion calls in total. He also assumed that the average call data record is 2,000 kilobytes. That means all the calls records take up five terabytes worth of storage.”
That would be per day. The storage costs, the Journal quoted Gartner analyst David Cappuccio as saying, would be 46.8 million — 20% less if open source technologies were used.
Reportedly, the data is being stored and analyzed in a $2 billion NSA data center in Utah, code-named Bumblehive (a double allusion to Utah’s large LDS population, which uses the bee as its symbol – to the extent that state highway road signs are beehive-shaped). The 1 million sq. ft. facility is thought to use flash storage for improved performance. Some reports said the center was due to be completed in March of this year, others this fall, while others said it could be as far away as 2016.
Interestingly, this isn’t the first time the NSA Utah data center has come up. The Salt Lake City Tribune reported on the data center as long ago as July, 2009. “The enormous building, which will have a footprint about three times the size of the Utah State Capitol building, will be constructed on a 200-acre site near the Utah National Guard facility’s runway.”
Numerous other sites have reported on the progress of the Utah data center over the years since then, along with data centers from vendors such as Oracle and eBay being built there – not to mention Twitter, which would be convenient for NSA monitoring of its data. (And hey! The NSA data center is reportedly compliant with the silver-level Leadership in Energy and Environmental Design specifications in sustainable development!) In fact, many states and small cities love such data center developments for the construction jobs (as many as 10,000 for the NSA site) and other economic development benefits they bring.
According to the Tribune, the Baltimore Sun had reported in 2006 that the NSA was forced to move outside the Beltway area because it had maxed out the local power grid. The Utah data center will reportedly use up to 65 megawatts of power — or as much as the entire city of Salt Lake City itself. The facility also asked to be annexed into the nearby city of Bluffdale to ensure it would have an adequate water supply for cooling the computers.
In fact, the state of Utah passed a law earlier this year that would enable it to add a new 6% tax to the power used, which could raise up to $2.4 million annually on the expected power costs of $40 million. The NSA is apparently quite put out at the potential additional charges.