The Electronic Frontier Foundation has announced that two vendors, Apple and Dropbox, have signed a pledge to help support its Digital Due Process initiative, which calls for a rewrite of the Electronic Communications Privacy Act to better protect user data.
The initiative has more than 50 members, including Amazon, AT&T, Facebook, Google, Microsoft, Twitter, and Yahoo!, which were called out in April as being major computer vendors that should support the proposal. Steps included in the proposal include telling users about data demands, being transparent about government requests, fighting for user privacy in the courts, and fighting for user privacy in Congress. Companies received from one to four stars (including partial stars) depending on how well they are implementing each of these policies.
Dropbox was a particularly interesting addition, because the company has been criticized about its policies regarding protecting user data in its cloud storage service.
Other vendors pf the 13 that the EFF called out in April that have not yet responded include Comcast, Myspace, Skype (since purchased by Microsoft, which is a member), and Verizon.
Organizations such as the American Civil Liberties Union and the Center for Democracy & Technology are also members.
It’s typically a good idea to take vendor surveys with a grain of salt; they tend to be slanted and unscientific. Not so with Symantec; they have actual scientific surveys with margins of error and everything.
Not to say, of course, that they’re completely unbiased; recall in this case that Symantec purchased Clearwell earlier this year in an attempt to improve its ranking after a recent Gartner Magic Quadrant on eDiscovery vendors.
That said, its Information Retention and eDiscovery Survey has some interesting points to be made — not the least of which is actual evidence from users that implementing an information retention policy saves money.
- Respondents using best practices reported a 64% faster response time with a 2.3 times higher success rate when responding to eDiscovery requests.
- They were 78% less likely to be sanctioned by the courts and 47% less likely to find themselves in a compromised legal position.
- They were also 20% less likely to have fines levied against them. In addition, they were 45% less likely to disclose too much information.
- Nearly half of respondents do not have an information retention plan in place.
- 30% are only discussing how to do so.
- 14% have no plan to do so.
- When asked why they don’t have information retention programs, respondents indicated the top reasons are: lack of need (41%), too costly (38%); nobody has been chartered with that responsibility (27%); don’t have time (26%); and lack of expertise (21%).
The part about “too costly” is particularly telling in light of the results.
Respondents who said they’d been asked to respond to a legal, compliance or regulatory request for electronically stored information reported the following results:
- Completely failed to fulfill the request 10%
- Partially failed to fulfill the request 10%
- Successfully fulfilled the request, but more slowly than the requestor would like 25%
- Successfully fulfilled the request in a timeframe that is acceptable to the requestor 35%
- Damage to Enterprise reputation or embarrassment 42%
- Fines 41%
- Compromised legal position 38%
- Sanctions by courts 28%
- Hampered our ability to make decisions in a timely fashion 26%
- Raised our profile as a potential litigation target 25%
The thing is, it’s true. Even though Internet speeds continue to increase, the amount of data we want to transmit continues to increase, too.
Which is why the various Internet denizens have developed….workarounds for large file transfers, which also provides the opportunity for the wonderful Internet pastime of geekly arguing.
Which brings us to station wagons, pigeons, and Blu-ray.
The canonical statement, by Andrew Tannenbaum in his 1996 book Computer Networks, is basically “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” And ever since then, there have been numerous websites devoted to how-many-angels-can-dance-on-the-head-of-a-pin discussions about just what that bandwidth would be.
You can tell how old the websites are based on what figures they use for comparable Internet bandwidth, the size of a magnetic tape, and so on. The Wikipedia entry for “Sneakernet” appears to have the most up-to-date calculations.
(The actual calculation using today’s technologies is left as an exercise for the reader.)
The Internet being the Internet, the calculations have been extended, ranging from petabytes in a sailboat to Blu-ray discs in a 747 (which, as it turns out, would actually be too heavy for a 747 to carry), to, more mundanely, the number of SD cards that fit into a Fed Ex box — as well as the bandwidth of a Netflix movie shipment through the mail.
And then there’s the pigeons.
Really truly, carrier pigeons have been used for a remarkable amount of data transfer in history — not just short messages, and aerial photography predating satellites, but things like blueprints from military installations in the U.S.
In fact, in 1982, Computerworld ran an article about how Lockheed Missile & Space Co. used pigeons to carry microfilm copies of blueprints to a research facility in Santa Cruz, because it was cheaper than printing out and transporting hard copies. And if you have $100 per half hour for someone to dig it up, you can apparently get a copy of Dan Rather introducing a story about it on CBS News.
Consequently, not one but two April Fool’s Internet protocols were developed — Transmission of IP Datagrams on Avian Carriers, and Transmission of IP Datagrams on Avian Carriers with Quality Control — for transmitting Internet data by carrier pigeon. The first one was even demonstrated, and while the experiment left something to be desired, Wikipedia points out that “during the last 20 years, the information density of storage media and thus the bandwidth of an Avian Carrier has increased 3 times faster than the bandwidth of the Internet.”
That’s not all. In various remote areas, such as rural U.K., Australia, and parts of South Africa, people have used carrier pigeons to demonstrate that they’re faster than what passes for high-speed Internet there.
The point is this: No matter how fat a pipe you have to the Internet, at some given amount of data, it’s going to be faster, cheaper, or both to use some manual method to ship data on some storage medium. It makes sense for you to do a back-of-the-envelope calculation to figure out where the data boundaries are for different mediums and different shipping methods, and update them as technology changes.
Tape’s not dead. Really. Products supporting the Linear Tape Open (LTO) 5 specification just began shipping this year, but already vendors are starting to make noises about LTO 6, for which there isn’t even an availability date announced yet.
In sort of the tape storage equivalent to Moore’s Law, a consortium of three vendors — Hewlett Packard, IBM, and Quantum, known as the Technology Provider Companies (TPC) — get together every few years and decide upon specifications for tape cartridges with a steady increase in speed and capacity. This helps keep users convinced that there’s still a future for tape.
For example, the specifications for LTO 5 (as well as LTO 6) were announced in December 2004, but it took until January 2010 before licenses for the LTO 5 specification was available, and products supporting it started to be available in the second quarter of that year.
Similarly, the LTO TPCs announced in June of this year that licenses for the LTO 6 specification were available. By extrapolation, one can assume that LTO 6 products could be announced any day.
LTO 6 is defined as having a capacity of 8 TB with a data transfer speed of up to 525 MB/s, assuming a 2.5:1 compression. This is in comparison to LTO 5, which has a capacity of 3 TB with a data transfer speed of up to 280 MB/s, assuming a 2.5:1 compression.
Lest people get fidgety about the future of tape after that, the LTO TPC announced this spring the next two generations, LTO 7 and LTO 8, with compressed capacities of 16 TB and 32 TB and data transfers speeds of 788 MB/s and 1180 MB/s, respectively. As with LTO 6, no dates were announced, but one might expect each will come out about two to three years in succession.
The thing to remember, also, is that each LTO generation can typically only read two generations before it — meaning users needs to either rewrite their tape library every few years or keep a bunch of old LTO machines around. “By the time LTO 8 is released, organizations will need, at a minimum, LTO 3 drives to read LTO 1 through LTO 3 cartridges; LTO 6 drives to read LTO 4 through LTO 6 cartridges; and LTO 8 drives to read the LTO 7 and LTO 8 cartridges,” wrote Graeme Elliott earlier this year.
The best part about IBM’s experimental 120-petabyte hard drive is reading all the ways that writers try to explain how big it is.
- 2.4 million Blu-ray disks
- 24 million HD movies
- 24 billion MP3s
- 1 trillion files
- Eight times as largest as the biggest disk array available previously
- More than twice the entire written works of mankind from the beginning of recorded history in all languages
- 6,000 Libraries of Congress (a standard unit of data measure)
- Almost as much data as Google processes every week
- Or, four Facebooks
It is not one humungo drive; it is, in fact, an array of 200,000 conventional hard drives (not even solid-state disk) hooked together (which would make them an average of 600 GB each).
Unfortunately, you’re not going to be able to trundle down to Fry’s and get one anytime soon. No, this is something being put together by the IBM Almaden research lab in San Jose, Calif., according to MIT Technology Review.
What exactly it’s going to be used for IBM wouldn’t say, only that it was “an unnamed client that needs a new supercomputer for detailed simulations of real-world phenomena.” Most writers speculated that that meant weather, though Popular Science thought it could be used for seismic monitoring — or by the NSA for spying on people.
Like the Cray supercomputer back in the day, and some high-powered PCs even now, the system is reportedly water-cooled rather than by using fans.
Needless to say, it also uses a different file system than a typical PC: IBM’s General Parallel File System (GPFS), which according to Wikipedia has been available on GPFS has been available on IBM’s AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008 and which some tests have shown can work up to 37 times faster than a typical system. (The Wikipedia entry also has an interesting comparison with the file system used by big data provider Hadoop.)
GPFS provides higher input/output performance by “striping” blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel.”
The system also has a kind of super-mondo RAID that lets dying disks store copies of themselves and then get replaced, which reportedly gives the system a mean time between failure of a million years.
Technology Review didn’t say how much space it took up, but if a typical drive is, say, 4 in. x 5.75 in. x 1 in, we’re talking 4.6 million cubic inches just for the drives themselves, not counting the cooling system and cables and so on. That’s a 20-ft. x 20-ft. square almost 7.5 feet high, just of drives. (This is all back-of-the-envelope calculations.)
In fact, the system needs two petabytes of its storage just to keep track of all the index files and metadata, Technology Review reported.
In the winter, I keep my thermostat set to a particular temperature. When I leave the house, or go to bed, I turn the thermostat down, and when I get home or wake up, I turn it back up. This ensures that the house is comfortable when I’m using it, and more energy-efficient when I’m not.
Now, someone is talking about doing the same thing for hard disk drives.
Eran Tal, a hardware engineer at Facebook, is talking about the idea. In case you didn’t know, Facebook has some of the largest data centers in the world, and has begun publicizing some details of their design to help other data center managers leverage what Facebook has learned in the process.
Consequently, earlier this year, Facebook created when it called the Open Compute Project, which is, essentially, to hardware design what open source is to software design. Thus far, the site’s blog has a grand total of two postings, along with a number of comments on them.
And that’s where Tal comes in. A few days ago, he made one of those two posts, musing about what it would be like to have hard disks with a toggle switch between low speed and high speed, so that as the data on them became older and less actively used, the switch could be toggled to put the hard disks on a lower speed — saving energy in the process, without having to do the data migration that active tiering requires.
Reducing HDD RPM by half would save roughly 3-5W per HDD. Data centers today can have up to tens and even hundreds of thousands of cold drives, so the power savings impact at the data center level can be quite significant, on the order of hundreds of kilowatts, maybe even a megawatt. The reduced HDD bandwidth due to lower RPM would likely still be more than sufficient for most cold use cases, as a data rate of several (perhaps several dozen) MBs should still be possible. In most cases a user is requesting less than a few MBs of data, meaning that they will likely not notice the added service time for their request due to the reduced speed HDDs.
Once upon a time — seven whole years ago — there was a vendor that did something like this: Copan, with what it called its Massive Array of Idle Disk (MAID) technology, produced disk drives where only up to 25% of them were on at a time. Unfortunately, after getting new funding as recently as February 2009, Copan declared bankruptcy in 2010 and was bought by SGI (yes, it’s still around), which still markets the technology, after a fashion at least.
Several other vendors, including Nexsan with its AutoMAID technology, also have products in this area.
The big trick with any of these systems is ensuring that the data on them really isn’t used very much, because it can take up to 30 seconds for the disk to start from zero, and up to 15 seconds from the slower speed. But as Derrick Harris of GigaOm writes, the savings for a data center the size of Facebook’s can be considerable, and the technology could end up trickling down in the process.
Another e-Discovery vendor has been purchased: Hewlett-Packard has announced its intent to purchase UK vendor Autonomy — which, like Symantec purchasee Clearwell earlier this year, was also in the Leaders section of Gartner’s e-Discovery Magic Quadrant released in May.
In that report, Gartner predicted that consolidation would have eliminated one in four enterprise e-Discovery vendors by 2014, with the acquirers likely to be mainstream companies such as Hewlett-Packard, Oracle, Microsoft, and storage vendors. Autonomy itself acquired Iron Mountain’s archiving, e-discovery and online backup business in May for US$ 380 million in cash.
HP offered the US equivalent of $42.11 per share for Autonomy, which it said was a 64% premium over the one-day stock price and a 58% premium over the one-month average stock price. The overall price is on the order of $10 billion.
Autonomy is a brand and marketing powerhouse that appears on many clients’ shortlists,” Gartner said in its earlier report. “Although we have seen little appetite for ‘full-service e-discovery platforms’ from clients as yet, Autonomy is positioned to seize these opportunities when they do arise — indeed, the overall market may evolve in that direction.”
HP’s chief executive officer, Leo Apotheker, formerly of SAP, has said he wants to focus on higher-margin businesses such as software and de-emphasize the personal computer business, said the New York Times. The company also said it is eliminating its WebOS business and is reportedly considering spinning off its PC business, just a decade after acquiring major PC vendor Compaq.
[T]he decision to buy Autonomy also marks a change of course for HP, one that makes HP’s trajectory look remarkably similar to rival IBM’s nearly a decade ago. IBM, a key player in building the PC market in the 1980s, sold its PC business in 2004 to focus on software and services, which aren’t as labor- or component-intensive as building computer hardware.”
However, such a transition may not be easy, said an article in the Wall Street Journal, which examined how IBM had made that transition.
The Autonomy deal offered another advantage to HP, noted a different New York Times article. Like Microsoft’s purchase of Skype earlier this year, it gives HP the opportunity to spend money it had earned outside the U.S. — reportedly as much as $12 billion — without having to pay taxes on that money by bringing it into the U.S.
Other e-Discovery vendors include FTI Technology, Guidance Software, and kCura, the remaining vendors in the “Leaders” section in the Gartner Magic Quadrant. Less attractive, but also likely to be less expensive and, maybe, more desperate, will be the other vendors, such as AccessData Group, CaseCentral, Catalyst Repository Systems, CommVault, Exterro, Recommind and ZyLab in the “visionaries” quadrants, and Daegis, Epiq Systems, Integreon, Ipro, Kroll Ontrack, as well as the ediscovery components of Lexis/Nexis and Xerox Litigation Services in the “niche” quadrant.
To anybody who’s run a USB memory stick through the laundry or left one sitting in a remote machine, there’s no surprise in the results from the recent Ponemon Institute study, The State of USB Drive Security.
Ponemon, which performed the study on behalf of Kingston, a manufacturer of encrypted USB thumb drives, did not fully describe its methodology, but said it had surveyed 743 IT and IT security practitioners with an average of 10 years of relevant experience.
Interesting tidbits from the survey include the following:
- More than 70% of respondents in this study say that they are absolutely certain (47%) or believe that it was most likely (23%) that a data breach was caused by sensitive or confidential information contained on a missing USB drive within the past two years.
- The majority of organizations (67%) that had lost data confirmed that they had multiple loss events –- in some cases, more than 10 separate events.
- More than 40% of organizations surveyed report having more than 50,000 USB drives in use in their organizations, with nearly 20% having more than 100,000 drives in circulation
- On average, organizations had lost more than 12,000 records about customers, consumers and employees as a result of missing USBs.
- The average cost per record of a data breach is $214, making the average cost of lost records to an organization $2,568,000.
This isn’t new; there’ve been numerous incidents of data loss via USB memory stick, either by losing them or by theft, ever since the handy little things came out. But those have been largely anecdotal reports, while this was a more broadly based survey.
And that’s just data going out. Another issue is that of malware coming in, also via thumb drive. Again, we have heard of anecdotal incidents, but the survey also reported that incoming security was an issue as well.
“The most recent example of how easily rogue USB drives can enter an organization can be seen in a Department of Homeland Security test in which USBs were ‘accidentally’ dropped in government parking lots. Without any identifying markings on the USB stick, 60% of employees plugged the drives into government computers. With a ‘valid’ government seal, the plug-in rate reached 90%.”
For example, the survey found that free USB sticks from conferences/trade shows, business meetings and similar events are used by 72% of employees ― even in organizations that mandate the use of secure USBs. And there’s not very many of those: Only 29% felt that their organizations had adequate policies to prevent USB misuse.
The report went on to list 10 USB security recommendations — which many or most organizations do not practice:
1. Providing employees with approved, quality USB drives for use in the workplace.
2. Creating policies and training programs that define acceptable and unacceptable uses of USB drives.
3. Making sure employees who have access to sensitive and confidential data only use secure USB drives.
4. Determining USB drive reliability and integrity before purchasing by confirming compliance with leading security standards and ensuring that there is no malicious code on these tools.
5. Deploying encryption for data stored on the USB drive.
6. Monitoring and tracking USB drives as part of asset management procedures.
7. Scanning devices for virus or malware infections.
8. Using passwords or locks.
9. Encrypting sensitive data on USB drives.
10. Having procedures in place to recover lost USB drives.
The conventional wisdom has always been that solid state “flash” storage offered higher performance but cost more, making it too expensive to replace spinning disk technology other than in certain high-performance applications. However, eBay is putting that notion to bed by moving to flash storage at a price it says is comparable to spinning disk technology.
“It probably costs on par or less than a standard spindle-based solution out there,” says Michael Craft, manager of QA systems administration for eBay. Moreover, because the Nimbus flash storage takes up half a rack in comparison to the four to six racks of gear it replaced, power and cooling costs are less, he says.
The acquisition cost for a 10TB Nimbus S-class with a year’s support was $129,536 while the 10TB usable NetApp system was $135,168, based on 450GB, 15,000rpm drives. The purchase of years 2-5 support was $90,112 from NetApp and $45,056 from Nimbus, giving a five-year fixed cost for NetApp of $227,908 and $174,920 for Nimbus, $52,988 less…The five-year total cost of ownership was much lower with Nimbus, even without taking performance into account. Here, according to Nimbus, it creamed NetApp, producing 800,000 4K IOPS versus the FAS array’s 20,000. The cost/IOPS was $0.22 for the S-class and $11.39 for the NetApp array.
Flash storage also has a reputation for getting “tired” over time with performance decreasing after the media has been overwritten a certain number of times. Nimbus CEO Tom Isakovich claimed that with his company’s new generation of flash technology, the endurance problem was taken care of; Craft was more circumspect, saying only that in six months eBay had had no failures and that it was working with Nimbus to replace storage systems before they failed. “Everything fails,” he says. “Just be proactive about it so we can replace it during business hours.”
EBay wasn’t looking for flash technology specifically, but was looking to meet certain performance requirements, and the Nimbus products were the only ones that provided it, Craft says. The time required to perform certain tasks has typically dropped by five times, such as taking five minutes to perform a task that used to take 40 to 45, he says. He’s also looking forward to implementing the dedupe functionality now in the Nimbus flash storage product in hopes of eliminating up to 90% of writes, he adds.
In total, eBay has deployed up to 100 TB of flash storage, and is replacing its existing spinning disk storage as fast as it can, Craft says. Another advantage is that it takes up less space, making it easier for the company to fit into its new data center. “When we started getting into it, the physical footprint became an ‘oh no’ situation,” but with a pure flash play, it was a “real nice fit,” he says.
A while back, I wrote a piece on how the Arizona State University School of Earth and Space Exploration (SESE) moved a petabyte of data from its previous storage system to its new one. That was pretty impressive.
Now, how about 30 petabytes?
- More than 10 times as much as stored in the hippocampus of the human brain
- All the data used to render Avatar’s 3D effects — times 30
- More than the amount of data passed daily through AT&T or Google
So what is there bigger than A&T or Google? It could only be Facebook — which added the last 10 petabytes just in the past year — when it was *already* the largest Hadoop cluster in the world. Writes Paul Yang on the Facebook Engineering blog:
During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well…By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center.
What was particularly ambitious is that Facebook wanted to do this without shutting down, which is why it couldn’t just move all the existing machines to the new space, Yang described. Instead, the company built a giant new cluster, and then replicated all the existing data to it — while the system was still up. Then, after all the data was replicated, all the data that had changed since the replication started was copied over as well.
Facebook uses Hive for analytics, which means it uses the Hadoop distributed file system (HDFS), which is particularly well suited for big data, Yang said — which has the potential for being useful more broadly in the future, he added:
As an additional benefit, the replication system also demonstrated a potential disaster-recovery. solution for warehouses using Hive. Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.
Yang didn’t say how long all this took. But the capability will stand Facebook in good stead in the future, as the company builds a new data center in Prineville, Ore., as well as another one in North Carolina, noted GigaOm.