Yottabytes: Storage and Disaster Recovery

September 14, 2011  12:25 PM

Pigeons, Station Wagons, Blu-ray, and Data Transfer

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

XKCD on<br /> File Transfers

Sound familiar?

The thing is, it’s true. Even though Internet speeds continue to increase, the amount of data we want to transmit continues to increase, too.

Which is why the various Internet denizens have developed….workarounds for large file transfers, which also provides the opportunity for the wonderful Internet pastime of geekly arguing.

Which brings us to station wagons, pigeons, and Blu-ray.

The canonical statement, by Andrew Tannenbaum in his 1996 book Computer Networks, is basically “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” And ever since then, there have been numerous websites devoted to how-many-angels-can-dance-on-the-head-of-a-pin discussions about just what that bandwidth would be.

You can tell how old the websites are based on what figures they use for comparable Internet bandwidth, the size of a magnetic tape, and so on. The Wikipedia entry for “Sneakernet” appears to have the most up-to-date calculations.

(The actual calculation using today’s technologies is left as an exercise for the reader.)

The Internet being the Internet, the calculations have been extended, ranging from petabytes in a sailboat to Blu-ray discs in a 747 (which, as it turns out, would actually be too heavy for a 747 to carry), to, more mundanely, the number of SD cards that fit into a Fed Ex box — as well as the bandwidth of a Netflix movie shipment through the mail.

And then there’s the pigeons.

Really truly, carrier pigeons have been used for a remarkable amount of data transfer in history — not just short messages, and aerial photography predating satellites, but things like blueprints from military installations in the U.S.

In fact, in 1982, Computerworld ran an article about how Lockheed Missile & Space Co. used pigeons to carry microfilm copies of blueprints to a research facility in Santa Cruz, because it was cheaper than printing out and transporting hard copies. And if you have $100 per half hour for someone to dig it up, you can apparently get a copy of Dan Rather introducing a story about it on CBS News.

Consequently, not one but two April Fool’s Internet protocols were developed — Transmission of IP Datagrams on Avian Carriers, and Transmission of IP Datagrams on Avian Carriers with Quality Control — for transmitting Internet data by carrier pigeon. The first one was even demonstrated, and while the experiment left something to be desired, Wikipedia points out that “during the last 20 years, the information density of storage media and thus the bandwidth of an Avian Carrier has increased 3 times faster than the bandwidth of the Internet.”

That’s not all. In various remote areas, such as rural U.K., Australia, and parts of South Africa, people have used carrier pigeons to demonstrate that they’re faster than what passes for high-speed Internet there.

The point is this: No matter how fat a pipe you have to the Internet, at some given amount of data, it’s going to be faster, cheaper, or both to use some manual method to ship data on some storage medium. It makes sense for you to do a back-of-the-envelope calculation to figure out where the data boundaries are for different mediums and different shipping methods, and update them as technology changes.

September 7, 2011  4:46 PM

Are You Ready for the New LTO 6 Tape Standard?

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Tape’s not dead. Really. Products supporting the Linear Tape Open (LTO) 5 specification just began shipping this year, but already vendors are starting to make noises about LTO 6, for which there isn’t even an availability date announced yet.

In sort of the tape storage equivalent to Moore’s Law, a consortium of three vendors — Hewlett Packard, IBM, and Quantum, known as the Technology Provider Companies (TPC) — get together every few years and decide upon specifications for tape cartridges with a steady increase in speed and capacity. This helps keep users convinced that there’s still a future for tape.

For example, the specifications for LTO 5 (as well as LTO 6) were announced in December 2004, but it took until January 2010 before licenses for the LTO 5 specification was available, and products supporting it started to be available in the second quarter of that year.

Similarly, the LTO TPCs announced in June of this year that licenses for the LTO 6 specification were available. By extrapolation, one can assume that LTO 6 products could be announced any day.

LTO 6 is defined as having a capacity of 8 TB with a data transfer speed of up to 525 MB/s, assuming a 2.5:1 compression. This is in comparison to LTO 5, which has a capacity of 3 TB with a data transfer speed of up to 280 MB/s, assuming a 2.5:1 compression.

Lest people get fidgety about the future of tape after that, the LTO TPC announced this spring the next two generations, LTO 7 and LTO 8, with compressed capacities of 16 TB and 32 TB and data transfers speeds of 788 MB/s and 1180 MB/s, respectively. As with LTO 6, no dates were announced, but one might expect each will come out about two to three years in succession.

The thing to remember, also, is that each LTO generation can typically only read two generations before it — meaning users needs to either rewrite their tape library every few years or keep a bunch of old LTO machines around. “By the time LTO 8 is released, organizations will need, at a minimum, LTO 3 drives to read LTO 1 through LTO 3 cartridges; LTO 6 drives to read LTO 4 through LTO 6 cartridges; and LTO  8 drives to read the LTO 7 and LTO 8 cartridges,” wrote Graeme Elliott earlier this year.

August 31, 2011  2:46 PM

IBM’s 120-Petabyte Hard Drive Causes Flurry of Analogies

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

The best part about IBM’s experimental 120-petabyte hard drive is reading all the ways that writers try to explain how big it is.

  • 2.4 million Blu-ray disks
  • 24 million HD movies
  • 24 billion MP3s
  • 1 trillion files
  • Eight times as largest as the biggest disk array available previously
  • More than twice the entire written works of mankind from the beginning of recorded history in all languages
  • 6,000 Libraries of Congress (a standard unit of data measure)
  • Almost as much data as Google processes every week
  • Or, four Facebooks

It is not one humungo drive; it is, in fact, an array of 200,000 conventional hard drives (not even solid-state disk) hooked together (which would make them an average of 600 GB each).

Unfortunately, you’re not going to be able to trundle down to Fry’s and get one anytime soon. No, this is something being put together by the IBM Almaden research lab in San Jose, Calif., according to MIT Technology Review.

What exactly it’s going to be used for IBM wouldn’t say, only that it was “an unnamed client that needs a new supercomputer for detailed simulations of real-world phenomena.” Most writers speculated that that meant weather, though Popular Science thought it could be used for seismic monitoring — or by the NSA for spying on people.

Like the Cray supercomputer back in the day, and some high-powered PCs even now, the system is reportedly water-cooled rather than by using fans.

Needless to say, it also uses a different file system than a typical PC: IBM’s General Parallel File System (GPFS), which according to Wikipedia has been available on GPFS has been available on IBM’s AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008 and which some tests have shown can work up to 37 times faster than a typical system. (The Wikipedia entry also has an interesting comparison with the file system used by big data provider Hadoop.)

GPFS provides higher input/output performance by “striping” blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel.”

The system also has a kind of super-mondo RAID that lets dying disks store copies of themselves and then get replaced, which reportedly gives the system a mean time between failure of a million years.

Technology Review didn’t say how much space it took up, but if a typical drive is, say, 4 in. x 5.75 in. x 1 in, we’re talking 4.6 million cubic inches just for the drives themselves, not counting the cooling system and cables and so on. That’s a 20-ft. x 20-ft. square almost 7.5 feet high, just of drives.  (This is all back-of-the-envelope calculations.)

In fact, the system needs two petabytes of its storage just to keep track of all the index files and metadata, Technology Review reported.

August 24, 2011  11:16 PM

A Thermostat for Your Hard Drives?

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

In the winter, I keep my thermostat set to a particular temperature. When I leave the house, or go to bed, I turn the thermostat down, and when I get home or wake up, I turn it back up. This ensures that the house is comfortable when I’m using it, and more energy-efficient when I’m not.

Now, someone is talking about doing the same thing for hard disk drives.

Eran Tal, a hardware engineer at Facebook, is talking about the idea. In case you didn’t know, Facebook has some of the largest data centers in the world, and has begun publicizing some details of their design to help other data center managers leverage what Facebook has learned in the process.

Consequently, earlier this year, Facebook created when it called the Open Compute Project, which is, essentially, to hardware design what open source is to software design. Thus far, the site’s blog has a grand total of two postings, along with a number of comments on them.

And that’s where Tal comes in. A few days ago, he made one of those two posts, musing about what it would be like to have hard disks with a toggle switch between low speed and high speed, so that as the data on them became older and less actively used, the switch could be toggled to put the hard disks on a lower speed — saving energy in the process, without having to do the data migration that active tiering requires.

Reducing HDD RPM by half would save roughly 3-5W per HDD. Data centers today can have up to tens and even hundreds of thousands of cold drives, so the power savings impact at the data center level can be quite significant, on the order of hundreds of kilowatts, maybe even a megawatt. The reduced HDD bandwidth due to lower RPM would likely still be more than sufficient for most cold use cases, as a data rate of several (perhaps several dozen) MBs should still be possible. In most cases a user is requesting less than a few MBs of data, meaning that they will likely not notice the added service time for their request due to the reduced speed HDDs.

Once upon a time — seven whole years ago — there was a vendor that did something like this: Copan, with what it called its Massive Array of Idle Disk (MAID) technology, produced disk drives where only up to 25% of them were on at a time. Unfortunately, after getting new funding as recently as February 2009, Copan declared bankruptcy in 2010 and was bought by SGI (yes, it’s still around), which still markets the technology, after a fashion at least.

Several other vendors, including Nexsan with its AutoMAID technology, also have products in this area.

The big trick with any of these systems is ensuring that the data on them really isn’t used very much, because it can take up to 30 seconds for the disk to start from zero, and up to 15 seconds from the slower speed. But as Derrick Harris of GigaOm writes, the savings for a data center the size of Facebook’s can be considerable, and the technology could end up trickling down in the process.

August 18, 2011  9:20 PM

Another One Bites the Dust: HP Buys Autonomy

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Another e-Discovery vendor has been purchased: Hewlett-Packard has announced its intent to purchase UK vendor Autonomy — which, like Symantec purchasee Clearwell earlier this year, was also in the Leaders section of Gartner’s e-Discovery Magic Quadrant released in May.

In that report, Gartner predicted that consolidation would have eliminated one in four enterprise e-Discovery vendors by 2014, with the acquirers likely to be mainstream companies such as Hewlett-Packard, Oracle, Microsoft, and storage vendors. Autonomy itself acquired Iron Mountain’s archiving, e-discovery and online backup business in May for US$ 380 million in cash.

HP offered the US equivalent of $42.11 per share for Autonomy, which it said was a 64% premium over the one-day stock price and a 58% premium over the one-month average stock price. The overall price is on the order of $10 billion.

Autonomy is a brand and marketing powerhouse that appears on many clients’ shortlists,” Gartner said in its earlier report. “Although we have seen little appetite for ‘full-service e-discovery platforms’ from clients as yet, Autonomy is positioned to seize these opportunities when they do arise — indeed, the overall market may evolve in that direction.”

HP’s chief executive officer, Leo Apotheker, formerly of SAP, has said he wants to focus on higher-margin businesses such as software and de-emphasize the personal computer business, said the New York Times. The company also said it is eliminating its WebOS business and is reportedly considering spinning off its PC business, just a decade after acquiring major PC vendor Compaq.

The AP, in fact, went so far as to say

[T]he decision to buy Autonomy also marks a change of course for HP, one that makes HP’s trajectory look remarkably similar to rival IBM’s nearly a decade ago. IBM, a key player in building the PC market in the 1980s, sold its PC business in 2004 to focus on software and services, which aren’t as labor- or component-intensive as building computer hardware.”

However, such a transition may not be easy, said an article in the Wall Street Journal, which examined how IBM had made that transition.

The Autonomy deal offered another advantage to HP, noted a different New York Times article. Like Microsoft’s purchase of Skype earlier this year, it gives HP the opportunity to spend money it had earned outside the U.S. — reportedly as much as $12 billion — without having to pay taxes on that money by bringing it into the U.S.

Other e-Discovery vendors include FTI Technology, Guidance Software, and kCura, the remaining vendors in the “Leaders” section in the Gartner Magic Quadrant. Less attractive, but also likely to be less expensive and, maybe, more desperate, will be the other vendors, such as AccessData Group, CaseCentral, Catalyst Repository Systems, CommVault, Exterro, Recommind and ZyLab in the “visionaries” quadrants, and Daegis, Epiq Systems, Integreon, Ipro, Kroll Ontrack,  as well as the ediscovery components of Lexis/Nexis and Xerox Litigation Services in the “niche” quadrant.

August 13, 2011  7:47 PM

Rule of ‘Thumb’: You’re Going to Lose Them

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

To anybody who’s run a USB memory stick through the laundry or left one sitting in a remote machine, there’s no surprise in the results from the recent Ponemon Institute study, The State of USB Drive Security.

Ponemon, which performed the study on behalf of Kingston, a manufacturer of encrypted USB thumb drives, did not fully describe its methodology, but said it had surveyed 743 IT and IT security practitioners with an average of 10 years of relevant experience.

Interesting tidbits from the survey include the following:

  • More than 70% of respondents in this study say that they are absolutely certain (47%) or believe that it was most likely (23%) that a data breach was caused by sensitive or confidential information contained on a missing USB drive within the past two years.
  • The majority of organizations (67%) that had lost data confirmed that they had multiple loss events –- in some cases, more than 10 separate events.
  • More than 40% of organizations surveyed report having more than 50,000 USB drives in use in their organizations, with nearly 20% having more than 100,000 drives in circulation
  • On average, organizations had lost more than 12,000 records about customers, consumers and employees as a result of missing USBs.
  • The average cost per record of a data breach is $214, making the average cost of lost records to an organization $2,568,000.


This isn’t new; there’ve been numerous incidents of data loss via USB memory stick, either by losing them or by theft, ever since the handy little things came out. But those have been largely anecdotal reports, while this was a more broadly based survey.

And that’s just data going out. Another issue is that of malware coming in, also via thumb drive. Again, we have heard of anecdotal incidents, but the survey also reported that incoming security was an issue as well.

The most recent example of how easily rogue USB drives can enter an organization can be seen in a Department of Homeland Security test in which USBs were ‘accidentally’ dropped in government parking lots. Without any identifying markings on the USB stick, 60% of employees plugged the drives into government computers. With a ‘valid’ government seal, the plug-in rate reached 90%.

For example, the survey found that free USB sticks from conferences/trade shows, business meetings and similar events are used by 72% of employees ― even in organizations that mandate the use of secure USBs. And there’s not very many of those: Only 29% felt that their organizations had adequate policies to prevent USB misuse.

The report went on to list 10 USB security recommendations — which many or most organizations do not practice:

1. Providing employees with approved, quality USB drives for use in the workplace.
2. Creating policies and training programs that define acceptable and unacceptable uses of USB drives.
3. Making sure employees who have access to sensitive and confidential data only use secure USB drives.
4. Determining USB drive reliability and integrity before purchasing by confirming compliance with leading security standards and ensuring that there is no malicious code on these tools.
5. Deploying encryption for data stored on the USB drive.
6. Monitoring and tracking USB drives as part of asset management procedures.
7. Scanning devices for virus or malware infections.
8. Using passwords or locks.
9. Encrypting sensitive data on USB drives.
10. Having procedures in place to recover lost USB drives.

August 6, 2011  11:05 PM

Flash Storage Gets Big Boost With EBay Win

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

The conventional wisdom has always been that solid state “flash” storage offered higher performance but cost more, making it too expensive to replace spinning disk technology other than in certain high-performance applications. However, eBay is putting that notion to bed by moving to flash storage at a price it says is comparable to spinning disk technology.

“It probably costs on par or less than a standard spindle-based solution out there,” says Michael Craft, manager of QA systems administration for eBay. Moreover, because the Nimbus flash storage takes up half a rack in comparison to the four to six racks of gear it replaced, power and cooling costs are less, he says.

Craft didn’t provide specific pricing information for the reportedly seven-figure deal, but Chris Mellor of Channel Register performed a detailed pricing analysis.

The acquisition cost for a 10TB Nimbus S-class with a year’s support was $129,536 while the 10TB usable NetApp system was $135,168, based on 450GB, 15,000rpm drives. The purchase of years 2-5 support was $90,112 from NetApp and $45,056 from Nimbus, giving a five-year fixed cost for NetApp of $227,908 and $174,920 for Nimbus, $52,988 less…The five-year total cost of ownership was much lower with Nimbus, even without taking performance into account. Here, according to Nimbus, it creamed NetApp, producing 800,000 4K IOPS versus the FAS array’s 20,000. The cost/IOPS was $0.22 for the S-class and $11.39 for the NetApp array.

Flash storage also has a reputation for getting “tired” over time with performance decreasing after the media has been overwritten a certain number of times. Nimbus CEO Tom Isakovich claimed that with his company’s new generation of flash technology, the endurance problem was taken care of; Craft was more circumspect, saying only that in six months eBay had had no failures and that it was working with Nimbus to replace storage systems before they failed. “Everything fails,” he says. “Just be proactive about it so we can replace it during business hours.”

EBay wasn’t looking for flash technology specifically, but was looking to meet certain performance requirements, and the Nimbus products were the only ones that provided it, Craft says. The time required to perform certain tasks has typically dropped by five times, such as taking five minutes to perform a task that used to take 40 to 45, he says. He’s also looking forward to implementing the dedupe functionality now in the Nimbus flash storage product in hopes of eliminating up to 90% of writes, he adds.

In total, eBay has deployed up to 100 TB of flash storage, and is replacing its existing spinning disk storage as fast as it can, Craft says. Another advantage is that it takes up less space, making it easier for the company to fit into its new data center. “When we started getting into it, the physical footprint became an ‘oh no’ situation,” but with a pure flash play, it was a “real nice fit,” he says.

July 29, 2011  9:05 PM

How Would *You* Move 30 Petabytes of Data?

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

A while back, I wrote a piece on how the Arizona State University School of Earth and Space Exploration (SESE) moved a petabyte of data from its previous storage system to its new one. That was pretty impressive.

Now, how about 30 petabytes?

First of all, here’s some perspective on how much 30 petabytes is:

  • More than 10 times as much as stored in the hippocampus of the human brain
  • All the data used to render Avatar’s 3D effects — times 30
  • More than the amount of data passed daily through AT&T or Google

So what is there bigger than A&T or Google? It could only be Facebook — which added the last 10 petabytes just in the past year — when it was *already* the largest Hadoop cluster in the world. Writes Paul Yang on the Facebook Engineering blog:

During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well…By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center.

What was particularly ambitious is that Facebook wanted to do this without shutting down, which is why it couldn’t just move all the existing machines to the new space, Yang described. Instead, the company built a giant new cluster, and then replicated all the existing data to it — while the system was still up. Then, after all the data was replicated, all the data that had changed since the replication started was copied over as well.

Facebook uses Hive for analytics, which means it uses the Hadoop distributed file system (HDFS), which is particularly well suited for big data, Yang said — which has the potential for being useful more broadly in the future, he added:

As an additional benefit, the replication system also demonstrated a potential disaster-recovery. solution for warehouses using Hive. Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.

Yang didn’t say how long all this took. But the capability will stand Facebook in good stead in the future, as the company builds a new data center in Prineville, Ore., as well as another one in North Carolina, noted GigaOm.

July 24, 2011  1:14 PM

I Want a Terabyte On My Laptop

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Okay, I don’t usually talk about speeds and feeds here, but this is cool. Western Digital has designed a hard disk drive that lets you have a 1-terabyte drive on a notebook.

Heck, the brick I do my backups on isn’t that big.

(You have to understand, I’m old. When I bought my first computer, in 1983, a Hewlett-Packard HP-150 (hey! it had a touchscreen! it was ahead of its time!), I could have gotten a 10-*mega*byte hard disk with it that was the same size as the computer and cost just as much, even with my employee discount. So this is my tell-us-about-the-first-time-you-saw-a-car-Grandma moment.)

So here’s the deal. The Scorpio Blue is a  2.5-inch form factor drive with standard 9.5 mm height, which means it can fit into a notebook. But it’s the first drive with this type of capacity. The way WD was able to do it was by being able to fit 500 GB on each of two platters, rather than the three platters most drives require, said Jason Bache at MyCE:

Traditional terabyte drives use three 334-GB platters to achieve their capacity, which inevitably makes the drives too thick to fit in anything but a desktop or a specially-modified notebook case. Both Samsung’s and Western Digital’s new drives use two 500-GB platters, made possible by advances in platter formatting.

While Samsung announced a similar drive in June, WD is the first vendor to be able to ship one, Bache says, something that is also repeated in numerous other articles, though vendors are apparently taking orders for it.

Shopping for it can be a little challenging; doing a Google shopping search, for example, you run into all sorts of things, including the 12.5 mm version, and ones that aren’t actually 1 TB even if that’s what you search for. (Oddly, there’s also some priced at $4,000 or more; I wonder if it’s some sort of automatic pricing issue.) Anyway, here’s the real thing, and it seems to be $120 or so.

CNET notes that it spins at 5200 rpm instead of 5400 rpm, which means it’s going to be slower (and is probably also behind the low power requirements, low noise, and low operating temperature that WD is touting).

No doubt, as with the quadruple toe loop, everyone will be doing it now that someone proved they can; if nothing else, Seagate will be doing it because it is acquiring Samsung.

For me, it’s still in the waltzing-bear stage, but it’s tempting, just for the size queen aspect of it.

July 18, 2011  4:26 PM

Why It’s Dangerous to Rely on the Cloud — At Least, If You Use Comcast

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

It’s not always fun to be right.

Less than three weeks ago, I wrote a piece about the downsides of backing up to the cloud vs. backing up to one’s own storage, talking about several potential problems, including that of (as Steven J. Vaughan-Nichols of ZDNet had said before me)

“Wouldn’t that be just wonderful! Locked out of your local high-speed ISP for a year because you spent too much time working on Office 365 and watching The Office reruns.”

As it happens, exactly that has happened — except the culprit wasn’t even The Office reruns, but a cloud backup service!

André Vrignaud, an entertainment industry consultant, described in his blog this week how he was cut off from Comcast — about the only broadband provider in his area — for a year for breaking his 250-gigabyte bandwidth cap two months in a row because, as it turns out, he was backing up his voluminous picture and video files to the cloud.

You know, like writers and vendors keep telling you that you should be doing.

While he knew about the cap, he didn’t realize that data he uploaded counted against it as well as data he downloaded.

One could argue that Vrignaud — who worked at home — shouldn’t have been using a consumer service in the first place. He points out, however, that the business service is considerably more expensive for a lesser service, that he would also be required to sign up for a long-term plan and buy additional equipment he didn’t need, and that, in any event, it was now moot because Comcast had banned him from *all* its services.

Stacey Higginbotham of GigaOm went on to note that it also isn’t always easy to determine what would constitute “business use,” and that neither ISPs like Comcast nor cloud service providers are doing a good job of alerting users to the potential problem nor telling them what to do should it occur. Moreover, she added, aside from the issue of whether such a cap was productive, the particular cap Comcast instituted was archaic; the median usage when Comcast implemented the cap in 2008 was 2 to 4 GB per month, and it has now more than doubled to from 4 to 6 GB per month — but with no increase in the cap.

With services such as cloud backup, online phone calling, and music services such as Spotify becoming more prevalent, this is likely to become more of an issue. Jason Garland, who identifies himself as a senior voice network engineer for Amazon, posted a spreadsheet on Google+ demonstrating that, depending on the speed of the connection, users could hit the cap in less than five hours of a single day. It’s hard to imagine that cloud application providers are going to sit still for this for long.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: