Yottabytes: Storage and Disaster Recovery

Jul 29 2011   9:05PM GMT

How Would *You* Move 30 Petabytes of Data?



Posted by: Sharon Fisher
Tags:
facebook
hadoop
migration
replication

A while back, I wrote a piece on how the Arizona State University School of Earth and Space Exploration (SESE) moved a petabyte of data from its previous storage system to its new one. That was pretty impressive.

Now, how about 30 petabytes?

First of all, here’s some perspective on how much 30 petabytes is:

  • More than 10 times as much as stored in the hippocampus of the human brain
  • All the data used to render Avatar’s 3D effects — times 30
  • More than the amount of data passed daily through AT&T or Google

So what is there bigger than A&T or Google? It could only be Facebook — which added the last 10 petabytes just in the past year — when it was *already* the largest Hadoop cluster in the world. Writes Paul Yang on the Facebook Engineering blog:

During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well…By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center.

What was particularly ambitious is that Facebook wanted to do this without shutting down, which is why it couldn’t just move all the existing machines to the new space, Yang described. Instead, the company built a giant new cluster, and then replicated all the existing data to it — while the system was still up. Then, after all the data was replicated, all the data that had changed since the replication started was copied over as well.

Facebook uses Hive for analytics, which means it uses the Hadoop distributed file system (HDFS), which is particularly well suited for big data, Yang said — which has the potential for being useful more broadly in the future, he added:

As an additional benefit, the replication system also demonstrated a potential disaster-recovery. solution for warehouses using Hive. Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.

Yang didn’t say how long all this took. But the capability will stand Facebook in good stead in the future, as the company builds a new data center in Prineville, Ore., as well as another one in North Carolina, noted GigaOm.

 Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: