Posted by: Sharon Fisher
facebook, hadoop, migration, replication
A while back, I wrote a piece on how the Arizona State University School of Earth and Space Exploration (SESE) moved a petabyte of data from its previous storage system to its new one. That was pretty impressive.
Now, how about 30 petabytes?
- More than 10 times as much as stored in the hippocampus of the human brain
- All the data used to render Avatar’s 3D effects — times 30
- More than the amount of data passed daily through AT&T or Google
So what is there bigger than A&T or Google? It could only be Facebook — which added the last 10 petabytes just in the past year — when it was *already* the largest Hadoop cluster in the world. Writes Paul Yang on the Facebook Engineering blog:
During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well…By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center.
What was particularly ambitious is that Facebook wanted to do this without shutting down, which is why it couldn’t just move all the existing machines to the new space, Yang described. Instead, the company built a giant new cluster, and then replicated all the existing data to it — while the system was still up. Then, after all the data was replicated, all the data that had changed since the replication started was copied over as well.
Facebook uses Hive for analytics, which means it uses the Hadoop distributed file system (HDFS), which is particularly well suited for big data, Yang said — which has the potential for being useful more broadly in the future, he added:
As an additional benefit, the replication system also demonstrated a potential disaster-recovery. solution for warehouses using Hive. Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.
Yang didn’t say how long all this took. But the capability will stand Facebook in good stead in the future, as the company builds a new data center in Prineville, Ore., as well as another one in North Carolina, noted GigaOm.