Yottabytes: Storage and Disaster Recovery

Jan 31 2015   4:50PM GMT

Here’s a Backup Job: the Entire Internet

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Think your backup job is tough? How about backing up the entire Internet?

That’s the role fulfilled by the Internet Archive, which is suddenly getting a lot of attention these days, with major articles in publications such as the New Yorker and Medium.

You may wonder, what’s the point of archiving the Internet? Do we really need to save all those memes and cat pictures? But the Internet is more than that, insist preservationists.

Jill Lepore leads off her New Yorker article by noting that the Internet Archive’s web preservation service, known as the Wayback Machine, was the only remaining source of evidence that Ukraine separatists had posted that they had shot down Malaysia Airlines Flight 17 on a Russian social media site – a site that the Internet Archive had begun saving just two weeks before.  “On July 17th, at 3:22 P.M. G.M.T., the Wayback Machine saved a screenshot of Strelkov’s VKontakte post about downing a plane,” she writes. “Two hours and twenty-two minutes later, Arthur Bright, the Europe editor of the Christian Science Monitor, tweeted a picture of the screenshot, along with the message ‘Grab of Donetsk militant Strelkov’s claim of downing what appears to have been MH17.’ By then, Strelkov’s VKontakte page had already been edited: the claim about shooting down a plane was deleted. The only real evidence of the original claim lies in the Wayback Machine.”

In addition to web pages, the Internet Archive – which, incidentally, is hosted in a former Christian Science church because it looked like the organization’s logo — also hosts books, videos, “ephemeral” films such as advertising, audio recordings, concert recordings, audio books, television news broadcasts, and historical software (including Oregon Trail and Leisure Suit Larry in the Land of the Lounge Lizards), writes Andy Baio in Medium. Altogether, it includes 500,000 pieces of software, more than 2 million books, 3 million hours of TV, and 430 billion web pages, writes Justin Ellis. “In a single day, they digitize more than 1,000 books. They capture TV 24 hours a day. In a week, they save more than 1 billion URLs.”

So how do pages get saved into the Wayback Machine? There’s three ways, Lepore writes:

  • There’s a crawler that attempts to make a copy of every Web page it can find every two months or so, though she points out that the New Yorker’s home page gets saved about six times a day
  • Librarians choose certain pages to be archived in certain subject areas, through a service called Archive It, at archive-it.org, which also lets individuals and institutions build their own archives
  • Anyone who wants to can preserve a Web page, at any time, by going to archive.org/web, typing in a URL, and clicking “Save Page Now,” which is how five of the twelve screenshots of the Malaysian Airlines post were made

At this point, the Wayback Machine has archived more than 430 billion Web pages, comprising 20 petabytes of storage – which is double its 2012 figure, Lepore writes. 600,000 people use it every day, conducting 2,000 searches a second, she adds.

That said, it’s not difficult to keep the Wayback Machine from trawling a site; all it takes is a single text file, Lepore writes – which has the effect of deleting all the archives as well. “Blocking a Web crawler requires adding only a simple text file, ‘robots.txt,’ to the root of a Web site,” she writes. “The Wayback Machine will honor that file and not crawl that site, and it will also, when it comes across a robots.txt, remove all past versions of that site. When the Conservative Party in Britain deleted ten years’ worth of speeches from its Web site, it also added a robots.txt, which meant that, the next time the Wayback Machine tried to crawl the site, all its captures of those speeches went away, too.”

The biggest problem with the Internet Archive is that it’s so big it’s really difficult to search, Lepore writes, because it lacks the tools. “You can do something more like keyword searching in smaller subject collections, but nothing like Google searching (there is no relevance ranking, for instance), because the tools for doing anything meaningful with Web archives are years behind the tools for creating those archives,” she writes. “Doing research in a paper archive is to doing research in a Web archive as going to a fish market is to being thrown in the middle of an ocean; the only thing they have in common is that both involve fish.”

To this end, the Internet Archive was recently one of 22 organizations to share in $3 million of grants from is the Knight Foundation through the Knight News Challenge, towards projects that provide new tools and ideas for making libraries more accessible. “The Internet Archive will get $600,000 to develop new technology to give users more control over how materials are uploaded, categorized, and curated in the archive,” Ellis writes. “What they plan to do with the funding from Knight is create a simpler upload system that works across any browser, a contributor management system that lets one or many people work on collections, expanded search functions, and improved tools for organizing what material can be added to certain collections.”

Including cat pictures, one presumes.

2  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • InfosysMgmt
    The title and premise of this article grossly overstates what is getting backed up.  Seriously, the "entire" Internet?  At least give a tiny bit of thought to what "entire" means.
    115 pointsBadges:
    report
  • Sharon Fisher
    Titles can only be so long. "A Significant Portion of the Web Pages On the Internet" wouldn't fit.  
    9,720 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: