Yottabytes: Storage and Disaster Recovery

Dec 30 2017   11:20PM GMT

Library of Congress Twitter Collection to Stop

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

social media

Ever decided you were going to save all your data until you found out how much space it was going to take up, and then change your mind? The Library of Congress is in that position, which is why it recently announced it is no longer going to save every Tweet.

The Library had started saving all the Tweets in 2010, retroactive to the beginning of Twitter in 2006, but as of January, that will stop. While the Library is going to continue to keep the Tweets it already has, it will save Tweets only on a selective basis.

“The technical infrastructure for the Library’s Twitter archive follows the same general practices for monitoring and managing other digital collection data at the Library,” the Library wrote in January 2013. “Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure. The volume of tweets the Library receives each day has grown from 140 million beginning in February, 2011 to nearly half a billion tweets each day as of October, 2012.

In addition, “Since 2000, the library has been collecting pages from websites that document government information and activity,” writes Doug Criss for CNN. “Today, that archive is more than 300 terabytes in size and represents tens of thousands of different sites. The library’s entire collection of printed books has been estimated to total about 10 terabytes of data.”

Actually, there’s more to it than a simple matter of storage – it’s a matter of being able to find the information afterwards. The Library had said when it started saving the Tweets that it would develop a system by which people could search them, and that hasn’t happened yet, nor is it clear when such a search functionality will be available. In fact, the Library of Congress said public access to the archive would be blocked until it could  figure out “a cost-effective and sustainable” way to let people view and use it, Criss writes.

“Six years after the announcement, the Library of Congress still hasn’t launched the heralded tweet archive, and it doesn’t know when it will,” Andrew McGill wrote in the Atlantic in August, 2016 – indeed, predicting that the Library might eventually cut the project off.  “No engineers are permanently assigned to the project. So, for now, staff regularly dump unprocessed tweets into a server—the digital equivalent of throwing a bunch of paperclipped manuscripts into a chest and giving it a good shake. There’s certainly no way to search through all that they’ve collected.”

In fact, some are dubious that the Library’s decision has anything to do with storage at all, such as Kalev Leetaru in Forbes. “Given that the Library’s collection is, in its current form, essentially a dark archive saved to cold storage, enhanced compression and no query access mean the actual storage costs are minimal,” he writes. “By 2013 the Library’s Twitter archive totaled just 133 terabytes ‘for two compressed copies.’ Even assuming a massive growth rate, a full petabyte of data stored securely with “99.999999999% durability” (9 nine’s), costs just around $7,300 a month for immediate access or as low as $4,000 a month for batch access (5-12 hour access delay) in today’s modern commercial cloud. If the Library just needed to store the Twitter firehose securely and durably without any kind of user access (its current model), price and technical capability would not seem to be a limiting factor.”

Similarly, Leetaru continues, if the Library thinks text-only Tweets are no longer useful, why not just find a way to save the attached imagery? “It is unclear why the Library has concluded that it should simply abandon archiving Twitter, rather than making the argument that if a lack of multimedia content and links is the problem, then perhaps it should add multimedia and link archiving to its preservation pipeline or partner with organizations like the Internet Archive to preserve this content,” he writes. “After all, the Library has evolved countless times over the last two centuries as technology has changed.”

The timing on all this is actually interesting given how important Twitter is in international politics right now, such as the #MeToo movement being named the Time “Person of the Year.” “At a time when an increasing number of world leaders are taking to Twitter to discuss major policy issues, make formal statements and engage with their citizens at the same time we are talking about Russian influence operations, terrorist recruiting, trolls and harassment on social media, it might seem that this is absolutely the wrong time to suddenly announce with no warning that the Library of Congress will no longer be archiving the full Twitter firehose,” Leetaru writes.

“Generally, the tweets collected and archived will be thematic and event-based, including events such as elections, or themes of ongoing national interest, e.g. public policy,” the library said in a statement. But Leetaru criticized this sentiment. “One of the most basic problems with hand selecting specific topics over time is that it is often unclear at any given moment what will be important in the future,” he writes. “Not only does this mean that myriad society-influencing topics will not be archived at all, but even those topics that are archived will not be included until long after they have begun trending, meaning the early formative conversations around issues like future #metoo movements will be lost forever.”

“That’s a huge blow to the ability of Americans to hold politicians and companies accountable,” agrees Kara Alaimo in CNN. “In particular, social media has become more important to our politics than ever before, so we need more tools to hold politicians responsible for their behavior on social media — not less. But it’s precisely because Twitter has become such a big part of our national conversation that we need the tools to monitor it. If additional funding is necessary for the Library of Congress to be able to maintain a complete record, including visual and deleted tweets, Congress should provide it.”

Not everyone disagrees with the Library’s decision. “Its goal has never been to archive the web as a whole, only to preserve portions of it,” writes Jacob Brogan in Slate. “With that in mind, continuing to archive all of Twitter as such seems largely unnecessary, and possibly even counterproductive if future scholars really do want to look into the platform’s rise.”

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: