Web archives at the British Library - Big UK Domain Data for the Arts and HumanitiesBig UK Domain Data for the Arts and Humanities

This is a post by one of the project team from the British Library. Peter Webster is Web Archiving Engagement and Liaison Manager. Peter writes:

At the British Library, the web archiving team are delighted to be involved in this important new project, which builds on two earlier collaborative projects, both funded by the JISC: the AADDA project (with the IHR) and (at the Oxford Internet Institute) Big Data: Demonstrating the Value of the UKW Web Domain Dataset for Social Science Research.

The web archive holdings of the British Library for the UK consist of three main parts, each compiled under different conditions at different times, and accessed in different ways. One of the long-term outcomes of this project is to help us bring some of these access arrangements closer together.

The first of these three sets of data is the Open UK Web Archive, accessible at webarchive.org.uk. This is a selective archive, consisting of resources selected by curators at the Library and among partner organisations. The archived sites are made available publicly by permission of the site owners.

Since April 2013 the Library, in partnership with the five other legal deposit libraries for the UK, have had the right to receive a copy of all non-print publications from the UK; a framework known as Non-Print Legal Deposit. As I write, the second “domain crawl” is under way; our annual archiving of all the web content we know to be from the UK. The Legal Deposit UK Web Archive is, however, only available to users on premises controlled by one of the legal deposit libraries.

The third component of our holdings is the one with which this project is concerned: the JISC UK Web Domain Dataset. This is a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996 to 2013. The search and analysis interface for this is not yet publicly available, although individual items within it are available from the Internet Archive’s own site if you know the URL you need. There are also several datasets derived from it available for download on a public domain basis.

Although not a complete record of the UK web for that period, it is the most comprehensive such archive in existence. We are delighted to be working with arts and humanities researchers to develop the next generation of search and analysis tools to interrogate this unique dataset. Over time, those new tools should also greatly enhance the ways in which users can work with all three components of our web archives for the UK.