Collect, preserve, access – 10 years of archiving the online government record

On 21 January 2014, Suzy Espley, Tom Storrar and Simon Demissie from The National Archives gave a fascinating presentation about the UK Government Web Archive at the IHR’s ‘Archives and Society’ seminar. The UKGWA is very different from the dataset with which the Big UK Domain Data for the Arts and Humanities project will be working (derived from the UK domain crawl 1996-2013), in that it is both freely and fully accessible and actively curated by an expert team. (The scale of the wider national domain crawl, and the complex legal framework that is currently in place, prohibit access and intervention of this kind, something which is true of most national web archives, Portugal being a notable exception).

The UKGWA, we learnt, consists of more than 80 terabytes of data and is accessed by over 20 million users a month. The latter figure in particular is very impressive, as well as being suggestive of the interest in and value of web archives generally. While the earliest piece of born digital content dates from January 1994, coverage really begins from 1996 (as is true of both the Internet Archive and the data held by the British Library). Around 70 websites a month are crawled, and there are additional crawls ahead of planned website closures or major changes (for example of government).

The process for adding websites to the archive is complex, both technically and from a collection development standpoint. Quality Assurance involves comparison with the live site that has been crawled, to ensure that all content and functionality have transferred correctly. Manual intervention may be required at this point, although the team do try reduce this by providing guidance to government departments about how to design a website so that it can be crawled effectively. Technical challenges are constantly emerging, but social media and audio-visual materials are already problematic. A range of social media content embedded within gov.uk websites has to be archived, and it has been necessary to build a bespoke video player to capture moving image material. Interestingly, while developing a solution for capturing Twitter, the team have devised a way to show where TinyURLs resolve to, although these external sites themselves remain outside the scope of the collection.

Decisions about which sites to include are equally difficult. The ownership boundary between governmental, party political and private or charitable organisations, for example, is often hard to determine when considering sites for the collection. In a relatively new field, there is very little guidance available, a problem compounded by the speed with which decisions may have to be taken.

Turning to the way in which researchers engage with the UKGWA, it’s clear that a great deal of thought has gone into supporting informed use. The words ‘[ARCHIVED CONTENT]’ appear in the page title (and in Google search results); the date of the crawl forms part of the URL; and a banner at the top of each page makes it obvious that this is not the live web. The relatively controlled nature of government web space also means that seamless redirection to an archived page within the UKGWA is possible, rather than the clicked link resulting in a 404 page error (a bridging page explains that the user is now moving into the archived web).

The presentation ended with some suggestions from Simon Demissie about the value of web archives for researchers, most often as one element in a range of sources. Already, and this is likely increasingly to be the case, a website may be the only form of a record transferred to The National Archives – there will be no paper equivalent. The 2011 London riots illustrated the point nicely. Complementing formal parliamentary discussion and government responses are a host of online materials such as the Riots Communities and Victims Panel website which would otherwise be lost. Intriguingly, too, Freedom of Information requests which result in answers being published online are bringing into the public domain material which is still closed in paper format.

Suzy Espley ended the seminar with a suggested research project – a study of central government departments in the six months before and after the 2010 general election in the UK. An analysis of the scale and nature of the change would offer new insight into the political process, and the way in which it is communicated online. Such a project would also, rather neatly, complement another IHR project, ‘Digging into Linked Parliamentary Data‘, which, among other things, will be examining changes in political language before and after elections in the UK, Canada and the Netherlands.

I’m very grateful to Suzy Espley for helpful feedback on this post.

Announcing the project

We are delighted to have been awarded AHRC funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis.

A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is a collaboration between the Institute of Historical Research, University of London, the British Library, the Oxford Internet Institute and Aarhus University.