IHR workshop on web archiving - Big UK Domain Data for the Arts and HumanitiesBig UK Domain Data for the Arts and Humanities

On 11 November the IHR held a workshop, ‘An Introduction to Web Archiving for Historians‘, for which we welcomed back two old friends from the BUDDAH project as speakers.

The day opened with Jane Winters talking about why historians should be using web archives. You can see the slides of Jane’s talk here, including a couple courtesy of a fascinating presentation from Andy Jackson about the persistence of live web pages. This was followed by Helen Hockx-Yu, formerly of the British Library’s web archiving team but now at the Internet Archive. Helen described the Internet Archive’s mind-boggling scale and its ambitious plans for the future; Helen’s slides are here. Jane then returned to talk about the UK Government Web Archive and Parliament Web Archive (more slides here).

After having heard about various web archives, attendees were able to try the Shine interface for themselves. This is an interface to a huge archive – the UK’s web archive covering 1996 to 2013 – all now searchable as full text. Shine was one of the major outputs of the BUDDAH project and we were delighted to see how fast and responsive it now is, thanks to the continuing work of the web archiving team at the British Library.

Before lunch there was time for Marty Steer to lead a quick canter through the command line tool wget. Marty explained how flexible this tool is for downloading web pages or whole sites (and the importance of using the settings provided to avoid hammering sites with a blizzard of requests). You can even use wget to create complete WARC files. Marty’s presentation, with all of the commands used, can be read here.

After lunch Rowan Aust of Royal Holloway described her research on the BBC’s reaction to the Jimmy Savile scandal and how it has removed Savile from instances of its web and programme archives. Rowan’s earlier account of the research, written for the BUDDAH project, is on our institutional repository.

Then it was back to the command line, as Jonathan Blaney explained how easy it is to interrogate very large text files by typing a few lines of text. On Mac and Linux machines a fully-featured ‘shell’, bash, is provided by default; for this session using Windows Jonathan had installed ‘Git bash’, a free, lightweight version of bash (there are useful installation instructions here). The group looked at links to external websites in the Shine dataset, using a sample of a random 0.01% of the full link file; this still amounted to about 1.5 million lines (the full file, at 19GB, can be downloaded from the UKWA site). The main command used for this hands-on session was grep, a powerful yet simple search utility which is ideal for searching very large files or numbers of files.

The day ended with the group using webrecorder.io, a free online tool which allows the archiving of web pages through a simple and intuitive interface.

We’d like to thank everyone who came to the workshop: this was the first time we had run such an event on web archiving and their enthusiastic participation and constructive feedback have given us the confidence to run this course again in the future.