How big is the UK web?

shutterstock_125086040The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

You’ve probably played the time-honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets … the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one ‘host’.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions)

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

First posted at http://britishlibrary.typepad.co.uk/webarchive/2014/06/how-big-is-the-uk-web.html by Jason Webber,2 June 2014

Welcome to our 11 bursary holders

One of the main aims of the project is to involve arts and humanities researchers in the development of tools for analysing web archives, thereby ensuring that those tools meet real rather than perceived researcher needs. We recently ran an open competition inviting researchers to submit proposals across a range of disciplines which focus on the archived web, and have selected 11 from a tremendously strong and varied set of applications. The topics that will be studied over the next eight months are:

  • Rowan Aust – Tracing notions of heritage
  • Rona Cran – Beat literature in the contemporary imagination
  • Richard Deswarte – Revealing British Eurosceptism in the UK web domain and archive
  • Chris Fryer – The UK Parliament Web Archive
  • Saskia Huc-Hepher – An ethnosemiotic study of London French habitus as displayed  in blogs
  • Alison Kay – Capture, commemoration and the citizen-historian: Digital Shoebox archives relating to P.O.W.s in the Second World War
  • Gareth Millward – Digital barriers and the accessible web: disabled people, information and the internet
  • Marta Musso – A history of the online presence of UK companies
  • Harry Raffal – The Ministry of Defence’s online development and strategy for recruitment between 1996 and 2013
  • Lorna Richardson – Public archaeology: a digital perspective
  • Helen Taylor – Do online networks exist for the poetry community?

We very much look forward to working with our bursary holders over the coming months, and will be showcasing some of their research findings on this blog.

Twenty-five years of the web

shutterstock_185840144

Shutterstock 185840144 © Evlakhov Valeriy

12 March 2014 marked the 25th birthday of the web. As you would expect, there was a great deal of coverage online, both in relatively formal reporting contexts (e.g. newspaper interviews with Sir Tim Berners-Lee) and in social media. The approach taken by Nominet (one of the major internet registry companies) was among the most interesting. It published a brief report (The Story of the Web: Celebrating 25 Years of the World Wide Web) and a rather nice timeline of the web’s defining moments. The report, written by Jack Schofield, reminds us that Yahoo! (with that exclamation mark) ‘became the first web giant’ (p. 5); that Netscape Navigator dominated web browsing in the early years, and indeed ‘almost became synonymous with the web’ (p. 5); and that Google has only been part of our lives since 1997, Wikipedia since 2001 (pp. 6, 7). It concludes that ‘The web is now so deeply engrained in modern life that the issue isn’t whether people will leave, but how long it will take for the next two billion to join us’.

All of this is not just nostalgia – it will be impossible for historians to understand life in the late 20th and early 21st century without studying how the internet and the web have shaped our lives, for better and worse. This analysis requires that the web – ephemeral by its very nature – be archived. We have already lost some of our web history. The web is 25 years old, but the Internet Archive only began to collect website snapshots in 1996, that is, 18 years ago. The Institute of Historical Research launched its first website (then described as a hypertext internet server) in August 1993, but it was first captured by the Wayback Machine only in December 1996. At the time of writing, it has been saved 192 times, with the last capture occurring on 30 October 2013. Without the work of the Internet Archive, and now national institutions such as the National Archives and the British Library in the UK, we would not have any of this data. Researchers and web archivists can work together to ensure that in 2039, we will have 50 years’ worth of primary source materials to work with.

Collect, preserve, access – 10 years of archiving the online government record

On 21 January 2014, Suzy Espley, Tom Storrar and Simon Demissie from The National Archives gave a fascinating presentation about the UK Government Web Archive at the IHR’s ‘Archives and Society’ seminar. The UKGWA is very different from the dataset with which the Big UK Domain Data for the Arts and Humanities project will be working (derived from the UK domain crawl 1996-2013), in that it is both freely and fully accessible and actively curated by an expert team. (The scale of the wider national domain crawl, and the complex legal framework that is currently in place, prohibit access and intervention of this kind, something which is true of most national web archives, Portugal being a notable exception).

The UKGWA, we learnt, consists of more than 80 terabytes of data and is accessed by over 20 million users a month. The latter figure in particular is very impressive, as well as being suggestive of the interest in and value of web archives generally. While the earliest piece of born digital content dates from January 1994, coverage really begins from 1996 (as is true of both the Internet Archive and the data held by the British Library). Around 70 websites a month are crawled, and there are additional crawls ahead of planned website closures or major changes (for example of government).

The process for adding websites to the archive is complex, both technically and from a collection development standpoint. Quality Assurance involves comparison with the live site that has been crawled, to ensure that all content and functionality have transferred correctly. Manual intervention may be required at this point, although the team do try reduce this by providing guidance to government departments about how to design a website so that it can be crawled effectively. Technical challenges are constantly emerging, but social media and audio-visual materials are already problematic. A range of social media content embedded within gov.uk websites has to be archived, and it has been necessary to build a bespoke video player to capture moving image material. Interestingly, while developing a solution for capturing Twitter, the team have devised a way to show where TinyURLs resolve to, although these external sites themselves remain outside the scope of the collection.

Decisions about which sites to include are equally difficult. The ownership boundary between governmental, party political and private or charitable organisations, for example, is often hard to determine when considering sites for the collection. In a relatively new field, there is very little guidance available, a problem compounded by the speed with which decisions may have to be taken.

Turning to the way in which researchers engage with the UKGWA, it’s clear that a great deal of thought has gone into supporting informed use. The words ‘[ARCHIVED CONTENT]’ appear in the page title (and in Google search results); the date of the crawl forms part of the URL; and a banner at the top of each page makes it obvious that this is not the live web. The relatively controlled nature of government web space also means that seamless redirection to an archived page within the UKGWA is possible, rather than the clicked link resulting in a 404 page error (a bridging page explains that the user is now moving into the archived web).

The presentation ended with some suggestions from Simon Demissie about the value of web archives for researchers, most often as one element in a range of sources. Already, and this is likely increasingly to be the case, a website may be the only form of a record transferred to The National Archives – there will be no paper equivalent. The 2011 London riots illustrated the point nicely. Complementing formal parliamentary discussion and government responses are a host of online materials such as the Riots Communities and Victims Panel website which would otherwise be lost. Intriguingly, too, Freedom of Information requests which result in answers being published online are bringing into the public domain material which is still closed in paper format.

Suzy Espley ended the seminar with a suggested research project – a study of central government departments in the six months before and after the 2010 general election in the UK. An analysis of the scale and nature of the change would offer new insight into the political process, and the way in which it is communicated online. Such a project would also, rather neatly, complement another IHR project, ‘Digging into Linked Parliamentary Data‘, which, among other things, will be examining changes in political language before and after elections in the UK, Canada and the Netherlands.

I’m very grateful to Suzy Espley for helpful feedback on this post.

Announcing the project

We are delighted to have been awarded AHRC funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis.

A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is a collaboration between the Institute of Historical Research, University of London, the British Library, the Oxford Internet Institute and Aarhus University.