Web archives at the British Library

This is a post by one of the project team from the British Library. Peter Webster is Web Archiving Engagement and Liaison Manager. Peter writes:

At the British Library, the web archiving team are delighted to be involved in this important new project, which builds on two earlier collaborative projects, both funded by the JISC:  the AADDA project (with the IHR)  and (at the Oxford Internet Institute) Big Data: Demonstrating the Value of  the UKW Web Domain Dataset for Social Science Research.

The web archive holdings of the British Library for the UK consist of three main parts, each compiled under different conditions at different times, and accessed in different ways. One of the long-term outcomes of this project is to help us bring some of these access arrangements closer together.

The first of these three sets of data is the Open UK Web Archive, accessible at webarchive.org.uk. This is a selective archive, consisting of resources selected by curators at the Library and among partner organisations. The archived sites are made available publicly by permission of the site owners.

Since April 2013 the Library, in partnership with the five other legal deposit libraries for the UK, have had the right to receive a copy of all non-print publications from the UK; a framework known as Non-Print Legal Deposit. As I write, the second “domain crawl” is under way; our annual archiving of all the web content we know to be from the UK. The Legal Deposit UK Web Archive is, however, only available to users on premises controlled by one of the legal deposit libraries.

The third component of our holdings is the one with which this project is concerned: the JISC UK Web Domain Dataset. This is a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996 to 2013. The search and analysis interface for this is not yet publicly available, although individual items within it are available from the Internet Archive’s own site if you know the URL you need. There are also several datasets derived from it available for download on a public domain basis.

Although not a complete record of the UK web for that period, it is the most comprehensive such archive in existence. We are delighted to be working with arts and humanities researchers to develop the next generation of search and analysis tools to interrogate this unique dataset. Over time, those new tools should also greatly enhance the ways in which users can work with all three components of our web archives for the UK.

Recreational bugs: the limits of representing the past through web archives

This is a post from team member Josh Cowls, cross-posted from his blog. Josh is also on Twitter: @JoshCowls.

I am in Aarhus this week for the ‘Web Archiving and Archived Web’ seminar organised by Netlab at Aarhus University. Before the seminar got underway, I had time to walk around ‘The Old Town’ (Den Gamle By), a vibrant, open-air reconstruction of historic Danish buildings from the eighteenth century to the present. The Old Town is described as an open-air museum, but in many ways it’s much more than that: it’s filled with actors who walk around impersonating townsfolk from across history, interacting with guests to bring the old town more vividly to life.

Aarhus-window-Josh-Cowls

As it turned out, an afternoon spent in The Old Town has provided a great theoretical context for the web archiving seminar. Interest in web archiving has grown significantly in recent years, as the breadth of participants – scholars, curators and technologists – represented at the seminar shows. But web archiving remains replete with methodological and theoretical questions which are only starting to be answered.

One of the major themes already emerging from the discussions relates exactly how the act of web archiving is conceptualised. A popular myth is that web archives, particularly those accessible in the Internet Archive through the Wayback Machine interface, serve as direct, faithful representations of the web as it used to be. It’s easy to see why this view can seem seductive: the Wayback Machine offers an often seamless experience which functions much like the live web itself: enter URL, select date, surf the web of the past. Yet, as everyone at the seminar already knows painfully well, there are myriad reasons why this is a false assumption. Even within a single archive, plenty of discrepancies emerge, in terms of when each page was archived, exactly what was archived, and so on. Combining data from multiple archives is exponentially more problematic still.

Moreover, the emergence of ‘Web 2.0′ platforms such as social networks which have transformed the live web experience have proved difficult to archive. Web archiving emerged to suit the ‘Web 1.0′ era, a primarily ‘old media’ environment of text, still images and other content joined together, crucially, by hyperlinks. But with more people sharing more of their lives online with more sophisticated expressive technology, the data flowing over the Internet is of a qualitatively richer variety. Some of the more dramatic outcomes of this shift have already emerged – such as Edward Snowden’s explosive NSA revelations, or the incredible value of personal data to corporations – but the broader-based implications of this data for our understanding of society are still emerging.

Web archives may be one of the ways in which for scholars of the present and future learn more about contemporary society. Yet the potential this offers must be accompanied by a keener understanding of what archives do and don’t represent. Most fundamentally, the myth that web archives faithfully represent what the web as it was needs to be exposed and explained. Web archives can be a more or less accurate representation of the web of the past, but they can never be a perfect copy. The ‘Old Town’ in Aarhus is a great recreation of the past, but I was obviously never under the illusion that I was actually seeing the past – those costumed townsfolk were actors after all. I was always instinctively aware, moreover, that the museum’s curators affected what I saw. Yet given that they are, and given the seemingly neutral nature implied by the term ‘archive’, this trap is more easily fallen into in the case of web archives. Understanding that web archives, however seamless, will never be a perfectly faithful recreation of the experience of users at the time – or put even more simply, that these efforts are always a recreation and not the original experience itself – is an important first step in a more appropriate appreciation of the opportunities that they offer.

Moreover, occasions like this seminar give scholars at the forefront of preserving and using archived material from the web a chance to reflect on the significance of the design decisions taken now around data capture and analysis for generations of researchers in future. History may be written by the victors, but web history is facilitated, in essence, by predictors: those charged with anticipating exactly which data, tools and techniques will be most valuable to posterity.

How big is the UK web?

shutterstock_125086040The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

You’ve probably played the time-honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets … the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one ‘host’.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions)

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

First posted at http://britishlibrary.typepad.co.uk/webarchive/2014/06/how-big-is-the-uk-web.html by Jason Webber,2 June 2014

Welcome to our 11 bursary holders

One of the main aims of the project is to involve arts and humanities researchers in the development of tools for analysing web archives, thereby ensuring that those tools meet real rather than perceived researcher needs. We recently ran an open competition inviting researchers to submit proposals across a range of disciplines which focus on the archived web, and have selected 11 from a tremendously strong and varied set of applications. The topics that will be studied over the next eight months are:

  • Rowan Aust – Tracing notions of heritage
  • Rona Cran – Beat literature in the contemporary imagination
  • Richard Deswarte – Revealing British Eurosceptism in the UK web domain and archive
  • Chris Fryer – The UK Parliament Web Archive
  • Saskia Huc-Hepher – An ethnosemiotic study of London French habitus as displayed  in blogs
  • Alison Kay – Capture, commemoration and the citizen-historian: Digital Shoebox archives relating to P.O.W.s in the Second World War
  • Gareth Millward – Digital barriers and the accessible web: disabled people, information and the internet
  • Marta Musso – A history of the online presence of UK companies
  • Harry Raffal – The Ministry of Defence’s online development and strategy for recruitment between 1996 and 2013
  • Lorna Richardson – Public archaeology: a digital perspective
  • Helen Taylor – Do online networks exist for the poetry community?

We very much look forward to working with our bursary holders over the coming months, and will be showcasing some of their research findings on this blog.

Preserving the present: the unique challenges of archiving the web

This post is by project team member Josh Cowls of the OII.

In March 2012, as Mitt Romney was seeking to win over conservative voters in his bid to become the Republican Party’s presidential nominee, his adviser Eric Fehrnstrom discussed concerns over his appeal to moderate voters later in the campaign, telling a CNN interviewer, “For the fall campaign … everything changes. It’s almost like an Etch A Sketch. You can kind of shake it up, and we start all over again.” Fehrnstrom’s unfortunate response provided a memorable metaphor for the existing perception of Romney as a ‘flip-flopper’. Fehrnstrom’s opposite number in the Obama campaign, David Axelrod, would later jibe that “it’s hard to Etch-A-Sketch the truth away”, and indeed, tying Romney to his less appetising positions and comments formed a core component of the President’s successful re-election strategy.

636px-Etch-A-Sketch_Animator

Clearly, in the harsh spotlight of an American presidential election, when a candidate’s every utterance is recorded, it is indeed “hard to Etch-A-Sketch the truth away”. Yet even in our digital era, a time at which – as recent revelations have suggested – vast hordes of our communication records may be captured every day, the Romney example is more the exception than the rule. In fact, even at the highest levels and in the most important contexts, it can be surprisingly easy for digitised information to simply go missing or at least become inaccessible: the web is more of an Etch-A-Sketch than it might appear.

Take the case of the Conservative Party’s attempts to block access to political material pre-dating the 2010 general election. It remains unclear whether these efforts were thoroughly Machiavellian or rather less malign (and in any case a secondary archive continued to provide access to the materials). Regardless, the incident certainly challenged the prevailing assumption that all materials which were once online will stay there.

In fact, the whole notion of staying there on the web is an illogical one. Certainly, the web has democratised the distribution of information: publishing used to be the preserve of anyone rich enough to own a printing press, but with the advent of the web, all it takes to publish virtual content is a decent blogging platform. Yet it’s crucial to remember that the exponential growth in the number of publishers online does not mean that the underlying process of publishing has entirely changed.

Although many prominent social media sites mask this well, there is still a core distinction on any web page between writer and reader. In fact, this distinction is baked into the DNA of the web: any user can freely browse the web through the system of URLs, but each individual site is operated, and its HTML code modified, by a specific authorised user. (Of course, there are certain sites like wikis which do allow any user to make substantial edits to a page.) As such, the distinction between writer and reader remains relevant on the web today.

In fact, there is at least one way in which the web entrenches more rather than less control in the hands of publishers as compared with traditional media. Spared of the need to make a physical copy of their work, publishers can make changes to published content without a leaving a shred of evidence: I might have removed a typo from the previous paragraph a minute before you loaded this page, and you’d never know the difference. And it’s not only trivial changes to a single web page, but also the wholesale removal of entire web sites and other electronic resources which can pass unnoticed online. At a time when more and more aspects of life take place on the Internet, the importance of this to both academics and the public more broadly is becoming increasingly clear.

This of course is where the practice of web archiving comes in. I’m of the belief that web archiving should be conceived as broadly as possible, namely as any attempt to preserve any part of the web at any particular time. Contained within the scope of this definition is a huge range of different archiving activities, from a screenshot of a single web page to a sweep of an entire national web domain or the web in its entirety. Given the huge technical constraints involved, difficult decisions usually have to be made in choosing exactly what to archive; the core tension is often between breadth of coverage and depth in terms of snapshot frequency and the handling of technically complicated objects, for example. These decisions will affect exactly how the archived web will look in the future. 

Yet our discussion of the challenges around web archiving shouldn’t take place in a vacuum. Certainly the archiving of printed records comes with its own challenges, too, not least over access: in the case of Soviet Russia, for example, it was only after the Cold War had finished that archives were open to historians, and then only partially. Web archives in contrast have the virtue that they can be – and typically are – made freely available over the web itself for analysis. And, just as with the spurt of scholarship that followed the opening of the Soviet archives, we should be sure to see the preservation of web archives not merely as a challenge but also as an opportunity. Analysing web archives can enhance our ability to talk about the first twenty five years of life on the web – and unearth new insights about society more generally.

A central purpose of this project is to support the work of scholars at the cutting edge of exactly this sort of research. There’s just a couple more days to submit a proposal for one of the project’s research bursaries; see here for more details.

Twenty-five years of the web

shutterstock_185840144

Shutterstock 185840144 © Evlakhov Valeriy

12 March 2014 marked the 25th birthday of the web. As you would expect, there was a great deal of coverage online, both in relatively formal reporting contexts (e.g. newspaper interviews with Sir Tim Berners-Lee) and in social media. The approach taken by Nominet (one of the major internet registry companies) was among the most interesting. It published a brief report (The Story of the Web: Celebrating 25 Years of the World Wide Web) and a rather nice timeline of the web’s defining moments. The report, written by Jack Schofield, reminds us that Yahoo! (with that exclamation mark) ‘became the first web giant’ (p. 5); that Netscape Navigator dominated web browsing in the early years, and indeed ‘almost became synonymous with the web’ (p. 5); and that Google has only been part of our lives since 1997, Wikipedia since 2001 (pp. 6, 7). It concludes that ‘The web is now so deeply engrained in modern life that the issue isn’t whether people will leave, but how long it will take for the next two billion to join us’.

All of this is not just nostalgia – it will be impossible for historians to understand life in the late 20th and early 21st century without studying how the internet and the web have shaped our lives, for better and worse. This analysis requires that the web – ephemeral by its very nature – be archived. We have already lost some of our web history. The web is 25 years old, but the Internet Archive only began to collect website snapshots in 1996, that is, 18 years ago. The Institute of Historical Research launched its first website (then described as a hypertext internet server) in August 1993, but it was first captured by the Wayback Machine only in December 1996. At the time of writing, it has been saved 192 times, with the last capture occurring on 30 October 2013. Without the work of the Internet Archive, and now national institutions such as the National Archives and the British Library in the UK, we would not have any of this data. Researchers and web archivists can work together to ensure that in 2039, we will have 50 years’ worth of primary source materials to work with.

Our first workshop

On 26 February we held a very successful half-day workshop on web archiving and research. Despite the blue skies and sun over London – something almost lost to living memory – about 40 people took part in the event.

The Principal Investigator, Jane Winters, introduced the day by emphasising how keen we are to receive applications for our bursaries. A strong focus of the whole workshop was to explain what could be done with a web archive in terms of providing evidence for researchers (its pitfalls as well as its benefits), to explain about the bursaries we are offering, and to answer any questions from potential applicants.

Peter Webster of the British Library then talked through the various incarnations of the UK web domain archive that he and his colleagues curate, as well as the progress made on tools and an interface to the ‘dark archive’ produced by a previous project, AADDA. Peter enlivened his talk with some examples from his own research, using the web evidence of the furore created by the former Archbishop of Canterbury’s 2008 comments on sharia law.

Josh Cowls of the Oxford Internet Institute gave a taste of some of the work the OII is doing on mapping the UK web domain’s history, by, for example, analysing the way links between domains such as ac.uk and co.uk have changed over time.

Many of the researchers who took part in the AADDA project attended the workshop and one of them, Richard Deswarte, described the research he had done for that project, looking at Euroscepticism. Richard then asked other members of the research group to describe their own experiences: all of these seemed to follow a similar trajectory of initial uncertainty, followed by great excitement about the possibilities of web archives as a research tool, and finally some recalibrating of expectations as the technical impediments became apparent.

Suzy Espley and Tom Storrar of the National Archives gave an introduction to the work the TNA is doing in archiving the UK government webspace. The contrast between this archive and the UK domain archive is interesting: we might think of the former as narrow and deep and the latter as broader and shallower. The TNA has been longer in developing its archive and offering an interface to the public, and it was encouraging to learn that it is now accessed 20 millions times a month: proof that there is a great appetite for web archives.

The final speaker was Niels Brügger, who had come all the way from Aarhus to give our keynote presentation. Niels is our consultant on the project and, as a founder member of the RESAW project (which seeks to foster the study of national web archives), was ideally placed to address a room full of researchers. Niels explained that a complete copy of a national web archive is an impossibility: it is constantly changing in all dimensions. For example, if an archival copy of a web page is taken at a particular moment, is it necessary to have archival copies of everything that page linked to? But these pages, if archived, may be taken at different times from the page of origin. Niels raised many other interesting questions, which represent not so much hindrances to web archive research as things that must be borne in mind by the researcher.

We finished the afternoon with a wide-ranging discussion, and a final encouragement from the organisers to consider applying for a bursary. But the bursaries are by no means restricted to those who attended the workshop, so check out the link above if you may be interested in applying. Applications close on 25 April 2014.

Collect, preserve, access – 10 years of archiving the online government record

On 21 January 2014, Suzy Espley, Tom Storrar and Simon Demissie from The National Archives gave a fascinating presentation about the UK Government Web Archive at the IHR’s ‘Archives and Society’ seminar. The UKGWA is very different from the dataset with which the Big UK Domain Data for the Arts and Humanities project will be working (derived from the UK domain crawl 1996-2013), in that it is both freely and fully accessible and actively curated by an expert team. (The scale of the wider national domain crawl, and the complex legal framework that is currently in place, prohibit access and intervention of this kind, something which is true of most national web archives, Portugal being a notable exception).

The UKGWA, we learnt, consists of more than 80 terabytes of data and is accessed by over 20 million users a month. The latter figure in particular is very impressive, as well as being suggestive of the interest in and value of web archives generally. While the earliest piece of born digital content dates from January 1994, coverage really begins from 1996 (as is true of both the Internet Archive and the data held by the British Library). Around 70 websites a month are crawled, and there are additional crawls ahead of planned website closures or major changes (for example of government).

The process for adding websites to the archive is complex, both technically and from a collection development standpoint. Quality Assurance involves comparison with the live site that has been crawled, to ensure that all content and functionality have transferred correctly. Manual intervention may be required at this point, although the team do try reduce this by providing guidance to government departments about how to design a website so that it can be crawled effectively. Technical challenges are constantly emerging, but social media and audio-visual materials are already problematic. A range of social media content embedded within gov.uk websites has to be archived, and it has been necessary to build a bespoke video player to capture moving image material. Interestingly, while developing a solution for capturing Twitter, the team have devised a way to show where TinyURLs resolve to, although these external sites themselves remain outside the scope of the collection.

Decisions about which sites to include are equally difficult. The ownership boundary between governmental, party political and private or charitable organisations, for example, is often hard to determine when considering sites for the collection. In a relatively new field, there is very little guidance available, a problem compounded by the speed with which decisions may have to be taken.

Turning to the way in which researchers engage with the UKGWA, it’s clear that a great deal of thought has gone into supporting informed use. The words ‘[ARCHIVED CONTENT]‘ appear in the page title (and in Google search results); the date of the crawl forms part of the URL; and a banner at the top of each page makes it obvious that this is not the live web. The relatively controlled nature of government web space also means that seamless redirection to an archived page within the UKGWA is possible, rather than the clicked link resulting in a 404 page error (a bridging page explains that the user is now moving into the archived web).

The presentation ended with some suggestions from Simon Demissie about the value of web archives for researchers, most often as one element in a range of sources. Already, and this is likely increasingly to be the case, a website may be the only form of a record transferred to The National Archives – there will be no paper equivalent. The 2011 London riots illustrated the point nicely. Complementing formal parliamentary discussion and government responses are a host of online materials such as the Riots Communities and Victims Panel website which would otherwise be lost. Intriguingly, too, Freedom of Information requests which result in answers being published online are bringing into the public domain material which is still closed in paper format.

Suzy Espley ended the seminar with a suggested research project – a study of central government departments in the six months before and after the 2010 general election in the UK. An analysis of the scale and nature of the change would offer new insight into the political process, and the way in which it is communicated online. Such a project would also, rather neatly, complement another IHR project, ‘Digging into Linked Parliamentary Data‘, which, among other things, will be examining changes in political language before and after elections in the UK, Canada and the Netherlands.

I’m very grateful to Suzy Espley for helpful feedback on this post.

Announcing the project

We are delighted to have been awarded AHRC funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis.

A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.

The project, one of 21 to be funded as part of the AHRC’s Big Data Projects call, is a collaboration between the Institute of Historical Research, University of London, the British Library, the Oxford Internet Institute and Aarhus University.