June 2014 - Big UK Domain Data for the Arts and HumanitiesBig UK Domain Data for the Arts and Humanities

This is a post from team member Josh Cowls, cross-posted from his blog. Josh is also on Twitter: @JoshCowls.

I am in Aarhus this week for the ‘Web Archiving and Archived Web’ seminar organised by Netlab at Aarhus University. Before the seminar got underway, I had time to walk around ‘The Old Town’ (Den Gamle By), a vibrant, open-air reconstruction of historic Danish buildings from the eighteenth century to the present. The Old Town is described as an open-air museum, but in many ways it’s much more than that: it’s filled with actors who walk around impersonating townsfolk from across history, interacting with guests to bring the old town more vividly to life.

As it turned out, an afternoon spent in The Old Town has provided a great theoretical context for the web archiving seminar. Interest in web archiving has grown significantly in recent years, as the breadth of participants – scholars, curators and technologists – represented at the seminar shows. But web archiving remains replete with methodological and theoretical questions which are only starting to be answered.

One of the major themes already emerging from the discussions relates exactly how the act of web archiving is conceptualised. A popular myth is that web archives, particularly those accessible in the Internet Archive through the Wayback Machine interface, serve as direct, faithful representations of the web as it used to be. It’s easy to see why this view can seem seductive: the Wayback Machine offers an often seamless experience which functions much like the live web itself: enter URL, select date, surf the web of the past. Yet, as everyone at the seminar already knows painfully well, there are myriad reasons why this is a false assumption. Even within a single archive, plenty of discrepancies emerge, in terms of when each page was archived, exactly what was archived, and so on. Combining data from multiple archives is exponentially more problematic still.

Moreover, the emergence of ‘Web 2.0′ platforms such as social networks which have transformed the live web experience have proved difficult to archive. Web archiving emerged to suit the ‘Web 1.0′ era, a primarily ‘old media’ environment of text, still images and other content joined together, crucially, by hyperlinks. But with more people sharing more of their lives online with more sophisticated expressive technology, the data flowing over the Internet is of a qualitatively richer variety. Some of the more dramatic outcomes of this shift have already emerged – such as Edward Snowden’s explosive NSA revelations, or the incredible value of personal data to corporations – but the broader-based implications of this data for our understanding of society are still emerging.

Web archives may be one of the ways in which for scholars of the present and future learn more about contemporary society. Yet the potential this offers must be accompanied by a keener understanding of what archives do and don’t represent. Most fundamentally, the myth that web archives faithfully represent what the web as it was needs to be exposed and explained. Web archives can be a more or less accurate representation of the web of the past, but they can never be a perfect copy. The ‘Old Town’ in Aarhus is a great recreation of the past, but I was obviously never under the illusion that I was actually seeing the past – those costumed townsfolk were actors after all. I was always instinctively aware, moreover, that the museum’s curators affected what I saw. Yet given that they are, and given the seemingly neutral nature implied by the term ‘archive’, this trap is more easily fallen into in the case of web archives. Understanding that web archives, however seamless, will never be a perfectly faithful recreation of the experience of users at the time – or put even more simply, that these efforts are always a recreation and not the original experience itself – is an important first step in a more appropriate appreciation of the opportunities that they offer.

Moreover, occasions like this seminar give scholars at the forefront of preserving and using archived material from the web a chance to reflect on the significance of the design decisions taken now around data capture and analysis for generations of researchers in future. History may be written by the victors, but web history is facilitated, in essence, by predictors: those charged with anticipating exactly which data, tools and techniques will be most valuable to posterity.

The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

You’ve probably played the time-honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets … the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one ‘host’.

We want:

URLs (in billions)
Size (in terabytes)
Hosts (in millions)

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

First posted at http://britishlibrary.typepad.co.uk/webarchive/2014/06/how-big-is-the-uk-web.html by Jason Webber,2 June 2014

Big UK Domain Data for the Arts and Humanities

Monthly Archives: June 2014

Recreational bugs: the limits of representing the past through web archives

How big is the UK web?