Project case studies now available

We are delighted that we can now make available five of the case studies written by researchers across the humanities and social sciences. More will be available via this blog soon.

At the beginning of the project we had a number of aspirations for what the case studies could achieve. Firstly, of course, we wanted to show the variety of research that could be undertaken across different disciplines with web archives. Secondly, we wanted the researchers to give us feedback on the interface to the archive that the project was developing (this they did at monthly meetings) and we are very grateful to them for attending and giving their views; this process improved the interface markedly. Thirdly we hoped that some of the researchers might become advocates for web archiving among their peers.

The last is already being realised. At a conference on web archiving in Aarhus in June, no fewer than four of our researchers gave papers. Given their enthusiasm, we are sure that they will also present their work at events in their own subject areas.

The first five case studies we are marking available are:

Web archives as big data

Peter Webster, a member of the project team, here reflects on the conference we held at the IHR on 3 December. Peter writes:

In early December the project held an excellent day conference on the theme of web archives as big data. A good part of the day was taken up with short presentations from eight of our bursary holders, reflecting both on the substantive research findings they have achieved, and also on the experience of using the SHINE interface and on web archives as source material in general.

In early 2015 these results will appear on this blog as a series of reports, one from each bursary holder. So, rather than attempt to summarise each presentation in turn, this post reflects on some common methodological themes that emerged during the course of the day.

Perhaps the single most prominent note of the whole day was of the sheer size of the archive. “Too much data!” was a common cry heard during the project, and with good reason, since there are few other archives in common use with data of this magnitude, at least amongst those used by humanists. In an archive with more than 2 billion resources recorded in the index, the researchers found that queries needed to be a great deal more specific than most users are accustomed to; and that even the slightest ambiguity in the choice of search terms in particular led very quickly to results sets containing many thousands of results. Gareth Millward also drew attention to the difficulties in interpreting patterns in the incidence of any but the most specific search terms across time across the whole dataset, since almost any search term a user can imagine may have more than one meaning in an archive of the whole UK web.

One common strategy to come to terms with the size of the archive was to “think small”: to explore some very big data by means of a series of small case studies, which could then be articulated together. Harry Raffal, for example, focused on a succession of captures of a small set of key pages in the Ministry of Defence’s web estate; Helen Taylor on a close reading of the evolution of the content and structure of certain key poetry sites as they changed over time. This approach had much in common with that of Saskia Huc-Hepher on the habitus of the London French community as reflected in a number of key blogs. Rowan Aust also read important things from the presence and absence of content in the BBC’s web estate in the wake of the Jimmy Saville scandal.

An encouraging aspect of the presentations was the methodological holism on display, with this particular dataset being used in conjunction with other web archives, notably the Internet Archive. In the case of Marta Musso’s work on the evolution of the corporate web space, this data was but one part of a broader enquiry employing questionnaire and other evidence in order to create a rounded picture.

One particular and key difference between the SHINE interface and other familiar services is that search results in SHINE are not prioritised by any algorithmic intervention, but are presented in the archival order. This brought into focus one of the recurrent questions in the project: in the context of superabundant data, how attached is the typical user to a search service that (as it were) second-guesses what it was that the user *really* wanted to ask, and presents results in that order? If such a service is what is required, then how transparent must the operation of the algorithm be in order to be trusted ? Richard Deswarte powerfully drew attention to how fundamental has been the effect of Google on user expectations of the interfaces they use. Somewhat surprisingly (at least for me), more than one of the speakers was prepared to accept results without such machine prioritisation: indeed, in some senses it was preferable to be able to utilise what Saskia Huc-Hepher described as the “objective power of arbitrariness”. If a query produced more results than could be inspected individually, then both Saskia and Rona Cran were more comfortable with making their own decisions about taking smaller samples from those results than relying on a closed algorithm to make that selection. In a manner strikingly akin to the functionality of the physical library, such arbitrariness also led on occasion to a creative serendipitous juxtaposition of resources: a kind of collage in the web archive.

Big Data in the Humanities: lessons from papyrus and Instagram

This is a cross-posting of an item that our colleague Josh Cowls has just written for his own blog. Thanks to Josh for permission to repost here.

 

I’m currently in Washington DC to attend the IEEE International Conference on Big Data. The first day is set aside for workshops, and I’ve just attended a really insightful one on ‘Big Humanities Data’. The diversity of work presented was immense, covering a huge sweep of history: from fragments of ancient Greek text to Instagram photos taken during the Ukraine revolution, via the Irish Rebellion of 1641 and the Spanish flu outbreak of a century ago. Nonetheless, certain patterns stuck out from many of most of the talks given.

The workshop started with a fascinating keynote from Michael Levy and Michael Haley Goldman from the US Holocaust Memorial Museum here in DC, which laid out the transformative effect of collecting and digitizing large volumes of documentation relating to the Holocaust. As they put it, the role of the institution has changed because of what can be done with this data, initiating a more interactive, collaborative relationship with the public. The historical specificity of the Holocaust as an event has yielded a ‘narrow but deep’ and diverse set of resources – from aerial bombing photography to thousands of hours of oral history – enabling new and unusual research questions, and changing the nature of historical enquiry in this area. I was able to plug the UK domain data project when I argued for the power of search for engaging both professional researchers and the public at large.

This ability to ask new questions in new ways was a prevalent theme across all the talks. Whether the data itself is weeks or centuries old, the innovative methods being used allow novel perspectives and findings. Research into the difference between text in different versions of the Bible, and a study of media text during and after the Fukushima disaster, both showed the new scale at which phenomena new and old could be analysed.

Yet challenges undoubtedly remain for the integration of these new tools into existing humanities research. The issue of data quality was frequently mentioned, no matter whether the data is born- or naturalised-digital; Owen Conlan described how scholars researching digitised records of the Irish Rebellion want both certainty of recall at a top level and scrutability of individual data points, while Alise Tifentale pointed out that photos taken during the Ukrainian Revolution were not a representative record of the protests.

In response, many presenters advocated a dialectical approach between historians versed in traditional questions of validity and the computer scientists (and digital humanists) who build algorithms, software and other tools for analysis. To speak in cliched big data parlance for a moment, the volume of humanities data which can now be analysed and the velocity at which this can be done is clearly new, but it became clear that by the nature of their training and experience, humanities researchers are ahead of the game when it comes to the challenges of verifying highly varied data.

The workshop was rounded off with a panel discussion with representatives of major funding bodies, which took a broader view on wider issues going forward, such as developing infrastructure, the maintenance of funding and the necessity of demonstrating the impact of this research to governments and the public. Overall, it was great to get a taste of the wealth of research being done using new data, tools and skills at this workshop, and to reflect on how many of the challenges and solutions suggested relate to research I’m part of back home.

Search results for historical material

This is a guest post by Jaspreet Singh, a researcher at the L3S Research Center in Hanover. Jaspreet writes:

When people use a commercial search engine to search for information, they represent their intent using a set of keywords. In most cases this is to quickly look up a piece of information and move on to the next task. For scholars however, the information intent is usually very different from the casual user and often hard to express as keywords. The fact that the advanced query feature of the BL’s web archive search engine is quite popular is strong evidence to suggest this.

By working closely with scholars though we can gain better insights into their search intents and design the search engine accordingly. In my master thesis I focus specifically on search result ranking when the user search intent is historical.

Let us consider the user intent, ‘I want to know the history of Rudolph Giuliani, the ex-mayor of New York City’. We can safely assume that history refers to the important time periods and aspects of Rudolph Giuliani’s life. The user would most likely input the keywords ‘rudolph giuliani’ and expect to see a list of documents that give him a general overview of Giuliani’s major historically relevant facts. From here the user can modify his query of filter the results using facets to dig deeper into certain aspects. A standard search engine however is unaware of this intent. It only receives keywords as input and tries to serve the most relevant documents of the user.

At the L3S Research Center we have developed a prototype search engine specifically for historical search intents. We use temporal and aspect based search result diversification techniques to serve users with documents which cover a topic’s most important historical facts within the top n results. For example, when searching for ‘rudolph giuliani’ we try to retrieve documents that cover his election campaigns, his mayoralty, his run for senate and his personal life so that the user gets a quick gist of the important facts. Using our system, the user can explore the results by time using an interactive timeline or modify the query. The prototype showcases the various state of the art algorithms used for search diversification as well as our own algorithm, ASPTD. We use the New York Times 1987-2007 news archive as our corpus of study. In the interface we present only the top 30 results at a time.

In the future, we plan to test our approach on a much larger news archive like the 100 year London Times corpus. We also intend to strengthen the algorithm to work with web archives and work with the BL to integrate such methods in the current BL web archive search system so that users can explore the archive better.

Link to the system: http://pharos.l3s.uni-hannover.de:7080/ArchiveSearch/starterkit/

Project progress, an update

Josh Cowls reflects on recent developments and our goals towards the end of the project:

 

We are already well past the half-way mark of the project, and exciting new developments mean that our eleven researchers are well on their way to producing high-quality humanities research using the massive UK Web Domain Dataset.

The project team meets with the researchers on a regular basis, and these meetings always involve really constructive dialogue between the researchers accessing and using the data, and the development team at the British Library who are improving the interface of the archive all the time.

Our most recent meeting in September was no exception. We first got a brief update from all the researchers present about how their work was taking shape. This led seamlessly into a wider discussion of what researchers want from the interface. The top priority was for the creation of accounts for each individual user, enabling users to save the often-complex search queries that they generate. Another high priority was the ability to search within results sets, enabling more iterative searching.

Among the other enhancements suggested by the researchers were a number of proposed tweaks to the interface. One suggestion to save researchers time was for a snippet view on the results page, showing the search term in context – meaning researchers could skip over pages clearly irrelevant to their interest. On the other hand, it was not felt that URLs should necessarily appear on results pages.

Other requested tweaks to the interface included:

  • An option to choose the number of search results per page and to show more results per page by default
  • The ability to filter results from advanced as well as simple search queries
  • Tailoring of the ‘show more’ feature depending on the facet
  • A ‘show me a sample’ feature for large amounts of results, with a range of sampling methods, including a random sample option.

As well as these interface issues, the conversation also focussed on more academic questions, especially in regard to how results should be cited from the dataset. A ‘cite me’ button was suggested, which would allow a quick way of citing results, and similarly, when viewing individual results on the Internet Archive, an outer frame could include citation details. But of course, exactly what form these citation details should take raised other questions: should the British Library be cited as the provider of the data, or should the Internet Archive as the original collector? How should collections of results be cited, given that the British Library’s search functionality generated the results?

Inevitably, some of these questions couldn’t be answered definitely at the meeting, but the experience shows the value of involving researchers – who are able to raise vital questions from an academic perspective – while the development of the interface is still in progress. Since the meeting, many of the proposed changes have already been implemented – including, crucially, the introduction of log-ins for researchers, enabling the preservation and retrieval of search queries. The researchers are encouraged to bring more requests to our next meeting, at the British Library next week. From then, the pace of the project will accelerate still further, with a demo of the project to the general public at the AHRC’s Being Human Festival in November, and the ‘Web archives as big data’ conference in early December, when the researchers will present their findings.

Web archives at the British Library

This is a post by one of the project team from the British Library. Peter Webster is Web Archiving Engagement and Liaison Manager. Peter writes:

At the British Library, the web archiving team are delighted to be involved in this important new project, which builds on two earlier collaborative projects, both funded by the JISC:  the AADDA project (with the IHR)  and (at the Oxford Internet Institute) Big Data: Demonstrating the Value of  the UKW Web Domain Dataset for Social Science Research.

The web archive holdings of the British Library for the UK consist of three main parts, each compiled under different conditions at different times, and accessed in different ways. One of the long-term outcomes of this project is to help us bring some of these access arrangements closer together.

The first of these three sets of data is the Open UK Web Archive, accessible at webarchive.org.uk. This is a selective archive, consisting of resources selected by curators at the Library and among partner organisations. The archived sites are made available publicly by permission of the site owners.

Since April 2013 the Library, in partnership with the five other legal deposit libraries for the UK, have had the right to receive a copy of all non-print publications from the UK; a framework known as Non-Print Legal Deposit. As I write, the second “domain crawl” is under way; our annual archiving of all the web content we know to be from the UK. The Legal Deposit UK Web Archive is, however, only available to users on premises controlled by one of the legal deposit libraries.

The third component of our holdings is the one with which this project is concerned: the JISC UK Web Domain Dataset. This is a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996 to 2013. The search and analysis interface for this is not yet publicly available, although individual items within it are available from the Internet Archive’s own site if you know the URL you need. There are also several datasets derived from it available for download on a public domain basis.

Although not a complete record of the UK web for that period, it is the most comprehensive such archive in existence. We are delighted to be working with arts and humanities researchers to develop the next generation of search and analysis tools to interrogate this unique dataset. Over time, those new tools should also greatly enhance the ways in which users can work with all three components of our web archives for the UK.

Recreational bugs: the limits of representing the past through web archives

This is a post from team member Josh Cowls, cross-posted from his blog. Josh is also on Twitter: @JoshCowls.

I am in Aarhus this week for the ‘Web Archiving and Archived Web’ seminar organised by Netlab at Aarhus University. Before the seminar got underway, I had time to walk around ‘The Old Town’ (Den Gamle By), a vibrant, open-air reconstruction of historic Danish buildings from the eighteenth century to the present. The Old Town is described as an open-air museum, but in many ways it’s much more than that: it’s filled with actors who walk around impersonating townsfolk from across history, interacting with guests to bring the old town more vividly to life.

Aarhus-window-Josh-Cowls

As it turned out, an afternoon spent in The Old Town has provided a great theoretical context for the web archiving seminar. Interest in web archiving has grown significantly in recent years, as the breadth of participants – scholars, curators and technologists – represented at the seminar shows. But web archiving remains replete with methodological and theoretical questions which are only starting to be answered.

One of the major themes already emerging from the discussions relates exactly how the act of web archiving is conceptualised. A popular myth is that web archives, particularly those accessible in the Internet Archive through the Wayback Machine interface, serve as direct, faithful representations of the web as it used to be. It’s easy to see why this view can seem seductive: the Wayback Machine offers an often seamless experience which functions much like the live web itself: enter URL, select date, surf the web of the past. Yet, as everyone at the seminar already knows painfully well, there are myriad reasons why this is a false assumption. Even within a single archive, plenty of discrepancies emerge, in terms of when each page was archived, exactly what was archived, and so on. Combining data from multiple archives is exponentially more problematic still.

Moreover, the emergence of ‘Web 2.0′ platforms such as social networks which have transformed the live web experience have proved difficult to archive. Web archiving emerged to suit the ‘Web 1.0′ era, a primarily ‘old media’ environment of text, still images and other content joined together, crucially, by hyperlinks. But with more people sharing more of their lives online with more sophisticated expressive technology, the data flowing over the Internet is of a qualitatively richer variety. Some of the more dramatic outcomes of this shift have already emerged – such as Edward Snowden’s explosive NSA revelations, or the incredible value of personal data to corporations – but the broader-based implications of this data for our understanding of society are still emerging.

Web archives may be one of the ways in which for scholars of the present and future learn more about contemporary society. Yet the potential this offers must be accompanied by a keener understanding of what archives do and don’t represent. Most fundamentally, the myth that web archives faithfully represent what the web as it was needs to be exposed and explained. Web archives can be a more or less accurate representation of the web of the past, but they can never be a perfect copy. The ‘Old Town’ in Aarhus is a great recreation of the past, but I was obviously never under the illusion that I was actually seeing the past – those costumed townsfolk were actors after all. I was always instinctively aware, moreover, that the museum’s curators affected what I saw. Yet given that they are, and given the seemingly neutral nature implied by the term ‘archive’, this trap is more easily fallen into in the case of web archives. Understanding that web archives, however seamless, will never be a perfectly faithful recreation of the experience of users at the time – or put even more simply, that these efforts are always a recreation and not the original experience itself – is an important first step in a more appropriate appreciation of the opportunities that they offer.

Moreover, occasions like this seminar give scholars at the forefront of preserving and using archived material from the web a chance to reflect on the significance of the design decisions taken now around data capture and analysis for generations of researchers in future. History may be written by the victors, but web history is facilitated, in essence, by predictors: those charged with anticipating exactly which data, tools and techniques will be most valuable to posterity.

How big is the UK web?

shutterstock_125086040The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

You’ve probably played the time-honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets … the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one ‘host’.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions)

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

First posted at http://britishlibrary.typepad.co.uk/webarchive/2014/06/how-big-is-the-uk-web.html by Jason Webber,2 June 2014

Welcome to our 11 bursary holders

One of the main aims of the project is to involve arts and humanities researchers in the development of tools for analysing web archives, thereby ensuring that those tools meet real rather than perceived researcher needs. We recently ran an open competition inviting researchers to submit proposals across a range of disciplines which focus on the archived web, and have selected 11 from a tremendously strong and varied set of applications. The topics that will be studied over the next eight months are:

  • Rowan Aust – Tracing notions of heritage
  • Rona Cran – Beat literature in the contemporary imagination
  • Richard Deswarte – Revealing British Eurosceptism in the UK web domain and archive
  • Chris Fryer – The UK Parliament Web Archive
  • Saskia Huc-Hepher – An ethnosemiotic study of London French habitus as displayed  in blogs
  • Alison Kay – Capture, commemoration and the citizen-historian: Digital Shoebox archives relating to P.O.W.s in the Second World War
  • Gareth Millward – Digital barriers and the accessible web: disabled people, information and the internet
  • Marta Musso – A history of the online presence of UK companies
  • Harry Raffal – The Ministry of Defence’s online development and strategy for recruitment between 1996 and 2013
  • Lorna Richardson – Public archaeology: a digital perspective
  • Helen Taylor – Do online networks exist for the poetry community?

We very much look forward to working with our bursary holders over the coming months, and will be showcasing some of their research findings on this blog.

Preserving the present: the unique challenges of archiving the web

This post is by project team member Josh Cowls of the OII.

In March 2012, as Mitt Romney was seeking to win over conservative voters in his bid to become the Republican Party’s presidential nominee, his adviser Eric Fehrnstrom discussed concerns over his appeal to moderate voters later in the campaign, telling a CNN interviewer, “For the fall campaign … everything changes. It’s almost like an Etch A Sketch. You can kind of shake it up, and we start all over again.” Fehrnstrom’s unfortunate response provided a memorable metaphor for the existing perception of Romney as a ‘flip-flopper’. Fehrnstrom’s opposite number in the Obama campaign, David Axelrod, would later jibe that “it’s hard to Etch-A-Sketch the truth away”, and indeed, tying Romney to his less appetising positions and comments formed a core component of the President’s successful re-election strategy.

636px-Etch-A-Sketch_Animator

Clearly, in the harsh spotlight of an American presidential election, when a candidate’s every utterance is recorded, it is indeed “hard to Etch-A-Sketch the truth away”. Yet even in our digital era, a time at which – as recent revelations have suggested – vast hordes of our communication records may be captured every day, the Romney example is more the exception than the rule. In fact, even at the highest levels and in the most important contexts, it can be surprisingly easy for digitised information to simply go missing or at least become inaccessible: the web is more of an Etch-A-Sketch than it might appear.

Take the case of the Conservative Party’s attempts to block access to political material pre-dating the 2010 general election. It remains unclear whether these efforts were thoroughly Machiavellian or rather less malign (and in any case a secondary archive continued to provide access to the materials). Regardless, the incident certainly challenged the prevailing assumption that all materials which were once online will stay there.

In fact, the whole notion of staying there on the web is an illogical one. Certainly, the web has democratised the distribution of information: publishing used to be the preserve of anyone rich enough to own a printing press, but with the advent of the web, all it takes to publish virtual content is a decent blogging platform. Yet it’s crucial to remember that the exponential growth in the number of publishers online does not mean that the underlying process of publishing has entirely changed.

Although many prominent social media sites mask this well, there is still a core distinction on any web page between writer and reader. In fact, this distinction is baked into the DNA of the web: any user can freely browse the web through the system of URLs, but each individual site is operated, and its HTML code modified, by a specific authorised user. (Of course, there are certain sites like wikis which do allow any user to make substantial edits to a page.) As such, the distinction between writer and reader remains relevant on the web today.

In fact, there is at least one way in which the web entrenches more rather than less control in the hands of publishers as compared with traditional media. Spared of the need to make a physical copy of their work, publishers can make changes to published content without a leaving a shred of evidence: I might have removed a typo from the previous paragraph a minute before you loaded this page, and you’d never know the difference. And it’s not only trivial changes to a single web page, but also the wholesale removal of entire web sites and other electronic resources which can pass unnoticed online. At a time when more and more aspects of life take place on the Internet, the importance of this to both academics and the public more broadly is becoming increasingly clear.

This of course is where the practice of web archiving comes in. I’m of the belief that web archiving should be conceived as broadly as possible, namely as any attempt to preserve any part of the web at any particular time. Contained within the scope of this definition is a huge range of different archiving activities, from a screenshot of a single web page to a sweep of an entire national web domain or the web in its entirety. Given the huge technical constraints involved, difficult decisions usually have to be made in choosing exactly what to archive; the core tension is often between breadth of coverage and depth in terms of snapshot frequency and the handling of technically complicated objects, for example. These decisions will affect exactly how the archived web will look in the future. 

Yet our discussion of the challenges around web archiving shouldn’t take place in a vacuum. Certainly the archiving of printed records comes with its own challenges, too, not least over access: in the case of Soviet Russia, for example, it was only after the Cold War had finished that archives were open to historians, and then only partially. Web archives in contrast have the virtue that they can be – and typically are – made freely available over the web itself for analysis. And, just as with the spurt of scholarship that followed the opening of the Soviet archives, we should be sure to see the preservation of web archives not merely as a challenge but also as an opportunity. Analysing web archives can enhance our ability to talk about the first twenty five years of life on the web – and unearth new insights about society more generally.

A central purpose of this project is to support the work of scholars at the cutting edge of exactly this sort of research. There’s just a couple more days to submit a proposal for one of the project’s research bursaries; see here for more details.