Big Data in the Humanities: lessons from papyrus and Instagram

This is a cross-posting of an item that our colleague Josh Cowls has just written for his own blog. Thanks to Josh for permission to repost here.


I’m currently in Washington DC to attend the IEEE International Conference on Big Data. The first day is set aside for workshops, and I’ve just attended a really insightful one on ‘Big Humanities Data’. The diversity of work presented was immense, covering a huge sweep of history: from fragments of ancient Greek text to Instagram photos taken during the Ukraine revolution, via the Irish Rebellion of 1641 and the Spanish flu outbreak of a century ago. Nonetheless, certain patterns stuck out from many of most of the talks given.

The workshop started with a fascinating keynote from Michael Levy and Michael Haley Goldman from the US Holocaust Memorial Museum here in DC, which laid out the transformative effect of collecting and digitizing large volumes of documentation relating to the Holocaust. As they put it, the role of the institution has changed because of what can be done with this data, initiating a more interactive, collaborative relationship with the public. The historical specificity of the Holocaust as an event has yielded a ‘narrow but deep’ and diverse set of resources – from aerial bombing photography to thousands of hours of oral history – enabling new and unusual research questions, and changing the nature of historical enquiry in this area. I was able to plug the UK domain data project when I argued for the power of search for engaging both professional researchers and the public at large.

This ability to ask new questions in new ways was a prevalent theme across all the talks. Whether the data itself is weeks or centuries old, the innovative methods being used allow novel perspectives and findings. Research into the difference between text in different versions of the Bible, and a study of media text during and after the Fukushima disaster, both showed the new scale at which phenomena new and old could be analysed.

Yet challenges undoubtedly remain for the integration of these new tools into existing humanities research. The issue of data quality was frequently mentioned, no matter whether the data is born- or naturalised-digital; Owen Conlan described how scholars researching digitised records of the Irish Rebellion want both certainty of recall at a top level and scrutability of individual data points, while Alise Tifentale pointed out that photos taken during the Ukrainian Revolution were not a representative record of the protests.

In response, many presenters advocated a dialectical approach between historians versed in traditional questions of validity and the computer scientists (and digital humanists) who build algorithms, software and other tools for analysis. To speak in cliched big data parlance for a moment, the volume of humanities data which can now be analysed and the velocity at which this can be done is clearly new, but it became clear that by the nature of their training and experience, humanities researchers are ahead of the game when it comes to the challenges of verifying highly varied data.

The workshop was rounded off with a panel discussion with representatives of major funding bodies, which took a broader view on wider issues going forward, such as developing infrastructure, the maintenance of funding and the necessity of demonstrating the impact of this research to governments and the public. Overall, it was great to get a taste of the wealth of research being done using new data, tools and skills at this workshop, and to reflect on how many of the challenges and solutions suggested relate to research I’m part of back home.

Search results for historical material

This is a guest post by Jaspreet Singh, a researcher at the L3S Research Center in Hanover. Jaspreet writes:

When people use a commercial search engine to search for information, they represent their intent using a set of keywords. In most cases this is to quickly look up a piece of information and move on to the next task. For scholars however, the information intent is usually very different from the casual user and often hard to express as keywords. The fact that the advanced query feature of the BL’s web archive search engine is quite popular is strong evidence to suggest this.

By working closely with scholars though we can gain better insights into their search intents and design the search engine accordingly. In my master thesis I focus specifically on search result ranking when the user search intent is historical.

Let us consider the user intent, ‘I want to know the history of Rudolph Giuliani, the ex-mayor of New York City’. We can safely assume that history refers to the important time periods and aspects of Rudolph Giuliani’s life. The user would most likely input the keywords ‘rudolph giuliani’ and expect to see a list of documents that give him a general overview of Giuliani’s major historically relevant facts. From here the user can modify his query of filter the results using facets to dig deeper into certain aspects. A standard search engine however is unaware of this intent. It only receives keywords as input and tries to serve the most relevant documents of the user.

At the L3S Research Center we have developed a prototype search engine specifically for historical search intents. We use temporal and aspect based search result diversification techniques to serve users with documents which cover a topic’s most important historical facts within the top n results. For example, when searching for ‘rudolph giuliani’ we try to retrieve documents that cover his election campaigns, his mayoralty, his run for senate and his personal life so that the user gets a quick gist of the important facts. Using our system, the user can explore the results by time using an interactive timeline or modify the query. The prototype showcases the various state of the art algorithms used for search diversification as well as our own algorithm, ASPTD. We use the New York Times 1987-2007 news archive as our corpus of study. In the interface we present only the top 30 results at a time.

In the future, we plan to test our approach on a much larger news archive like the 100 year London Times corpus. We also intend to strengthen the algorithm to work with web archives and work with the BL to integrate such methods in the current BL web archive search system so that users can explore the archive better.

Link to the system:

Project progress, an update

Josh Cowls reflects on recent developments and our goals towards the end of the project:


We are already well past the half-way mark of the project, and exciting new developments mean that our eleven researchers are well on their way to producing high-quality humanities research using the massive UK Web Domain Dataset.

The project team meets with the researchers on a regular basis, and these meetings always involve really constructive dialogue between the researchers accessing and using the data, and the development team at the British Library who are improving the interface of the archive all the time.

Our most recent meeting in September was no exception. We first got a brief update from all the researchers present about how their work was taking shape. This led seamlessly into a wider discussion of what researchers want from the interface. The top priority was for the creation of accounts for each individual user, enabling users to save the often-complex search queries that they generate. Another high priority was the ability to search within results sets, enabling more iterative searching.

Among the other enhancements suggested by the researchers were a number of proposed tweaks to the interface. One suggestion to save researchers time was for a snippet view on the results page, showing the search term in context – meaning researchers could skip over pages clearly irrelevant to their interest. On the other hand, it was not felt that URLs should necessarily appear on results pages.

Other requested tweaks to the interface included:

  • An option to choose the number of search results per page and to show more results per page by default
  • The ability to filter results from advanced as well as simple search queries
  • Tailoring of the ‘show more’ feature depending on the facet
  • A ‘show me a sample’ feature for large amounts of results, with a range of sampling methods, including a random sample option.

As well as these interface issues, the conversation also focussed on more academic questions, especially in regard to how results should be cited from the dataset. A ‘cite me’ button was suggested, which would allow a quick way of citing results, and similarly, when viewing individual results on the Internet Archive, an outer frame could include citation details. But of course, exactly what form these citation details should take raised other questions: should the British Library be cited as the provider of the data, or should the Internet Archive as the original collector? How should collections of results be cited, given that the British Library’s search functionality generated the results?

Inevitably, some of these questions couldn’t be answered definitely at the meeting, but the experience shows the value of involving researchers – who are able to raise vital questions from an academic perspective – while the development of the interface is still in progress. Since the meeting, many of the proposed changes have already been implemented – including, crucially, the introduction of log-ins for researchers, enabling the preservation and retrieval of search queries. The researchers are encouraged to bring more requests to our next meeting, at the British Library next week. From then, the pace of the project will accelerate still further, with a demo of the project to the general public at the AHRC’s Being Human Festival in November, and the ‘Web archives as big data’ conference in early December, when the researchers will present their findings.