Big Data in the Humanities: lessons from papyrus and Instagram

This is a cross-posting of an item that our colleague Josh Cowls has just written for his own blog. Thanks to Josh for permission to repost here.

 

I’m currently in Washington DC to attend the IEEE International Conference on Big Data. The first day is set aside for workshops, and I’ve just attended a really insightful one on ‘Big Humanities Data’. The diversity of work presented was immense, covering a huge sweep of history: from fragments of ancient Greek text to Instagram photos taken during the Ukraine revolution, via the Irish Rebellion of 1641 and the Spanish flu outbreak of a century ago. Nonetheless, certain patterns stuck out from many of most of the talks given.

The workshop started with a fascinating keynote from Michael Levy and Michael Haley Goldman from the US Holocaust Memorial Museum here in DC, which laid out the transformative effect of collecting and digitizing large volumes of documentation relating to the Holocaust. As they put it, the role of the institution has changed because of what can be done with this data, initiating a more interactive, collaborative relationship with the public. The historical specificity of the Holocaust as an event has yielded a ‘narrow but deep’ and diverse set of resources – from aerial bombing photography to thousands of hours of oral history – enabling new and unusual research questions, and changing the nature of historical enquiry in this area. I was able to plug the UK domain data project when I argued for the power of search for engaging both professional researchers and the public at large.

This ability to ask new questions in new ways was a prevalent theme across all the talks. Whether the data itself is weeks or centuries old, the innovative methods being used allow novel perspectives and findings. Research into the difference between text in different versions of the Bible, and a study of media text during and after the Fukushima disaster, both showed the new scale at which phenomena new and old could be analysed.

Yet challenges undoubtedly remain for the integration of these new tools into existing humanities research. The issue of data quality was frequently mentioned, no matter whether the data is born- or naturalised-digital; Owen Conlan described how scholars researching digitised records of the Irish Rebellion want both certainty of recall at a top level and scrutability of individual data points, while Alise Tifentale pointed out that photos taken during the Ukrainian Revolution were not a representative record of the protests.

In response, many presenters advocated a dialectical approach between historians versed in traditional questions of validity and the computer scientists (and digital humanists) who build algorithms, software and other tools for analysis. To speak in cliched big data parlance for a moment, the volume of humanities data which can now be analysed and the velocity at which this can be done is clearly new, but it became clear that by the nature of their training and experience, humanities researchers are ahead of the game when it comes to the challenges of verifying highly varied data.

The workshop was rounded off with a panel discussion with representatives of major funding bodies, which took a broader view on wider issues going forward, such as developing infrastructure, the maintenance of funding and the necessity of demonstrating the impact of this research to governments and the public. Overall, it was great to get a taste of the wealth of research being done using new data, tools and skills at this workshop, and to reflect on how many of the challenges and solutions suggested relate to research I’m part of back home.