London’s Pulse and the challenge of big data

In my project, Metropolitan Medical Officers, I am working with the annual reports of the London Medical Officers of Health. The Wellcome Library have recently digitised the entire run of these reports, from 1848 to 1972, to create their London’s Pulse resource. These reports are packed with detail about the life of a capital city and the work of its medical officers during more than a century. But for the professional scholar, the digitisation of these reports presents some new challenges.

Traditionally, historians do a lot of reading. The work of trawling through texts for relevant material can be time-consuming but it is also an important aspect of scholarship. Even if we do not end up drawing specifically upon the majority of a text, it often provides valuable context for those words we do want to focus upon. And, of course, one can stumble across the unexpectedly interesting and the interestingly unexpected. Scholarship is often diverted from its original trajectory by such serendipities, as new approaches or questions are suggested, and part of the joy of research is reading to discover without having a clear expectation of what one will come across.

When texts are digitised, none of this has to be lost, even though the material experience of reading may change. But if we continue to use digital texts only in the way we used paper ones then we are not making any substantial gains (except perhaps in time and travel expenses). The challenge presented by a resource like London’s Pulse, then, is to do something not possible with the paper texts from which it has been created.

The keyword search that is available does offer the possibility of searching for a specific word or phrase. Know what it is you want to find, and the search function will take you straight to potentially relevant pages. But what if you don’t know exactly what you’re looking for, or exactly what words to use? Or if performing your keyword search turns up no hits, or too many hits? This method cannot guarantee to find all relevant sections of text (consider variant spellings, alternative terminology and even mis-spellings) or to exclude irrelevant ones (imagine you are searching for ‘treatment’ or another such word with a more general meaning as well as its medical one). Keyword searches also cannot give any information about the structure or the meaning of the text as a whole.

If we are to make the most of the opportunity offered by digitisation, we need to find a way to engage with a digital corpus like London’s Pulse holistically. However, this archive, estimates my colleague Luke Blaxill, consists of something like 75 million words and, as he says, this is more than any scholar could read in a lifetime. To historians, dealing with such ‘big data’ is new and daunting territory. We are taking our lead on how to engage with such data from those in other fields, particularly corpus linguistics. The aim is to find ways of using digital technology, such as text mining, to generate new methodologies that will be valuable tools for historians in the brave new age of digital humanities.

Text mining is not a substitute for traditional historical scholarship, but it offers some useful methods and opens up new possibilities for research. With text mining, we can use digital technology to extract not just the words from a text, but their meaning, once the programme has to be ‘taught’ about the structure of texts and how words relate to each other. Thus, we can search for concepts not just exact terms, and we can be surer of finding all, and only, relevant material. Moreover, text-mining programmes can actually generate information about associations within the text between the concept under examination and other ideas or terms. So text mining can actually generate insights about the content of a large corpus of collected texts, such as the London’s Pulse archive. This ability to parse meaning from bodies of text too large to be digested by a human reader opens new avenues of research.

There are currently two projects using text-mining techniques to engage with London’s Pulse. In Manchester, historians are collaborating with the National Centre for Text Mining (NaCTeM) to develop a programme that is fluent in the vocabulary and associations that will make it useful for historical research. My own project, funded directly by the Wellcome Library, will use an off-the-peg, much simpler, text-mining programme to enable me to look at several reports without sacrificing depth of historical engagement. Where will text mining take medical history? Watch this space…

Jane K. Seymour
Wellcome Library Fellow
Centre for History in Public Health