Case Study OED API: Exploring word meaning in historical texts with computational methods
In this case study, Exploring word meaning in historical texts with computational methods: when time makes a difference, Dr Barbara McGillivray details her experience using the OED API to aid her research.
This study is part of Living with machines, a five-year project funded by the UK Research and Innovation and based at The Alan Turing Institute and the British Library. Living with machines aims to mine large historical datasets and develop new computational methods to gain new insights into the way the lives of ordinary people were affected by mechanisation in the long nineteenth century. The study is described in this paper.
A fundamental question to the project is: which kinds of entities and objects should we focus on to investigate mechanisation? Because we are interested in mechanisation, words related to machines are an obvious place to start. But words have many meanings and nuances. Our study made a first attempt at uncovering the semantic complexities of words in their historical context. We focussed on the word “machine” first, and then extended the methodology to other words with the aim of understanding how the semantics of words (i.e. their meanings) interacts with historical change (i.e. the change happening in the world).
Current computational linguistics research has made successful attempts at modelling word meaning at scale, but a lot remains to be done to put these computational models to the test of historical scholarship and see how they can help us answer pressing research questions. Importantly, a lot of computational research looks at texts in a historical vacuum, “synchronically” linguists would say. This is particularly true for the research in tracing the meaning of words in texts (known as “word sense disambiguation”), which was our topic of interest. Instead, we wanted to see whether knowing about the time in which a text was written made a difference to the way these methods work.
Take the entry for machine, which has twenty-six senses (and definitions) in the OED. One of them is ‘A complex device consisting of a number of interrelated parts, each having a definite function, together applying, using, or generating mechanical or (later) electrical power to perform a certain kind of work’ (sense IV 6b of “machine, n.”):
Now let us take three examples where the word machine is used:
- The calculating machine now constructing under the superintendence of the inventor.
- The Church was excellent as a national refrigerating machine.
- Examples of mobile earthmoving plant are bulldozers, graders and scrapers.
To the human eye, the first example is clearly related to the definition we gave above, while the second one is not. The third one contains the word plant rather than machine, but this word is used with a sense related to the one we are interested in. How can we replicate this at scale? In other words, how can we get a computer to recognise usages of word that are related to a specific definition of interest in historical texts? We turned to the OED API, machine learning and word embeddings for this.
Why did you choose the OED API to aid your research?
The OED is an incredibly valuable resource to anyone interested in tracing the meaning of English words historically. It has a very rich inventory of definitions for the different meanings a word can have, and it connects these definitions with real examples (or “quotations”) from historical sources.
In our case, we wanted to make the best use of this rich resource to gather as much information as possible about how quotations and definitions are linked together. Being able to use the OED API meant that we could mine these data at scale, across many dictionary entries. Moreover, the OED is linked to the Historical Thesaurus of English (HTE) at the level of individual senses. So we can find words that are semantically close to a specific sense of machine (and many other words). For example, the definition of plant as ‘As a mass noun: machinery and apparatus, either fixed or movable, used in an industrial or engineering process.’ (sense 5.d. of “plant, n.”) is related to the one we are considering here. This allowed us to considerably expand the set of entries and therefore the amount of quotation evidence to be used by our algorithm.
Given a definition of interest (for example machine as ‘A complex device consisting of a number of interrelated parts […]’), we collected all the OED quotations associated with this definition and dated between 1760 and 1850. Not only this, we also collected all quotations of a semantically related senses to the one of focus. For example, we also collected the quotations of sense 5.d. of “plant, n.”. This way, we had a lot of data that our algorithm could use to learn from. We trained a series of word embedding models (i.e. geometrical representations of the words) from this quotation dataset and created a semantic representation for each sense of the words. Basically, we were able to represent each sense as a geometrical object (a vector or embedding). Then, we took a sentence containing the word (e.g. machine) and tried to decide, automatically, if this usage related to the sense of interest or not. We did this by establishing if the embedding for the sentence was sufficiently close to the sense embedding.
Crucially, in some of our experiments these sense embeddings were time sensitive because they were created using the information of the dates of the quotations. We found that building embeddings using data from the time period we are interested in leads to better results than embeddings built on contemporary data or older data. The results are nuanced and stronger for some words than others.
What was your experience of working with the OED API?
The OED API was a core contribution to this research. Without it, it would simply not have happened. It provided us with a wealth of high-quality historical textual evidence linked to semantic properties of words that is exactly what we needed to let our algorithm learn patterns of association between words and meanings in historical times. Clearly, our method can be further expanded to work on other languages for which historical dictionary data are available, and even beyond the dictionary world. For example, a corpus of historical texts annotated at the level of their semantics could be used instead of the OED quotations, but the availability of the OED API meant that we could develop this method in the first place. We hope that our research will be useful to other scholars interested in exploring word meaning in historical texts at scale. We also are pleased to have drawn the attention to historical texts within the natural language processing community and to have shown that computational methods capable of handling the specificities of historical texts are of great relevance not just to digital humanities and historical linguistics research, but also to the cultural heritage sector.
The opinions and other information contained in the OED blog posts and comments do not necessarily reflect the opinions or positions of Oxford University Press.