Using corpora to track the language of Covid-19: update 2
Back in April, when the first update of Covid-19-related vocabulary was published in the OED, we wrote about some developments that we’d been monitoring using our language corpora. In this update we take a look at some of the linguistic changes that have happened since then, and describe the ways that OED lexicographers use corpora to track such changes. The analysis is mainly based on our monitor corpus of English, which currently contains over 10 billion words of web-based news content from 2017 to the present day, and is updated each month.
Tracking spikes and surges in frequency
Each month we run searches to identify words which are markedly more frequent in that month than in the corpus as a whole (‘keywords’ for those months). As described in our April blog post, keywords in January, February, and March included those describing and naming the virus and the disease (coronavirus, Covid-19, respiratory, etc.), and those referring to the social consequences and medical response (social distancing, self-isolation, self-quarantine, lockdown, PPE, ventilator, etc.). The keywords in April, May, and June show further shifts and changes:
In April, there is a continued focus of vocabulary around the social and economic impacts of Covid-19, including lockdown, social or physical distancing, and – following the introduction of the UK’s Coronavirus Job Retention Scheme in late March – furlough. And as millions of people adapted to communicating remotely, references to the video-chat application Zoom became widespread, including use as a verb. Mask and covering are also keywords in April, May, and June, reflecting continuing discussions of when and where face coverings should be worn.
In May, we see the first signs of looking ahead to life post-lockdown: reopen, phased (as in phased return to work, phased reopening), and easing (especially in easing of restrictions/measures, easing of the lockdown) are all keywords this month. There is also an interesting pattern of contrast with virtual life as people start thinking about or tentatively restart face-to-face interaction: in-person has increased in frequency, and is used in contexts which previously would not normally have been necessary (since the ‘in-person’ version was the norm), as in in-person worship and in-person graduation.
The search for effective medical treatments continues, reflected in the frequency of references to hydroxychloroquine in the news and, in June, dexamethasone: these and other medical and scientific terms included in this update are discussed in this article.
For OED lexicographers, these keyword searches help to highlight significant terms that we need to consider adding to the dictionary: thus, this month’s new additions include contact tracer n., face covering n., and Zoom v/2, among others. They also highlight entries that need revisiting in light of recent developments, such as furlough n. and v. For any given term we can then dig further into the data: for example, a quick analysis shows that the most common recent use of the noun covering is in face covering (facial covering is also used but much less frequently), and that this term has significantly increased in frequency since April, as shown in the chart below. In fact, the term has a long history – our new entry shows that it dates to a1732, in a general sense – but its current frequency and cultural significance make it an important addition to the dictionary now.
We also regularly carry out a ‘trends’ search, which identifies words which have surged in frequency over a particular period of time, even if their overall frequency isn’t as high as those on the monthly keywords lists[iii]. This is a useful tool for catching words as they emerge into usage, which we can monitor for future updates. For example, deconfinement was one of the top trending words in May. This word, borrowed from French déconfinement and referring to the process of coming out of lockdown (confinement in French), was initially used with reference to France and other French-speaking countries but started to be used more widely to refer to any country’s lockdown-easing process. However, its overall frequency is not particularly high (0.4 per million tokens at its peak; contrast this with face covering above, which in June had a frequency of over 30 per million), and it has become slightly less frequent in June. It may not last as a word in English, in this sense: we’ll continue to track its usage.
Another word on a recent trends list was teleconsultation (a health-care consultation carried out remotely using telecommunications technology). This is not a new word – there’s evidence going back to the 20th century on the various databases we consult – but before the Covid-19 pandemic it was relatively infrequent. If the recently increased frequency continues, it may be a candidate to add to the OED to join related words which are already covered, like telemedicine and telehealth. Therefore, this is another word on our watch list.
Using corpora to inform editorial decisions
As well as highlighting potential dictionary additions, corpora provide objective data about word usage. In the following sections we give some examples of ways that corpus analysis helps ensure that dictionary entries accurately represent the way words are typically used.
The practice in the OED is to give the most common modern British form of a word as the headword or lemma, with additional spellings listed as variant forms. We use corpora to identify the most frequent form: for example, our most recent corpus data shows that face covering is approximately 17 times more frequent than hyphenated face-covering, so the former is the lemma form in the OED.
Some words are more complex. There has been quite a lot of discussion online about whether Covid-19 should be spelled with an initial capital (as in this article) or with full capitals, COVID-19, and different official bodies and news organizations follow different practices. As the charts below show, the pattern varies according to variety of English. Specifically, in UK English there is a clear preference for the form Covid-19, while in the US the preference is for COVID-19, although with a very slight shift towards Covid-19 in recent months. There may be fluctuations as time goes on, and this is something we’ll continue to track.
Corpora also provide useful information about the distribution of a word in different varieties of English. For example, as the chart below shows, although frontliner is used worldwide, it is particularly frequent in South East Asia, especially the Philippines and Malaysia: in other countries the more usual term is frontline worker or similar. For this reason we have labelled frontliner n. sense 2 as ‘now chiefly South-East Asian’.
Such corpus data is often useful in confirming editorial hunches. The editors working on self-isolate, self-quarantine, and related words felt that although there are technical differences between the two terms, they are often used interchangeably, the main difference being in regional distribution. To confirm this we looked at various corpora, and the clearest picture can be seen in the Coronavirus Corpus, a corpus of news articles relating to Covid-19 on english-corpora.org. As shown in the charts below, self-quarantine is more common in the US than in Canada, Great Britain, Ireland, Australia, and New Zealand, where self-isolate and self-isolation are preferred. A note to this effect has been added to our updated entry for self-quarantine v.: “In recent use, in the context of the Covid-19 pandemic, self-isolate and self-quarantine have often been used interchangeably, with self-quarantine being more common in the United States.”
Usage and collocations
Finally, corpora are invaluable in highlighting the contexts in which a word is used, often indicating particular nuances or senses. As we were drafting frontliner, we had another look at frontline. The sense of the adjective as used in frontline worker/employee/staff, etc., had been defined as “Of a person: working at the forefront of an organization’s public activity, typically as the point of direct contact with customers, clients, users of the organization’s services, etc.” This was an accurate summary when the entry was first revised a few years ago, but the focus of the sense has shifted during the Covid-19 pandemic. We compared salient collocates of frontline – that is, words occurring near frontline with a statistically significant frequency – in 2020 with those of previous years. Some had remained unchanged – frontline staff has been a consistently common collocation, for example – but the following stood out as much more frequent in 2020:
- frontline nurse/medic/caregiver
- frontline healthcare/health-care workers
- frontline warrior/hero
- courageous/heroic frontline workers
- essential frontline worker
This very positive sentiment associated with frontline workers, and the focus on such workers as carrying out essential roles, especially in health care, led us to expand the definition as follows: “Of a person: working at the forefront of an organization’s public activity, typically as the point of direct contact with customers, clients, users of the organization’s services, etc., (now) esp. designating such an employee who provides a service regarded as vital within the community, such as a health-care worker, teacher, etc.; often in frontline worker.”
Of course, while Covid-19 has been one of the defining features of 2020 so far, the other major topic in the news has been the Black Lives Matter movement and the protests following the killing of George Floyd on 25th May. The corpus keywords in June (in the table at the beginning of this article) reflect the enormous impact of these events, with references to racism, injustice, and police brutality, calls to defund the police, and discussions of the removal of Confederate statues and other monuments. The enduring effects of these events – on society and on language – are still unknown. But the OED will continue to record and review them for future updates.
[i] For an explanation of keywords in corpus linguistics see https://www.sketchengine.eu/my_keywords/keyword/. The reference corpus was the whole Oxford Corpus; the focus corpus was the section for the given month. Proper names were excluded.
[ii] Throughout this article, charts based on Oxford Corpus data show frequencies per million tokens. (Tokens are the smallest units of a corpus, typically either words or punctuation marks: for consistency, corpus sizes are usually measured in tokens rather than words.) Also, variant spellings and inflected forms are included: for example, figures for face covering include those for face-covering, face-coverings, face covering, face coverings, etc.
[iii] For an explanation of trends searches see https://www.sketchengine.eu/guide/trends/
Header image: created by Kevin Kobsic, Unsplash
The opinions and other information contained in the OED blog posts and comments do not necessarily reflect the opinions or positions of Oxford University Press.