Corpus analysis of the language of Covid-19

Corpus analysis of the language of Covid-19

Last week the OED was updated with some of the words and phrases which have become increasingly familiar in the context of the current global crisis, such as self-isolation, social distancing, and flatten the curve. OED editors are continually monitoring linguistic developments, and one of the ways of doing this is through analysis of language corpora. In this article we summarize some recent trends, using data from our monitor corpus of English. This corpus contains over 8 billion words of web-based news content from 2017 to the present day, and is updated each month.

Coronavirus, COVID-19, and other words denoting the virus and the disease

The charts below show the frequency in the last four months of coronavirus, COVID-19, and other words denoting the novel coronavirus and the disease it causes [1].

Most of the words have, to different degrees, become more frequent, including the shortened forms corona and covid. (We’ve also seen evidence of the further shortenings rone and rona, mainly on social media.) The exceptions are the abbreviations of novel coronavirus – nCoV and 2019-nCoV – which peaked in February and have since become less common.

The most striking change has been the huge increase in frequency of the words coronavirus and COVID-19 themselves. Before 2020, coronavirus was relatively rare outside medical and scientific discourse, while COVID-19 was only coined in February; both now dominate global discourse.

The charts below illustrate the extent to which the word coronavirus has become overwhelmingly frequent. The first compares it with words referring to other major news topics in recent times: climate, Brexit, and impeachment. The second compares it with one of the most frequently-used nouns in the English language, time.

Words used with coronavirus

The changing contexts in which a word is used can give insight into shifting perceptions and concerns. The table below shows the top twenty collocates of coronavirus in the last three months: that is, words occurring near coronavirus with a statistically significant frequency[3]. Collocates occur in different patterns: for example, in the following, the words in bold are all collocates of coronavirus: coronavirus outbreak; novel coronavirus; spread of coronavirus; fight the coronavirus.

The linguistic impact of the coronavirus pandemic

The impact of the current pandemic on the English language can be explored by looking at corpus keywords in the last three months: that is, words which were significantly more frequent in those months than in the corpus as a whole[4].

The table below shows the top 20 keywords for January, February, and March; those relating to the coronavirus crisis are highlighted in red. In January and February, some of the keywords related to coronavirus; others referred to other world events such as the Australian bushfires, the assassination of Qasem Soleimani, Donald Trump’s impeachment and acquittal, the Democratic caucuses, locust swarms in East Africa, investigations into the Astros sign-stealing scandal, and so on. In March, however, every one of the top twenty keywords was in some way related to coronavirus.

It is also revealing to compare the coronavirus-related keywords from January to March. In January, the words mainly relate to naming and describing the virus: coronavirus, SARS, virus, human-to-human, respiratory, flu-like. By March the keywords reflect the social impact of the virus, and issues surrounding the medical response: social distancing, self-isolation and self-quarantine, lockdown, non-essential (as in non-essential travel), and postpone are all especially frequent, as are PPE and ventilator.

As noted in last week’s update, many of the words used in the context of the current crisis are not completely new, but were relatively uncommon before this year. The chart below shows the increase in frequency of two particularly salient sets of terms: social distancing/social distance and self-isolation/self-isolate.

Our research into the language of Covid-19 is ongoing, and we will share updates on the blog as we continue to monitor our corpora and track linguistic developments.


[1] Throughout this article, charts show frequencies per million tokens. (Tokens are the smallest units of a corpus, typically either words or punctuation marks: for consistency, corpus sizes are usually measured in tokens rather than words.) Also, variant spellings and inflected forms are included: for example, figures for COVID-19 include those for Covid-19, COVID19, etc., and figures for self-isolate include those for self-isolated, self-isolating, etc.

[2] Figures for corona are estimates, based on analysing samples of uses of corona (which has a number of senses) and extrapolating overall frequency of the use as a shortening of coronavirus.

[3] The corpus interface used was the Sketch Engine. Collocates within three words on either side of coronavirus were retrieved (excluding prepositions and other function words), and ordered by statistical significance using the logDice measure: see https://www.sketchengine.eu/my_keywords/logdice/.

[4] For an explanation of keywords in corpus linguistics see https://www.sketchengine.eu/my_keywords/keyword/.  The reference corpus was the whole Oxford Corpus; the focus corpus was the section for the given month. Proper names were excluded.

Image by John Cameron on Unsplash

The opinions and other information contained in the OED blog posts and comments do not necessarily reflect the opinions or positions of Oxford University Press.

Comments