Applying a semantic tool to the OED: the Linguistic DNA Project

The questions which could not be addressed during the webinar session were addressed by Dr Mehl and answer are available to view below.


Could you explain about the name ‘Linguistic DNA’? What do you mean by DNA here?

We’ve described the project as an attempt to pinpoint the ‘building blocks’ of modern discourse, hence the comparison to DNA as the ‘building blocks’ of life. We see the interactions between key words (i.e. their co-occurrences) as constituting these building blocks, and we map them using high-performance computing that’s comparable to what is used by geneticists mapping genomes. 

As regards Mutual Info, what about asymmetries of Mutual Info within trios?

Yes, Mutual Information is asymmetrical within trios, and can be measured in multiple ways: between pairs of words A and B, B and C, and A and C, for example. In Fano’s (1960) first explanation of Mutual Information, he also proposes measuring it for trios, based on the probability that event (or word ) C will occur, given that events (or words) A and B occur. That’s how we measure Mutual Information for trios. 

Can the trios be used to identify corpuses?

Yes, and there’s a lot of potential here. People already use keywords to build corpora. For example, a researcher might build a corpus of newspaper articles whose titles contain the word ‘crisis’, or whose texts contain any word from a list related to crises. We could use our tool to build corpora that contain specific trios or quads.

On the slide on sermon trios, the pair ‘heaven-earth’ appears also as ‘earth-heaven’. Is this because you used one of these words as search ‘pivot’ each time? If so, pairings will appear differently, I understand (In the science slide, the ‘body-part’ pair appears always in this order).

A trio has six different orderings (ABC, ACB, BAC, BCA, CAB, CBA), and each can have a different Mutual Information score. Our processor identifies a first word (or node word or, as you put it, a pivot), and then identifies all the co-occurring pairs within 50 tokens to each side. Then, it identifies for each pair all co-occurring trios within 50 tokens to each side. So, our raw data outputs will include each ordering of a given trio, with a Mutual Information score. The public interface allows the option of viewing each ordering and its score, or only viewing the order with the highest Mutual Information score. 

Can you tell us a few of the texts that you are looking at in this project?

I showed some examples from Speed’s history of the British Isles, and several others, but our data contains over 60,000 texts. Some of them can be viewed here:

Could you explain why you have not included adverbs?

Good question. We automatically identify parts of speech for each word using a tool called MorphAdorner, which was developed for Early Modern English using traditional ‘philological’ parts of speech (rather than more contemporary linguistic analyses of parts of speech). It combines adverbs together with ‘particles’, ‘conjunctions’, and ‘prepositions’. This is a rather large group that contains more than we would ideally want it to. Our statistical analysis is based on frequencies for each part of speech, and we felt that this rather large group was too messy to be viable, so we left it out. 

Professor Mehl gave the URL to connect to the trio data, and I neglected to note it down. Could you please let me know the url?

Yes, you can link to it via the project website, (which is easy to remember), or go directly to it here: