{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![API model](img/api_model.png)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------------------------\n", "\n", "# The OED researcher API\n", "\n", "The OED API is an interface that enables clients to do things with information derived from the OED.\n", "* 'Clients' = primarily programs and applications, rather than people.\n", "\n", "--------------------------------\n", "\n", "### Usage\n", "\n", "Documentation and sign-up:\n", "https://languages.oup.com/research/oed-researcher-api/\n", "\n", "Base URL: https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/\n", "\n", "\n", "---------------------------------\n", "\n", "### Sample queries\n", "\n", "Entry or entries for the word _monitor_: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/words/?lemma=monitor\n", "\n", "Senses of the word _monitor_ that existed in 1700: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/senses/?lemma=monitor¤t_in=1700\n", "\n", "Words formed with the suffix _–esque_: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/word/esque_su01/derivatives/\n", "\n", "Senses to do with tennis: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/senses/?topic=Tennis\n", "\n", "Quotations by women authors between 1780 and 1800: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/quotations/?year=1780-1800&author_gender=female\n", "\n", "… and the same where these provide the earliest evidence for a word: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/quotations/?year=1780-1800&author_gender=female&first_in_word=true\n", "\n", "Words derived from Hungarian: \n", "https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/words/?etymon_language=Hungarian" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------\n", "\n", "# Basic API usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports and constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "import json\n", "import pprint\n", "import requests\n", "\n", "\n", "API_BASE_URL = 'https://oed-researcher-api.oxfordlanguages.com/oed/api/v0.2/'\n", "# Parameters to be included as headers to each API request\n", "# - required for authorization.\n", "with open('credentials.json') as f:\n", " credentials = json.load(f)\n", "HEADERS = {\n", " 'app_id': credentials.get('APP_ID'),\n", " 'app_key': credentials.get('APP_KEY'),\n", "}\n", "\n", "\n", "def _make_api_request(endpoint, query_params, show_url=False):\n", " \"\"\"\n", " Make the API request\n", " \n", " Parameters\n", " ----------\n", " endpoint : str\n", " The API endpoint, e.g. 'senses'.\n", "\n", " query_params: dict\n", " Additional query parameters to include in the request.\n", "\n", " show_url : bool, optional\n", " Defaults to False.\n", "\n", " Returns\n", " -------\n", " list\n", " A list of dicts, each dict being the JSON representation\n", " of a word, sense, etc., as returned by the API.\n", " \"\"\"\n", " response = requests.get(\n", " API_BASE_URL + endpoint + '/',\n", " params=query_params,\n", " headers=HEADERS,\n", " )\n", " if show_url:\n", " print(response.url + '\\n')\n", " if str(response.status_code) != '200':\n", " _error_report(response)\n", " exit()\n", " else:\n", " return response.json()['data']\n", "\n", " \n", "def _error_report(response):\n", " \"\"\"\n", " Print out an error report for any response that does\n", " not have a 200 status code.\n", "\n", " Parameters\n", " ----------\n", " response : requests.Response object\n", " \"\"\"\n", " print('! Status code {code} returned by URL {url}'.format(\n", " code=response.status_code,\n", " url=response.url,\n", " ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List senses for a lemma\n", "Using the OED API _/senses/_ endpoint.\n", "\n", "* retrieves all the senses of a lemma;\n", "* (optionally) filters for the subset of senses current in a given period (include a year=yyyy keyword argument);\n", "* returns senses in date order." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "def list_possible_meanings(formatted=True):\n", " lemma = input('WORD: ')\n", " year = input('YEAR: ')\n", " display_senses(\n", " lemma.strip(),\n", " year=int(year.strip()),\n", " formatted=formatted,\n", " )\n", "\n", " \n", "def display_senses(lemma, year=None, formatted=True):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " lemma : str\n", " The lemma (word) for which senses are sought.\n", "\n", " year : int, optional\n", " If specified, results are filtered so that only\n", " senses that were current in this year are included.\n", " Defaults to None.\n", "\n", " formatted : bool, optional\n", " If True, display a formatted version of the sense.\n", " If False, display the raw JSON representation of\n", " the sense. Defaults to True\n", " \"\"\"\n", " # Set the parameters for the API request\n", " query_params = {'lemma': lemma, 'current_in': year}\n", " senses = _make_api_request('senses', query_params, show_url=True)\n", " for sense in senses:\n", " if formatted:\n", " _display_formatted(sense)\n", " else:\n", " _display_raw(sense)\n", "\n", "\n", "def _display_raw(sense):\n", " \"\"\"\n", " Display the raw JSON of a sense returned by the API.\n", " \n", " Parameters\n", " ----------\n", " sense : dict\n", " JSON representation of an OED sense, as returned by the API.\n", " \"\"\"\n", " pprint.pprint(sense, indent=2, width=80, compact=False, sort_dicts=False)\n", " print('')\n", "\n", "\n", "def _display_formatted(sense):\n", " \"\"\"\n", " Display a formatted view of selected features of a sense\n", " returned by the API.\n", " \n", " Parameters\n", " ----------\n", " sense : dict\n", " JSON representation of an OED sense, as returned by the API.\n", " \"\"\"\n", " print('{pos}: {defn}\\n\\t{date}\\n\\t{ref} {url}\\n'.format(\n", " pos=sense['part_of_speech'],\n", " defn=sense['definition'],\n", " date=sense['daterange']['rangestring'],\n", " ref=sense['oed_reference'],\n", " url=sense['oed_url'],\n", " ))\n", "\n", "\n", "list_possible_meanings(formatted=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------------\n", "\n", "# Parsing a piece of text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From John Marston's satire _The scourge of villanie_ (1598)\n", "> But I am vexed, when swarmes of _Iulians_ \n", "> Are still manur'd by lewd Precisians: \n", "> Who scorning Church rites, take the simbole vp \n", "> As slouenly, as carelesse Courtiers slup \n", "> Their mutton gruell. Fie, who can with-hold, \n", "> But must of force make his milde Muse a scold?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------------\n", "\n", "### Processing each token in a sentence\n", "Using the OED API _/lemmatizetext/_ endpoint:\n", "* tokenizes the input sentence;\n", "* skips punctuation and core vocabulary tokens;\n", "* identifies possible lemmatizations for non-core vocabulary.\n", "\n", "Candidate lemmatizations are returned in order of likelihood, taking into account:\n", "* the date of the text;\n", "* some basic part-of-speech tagging. (This can be improved by pre-processing the text.)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "TEXT = \"\"\"\n", "But I am vexed, when swarmes of Iulians\n", "Are still manur'd by lewd Precisians:\n", "Who scorning Church rites, take the simbole vp\n", "As slouenly, as carelesse Courtiers slup\n", "Their mutton gruell.\n", "\"\"\"\n", "\n", "\n", "def parse_text(text, year):\n", " \"\"\"\n", " Use the OED API to parse a string of text.\n", "\n", " Parameters\n", " ----------\n", " text : str\n", " The string of text to be parsed.\n", "\n", " year : int\n", " The (approximate) date of the text.\n", " \"\"\"\n", " text = text.replace('\\n', ' ').strip()\n", " # Set the parameters for the API request\n", " query_params = {'text': text, 'year': year}\n", " tokens = _make_api_request('lemmatizetext', query_params)\n", " for token in tokens:\n", " process_token(token, year)\n", "\n", "\n", "def process_token(token, year):\n", " \"\"\"\n", " Print out information for a single token.\n", "\n", " Parameters\n", " ----------\n", " token : dict\n", " The dict of features for a single token\n", " (see documentation for the OED API /lemmatizetext/\n", " endpoint).\n", "\n", " year : int\n", " The (approximate) date of the source text.\n", " \"\"\"\n", " print(' ' + token['token'])\n", " for entry in (t['word'] for t in token['lemmatizations']):\n", " print(' {e} ({date})'.format(\n", " e=entry['oed_reference'],\n", " date=entry['daterange']['rangestring'],\n", " ))\n", " break\n", "\n", "\n", "parse_text(TEXT, 1598)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------------\n", "\n", "### Meanings\n", "Using the OED API _/word/{id}/senses/_ endpoint.\n", "\n", "For a given word, this:\n", "* retrieves all the senses of the word, as listed in OED;\n", "* (optionally) filters for the subset of senses current in a given period;\n", "* returns senses in date order.\n", "\n", "For simplicity:\n", "* we assume that the first lemmatization candidate is correct - so we're only retrieving senses for this word;\n", "* we're skipping higher-frequency words - we're only interested in senses for lower-frequency words." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "def process_token(token, year):\n", " \"\"\"\n", " Print out information for a single token.\n", "\n", " Parameters\n", " ----------\n", " token : dict\n", " The dict of features for a single token\n", " (see documentation for the OED API /lemmatizetext/\n", " endpoint).\n", "\n", " year : int\n", " The (approximate) date of the source text.\n", " \"\"\"\n", " print(' ' + token['token'])\n", " if token['lemmatizations']:\n", " entry = token['lemmatizations'][0]['word']\n", " print(' {e} ({date})'.format(\n", " e=entry['oed_reference'],\n", " date=entry['daterange']['rangestring'],\n", " ))\n", " fetch_senses(entry, year)\n", "\n", "\n", "def fetch_senses(entry, year):\n", " \"\"\"\n", " Fetch and print out the set of senses belonging to a given word,\n", " filtered for the subset of senses that were current in the year\n", " specified.\n", "\n", " Parameters\n", " ----------\n", " entry : dict\n", " The entry whose senses are sought.\n", "\n", " year : int\n", " The year used to filter senses for currency.\n", " \"\"\"\n", " # Bail out if this is a high-frequency word\n", " if entry['frequency'] and entry['frequency'][-1][1] > 2:\n", " return\n", "\n", " query_params = {'current_in': year}\n", " endpoint = 'word/{id}/senses'.format(id=entry['id'])\n", " senses = _make_api_request(endpoint, query_params)\n", " for sense in senses[0:3]: # just the first 3 senses\n", " print(' \\u2043 \"{defn}...\" ({date})'.format(\n", " defn=sense['definition'][0:80],\n", " date=sense['daterange']['rangestring'],\n", " ))\n", " fetch_synonyms(sense['id'], year)\n", "\n", "\n", "def fetch_synonyms(sense_id, year):\n", " pass # stub\n", "\n", "\n", "parse_text(TEXT, 1598)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------------\n", "\n", "### Synonyms\n", "Using the OED API _/sense/{id}/synonyms/_ endpoint.\n", "\n", "For a given sense, this:\n", "* retrieves all senses in the same node of the semantic taxonomy (~synonyms);\n", "* (optionally) filters for the subset of synonyms current in a given period;\n", "* returns synonyms in alphabetical order by lemma. (Here we post-process the API response to re-sort into date order.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "def fetch_synonyms(sense_id, year):\n", " \"\"\"\n", " Fetch and print out the set of synonyms for a given sense,\n", " filtered for the subset of synonyms that were current in\n", " the year specified.\n", "\n", " Parameters\n", " ----------\n", " sense_id : str\n", " The ID of the sense whose synonyms are sought.\n", "\n", " year : int\n", " The year used to filter synonyms for currency.\n", " \"\"\"\n", " query_params = {\n", " 'current_in': year,\n", " }\n", " endpoint = 'sense/{id}/synonyms'.format(id=sense_id)\n", " synonyms = _make_api_request(endpoint, query_params)\n", " # Re-sort synonyms into date order\n", " synonyms.sort(key=lambda s: s['daterange']['start'])\n", " for synonym in synonyms:\n", " if synonym['id'] == sense_id:\n", " continue\n", " print(' \\u2023 {lemma} ({date})'.format(\n", " lemma=synonym['lemma'],\n", " date=synonym['daterange']['rangestring'],\n", " ))\n", "\n", "\n", "parse_text(TEXT, 1598)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }