Rhyming with NLP and Shakespeare

Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing, focused on NLTK, which among other things, does a nice job of collecting NLP datasets and algorithms into one library.

Let’s take one of Shakespeare’s sonnets and see if we can recommend alternate rhymes:

import nltk
from nltk.corpus import cmudict

sonnet = ['O thou, my lovely boy, who in thy power', \
   'Dost hold Time\'s fickle glass, his sickle, hour;', \
   'Who hast by waning grown, and therein showst', \
   'Thy lovers withering as thy sweet self growst;', \
   'If Nature, sovereign mistress over wrack,', \
   'As thou goest onwards, still will pluck thee back,', \
   'She keeps thee to this purpose, that her skill', \
   'May time disgrace and wretched minutes kill.', \
   'Yet fear her, O thou minion of her pleasure!', \
   'She may detain, but not still keep, her treasure:', \
   'Her audit, though delayed, answered must be,', \
   'And her quietus is to render thee.']

First, we tokenize the sonnet and grab the last word of each line.

from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
tokens = [wordpunct_tokenize(s) for s in sonnet]
punct = set(['.', ',', '!', ':', ';'])
filtered = [ [w for w in sentence if w not in punct ] for sentence in tokens]
last = [ sentence[len(sentence) - 1] for sentence in filtered]

One of the corpora included in NLTK is the CMU pronouncing dictionary, which gives us pronunciations for words, so the next step is to look up pronunciations:

from nltk.corpus import cmudict
[[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last]

[[('power', ['P', 'AW1', 'ER0'])], 
[('hour', ['AW1', 'ER0']), 
('hour', ['AW1', 'R'])], 
[], [], 
[('wrack', ['R', 'AE1', 'K'])], 
[('back', ['B', 'AE1', 'K'])], 
[('skill', ['S', 'K', 'IH1', 'L'])], 
[('kill', ['K', 'IH1', 'L'])], 
[('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])], 
[('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])], 
[('be', ['B', 'IY1']), ('be', ['B', 'IY0'])], 
[('thee', ['DH', 'IY1'])]]

Unfortunately this falls apart for unrecognizable words. In a real system, we might want a fallback, such as comparing the last letters of the word to other words in a dictionary. An implementation of this would reduce letters to a common set and remove silent letters, e.g. “remove e from the end of words”, reduce some letter combinations to common sounds, “ie” -> “y”, “c” -> “k”, etc. Obviously this is tightly tied to English, but would help with made-up or rare words.

We can join the prior result to our list of words, and see pronunciations:

[[(w, p) for (word, pron) in cmudict.entries() if word == w] for w in last]

[[('power', ['P', 'AW1', 'ER0'])], 
 [('hour', ['AW1', 'ER0']), 
  ('hour', ['AW1', 'R'])], [], [], 
  [('wrack', ['R', 'AE1', 'K'])], 
  [('back', ['B', 'AE1', 'K'])], 
  [('skill', ['S', 'K', 'IH1', 'L'])], 
  [('kill', ['K', 'IH1', 'L'])], 
  [('pleasure', ['P', 'L', 'EH1', 'ZH', 'ER0'])], 
  [('treasure', ['T', 'R', 'EH1', 'ZH', 'ER0'])], 
  [('be', ['B', 'IY1']), 
  ('be', ['B', 'IY0'])], 
  [('thee', ['DH', 'IY1'])]]

We can retrieve the number of sounds in a similar fashion:

syllables = \ 
  [[(word, len(pron)) for (word, pron) in cmudict.entries() if word == w] \
    for w in last]

[[('power', 3)], 
 [('hour', 2), 
  ('hour', 2)], 
  [], 
  [], 
  [('wrack', 3)], 
  [('back', 3)], 
  [('skill', 4)], 
  [('kill', 3)], 
  [('pleasure', 5)], 
  [('treasure', 5)], 
  [('be', 2), 
   ('be', 2)], 
   [('thee', 2)]]

Combining the last two together, we can get the word, number of sounds, and phenome list in one tuple. What the following code shows is that even though Python is simpler than Perl, it’s still easy to build data structures that are impossible to read.

syllables = \
[[(w, len(p), p) for (w, p) in cmudict.entries() if word == w] \
   for w in last]

[[('power', 3, ['P', 'AW1', 'ER0'])], 
 [('hour', 2, ['AW1', 'ER0']), 
 ('hour', 2, ['AW1', 'R'])], 
  [], 
  [], 
 [('wrack', 3, ['R', 'AE1', 'K'])], 
 [('back', 3, ['B', 'AE1', 'K'])], 
 [('skill', 4, ['S', 'K', 'IH1', 'L'])], 
 [('kill', 3, ['K', 'IH1', 'L'])], 
 [('pleasure', 5, ['P', 'L', 'EH1', 'ZH', 'ER0'])], 
 [('treasure', 5, ['T', 'R', 'EH1', 'ZH', 'ER0'])], 
 [('be', 2, ['B', 'IY1']), ('be', 2, ['B', 'IY0'])], 
 [('thee', 2, ['DH', 'IY1'])]]

Now that we have this, we need to write a test to see whether two words rhyme. To do this, we remove the first sound, and compare the ends of two words:

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      return [wt for (wt, pt) in cmudict.entries() \ 
             if l == len(pt) and p[1:] == pt[1:] ]
    except:
      return [w]
  except:
    return []

[rhymes(s) for s in syllables]

This returns a lot of results – much of it is good, but a lot is also garbage.

[['bauer', 'baur', 'bougher', 'bower', 'cower',
 'dauer', 'dour', 'gauer', 'gower', 'hauer', 'hower',
 "how're", 'kauer', 'knauer', 'lauer', 'mauer',
 'nauer', 'power', 'rauer', 'sauer', 'schauer', 
'shower', 'sour', 'tauer', 'tower'], ['auer', 'ayer', 
'ayer', 'eyer', 'for', 'her', 'hour', 'iyer', 'our',
 'oyer', 'uher', 'were'], [], [], ['back', 'backe',
 'bak', 'bakke', 'cac', 'caq', 'dac', 'dack', 'dak',
 'fac', 'hack', 'hacke', 'haq', 'haque', 'jac', 'jack',
 'jacques', 'knack', 'lac', 'lack', 'lak', 'mac',
 'mack', 'macke', 'mak', 'nack', 'nacke', 'pac', 
'pack', 'pak', 'paque', 'ptak', 'rack', 'rak', 'sac',
 'sack', 'sak', 'schack', 'shack', 'shaq', 'tac', 
't_a_c', 'tack', 'tacke', 'tak', 'wack', 'whack', 
'wrack', 'yack', 'yak', 'zach', 'zack', 'zak'], 
['back', 'backe', 'bak', 'bakke', 'cac', 'caq', 
'dac', 'dack', 'dak', 'fac', 'hack', 'hacke', 'haq',
 'haque', 'jac', 'jack', 'jacques', 'knack', 'lac', 
'lack', 'lak', 'mac', 'mack', 'macke', 'mak', 'nack',
 'nacke', 'pac', 'pack', 'pak', 'paque', 'ptak', 'rack',
... truncated, since this goes on for a while...

One issue with the above list is a lot of these “words” are names or garbage tokens, but not really used in a literary sense. We can filter these down by joining to Wordnet, which is a corpus with some semantic information, and more focused on real words.

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries() \ 
                  if l == len(pt) \ 
                  and p[1:] == pt[1:] \
                  and len(wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

This is better – we get a lot less, although not all of these words could be used in context, and most of them are pretty obvious (if you look for a rhyming word, you can always run through the alphabet in your head, which doesn’t require software)

[['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'], 
['hour', 'were'], [], [], ['back', 'dak', 'hack', 
'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak', 
'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'wrack', 'yack', 'yak'], ['back', 'dak', 'hack',
 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak',
 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'wrack', 'yack', 'yak'], ['skill'], ['bill', 
'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille',
 'mil', 'mill', 'nil', 'pill', 'rill', 'shill', 'sill',
 'thill', 'till', 'will', 'zill'], ['pleasure'], 
['treasure'], ['b', 'be', 'bee', 'c', 'd', 'de', 'dea', 
'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea', 'key', 'ki',
 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi', 'ne', 
'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're', 'sea', 
'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v', 'vi', 'wee', 
'xi', 'yi', 'z', 'zea', 'zee'], ['b', 'be', 'bee', 'c', 'd', 
'de', 'dea', 'fee', 'g', 'gee', 'ghee', 'he', 'ji', 'kea', 
'key', 'ki', 'knee', 'lea', 'lee', 'leigh', 'li', 'me', 'mi',
 'ne', 'nee', 'ni', 'p', 'pea', 'pee', 'qi', 'quay', 're',
 'sea', 'see', 'si', 't', 'te', 'tea', 'tee', 'ti', 'v', 
'vi', 'wee', 'xi', 'yi', 'z', 'zea', 'zee']]

The next version is a minor improvement, checking the ends, rather than the whole words. Note that for very short words, it removes all the random garbage (all the single sound words were treated as matching, since we compared 0 length lists)

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries() 
                  if l == len(pt) 
                  and p[-2:] == pt[-2:] 
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

[['bower', 'cower', 'dour', 'power', 'shower', 'sour', 'tower'], 
['hour'], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 'lac',
 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 
'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 'yak'], 
['back', 'dak', 'hack', 'jack', 'knack', 'lac', 'lack', 'mac',
 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 'sack', 'shack',
 'tack', 'whack', 'wrack', 'yack', 'yak'], ['brill', 'drill',
'frill', 'grill', 'grille', 'krill', 'quill', 'shrill', 'skill', 
'spill', 'still', 'swill', 'thrill', 'trill', 'twill'], ['bill', 
'chill', 'dill', 'fill', 'gill', 'hill', 'kill', 'lille', 'mil', 
'mill', 'nil', 'pill', 'rill', 'shill', 'sill', 'thill', 'till',
 'will', 'zill'], ['closure', 'crozier', 'pleasure', 'treasure'],
 ['closure', 'crozier', 'pleasure', 'treasure'], 
['b', 'be', 'bee'], []]

This returns a few words that are very similar to the original. To reduce this, we can check the “edit distance”, the number of letters you would have to change to turn one letter into another. Unfortunately on it’s own this takes away some good rhymes, so we also to check if the first letters differ too. This allos “tower” and “power” to match, but not “old” and “olde” (the reason you will still see some examples like this is words are compared to the original – not each other).

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries() 
                  if l == len(pt) 
                  and p[-2:] == pt[-2:] 
                  and (nltk.distance.edit_distance(w, wt) > 2 \
                  or not w[0:2] == wt[0:2])
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

[['bower', 'cower', 'dour', 'shower', 'sour', 'tower'], 
[], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 
'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 
'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 'yack',
 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac', 'lack',
 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 'sac', 
'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 
'yak'], ['brill', 'drill', 'frill', 'grill', 'grille', 
'krill', 'quill', 'shrill', 'spill', 'still', 'swill', 
'thrill', 'trill', 'twill'], ['bill', 'chill', 'dill', 
'fill', 'gill', 'hill', 'lille', 'mil', 'mill', 'nil', 
'pill', 'rill', 'shill', 'sill', 'thill', 'till', 'will',
 'zill'], ['closure', 'crozier', 'treasure'], ['closure', 
'crozier', 'pleasure'], ['b'], []]

From here, we can remove some words that don’t make sense in context by checking what parts of speech Wordnet has seen them in. We get the set of all possible part of speech tags for a word, and compare that to a potentially rhyming word – if there is overlap, we keep it.

def cmp(p, wt):
  pt = set([s.pos for s in wordnet.synsets(wt)])
  return len(pt & p) > 0 or len(p) == 0 or len(pt) == 0

def rhymes(s):
  try:
    (w, l, p) = s[0]
    try:
      pos = set([s.pos for s in wordnet.synsets(w)])
      filtered = [wt for (wt, pt) in cmudict.entries() 
                  if l == len(pt) 
                  and p[-2:] == pt[-2:] 
                  and (nltk.distance.edit_distance(w, wt) > 2 \
                  or not w[0:2] == wt[0:2])
                  and cmp(pos, wt)
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

Note this removes “dour” in the first part, where we expect to see a noun of some sort. The list isn’t too repetitive now, although it is getting quite short.

[['bower', 'cower', 'shower', 'sour', 'tower'], 
[], [], [], ['back', 'dak', 'hack', 'jack', 'knack', 
'lac', 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 
'rack', 'sac', 'sack', 'shack', 'tack', 'whack', 
'yack', 'yak'], ['dak', 'hack', 'jack', 'knack', 'lac',
 'lack', 'mac', 'mack', 'mak', 'pac', 'pack', 'rack', 
'sac', 'sack', 'shack', 'tack', 'whack', 'wrack', 'yack', 
'yak'], ['brill', 'drill', 'frill', 'grill', 'grille',
 'krill', 'quill', 'spill', 'still', 'swill', 'thrill',
 'trill', 'twill'], ['bill', 'chill', 'dill', 'fill', 
'gill', 'hill', 'lille', 'mil', 'mill', 'nil', 'pill', 
'rill', 'shill', 'sill', 'thill', 'till', 'will', 
'zill'], ['closure', 'crozier', 'treasure'],
['closure', 'crozier', 'pleasure'], ['b'], []]

One minor issue here is that “sour” is in the first set. Let’s try one last attempt – instead of checking all possible roles a word could play, lets check the role it actually plays, by running the text through a part of speech tagger. Finally can justify having the entire text the whole text of each line, even though we’ve been ignoring it thus far.

The NLKT part of speech tagger is much more detailed than what Wordnet has – this is because Wordnet acts as a dictionary of nouns, verbs, adjectives, and adverbs, but is less concerned about their roles, so we need to establish a mapping.


tagged = [nltk.pos_tag(t, simplify_tags = True) for t in tokens]
[[('O', 'NNP'), ('thou', 'VBD'), (',', ','), 
('my', 'PRP$'), ('lovely', 'RB'), ('boy', 'JJ'), 
(',', ','), ('who', 'WP'), ('in', 'IN'), ('thy', 'JJ'), 
('power', 'NN')], [('Dost', 'NNP'), ('hold', 'VBD'), 
('Time', 'NNP'), ("'", 'POS'), ('s', 'NNS'), 
('fickle', 'VBP'), ('glass', 'NN'), (',', ','), 
('his', 'PRP$'), ('sickle', 'NN'), (',', ','), 
('hour', 'PRP$'), (';', ':')], [('Who', 'WP'), 
('hast', 'VBN'), ('by', 'IN'), ('waning', 'NN'),
 ('grown', 'IN'), (',', ','), ('and', 'CC'), 
('therein', 'VB'), ('showst', 'JJ')], [('Thy', 'NNP'),
 ('lovers', 'NNS'), ('withering', 'VBG'), ('as', 'IN'), 
('thy', 'JJ'), ('sweet', 'NN'), ('self', 'NN'), 
('growst', 'NN'), (';', ':')], [('If', 'IN'), 
('Nature', 'NNP'), (',', ','), ('sovereign', 'NN'), 
('mistress', 'NN'), ('over', 'IN'), ('wrack', 'NN'), 
(',', ',')], [('As', 'IN'), ('thou', 'PRP'), 
('goest', 'VBP'), ('onwards', 'NNS'), (',', ','),
... truncated ...

def replace(tag):
  try: 
    return {
      'CC': '',
      'CD': 's',
      'DT': '',
      'EX': 'r',
      'FW': 'n',
      'IN': '',
      'JJ': 's',
      'JJR': 's',
      'JJS': 's',
      'LS': 'n',
      'MD': 'v',
      'NN': 'n',
      'NNS': 'n',
      'NNP': 'n',
      'NNPS': 'n',
      'PDT': 'v',
      'POS': 'n',
      'PRP': 'n',
      'PRP$': '',
      'RB': 'r',
      'RBR': 'r',
      'RBS': 'r',
      'RP': '',
      'TO': '',
      'UH': '',
      'VB': 'v',
      'VBD': 'v',
      'VBG': 'v',
      'VBN': 'v',
      'VBP': 'v',
      'VBZ': 'v',
      'WDT': '',
      'WP': 'n',
      'WP$': 'v',
      'WRB': 'r'
    }[tag]
  except:
    return ''

new_tags = \
  [[(w, replace(t)) \
   for w, t in line if not replace(t) == ''] \
   for line in tagged]

last = [[(w, pos) for (w, pos) in line[-1:]] \
        for line in new_tags]

syllables = \
  [[[(word, len(phenomes), phenomes, pos) \
   for (word, phenomes) in cmudict.entries() if word == w] \
   for (w, pos) in line] \
   for line in last]

def cmp(pos, wt):
  pt = set([s.pos for s in wordnet.synsets(wt)])
  return pos in pt or len(p) == 0 or len(pt) == 0

def rhymes(s):
  try:
    (w, l, p, pos) = s[0][0]
    try:
      filtered = [wt for (wt, pt) in cmudict.entries() 
                  if l == len(pt) 
                  and p[-2:] == pt[-2:] 
                  and (nltk.distance.edit_distance(w, wt) > 2 \
                  or not w[0:2] == wt[0:2])
                  and cmp(pos, wt)
                  and len(nltk.corpus.wordnet.synsets(wt)) > 0]
      return filtered
    except:
      return [w]
  except:
    return []

Interestingly, this actually returns more than the prior technique. This is because NLTK can tag many more words than Wordnet, since Wordnet is aiming at specific semantic interests. This maintains properties similar to the previous list (“dour” is still out), but less has been removed. Overall the word diversity is quite high as well. Clearly, one could spend a lot of time even on this simple task.

[rhymes(r) for r in syllables]

[['bower', 'shower', 'sour', 'tower'], 
['aerial', 'amble', 'angel', 'angle', 'ankle', 
'anvil', 'april', 'arousal', 'arrival', 'axle', 
'babble', 'babel', 'baffle', 'bagel', 'barrel', 
'barrel', 'basel', 'basil', 'basle', 'battle',
 'bauble', 'beadle', 'beagle', 'beetle', 'beigel',
 'beryl', 'betel', 'bethel', 'bevel', 'bible', 
'bobble', 'boodle', 'bottle', 'bubble', 'buckle', 
'bushel', 'bustle', 'cable', 'cackle', 'camel', 
'carol', 'carol', 'carrel', 'carroll', 'carroll',
 'castle', 'cattle', 'channel', 'chapel', 'chattel', 
'chisel', 'choral', 'chuckle', 'chunnel', 'circle',
 'cobble', 'cockle', 'colonel', 'coral', 'couple',
 'cuddle', 'cudgel', 'cycle', 'cyril', 'dazzle',
 'dental', 'devil', 'dibble', 'diesel', 'diesel',
 'doodle', 'double', 'duffel', 'durrell', 'equal',
 'fable', 'facial', 'faisal', 'fennel', 'fiddle',
 'final', 'fipple', 'fizzle', 'foible', 'fossil', 
'fuel', 'funnel', 'gable', 'gaggle', 'gavel', 'geisel',
 'gesell', 'giggle', 'girdle', 'gobble', 'goral',
 'gurgle', 'hackle', 'haggle', 'hassel', 'hassle',
 'hatchel', 'havel', 'hazel', 'heckle', 'hegel', 'herbal',
 'herschel', 'hobble', 'hovel', 'hubble', 'huddle', 
'hurdle', 'hustle', 'jackal', 'jiggle', 'jostle',
 'journal', 'juggle', 'kennel', 'kernel', 'kettle',
 'kibble', 'knuckle', 'label', 'ladle', 'laurel', 
'level', 'libel', 'little', 'local', 'lovell',
 'mammal', 'maple', 'medal', 'memel', 'metal', 
'methyl', 'mettle', 'michael', 'mickle', 'middle',
 'missal', 'missile', 'mitchell', 'mobile', 'modal',
 'model', 'mogul', 'moral', 'mosul', 'motile', 
'muckle', 'muddle', 'muffle', 'muscle', 'mussel', 
'muzzle', 'myrtle', 'nasal', 'natal', 'navel', 'needle', 
'nestle', 'nettle', 'nibble', 'nickel', 'nipple', 'noble', 
'noodle', 'novel', 'nozzle', 'paddle', 'panel', 'pebble',
 'pedal', 'people', 'peril', 'petal', 'phenol', 'pickle', 
'piddle', 'pommel', 'poodle', 'pottle', 'puddle', 'purple',
 'puzzle', 'rabble', 'rachel', 'raffle', 'rattle', 'ravel',
'razzle', 'rebel', 'revel', 'riddle', 'riffle', 'rifle',
 'rigel', 'ripple', 'rival', 'roble', 'rommel', 'rouble',
 'rubble', 'rubel', 'ruble', 'ruddle', 'ruffle', 'russell',
 'rustle', 'sable', 'saddle', 'seckel', 'segal', 'seidel',
 'settle', 'shackle', 'shekel', 'shovel', 'shuffle',
 'shuttle', 'sibyl', 'social', 'sorrel', 'table', 
'tackle', 'tassel', 'tattle', 'tercel', 'thermal', 
'thistle', 'tickle', 'tipple', 'title', 'tittle',
 'toggle', 'tootle', 'total', 'trial', 'tunnel', 
'turtle', 'tussle', 'umbel', 'uncle', 'vassal', 'vergil', 
'vessel', 'vigil', 'vinyl', 'virgil', 'vocal', 'waddle',
 'waffle', 'wattle', 'weasel', 'weevil', 'whistle', 
'whittle', 'wiesel', 'wiggle', 'wobble', 'wrestle', 
'wriggle', 'yodel'], [], [], ['back', 'dak', 'hack',
 'jack', 'knack', 'lac', 'lack', 'mac', 'mack', 'mak',
 'pac', 'pack', 'rack', 'sac', 'sack', 'shack', 'tack',
 'whack', 'yack', 'yak'], [], ['brill', 'drill', 'frill',
 'grill', 'grille', 'krill', 'quill', 'spill', 'still',
 'swill', 'thrill', 'trill', 'twill'], ['bill', 'chill',
 'fill', 'hill', 'mill', 'shill', 'till', 'will'], 
['closure', 'crozier', 'treasure'],
 ['closure', 'crozier', 'pleasure'], [], []]

9 Replies to “Rhyming with NLP and Shakespeare”

  1. This is really interesting – I know some publishers and data suppliers include metadata in their electronic documents to say whether they rhyme. I believe this is typically done manually by editors, but using a similar approach would be a big time saver.

    1. My suspicion is that many places that provide tagged databases use a machine to make an educated guess, and correct the tags, because it’s more accurate than doing it alone (someone on Hacker News suggested Metaphone as an algorithm for this). I’ve thought about doing an example of this process by sending test data to Amazon Mechanical Turk for corrections, but haven’t come up with a good test case yet.

  2. Hi Gary,

    In PHP you can find words that sound the same using the soundex function. You can easily go through a dictionary and run the soundex function on every word and group the words with the same soundex together.

    I wonder whether Python has a similar function.

    1. Yeah, you certainly can do that. There is a similar algorithm called Metaphone that is a little more sophisticated than soundex. I would suggest these only be used as a fallback – pronunciation dictionaries are based on surveys of how words are actually pronounced, where soundex and metaphone are educated guesses (although both may fall apart for regional and historical differences). There was a survey done of word pronunciations across the U.S. called TIMIT, which might also provide an interesting set of data, as it would show regional differences.

  3. Soundex and Metaphone both concentrate on the start of words, rhymes concentrate on the end of words. Both are adequate for short words. Of the two, Metaphone is better because it encodes the first character, whereas Soundex leaves it as is; so, for example, ‘ceiling’ and ‘sealing’ should be equal in Metaphone, but in Soundex they would start with C and S respectively. In both, ‘revealing’ would start with ‘R’ and would not match either of the earlier words even though it rhymes with both of them.

    1. It’s not in a separate file – you should be able to use the code in the post. Typically I aim to put exactly what I did in the posts so they are reproduceable.

  4. Just wanted to say thank you for an awesome article, I have found it incredibly useful. I am looking at sensible if not intelligent sentence construction with NLP, I was wondering if you have every attempted something like this , or if you could point to a useful resource. Anyway thanks for your excellent nltk posts.

    1. I haven’t, but one area of research you might find interesting is entailment, which is studying the relationships between verbs (you can get similar relationship dictionaries for nouns with Wordnet). It’s not exactly what you’re talking about, but it might get you somewhat closer depending what you’re trying to do.

Leave a Reply to Alex Muro Cancel reply

Your email address will not be published. Required fields are marked *