RethinkDB: Compute Levenshtein Edit Distance in python

RethinkDB does not have a built-in distance metric like Levenshtein edit distance, but it is fairly easy to update rows in place. I found some examples of people attempting to build regular expressions to match specific Levenshtein edit distances, but this does not seem like a high performance approach, especially for finding things more than one character apart.

To start, you must first install the Levenstein library for Python:

pip install python-levenshtein

Having done that, you can load your database:

import Levenshtein

import rethinkdb as r
r.connect("localhost", 28015).repl()

db = r.db("search")
table = db.table("industries")

Once you have the table, it’s relatively easy to get a single test row, modify it, and upload it to the database. Remember to add .run() to the end of the query. Once you do this, you can filter to rows that contain the column you’re interested in (in this case, you should get one row back from the last query). The reason that you can update a document you downloaded previously is that RethinkDB automatically pulls down the document’s ID.

doc = [x for x in table.limit(1).run()]

doc['Title_Query_Distance'] = \
  Levenshtein.ratio(doc['search_query'], doc['Title'])
doc['Title_Description_Distance'] = \
  Levenshtein.ratio(doc['search_query'], doc['Description'])

table.update(doc).run()

[x for x in table.filter('Title_Query_Distance').run()]

Having done all this, you can run this query for each record in the database (if you’re splitting the work up, you may wish to do this only on rows with the value not populated):

def updateDoc(table, doc):
  doc['Title_Query_Distance'] = \
    Levenshtein.ratio(doc['search_query'], doc['Title'])
  doc['Title_Description_Distance'] = \
    Levenshtein.ratio(doc['search_query'], doc['Description'])
  table.update(doc).run()

# this query takes about ten seconds on a 31k row table, with no indexes
toUpdate = \
  [doc for doc in \
    table.filter(r.row.has_fields("Title_Query_Distance").not_()).run()]

# this query can take hours (30k row table)
[updateDoc(table, doc) for doc in toUpdate]

Once we’ve done this, it’s a good practice to do some diagnostic queries, to verify everything updated as expected (these two should return matching counts)

table.filter(lambda x: x['Title_Query_Distance']).count().run()
table.count().run()

Leave a Reply

Your email address will not be published. Required fields are marked *