Data Mining - Gary Sieling

Finding template documents with RethinkDB and the Bing API in Python

Many businesses require new clients to fill out a form to pre-qualify them or to get information to quote a project. Given a list of industries, we can use search engine results to filter the industry list to just those that require forms (you could use this to find industries to target for a form-filling application, for instance)

It would be relatively cheap to have this analysis done using Amazon Mechanical Turk, but this is a good project to learn how RethinkDB works.

RethinkDB is advertised as a real-time, document oriented database, “Document oriented” meaning that it stores semi-structured data, like the JSON output from an API. Bing provides an API that is very easy to use, but rate limited, so we’ll build out a small data warehouse of API results that we can later explore to find content generation ideas.

RethinkDB is very simple to install:

yum -y install rethinkdb

rethinkdb --bind all

Once it’s installed, we’ll need to create the tables we want to use (note that we don’t need to define a schema). Also note that creating a query doesn’t run it until you call “run”, which lets you mix the query construction into your code.

import rethinkdb as r
r.connect("", 28015).repl()
db = r.db("search")

tables = ["search_results", "industries"]
for t in tables:
  try:
    db.table_create(t).run()
  except r.RqlRuntimeError as e:
    print e

To populate the industries list, we must obtain the list of business types. This could come from the headings in the phone book, or a business taxonomy like NAICS codes. NAICS codes have the nice property that they are collected in groups by industry type, so for instance you can pull a large list of only construction businesses.

I pulled this list from a couple places, and found that I needed to clean it up (remove extra whitespace, duplicates, etc).

The RethinkDB insert API has support for batch updates, which is nice for this case. It apparently also supports upserts, although I couldn’t figure out how to do a batch upsert in python.

src = r'd:/projects/bing/industries.txt'
with open(src, mode="r") as f:
    industries = f.readlines()

industries = filter(lambda x : x != '\n', industries)
industries = [x.strip().lower() for x in industries]
industries = list(set(industries))
industries = sorted(industries)

data = [{'industry': value} for value in industries]

db = r.db("search")
table = db.table("industries")
query = table.insert(data)
print query.run()

Now that we have this list, we can start querying the Bing API. You can get a key from the Azure Data portal for free with a Microsoft Account – they let you run up to 5,000 queries a month, so you likely will want to pick just one for testing purposes.

Interestingly, the API uses Basic Auth with the key for authentication.

def search(keyword):
  import urllib2
  import json

  keyBing = ''
  credentialBing = 'Basic ' + (':%s' % keyBing).encode('base64')[:]
  searchString = '%27' + keyword.replace(' ', '+') + '%27'
  top = 20
  offset = 0

  # Web, Image, News, Video
  url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + \
      'Query=%s&$top=%d&$skip=%d&$format=json' % \
      (searchString, top, offset)

  request = urllib2.Request(url)
  request.add_header('Authorization', credentialBing)
  requestOpener = urllib2.build_opener()
  response = requestOpener.open(request) 
  
  return json.load(response)

We also need to define a function that lets use save the API results. When writing an ETL process, it’s a good practice to import all the data as it is given to you, so that if you misunderstand something about the original interface you can backtrack easily later. It’s also a good practice to record information about where you got the data – the time, search query, and search engine. It’s possible that it might even be goot to record the API key, if you were scaling this up, as this would let you re-generate records you care about later.

def saveResults(keyword, results):
  import datetime
  currentTime = datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y")

  db = r.db("search")
  table = db.table("search_results")

  index = 1
  for row in results['d']['results']:
    print row
    record = row
    record['search_query'] = keyword
    record['api'] = 'Web'
    record['rank'] = index
    record['src'] = 'Bing'
    record['retrieved'] = currentTime

    rdbResults = table.insert(record).run()

    index = index + 1

Before we start populating this into RethinkDB, we need to write a query that checks to see if we’ve run this search already, to prevent overrunning our API limits.

One of the neat things about RethinkDB is that we can write WHERE clauses as lambdas, which are tokenized into their lower-level protocol language, then run on the database. One thing that takes some getting used to is that column headers are case-sensitive, unlike most SQL.

def check(keyword):
  db = r.db('search')
  table = db.table('search_results')
  rows = table.filter(lambda x: x['search_query'] == value)
  return rows.count().run()

Once we’ve done all this, we can bind everything together. I got a few errors from the Bing API (2x per 800 calls), so I just catch and continue, but for a more robust system you’d want to do better.

def executeQuery(keyword):
  if (check(keyword) == 0):
    try:
      results = search(keyword)
      saveResults(keyword, results)
    except:
      print "Error"

All we know now is what results Bing returns for a given query, which doesn’t answer the original question of whether a given business requires new client forms.

Buried in this data are several interesting scenarios: some top ranked matches are sites doing SEO, others are perfect matches with spelling differences between our query and the results, or terms synomymous with our query more standard for the industry in question.

As one approach, we can update all the documents in the database with the Levenshtein edit distance between the search query and the document title. This takes a little while – RethinkDB does have indexing, which I haven’t investigated yet.

Note that because “not” is a reserved word in Python, the API uses “not_”. Another nice API feature is that the query results are cursors, so we can operate on each operation in the stream individually. Doing updates on these is simple, because RethinkDB automatically pulls down it’s internal ID.

def updateDoc(table, doc):
  doc['Title_Query_Distance'] = \
    Levenshtein.ratio(doc['search_query'], doc['Title'])
  table.update(doc).run()
 
# this query takes about ten seconds on a 31k row table, with no indexes
toUpdate = \
  table.filter(r.row.has_fields("Title_Query_Distance").not_()).run()
 
[updateDoc(table, doc) for doc in toUpdate]

Once we’ve done all this, we can filter on the edit distance and search result rank, to find some good matches:

[x for x in \
  table.filter({'rank': 1}) \
       .filter(lambda x: x['Title_Query_Distance'] < .05) \
       .pluck("Title", "search_query", "Url") \
       .limit(10) \
       .run()]

The following are the results of this query - you can see these look pretty good. Getting all the good results will require some manual intervention, but the manual work is on a dataset that resembles what we really want in the end.

[{u'Title': u'REQUEST FOR PROPOSALS GYM FLOOR SANDING AND REFINISHING ...',
  u'Url': u'http://www.ri.net/middletown/rfp/RFP_MHS_GymFloorSanding.pdf',
  u'search_query': u'floor resurfacing proposal'},
 {u'Title': u'REQUEST FOR PROPOSAL (DESIGN BUILD FEE) CONSTRUCTION OF ...',
  u'Url': u'http://www.harborhousefl.com/wp-content/uploads/2012/05/Harbor-House-RFP-Final_Perimeter-Wall-2012.pdf',
  u'search_query': u'gate installation proposal'},
 {u'Title': u'REQUEST FOR PROPOSAL COUNTYWIDE RECORDS MANAGEMENT AND ...',
  u'Url': u'http://www.state.nj.us/counties/mercer/pdfs/exempt_rfpss200801.pdf',
  u'search_query': u'storage consultants proposal'},
 {u'Title': u'REQUEST FOR PROPOSAL #6645 SEALANT & CAULKING SERVICES FOR ...',
  u'Url': u'http://www.ewashtenaw.org/government/departments/finance/purchasing/online_bids/bid-status-previous-years/previous-year-assets/assets_2011/rfp6645.pdf',
  u'search_query': u'caulking proposal'}]