Thought Experiments - Gary Sieling

If you’re comparing IBM Watson (or Alchemy API) and Open Calais, you’re most likely interested in entity extraction – the natural language processing technique that tags references in text. One of the notable features of any entity extraction system is the ability to identify previously unseen items of a class (e.g. if you’re looking for the make and model of vehicles mentioned in a document, values that are novel to the algorithm). Separately, the number of different classes of entities recognized may be important, as it takes quite a bit of human labor to train these tools.

I ran a few thousand files through each for a search engine that finds good audio lectures – transcripts, talk descriptions, and author bios (~15k requests).

The IBM Watson product line is a series of many different products and features – there are additional APIs for relationship extraction, microformat tagging, extracting text from HTML, speech to text, and some image processing utilities I haven’t explored. It appears that IBM is trying to make the “Watson” line the AWS of artificial intelligence tools. This includes some features like sentiment and emotion analysis, which seem clearly aimed at marketers (e.g. finding sentiment of tweets). Watson identifies apparently hundreds of different types of entities (cities, different forms of times, health conditions, etc). Watson also offers a tool to build your own models, although this costs approximately $4k per month (and scales with team size) plus another $3.5k per month to host the model, so this is clearly out of startup range.

Open Calais, by contrast, is a very focused API, aiming at just entity and relationship extraction. This reflects the internal uses the product has within Thomson Reuters – tagging legal and financial documents for companies, law firms, and individuals referenced in text. In my opinion, it is unlikely that any company has spent as much time training entity extraction systems as Thomson Reuters, as this has been used against many of their existing database properties to generate new product lines. I don’t know how much Open Calais is used directly for internal products, as opposed to internal APIs, but in my testing I found that Open Calais typically returned double the number of entities Watson did. It is also a bit slower, and returns detailed context information about each entity.

In cases where something can be detected as a known quantity (e.g. “Microsoft”) Thomson Reuters has their own system of internal identifiers, whereas Alchemy API returns identifiers to datasets that normal people can actually make use of (e.g. http://wiki.dbpedia.org/).

Which API you’d want really depends on what you’re building. Both products have generous free plans and communicate through JSON payloads, so it’s pretty easy to try out both. I was able to port my application from Watson to Open Calais in about an hour, for instance. While the marketing sites for both products are a bit of a pain, the IBM site (“BlueMix”) is particularly terrible. When I tried to upgrade to a paid account, the site refused to believe I’d read the agreements and required support to get around it – the whole experience is like this, so budget extra time for wrestling with the site.

Category: Thought Experiments

IBM Watson vs Open Calais