Gary Sieling

Experiments with Named Entity Recognition APIs

I’ve been exploring APIs for Named Entity Recognition (and other language processing / AI techniques) as part of a project to discover university lectures and historical speeches.

Named Entity Recognition is a collection of techniques used to label and classify “entities” mentioned in a piece of text – e.g. to list countries mentioned in a speaker’s bio, makes and models of cars mentioned in accident reports, and so on. The automated labeling process typically tags parts of speech, then trains a machine learning algorithm to recognize the desired classes of values from manually tagged texts. Entity recognition systems may also return a unique identifier (useful for companies, people, etc), and thus must attempt to disambiguate two entities with the same name, and unify references to the same entity under different names (e.g. Microsoft, MSFT).

Once trained, the algorithm is expected to recognize values it has never seen before – new people, companies, countries, and so on.

There are several commercial and open-source systems that implement portions of this functionality. Training such a system requires a large dataset, so I suspect that as time progresses, the commercial offerings will greatly outstrip the free ones.

The Stanford Named Entity Recognizer is promoted as being good at tagging people, organizations, and locations. I’ve investigated two commercial systems in detail – AlchemyAPI (IBM) and Open Calaias (Thomson Reuters)- both have metered pricing.

Alchemy API

AlchemyAPI promotes that they can recognize several hundred entity types, which is likely the broadest coverage of any of these systems. In my experience, some of these seem to have a lot of false positives, but I imagine this will improve with time. AlchemyAPI also has a system to let you upload your own data and train new entities, although it is quite expensive. One nice feature is that they chose existing open data systems for the links, when they know what they are.

{
  "type": "Facility",
  "relevance": "0.817492",
  "count": "12",
  "text": "Alban Berg Quartet"
},
{
  "type": "Organization",
  "relevance": "0.350804",
  "count": "1",
  "text": "Cavatina Chamber Music Trust"
},
{
  "type": "Person",
  "relevance": "0.344306",
  "count": "2",
  "text": "Emma Parker",
  "disambiguated": {
    "name": "Emma Parker",
    "dbpedia": "http://dbpedia.org/resource/Emma_Parker",
    "freebase": "http://rdf.freebase.com/ns/m.0dln2wr"
  }
},
{
  "type": "JobTitle",
  "relevance": "0.326067",
  "count": "4",
  "text": "Ernest Bloch Lecturer"
}

One thing that surprises me about entity recognition systems is that more of them don’t use fixed lists of values to check against – for instance, there are hundreds of “former countries and territorial entities, but few of these were identified by AlchemyAPI in my testing.

Open Calais

Open Calais is run by Thomson Reuters – I found that it typically returns about 2x as many entities as AlchemyAPI, but with less categories. It is a more REST oriented API, in that it returns URLs to items it finds, so you can use these as ids. It also returns context clues, so at least you know where it found something (I imagine this might be useful in a search engine):

 "_typeGroup": "entities",
    "_type": "Position",
    "forenduserdisplay": "false",
    "name": "Professor of the History of Christianity and Leverhulme Major Research Fellow",
    "_typeReference": "http://s.opencalais.com/1/type/em/e/Position",
    "instances": [
      {
        "detection": "[ Professor in the History of Religion. He is also ]Professor of the History of Christianity and Leverhulme Major Research Fellow[ at Durham University.  \nHis first series of]",
        "prefix": " Professor in the History of Religion. He is also ",
        "exact": "Professor of the History of Christianity and Leverhulme Major Research Fellow",
        "suffix": " at Durham University.  \nHis first series of",
        "offset": 80,
        "length": 77
      }
    ],
    "relevance": 0.2

On the other hand, this did detect talks from Gresham College as being in Gresham, Oregon (clearly not in the U.K.), which illustrates the peril of relying too heavily on these types of systems.

Problem Areas

Finally, if you want to use these systems, it’s important to know where they break down.

Through this project I’ve pulled a few tens of thousands of entity calls, so if you have questions feel free to ask below.

Exit mobile version