{"id":1002,"date":"2013-05-07T00:40:21","date_gmt":"2013-05-07T00:40:21","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=1002"},"modified":"2013-05-07T00:40:21","modified_gmt":"2013-05-07T00:40:21","slug":"entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer\/","title":{"rendered":"Entity recognition with Scala and Stanford NLP Named Entity Recognizer"},"content":{"rendered":"<p>The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it&#8217;s fairly good at finding nouns, but not always at identifying the type of each noun.<\/p>\n<p>In this example, the entities I&#8217;d like to see are different &#8211; companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.<\/p>\n<p>For this text, selecting different options sometimes led to the classifier picking different options for a noun &#8211; one time it&#8217;s a person, another time it&#8217;s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes &#8211; if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I&#8217;ve noticed similar issues with company names.<\/p>\n<pre lang=\"Java\">\nimport org.apache.tika.parser.pdf._\nimport org.apache.tika.metadata._\nimport org.apache.tika.parser._\nimport java.io._\nimport org.xml.sax._\nimport edu.stanford.nlp.ie.crf.CRFClassifier\nimport edu.stanford.nlp.ling.CoreAnnotations\n\nobject pdfHandler extends ContentHandler {\n  val contents: StringBuffer = new StringBuffer()\n\n  def characters(ch: Array[Char], start: Int, length: Int) {\n    contents.append(new String(ch))\n  }\n\n  def endDocument() {\n  }\n\n  def endElement(uri: String, localName: String, qName: String) {\n  }\n\n  def endPrefixMapping(prefix: String) {\n  }\n\n  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {\n  }\n\n  def processingInstruction(target: String, data: String) {\n  }\n\n  def setDocumentLocator(locator: Locator) {\n  }\n\n  def skippedEntity(name: String) {\n  }\n\n  def startDocument() {\n  }\n\n  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {\n  }\n\n  def startPrefixMapping(prefix: String, uri: String) {\n  }\n}\n\nobject pdf extends App {\n  val file = \"\"\"e:\\data\\11-1285_i4dk.pdf\"\"\"\n\n  val pdf: PDFParser = new PDFParser();\n\n  val stream: InputStream = new FileInputStream(file)\n  val handler: ContentHandler = pdfHandler\n  val metadata: Metadata = new Metadata()\n  val context: ParseContext = new ParseContext()\n\n  pdf.parse(stream,\n    handler,\n    metadata,\n    context)\n\n  stream.close()\n\n  val contents: String = pdfHandler.contents.toString()\n  println(contents)\n\n  val src = \"stanford-ner-2013-04-04\/classifiers\/\"\n  val classifier1 = \"english.all.3class.distsim.crf.ser.gz\"\n  val classifier2 = \"english.conll.4class.distsim.crf.ser.gz\"\n  val classifier3 = \"english.muc.7class.distsim.crf.ser.gz\"\n\n  val serializedClassifier = src + classifier1\n\n  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)\n  val out = classifier.classify(contents)\n\n  var words = 0\n  for (i <- 0 to out.size() - 1) {\n    val sentence = out.get(i)\n\n    var foundWord = \"\"\n    var oldWordClass = \"\"\n\n    for (j <- 0 to sentence.size() - 1) {\n      val word = sentence.get(j)\n      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + \"\"\n\n      if (!oldWordClass.equals(wordClass)) {\n        if (!oldWordClass.equals(\"O\") &#038;&#038; !oldWordClass.equals(\"\")) {\n          print(\"[\/\" + oldWordClass + \"]\")\n        }\n      }\n\n      if (!wordClass.equals(\"O\") &#038;&#038; !wordClass.equals(\"\")) {\n        if (!oldWordClass.equals(wordClass)) {\n          print(\"[\" + wordClass + \"]\")\n        }\n      }\n\n      oldWordClass = wordClass\n\n      words = words + 1\n      print(word);\n      print(\" \");\n\n      if (words > 10) {\n        words = 0\n        println(\" \")\n      }\n    }\n  }\n}\n<\/pre>\n<pre>\n11-1285 [ORGANIZATION]US Airways , Inc. [\/ORGANIZATION]v.  \n[PERSON]McCutchen [\/PERSON]-LRB- 4\\\/16\\\/13 -RRB- 1 -LRB-  \nSlip Opinion -RRB- OCTOBER TERM ,  \n2012 Syllabus NOTE : Where it  \nis feasible , a syllabus -LRB-  \nheadnote -RRB- will be released ,  \nas isbeing done in connection with  \nthis case , at the time  \nthe opinion is issued . The  \nsyllabus constitutes no part of the  \nopinion of the Court but has  \nbeenprepared by the Reporter of Decisions  \nfor the convenience of the reader  \n. See [LOCATION]United States [\/LOCATION]v. [ORGANIZATION]Detroit  \nTimber & Lumber Co. [\/ORGANIZATION], 200  \nU. S. 321 , 337 .  \nSUPREME COURT OF THE [ORGANIZATION]UNITED STATES  \nSyllabus US AIRWAYS [\/ORGANIZATION], INC. ,  \nIN ITS CAPACITY AS FIDUCIARY AND  \nPLAN ADMINISTRATOR OF THE [LOCATION]US [\/LOCATION]AIRWAYS  \n, INC. . EMPLOYEE BENEFITS PLAN  \nv. [PERSON]MCCUTCHEN [\/PERSON]ET AL. . CERTIORARI  \nTO THE [ORGANIZATION]UNITED STATES [\/ORGANIZATION]COURT OF  \nAPPEALS FOR THE THIRD CIRCUIT No.  \n11 -- 1285 . Argued November  \n27 , 2012 -- Decided April  \n16 , 2013 The health benefits  \nplan established by petitioner [ORGANIZATION]US Airways  \n[\/ORGANIZATION]paid $ 66,866 in medical expenses  \nfor injuries suffered by respondentMcCutchen ,  \na [ORGANIZATION]US Airways [\/ORGANIZATION]employee , in  \na car accident caused by athird  \nparty . The plan entitled [ORGANIZATION]US  \nAirways [\/ORGANIZATION]to reimbursement if \n[PERSON]McCutchen [\/PERSON]\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it&#8217;s fairly good at finding nouns, but not always at identifying the type of each noun. In this example, the entities I&#8217;d like to &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/entity-recognition-with-scala-and-stanford-nlp-named-entity-recognizer\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Entity recognition with Scala and Stanford NLP Named Entity Recognizer&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,6],"tags":[300,378,385,417,480,495,530],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1002"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1002"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1002\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1002"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1002"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1002"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}