Extracting Dates and Times from Text with Stanford NLP and Scala

Stanford NLP is a library for text manipulation, which can parse and tokenize natural language texts. Typically applications which operate on text first split the text into words, then annotate the words with their part of speech, using a combination of heuristics and statistical rules. Other operations on the text build upon these results with the same techniques (heuristics and statistical algorithms on earlier data), which results in a pipeline model.

Here, for instance, we see two techniques for constructing a pipeline, one based on configuration, and one manual. Since this example is going to extract dates and times from text, we add the TimeAnnotator class to the end of the pipeline:

object Main {
  def main (args: Array[String]) {
    val props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

    val pipeline = new StanfordCoreNLP(props);

    val timeAnnotator = new TimeAnnotator()
    pipeline.addAnnotator(timeAnnotator)

    ...
  }
}

Once this is working, you simply tell the pipeline to annotate the text, and then wait for a bit.

val text =
  "Last summer, they met every Tuesday afternoon, from 1:00 pm to 3:00 pm."
val doc = new Annotation(text)
pipeline.annotate(doc)

The reason this takes time is that doing the actual work loads a handful of files from disk and works with them, and while they are small, they have large numbers of pre-defined rules. Consider the following sample, which is a small piece of the time matching piece of the library (there are around a thousand lines of this sort of thing).

 BASIC_NUMBER_MAP = {
    "one": 1,
    "two": 2,
    "three": 3,
    ...
 }

BASIC_ORDINAL_MAP = {
  "first": 1,
  "second": 2,
  "third": 3,
  ...
}

PERIODIC_SET = {
  "centennial": TemporalCompose(MULTIPLY, YEARLY, 100),
  "yearly": YEARLY,
  "annually": YEARLY,
  "annual": YEARLY,
  ...
}

This sample demonstrates two things – you can pull out more than just exact times (e.g. “last summer”, “next century”, ranges, times without dates), and the library handles a large number of equivalence classes for you.

One of the most likely issues you’ll run into trying to get this working is getting the classpath and parsing pipeline set up right – while simple to look at, if you try to customize it, you’ll need to develop an understanding of how the library is actually structured.

Once you run it, you can get dates out:

    val timexAnnotations = doc.get(classOf[TimeAnnotations.TimexAnnotations])
    for (timexAnn <- timexAnnotations) {
      val timeExpr = timexAnn.get(classOf[TimeExpression.Annotation])
      val temporal = timeExpr.getTemporal()
      val range = temporal.getRange()

      println(temporal)
      println(range)
    }

For the above example, this gives you the following (note alternating "temporal" and "range"). Note the "offset" lines as well - you can set a reference date for these, if you wish, so that

XXXX-SU OFFSET P-1Y
(XXXX-SU OFFSET P-1Y,XXXX-SU OFFSET P-1Y,)
XXXX-WXX-2TAF
null
T13:00
(T13:00:00.000,T13:00:59.999,PT1M)
T15:00
(T15:00:00.000,T15:00:59.999,PT1M)

If you add the following line, these ranges will fix themselves:

doc.set(classOf[CoreAnnotations.DocDateAnnotation], "2013-07-14")

Which gives you:

2012-SU
(2012-06-01,2012-09,P3M)

One thing I haven't figured out yet is whether you scan specify locale settings for this - presumably in parsing dates, you at least want to know which of month/day are intended to be first, even if you treat the text as English as a whole (that said, this may require first identifying a language and building a different pipeline - e.g. if the library can't handle French, Spanish, etc, being able to handle their dates is irrelevant).

For more examples, and information on customizing this, see the Stanford NLP documentation1.

  1. http://nlp.stanford.edu/software/sutime.shtml []