Rhyming with NLP and Shakespeare

“Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing, focused on NLTK, which among other things, does a nice job of collecting NLP datasets and algorithms into one library. Let’s take one of Shakespeare’s sonnets and see if we can recommend alternate rhymes: import nltk from nltk.corpus […]

Visualizing Six Million Files and Folders

Each year there are nearly 300,000 of these in Federal Federal Civil Court, 1.3-1.6 million in Federal Bankruptcy Court, but this pales in comparison to state courts, which accept just over 100 million cases each year. Even a small extract of these takes up a fair amount of space: This is what a court docket […]

, ,

Building an Directory Structure Index in Python

I’m working through examples in “Natural Language Processing with Python” (read my review) and found that the corpus I have to work with is large enough to require special performance tuning exercises. If you have a large enough directory structure, it becomes difficult to walk with os.walk – for instance any failure in longer scripts […]

Creating N-Gram Indexes with Python

“Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time – I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text, […]

, ,

Uncovering Lexical Relationships with Python and NLP

Wordnet is a database containing hierarchies of certain types of relationships – “a tree is part of a forest”, “a car is a type of motor vehicle”, “an engine is part of a car” (meronyms, holonyms). “Natural Language Processing with Python” (read my review) suggests that you might discover these relationships in a corpus by […]


NLP Analysis in Python using Modal Verbs

Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. “Natural Language Processing with Python” (read my review) has an […]

, , ,

Identifying important keywords using Lunr.js and the Blekko API

Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations – frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This […]

, , , , ,

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used: Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided […]