Entries by Gary

Visualizing Six Million Files and Folders

Each year there are nearly 300,000 of these in Federal Federal Civil Court, 1.3-1.6 million in Federal Bankruptcy Court, but this pales in comparison to state courts, which accept just over 100 million cases each year. Even a small extract of these takes up a fair amount of space: This is what a court docket […]

, ,

Building an Directory Structure Index in Python

I’m working through examples in “Natural Language Processing with Python” (read my review) and found that the corpus I have to work with is large enough to require special performance tuning exercises. If you have a large enough directory structure, it becomes difficult to walk with os.walk – for instance any failure in longer scripts […]

Creating N-Gram Indexes with Python

“Natural Language Processing with Python” (read my review) has lots of motivating examples for natural language processing. I quickly found it valuable to build indices ahead of time – I have a corpus of legal texts, and build a set of n-gram indices from it. The first index is a list of just tokenized text, […]

Optimizing WordPress Tag Pages

Normally I don’t like to write about “blogging,” but since website traffic generates some interesting data, it’s worth looking at it from a computer science perspective, to see the issues involved. By default, WordPress has two multi-valued fields associated with an article, “Categories” and “Tags.” Categories are treated as a closed, hierarchical set, and tags […]


NLP Analysis in Python using Modal Verbs

Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything. “Natural Language Processing with Python” (read my review) has an […]

Making Maps with Tilemill

TileMill is a piece of map-making software for rendering beautiful maps. You can export the maps to MapBox, for a Google Maps feel or combine with a tool like D3.js for interactive infographics. There are a surprising number of data sources: weather, earthquake locations, crime statistics, and ship and plane locations. A lot of this is from federal and municipal agencies […]