Please find below some of the most popular articles I've written on .

Table of Contents

Book Review: Natural Language Processing with Python

Building a full-text index of git commits using lunr.js and Github APIs

Converting git commit history to a solr full-text index

Expert Search Statistics

Identifying important keywords using Lunr.js and the Blekko API

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Scraping PDF text with Python



Book Review: Natural Language Processing with Python

“Natural Language Processing with Python” provides a nice overview of NLP techniques and Python, using NLTK (Natural Language Toolkit), a framework maintained by the books authors. It’s intended for use as (I assume) under-grad textbook (some of their examples of “difficult” bits of code will not appear difficult to more experienced programmers). Don’t be put […] Read More...

Building a full-text index of git commits using lunr.js and Github APIs

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer […] Read More...

Converting git commit history to a solr full-text index

I built a 4 million document archive from Github commits, which lets you search for open source experts, ranked by commit count. Click here to try the demo. Solr is a relatively recent addition to the world of Lucene (2007); it adds a web-app UI over lucene, scaling (highly available reads), and configuration. For those […] Read More...

Expert Search Statistics

The following are some interesting statistics about the Github expert-finder. Unique repositories: 18,977 Source git repos (GB): 250+ GB Solr Index Size: 3.2 GB Time to build index: ~12 hours spread over several days (had to restart indexer several times) Number of commits:  4,579,236   Read More...

Identifying important keywords using Lunr.js and the Blekko API

Lunr.js is a simple full-text engine in Javascript. Full text search ranks documents returned from a query by how closely they resemble the query, based on word frequency and grammatical considerations – frequently occurring words have minimal effect, whereas if a rare word occurs in a document several times, it boosts the ranking significantly. This […] Read More...

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js

Technologies used: Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided […] Read More...

Scraping PDF text with Python

If you want to extract text from a PDF with Python, there is a library called PDFMiner (beware: does not work in Python 3). This example will walk a directory structure, look for PDFs, and make a “.txt” file next to the PDF with a text rendition. import sys from pdfminer.pdfparser import PDFDocument, PDFParser from […] Read More...