Web scraping, harvesting, data extraction is the process of gathering data from websites. Various tools exist to download, clean, and process data, from built-in UNIX tools to custom software and programming language APIs. Often processing is required – either to extract structured or unstructured data. Sites that generate pages but do not provide an API often are easier to parse, as they are more consistent. Unstructured data is common in data extracted from PDFs, and in some cases manual intervention is even required, where you are forced to use a system like Amazon Mechanical Turk to have humans batch process data.

Below you will find some of the most popular pieces I’ve written, often proof of concept tools to test new technologies and explore database systems. Much data is readily available given some patience, from legal documents such as court cases, financial filings, weather monitoring, advertisement text, search results, auction data, and so on.


Scala: wget/curl example

The following example shows how to implement the bones of a wget or curl style application in Scala, including checking for HTTPS and obtaining the certificate chain. val testUrl = "https://" + domain val url = new URL(testUrl) val conn: HttpsURLConnectionImpl = url.openConnection() match { case httpsConn: HttpsURLConnectionImpl => httpsConn case conn => { println(conn.getClass) […]

Script to get values from wikiart pages

The following script will pull data values from a wikiart page (an excellent index of paintings) – import glob   indir = ‘D:\\projects\\art\\’ for filename in glob.glob(indir + "*.html"): print filename spans = soup.select("span[itemprop]") ahrefs = soup.select("a[itemprop]")   file = open(filename, ‘r’)   soup = BeautifulSoup(file, ‘html.parser’)   spans = soup.select("span[itemprop]") arefs = soup.select("a[itemprop]")   […]

Scraping Tabular Websites into a CSV file using PhantomJS

While there are many tools for scraping website content, two of my current favorites are PhantomJS (Javascript) and BeautifulSoup (Python). Many small scale problems are easily solved downloading files with wget, then post-processing – this works well, as the post-processing typically requires dozens of iterations to extract clean data. If you’re wondering about the legitimacy […]

Scraping a list of links from a document into a CSV file

First, right click an element you are interested in, select “Inspect Element”. In the Developer Tools window, select “Copy XPath”. If all goes well, this will be an array valued path, and you can modify it slightly to return all nodes, instead of the selected item. nodes = document.evaluate( ‘//*[@id="hm-lower-background"]/div/a’ ,document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ) […]


Counting Citations in U.S. Law

The U.S. Congress recently released a series of XML documents containing U.S. Laws. The structure of these documents allow us to find which sections of the law are most commonly cited. Examining which citations occur most frequently allows us to see what Congress has spent the most time thinking about. Citations occur for many reasons: […]

Examining Citations in Federal Law using Python

Congress frequently passes laws which amend or repeal sections of prior laws; this produces a series of edits to law which programmers will recognize as bearing resemblance to source control history. In concept this is simple, but in practice this is incredibly complex – for instance like source control, the system must handle renumbering. What […]

Converting JSON to a CSV file with Python

In a previous post, I showed how to extract data from the Google Maps API, which leaves a series of JSON files, like this: {“address_components”: [{“long_name”:”576″,”short_name”:”576″,”types”:[“street_number”]}, {“long_name”:”Concord Road”,”short_name”:”Concord Road”,”types”:[“route”]},{“long_name”:”Glen Mills”,”short_name”:”Glen Mills”,”types”:[“locality”,”political”]},{“long_name”:”PA”,”short_name”:”PA”,”types”:… Ideally we want selections from these as a CSV for manual review, and import into mapping software. First, we load a list of files: […]