Finding Corporate Sponsors of Open Source

I copied about 19,000 git repositories into a full-text solr index. Because commits are tied to email addresses this provides interesting insight into corporate open source contributions. The search front-end I added lets you search for programmers or companies, grouped by the number of commits. For example, searching for Linux returns the following results: linux-foundation […]

Reading the Youtube API from PhantomJS

The following code will retrieve the duration of a video from the Youtube API. The output must be in JSON, because PhantomJS currently doesn’t handle the XML correctly (from email threads it appears this will be fixed in a future release) address = ‘http://gdata.youtube.com/feeds/api/videos/’ + id + ‘?v=2&alt=json’; page.open(address, function (status) { var duration = […]

, ,

Scraping Adsense Ads with PhantomJS

PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS. Extracting advertisement blocks requires fairly simple CSS […]

Don’t use Access-Control-Allow-Origin

Access-Control-Allow-Origin is an HTTP header that allows servers to specify which hosts may send cross domain AJAX requests. Let’s say you were building an ad network, fetching content via AJAX. You would add this header to HTTP responses, once for each allowed domain. Clearly this is not scalable, but it’s a bad idea for other reasons […]

1/3 of old Flippa website auctions point to abandoned sites

Flippa is an auction site for buying and selling websites as businesses. Browsing the listings shows many low quality products. With careful inspection, there are often interesting, quality listings, but they are swallowed in the noise. Occasionally there are successful e-commerce sites, un-maintained high-traffic developer forums, or fire-sales on start-ups. Often these are educational, but […]

Detecting parked domains

Looking at old Flippa auctions that I scraped, it would be interesting to determine if domains are parked. I found this post, which describes a few options, including checking DNS entries for redirects, finding a blacklist, or content inspection. Some people have built APIs, but none appear maintained, and DNS providers rate limit whois lookups. […]

Fixing org.apache.solr.common.SolrException: Length Required

I received the following exception, after making no code changes: org.apache.solr.common.SolrException: Length Required The issue is that CommonsHttpSolrServer does not send a Content-Length header in updates. The root cause of my issue was switching the front-end proxy from Apache to Nginx, which apparently is more strict about headers.  

,

Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate […]