Detecting parked domains

Looking at old Flippa auctions that I scraped, it would be interesting to determine if domains are parked. I found this post, which describes a few options, including checking DNS entries for redirects, finding a blacklist, or content inspection. Some people have built APIs, but none appear maintained, and DNS providers rate limit whois lookups. […]

Fixing org.apache.solr.common.SolrException: Length Required

I received the following exception, after making no code changes: org.apache.solr.common.SolrException: Length Required The issue is that CommonsHttpSolrServer does not send a Content-Length header in updates. The root cause of my issue was switching the front-end proxy from Apache to Nginx, which apparently is more strict about headers.  

,

Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate […]

Generating ARFF files for Weka from Postgres

Since all my scraped data is in Postgres, this is the easiest way to get it out – the fastest iteration possible. At some point I’ll probably switch to a Java library. It’s interesting to see, but probably the only lesson from this is that all ETL scripts are ugly. WITH advertisers_ranked AS ( SELECT […]

,

A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running. Weka uses a […]

Expert Search Statistics

The following are some interesting statistics about the Github expert-finder. Unique repositories: 18,977 Source git repos (GB): 250+ GB Solr Index Size: 3.2 GB Time to build index: ~12 hours spread over several days (had to restart indexer several times) Number of commits:  4,579,236  

, ,

Advertisers used by banned sellers in Flippa auctions

In a previous post, I listed the top Flippa advertisers, gained through the node.js web scraper. Which advertisers are mentioned most often in auctions by banned sellers? As you can see, there is a big drop in the “unknown” category, and a big increase in banned accounts associated with Infolinks and CJ. After visual inspection, […]

ExtJS TreePanel Example

Problem You want to display a file manager style tree grid. Solution Use the Ext.chart.Chart, and set several properties under the “series” property to render a pie chart.   Discussion The official ExtJS documentation shows how to build a tree grid from an array, which I don’t find terribly helpful. More realistic applications require JSON […]

,

Diagnosing Connection Leaks in Node.js and Postgres

In building a website scraper with Chrome and Node.js, I made mistakes that led to connection leaks. In this application, the scraper runs in a browser and connects to a node.js server, which saves data off to a database. Once you know what the issues look like, they are easy to see, but otherwise often difficult […]