Don’t use Access-Control-Allow-Origin

Access-Control-Allow-Origin is an HTTP header that allows servers to specify which hosts may send cross domain AJAX requests. Let’s say you were building an ad network, fetching content via AJAX. You would add this header to HTTP responses, once for each allowed domain. Clearly this is not scalable, but it’s a bad idea for other reasons […]

1/3 of old Flippa website auctions point to abandoned sites

Flippa is an auction site for buying and selling websites as businesses. Browsing the listings shows many low quality products. With careful inspection, there are often interesting, quality listings, but they are swallowed in the noise. Occasionally there are successful e-commerce sites, un-maintained high-traffic developer forums, or fire-sales on start-ups. Often these are educational, but […]

Detecting parked domains

Looking at old Flippa auctions that I scraped, it would be interesting to determine if domains are parked. I found this post, which describes a few options, including checking DNS entries for redirects, finding a blacklist, or content inspection. Some people have built APIs, but none appear maintained, and DNS providers rate limit whois lookups. […]

Fixing org.apache.solr.common.SolrException: Length Required

I received the following exception, after making no code changes: org.apache.solr.common.SolrException: Length Required The issue is that CommonsHttpSolrServer does not send a Content-Length header in updates. The root cause of my issue was switching the front-end proxy from Apache to Nginx, which apparently is more strict about headers.  

,

Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate […]

Generating ARFF files for Weka from Postgres

Since all my scraped data is in Postgres, this is the easiest way to get it out – the fastest iteration possible. At some point I’ll probably switch to a Java library. It’s interesting to see, but probably the only lesson from this is that all ETL scripts are ugly. WITH advertisers_ranked AS ( SELECT […]

,

A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running. Weka uses a […]

Expert Search Statistics

The following are some interesting statistics about the Github expert-finder. Unique repositories: 18,977 Source git repos (GB): 250+ GB Solr Index Size: 3.2 GB Time to build index: ~12 hours spread over several days (had to restart indexer several times) Number of commits:  4,579,236  

, ,

Advertisers used by banned sellers in Flippa auctions

In a previous post, I listed the top Flippa advertisers, gained through the node.js web scraper. Which advertisers are mentioned most often in auctions by banned sellers? As you can see, there is a big drop in the “unknown” category, and a big increase in banned accounts associated with Infolinks and CJ. After visual inspection, […]