New tool explores Flippa seller history and facilitates due diligence

Clinton, owner of Experienced-People.net, released a new data analysis tool to explore Flippa auction history, available at http://stats.experienced-people.net/seller.php. Flippa is one of the largest marketplaces for websites for sale- this tool allows detailed research on sellers to facilitate due diligence and market research. One of the most fascinating part of browsing Flippa listings, for me, […]

Scraping a List of Adsense Sites Within a Niche

One of the challenges in web crawling and scraping is determining which URLs to scrape. It’s easy for a site to have many urls that aren’t visited by humans, like a stock photo site that uses an API to supplement its data. Sites with sessionid parameters or dynamic content may make many duplicate or similar […]


How to find Pitches in Music

Have you ever wanted to find what chords were in a song? A good first step is to read the notes played during a short time interval. If several notes are played simultaneously, in order to figure out the notes, we need to separate the waveform into the individual frequencies. Fortunately there is a well-established […]


Using Prolog to Generate Test Data

I’ve built several reporting systems where work was divided evenly between a charting UI and database scripts – an ETL job, report sql, and database schema. It’s nice to divide work between UI and database developers to take advantage of specialization, but not having data is always an issue for the first week or two […]

, ,

Scraping Adsense Ads with PhantomJS

PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS. Extracting advertisement blocks requires fairly simple CSS […]

Detecting parked domains

Looking at old Flippa auctions that I scraped, it would be interesting to determine if domains are parked. I found this post, which describes a few options, including checking DNS entries for redirects, finding a blacklist, or content inspection. Some people have built APIs, but none appear maintained, and DNS providers rate limit whois lookups. […]


Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate […]


A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running. Weka uses a […]