Entries Tagged '' ↓

Finding Corporate Sponsors of Open Source

I copied about 19,000 git repositories into a full-text solr index. Because commits are tied to email addresses this provides interesting insight into corporate open source contributions. The search front-end I added lets you search for programmers or companies, grouped by the number of commits.

For example, searching for Linux returns the following results:

linux-foundation …………….
gmail …………..
redhat ……..
suse …….


Note that “gmail” indicates the use of @gmail.com addresses, so it’s non-company result. Searching for Android returns:

google …………………………………………………………..
android ………………….
gmail …..
motorola ..
josefson ..
ziplinegames ..
suse ..
mosabuam ..
.google ..
davemloft ..
linux-foundation ..
sonyericsson ..


Try out the search engine here, or view the source code on github.

Recursive cat

Recursively combine Java files:

 cat `find . | fgrep .java` > java.text

How to find the URL of the current tab in a Chrome extension

It’s a little hard to find – getCurrent() returns the current hidden page for the extension.

chrome.tabs.getSelected(null, function(tab){
   var baseUrl = tab.url;

Reading the Youtube API from PhantomJS

The following code will retrieve the duration of a video from the Youtube API. The output must be in JSON, because PhantomJS currently doesn’t handle the XML correctly (from email threads it appears this will be fixed in a future release)

address = 'http://gdata.youtube.com/feeds/api/videos/' + id + '?v=2&alt=json';
page.open(address, function (status) {
   var duration = page.evaluate(function(){
       var data = JSON.parse(document.querySelector("pre").innerText);
       return data.entry.media$group.media$content[0].duration;
   console.log("video duration: " + duration);

Scraping Adsense Ads with PhantomJS

PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS.

Extracting advertisement blocks requires fairly simple CSS selectors. Google can’t change the format too often, since each publisher must paste a code into their site. Some ad networks render advertisements inside an iframe, so running the script may run into browser security restrictions. Extracting ad data from a page of Home Depot’s website gives us the following results:

Drywall Materials Sale, http://www.compare99.com/compare.html%3Fq%3Ddrywall-products%26ort%3DDrywall-Materials-Sale%26adid%3DiaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W
Sheetrock, http://shopping.yahoo.com/search%3B_ylc%3DX3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-%3Fp%3Dsheetrock%26sem%3DGoogle
Sheetrock Material Sale, http://www.buycheapr.com/us/result.jsp%3Fga%3Dus19%26q%3Dsheetrock%2Bmaterial
Installation Framing Door, http://www.moifriefacility.com
Architectural GFRG, http://www.sbgrace.com
WallBuilders Library, http://www.logos.com/products/details/2982%3Fgoogleads

I’ve written a short demo, which retrieves ad text and a screenshot for testing. It is invoked as follows (source is below, and on Github)

phantomjs adsense.js http://www.homedepot.com/Building-Materials-Drywall/FibaTape/h_d1/N-5yc1vZar3dZ38m/h_d2/Navigation?catalogId=10053&Nu=P_PARENT_ID&langId=-1&storeId=10051

The code is almost a little too easy- tell PhantomJS to load a page, run Javascript in the page context, and parse the Adsense URL format. As a programming paradigm, it’s a little complex to track scope, since some code runs in the PhantomJS context and some in the page context. PhantomJS scripts do not exit when a script ends, because many browser actions are asynchronous. This requires scripts to track state and add exit() calls at the end of every branch.

var page = require('webpage').create(),
    t, address;
if (phantom.args.length === 0) {
    console.log('Usage: phantomjs adsense.js ');
} else {
    t = Date.now();
    address = phantom.args[0];
    output = phantom.args[1];
    page.viewportSize = { width: 600, height: 600 };
    page.onConsoleMessage = function (msg) {
        console.log('Console log: ' + msg);
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('FAIL to load the address');
        } else {
            page.evaluate(function () {
                var parse = function(query) {
                    var vars = query.split("?")[1].split("&");
                    var res = {};
                    for (var i = 0; i < vars.length; i++) {
                        var pair = vars[i].split("=");
                        res[pair[0]] = unescape(pair[1]);
                    return res;
                var ads = document.querySelectorAll('#googleAdSenseLeft ul li a');
                for (var i=0; i<ads.length; i++){
                     var adQuery = ads[i].href;
                     var adContents = parse(adQuery);
                     adContents.url = adQuery;
                     adContents.text = ads[i].innerText;
            t = Date.now() - t;
            console.log('Loading time ' + t + ' msec');
            window.setTimeout(function () {
            }, 200);

The actual script output is JSON, and a little tedious to read:

Console log: Building Materials - Drywall - FibaTape at The Home Depot
Console log: {"sa":"l","ai":"CtXg4mpkIUPygG6Ol0AGRloCYCcupmcoEi9O58FOzp_mMrgEQByDHxt8eKAxQ_aetrgRgyb6miYyk1A-gAe2HlNYDyAEBqgQbT9CqfdRE5uzDpvzqUgRUNqtZ3ouY_UBn7VGD","num":"7","sig":"AOD64_0jc0_3b0Au9uLLeud6cAI77O6zrQ","adurl":"http://www.compare99.com/compare.html?q=drywall-products&ort=Drywall-Materials-Sale&adid=iaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W","url":"http://www.google.com/aclk?sa=l&ai=CtXg4mpkIUPygG6Ol0AGRloCYCcupmcoEi9O58FOzp_mMrgEQByDHxt8eKAxQ_aetrgRgyb6miYyk1A-gAe2HlNYDyAEBqgQbT9CqfdRE5uzDpvzqUgRUNqtZ3ouY_UBn7VGD&num=7&sig=AOD64_0jc0_3b0Au9uLLeud6cAI77O6zrQ&adurl=http://www.compare99.com/compare.html%3Fq%3Ddrywall-products%26ort%3DDrywall-Materials-Sale%26adid%3DiaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W","text":"Drywall Materials Sale"}
Console log: {"sa":"L","ai":"CHaiampkIUPygG6Ol0AGRloCYCZu8jlqzr4eAA9G9rwcQCCDHxt8eKAxQ7dCHowNgyb6miYyk1A_IAQGqBB5P0Loa2ETp7MOm_KJSgnGFYNjxasFE9TqY0t8TLpE","num":"8","ggladgrp":"2492582816717521168","gglcreat":"9712000621987456871","sig":"AOD64_1WwDM7Zp2jGv1pdrozELP2CSkZUA","adurl":"http://shopping.yahoo.com/search;_ylc=X3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-?p=sheetrock&sem=Google","url":"http://www.google.com/aclk?sa=L&ai=CHaiampkIUPygG6Ol0AGRloCYCZu8jlqzr4eAA9G9rwcQCCDHxt8eKAxQ7dCHowNgyb6miYyk1A_IAQGqBB5P0Loa2ETp7MOm_KJSgnGFYNjxasFE9TqY0t8TLpE&num=8&ggladgrp=2492582816717521168&gglcreat=9712000621987456871&sig=AOD64_1WwDM7Zp2jGv1pdrozELP2CSkZUA&adurl=http://shopping.yahoo.com/search%3B_ylc%3DX3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-%3Fp%3Dsheetrock%26sem%3DGoogle","text":"Sheetrock"}
Console log: {"sa":"L","ai":"CELX4mpkIUPygG6Ol0AGRloCYCc-MjpECz_OgsCKf8OKPCRAJIMfG3x4oDFDFt4T4-f____8BYMm-pomMpNQPyAEBqgQeT9CaZ9JE6OzDpvyiUrJ702HY8WrBRPU6mNLfEy6R","num":"9","sig":"AOD64_39oakLNPF7SIjdARg9y73otRYZhQ","adurl":"http://www.buycheapr.com/us/result.jsp?ga=us19&q=sheetrock+material","url":"http://www.google.com/aclk?sa=L&ai=CELX4mpkIUPygG6Ol0AGRloCYCc-MjpECz_OgsCKf8OKPCRAJIMfG3x4oDFDFt4T4-f____8BYMm-pomMpNQPyAEBqgQeT9CaZ9JE6OzDpvyiUrJ702HY8WrBRPU6mNLfEy6R&num=9&sig=AOD64_39oakLNPF7SIjdARg9y73otRYZhQ&adurl=http://www.buycheapr.com/us/result.jsp%3Fga%3Dus19%26q%3Dsheetrock%2Bmaterial","text":"Sheetrock Material Sale"}
Console log: {"sa":"L","ai":"C-S13mpkIUPygG6Ol0AGRloCYCaS2oM0D5P7ugla6r8cGEAogx8bfHigMULiOjo_9_____wFgyb6miYyk1A_IAQGqBBhP0Np5zUTr7MOm_LNT0dUZSgBnPsWIwzM","num":"10","sig":"AOD64_0rVOUBr5lndIy_sed-v9kmQBeqjw","adurl":"http://www.moifriefacility.com","url":"http://www.google.com/aclk?sa=L&ai=C-S13mpkIUPygG6Ol0AGRloCYCaS2oM0D5P7ugla6r8cGEAogx8bfHigMULiOjo_9_____wFgyb6miYyk1A_IAQGqBBhP0Np5zUTr7MOm_LNT0dUZSgBnPsWIwzM&num=10&sig=AOD64_0rVOUBr5lndIy_sed-v9kmQBeqjw&adurl=http://www.moifriefacility.com","text":"Installation Framing Door"}
Console log: {"sa":"L","ai":"CnycAmpkIUPygG6Ol0AGRloCYCem7q4oEqYSS7FKunu8KEAsgx8bfHigMUJbYl_L8_____wFgyb6miYyk1A_IAQGqBB5P0IonikTq7MOm_KJShBu3Z9jxasFE9TqY0t8TLpE","num":"11","sig":"AOD64_1MZGFM0lJ7DsgtuZZ-rv2CP6vcxA","adurl":"http://www.sbgrace.com","url":"http://www.google.com/aclk?sa=L&ai=CnycAmpkIUPygG6Ol0AGRloCYCem7q4oEqYSS7FKunu8KEAsgx8bfHigMUJbYl_L8_____wFgyb6miYyk1A_IAQGqBB5P0IonikTq7MOm_KJShBu3Z9jxasFE9TqY0t8TLpE&num=11&sig=AOD64_1MZGFM0lJ7DsgtuZZ-rv2CP6vcxA&adurl=http://www.sbgrace.com","text":"Architectural GFRG"}
Console log: {"sa":"l","ai":"C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT","num":"12","sig":"AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ","adurl":"http://www.logos.com/products/details/2982?googleads","url":"http://www.google.com/aclk?sa=l&ai=C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT&num=12&sig=AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ&adurl=http://www.logos.com/products/details/2982%3Fgoogleads","text":"WallBuilders Library"}
Console log: {"sa":"l","ai":"C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT","num":"12","sig":"AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ","adurl":"http://www.logos.com/products/details/2982?googleads","url":"http://www.google.com/aclk?sa=l&ai=C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT&num=12&sig=AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ&adurl=http://www.logos.com/products/details/2982%3Fgoogleads","text":"www.logos.com/"}

Looking at the output, some design decisions made by Google’s engineers become apparent. Google must track all clicks in order to charge publishers and pay advertisers, so they redirect everything through a URL shortener. The latency must be low or else the viewer will give up waiting for a site to load.

Links contain all information required to load the advertised site, so no database reads are required. The URL contains hashes, which presumably prevents a malicious user from modifying the URL. I suspect that these URLs also expire, by including the date in a hashed value. Clicks are likely written to a sharded database (i.e. BigTable, see also Redis, Cassandra, etc) and reconciled later.

Many thanks to Ariele for editing

Don’t use Access-Control-Allow-Origin

Access-Control-Allow-Origin is an HTTP header that allows servers to specify which hosts may send cross domain AJAX requests. Let’s say you were building an ad network, fetching content via AJAX. You would add this header to HTTP responses, once for each allowed domain. Clearly this is not scalable, but it’s a bad idea for other reasons as well.

Access-Control-Allow-Origin is tempting as a developer, because it allows you to build a lean multi-server set-up, without proxying requests. The real problem is entirely outside your control – corporate firewall proxies. The Watchguard Firewall is very aggressive by default, blocking content on a variety of heuristics. It removes HTTP headers it considers dangerous, including Access-Control-Allow-Origin, so a site built with this will never work for anyone inside their firewall.

The header directive is primarily for the client-side browser to enforce cross-site scripting policies. This protects end-users from malicious javascript. For example, Javascript might be inserted into a blog comment, and if incorrectly escaped, could run when a visitor loads the page, modifying content or redirecting the user to another domain. In spite of this, it is apparently too risky for some proxies, so be careful.

1/3 of old Flippa website auctions point to abandoned sites

Flippa is an auction site for buying and selling websites as businesses. Browsing the listings shows many low quality products. With careful inspection, there are often interesting, quality listings, but they are swallowed in the noise. Occasionally there are successful e-commerce sites, un-maintained high-traffic developer forums, or fire-sales on start-ups. Often these are educational, but usually incomplete. Almost as interesting is the proliferation of scams, from the obvious puffery to elaborately faked numbers.

I’d like to see how auctioned goods fare long after the listings end. If buyers determine maintenance isn’t worth the effort, they abandon the project, having lost the money they spent. If sellers can’t find a buyer, they too will abandon goods not worth the time and price of hosting. Flippa provides a tool to browse open and recent sales by tag which provides insight into recent listings.

Exploring abandoned projects may be a way to find a few good dropped domains, and possibly a fraud detection algorithm or website background checker. Using some scraped data, I extracted each domain name from 76,513 auctions, and ran tests to determine which domains are unresponsive, parked, or still operating. These auctions are a mix of domain sales and full websites – some listings contain more than one domain, but this data set represents only the main item for sale. Only 1,028 listings are missing domains.

Out of 41,172 distinct domains, 11,723 are unresponsive and 2,178 are parked, mostly by Godaddy (the parked numbers are likely lower than reality).


Let’s dig deeper. These are probably low-cost, low-quality listings. Let’s check:

Status Average Price Median Price
All $34761 $147
Working $52000 $160
Missing $159000 $3000
Parked $746 $110

Note these are only from records that have pricing or link data, not all domains. Not surprisingly, the medians are a lot lower than the averages. There are a number of sellers building one-off sites, easily identified through templated listings.

Flippa computes a few fields to aid buyers in their due-diligence. Google Links is a measure of how many pages are indexed in a search engine, obtained by searching for “site:garysieling.com”. This is a measure of site size:

Status Avg Google Links Median Google Links
All 64304 3
Working 82125 4
Missing 99 5
Parked 18 2

Inbound Links is a measure of how many links point to a site, obtained through Majestic SEO. In my experience these metrics vary widely, depending on the speed and thoroughness of a scraper- Google Webmaster Tools tends to show higher and more immediate results than other tools. Nonetheless, it is likely useful within the context of a single tool.

This is a measure of SEO value:

Status Avg Inbound Links Median Inbound Links
All 230341 254
Working 75252 314
Missing 228542 4000
Parked 5491 77

I’m surprised that the median price of unmaintained domains is as high as it is. The average may be high due to very high BIN prices – some of the highest values I’ve seen for pricing in the database are fraudulent auctions. For SEO value, it would seem that there are some good domains out there that are unowned, or not being used (my data does not currently distinguish). I checked which advertising types (CPA/CPC/Text/Affiliate) and advertising companies are referenced evenly in each type of listing – nothing significant to note there.

Some portion of those sales must be fraudulent- let’s see how that plays out in the data:

Status Seller Suspended Seller Banned Seller Ok Percent Ok
All 3956 1728 35844 84%
Working 2298 1024 24190 87%
Missing 1419 597 9794 80%
Parked 239 107 1860 82%

Some sellers have been banned, but it remains a crude predictor of future performance. Who abandons sites more, buyers or sellers?

Domain Status Sold Unsold Negotiated Post Auction Private Sale(Sold) Private Sale(Ended) Unknown
All 25989 15823 518 132 688 373
Working 16795 10343 383 85 486 321
Missing 7767 4621 102 45 188 31
Parked 1427 859 33 2 14 21

It looks like private sales don’t do too well, in terms of percentage sold. (Edit: Fixed two numbers I had transposed. I’m assuming “Private Sale Ended” means the auction ended without a sale, but that may be incorrect. This may also be an artifact of how the data is scraped)

The ratios of Sold to Unsold for each category are almost all the same (1.6-1.7), which indicates that buyers could do much better at due diligence. The number of unsold listings indicates that sellers could do a lot better at finding ways to provide value, a fact clear to anyone who has casually browsed auction listings. In the end, 4,000+ sites sold and abandoned represents a bit over $7,000,000. As they say, caveat emptor.

For further discussion, check out this thread at experienced-people.net.


Detecting parked domains

Looking at old Flippa auctions that I scraped, it would be interesting to determine if domains are parked. I found this post, which describes a few options, including checking DNS entries for redirects, finding a blacklist, or content inspection. Some people have built APIs, but none appear maintained, and DNS providers rate limit whois lookups. Google also has an internal, proprietary method for doing this.

I’ve compiled a list of text included in parked pages. I found that GoDaddy is a very common domain parker, and encompasses most of the results in my dataset, but the others are useful too. Many hosting companies operate through common providers, so the text often follows a pattern. A portion of parked sites sell their traffic to third parties, often “mom” sites – these can’t be detected through this method. If you find a parked domain that doesn’t match one of these strings, please comment below!

Identifying parked domains (buydomains.com)

The domain outfit.com is for sale. To purchase, call BuyDomains.com at +1 339-222-5147 or 866-836-6791. Click here for more details.”


This domain is for sale. Click here for more information.


This page provided to the domain owner free by Sedo's Domain Parking. Disclaimer: Domain owner and Sedo maintain no relationship with third party advertisers. Reference to any specific service or trade mark is not controlled by Sedo or domain owner and does not constitute or imply its association, endorsement or recommendation.


This page is provided courtesy of GoDaddy.com

NTCHosting, and other hosts (NTCHosting and astmonastir.com vary):

NTCHosting has registered astmonastir.com for one of its customers.

Suspended pages hosted on cPanel hosts:

This page has been suspended

LiquidNet (and many others):

LiquidNet Ltd Hosting has registered valuesalesdirect.com for one of its users.

Domains that are listed “for sale”, for the right price:

"The domain back.in may be for sale. Click here for details."
"The domain back.in is for sale. Click here for details."
DNS設定が見つかりません! / DNS ERROR!

I’m not sure which company the source host is for this one, but it’s very common. These two strings are included in a Javascript script block.


Fixing org.apache.solr.common.SolrException: Length Required

I received the following exception, after making no code changes:

org.apache.solr.common.SolrException: Length Required

The issue is that CommonsHttpSolrServer does not send a Content-Length header in updates. The root cause of my issue was switching the front-end proxy from Apache to Nginx, which apparently is more strict about headers.


Detecting auction spam with Weka

Weka is an open-source data-mining tool written in Java, providing a host of data mining algorithms. I am using it to build a proof-of-concept model that can classify auctions based on their value: fraudulent listing, zero valued listing, overpriced listing, or underpriced listing. I’ve scraped some data from Flippa, a website/business auction site, to facilitate data mining experiments, particularly to see how difficult it would be to detect spam or fraudulent auctions.

An ideal classifier would identify which listings are over-priced, under-priced, or worthless due to fraud or puffery. I suspect that many auctions fit into the zero-valued and fraudulent categories. In browsing listings, one sees domains with WordPress and an installed template, but no true potential, trademark infringement, unfixable copyright infringement (for example, 10,000 articles about movies copied from IMDB), etc. There are high risk assets with potential, such as discarded startups. Some sites have high traffic, but are declining due to Google algorithm changes or to being in the MySpace ecosystem.

In previous experiments, I noted that template-based auctions can be detected programmatically, in unexpected ways. While this does not reveal fraud, it does identify sellers who build sites solely to sell on Flippa. The data set contained attributes for whether the seller has been banned, what advertisers are referenced in the auction, description length, and number of header tags used in the description. Generally these are meant to determine if an auction’s text is built from a template.

Only a small portion of sellers are banned. A naive algorithm to predict whether an auction is from a banned seller would assume that all auctions are good – this achieves 95% accuracy. Many algorithms in Weka fall back to this algorithm, if the data is otherwise inconclusive. This is a terrible algorithm from a buyer’s point of view- it is better to err on the side of false positives to discourage risky purchases.

In this test, each auction has 129 attributes as detailed above. There is a boolean attribute for each advertising company that may be mentioned in a listing. I generated Weka’s ARFF file directly from Postgres. The best performing algorithm was BayesNet.

This is the result of the naive algorithm, which assumes everything is ok:
Correctly Classified Instances: 48943 95.9836 %
Incorrectly Classified Instances: 2048 4.0164 %

This is the result of BayesNet which is, technically, a worse result:
Correctly Classified Instances: 48853 95.8071 %
Incorrectly Classified Instances: 2138 4.1929 %

However, comparing the confusing matrices, most of the errors appear to be false positives. Not only that, the number of false positives, relative to the number of true positives, is relatively small. A small victory, certainly, which more than anything shows the amount of work needed to tune these algorithms.

=== Confusion Matrix, naive ===
not banned, banned 48943 0 (not banned)
2048 0 (banned)

=== Confusion Matrix, Bayes Net ===
not banned, banned 48797 146 (not banned)
1992 56 (banned)

To improve this in future, I will augment the source data with new fields that may indicate problems, such as the ratio of price to traffic. Whether a seller has been banned is a crude way of identifying low quality listings, as the ban may be for only one listing. In the future I will also look at whether sites continue to exist after the sale, whether they use trademarked content, and whether they are re-sold and hold their value.

Thanks to Ariele for editing