Discovering Corporate Open Source Contributions

A year or two ago, I saw a Microsoft’s director deliver the keynote at a conference. He claimed at the time that Microsoft had entered a new era of supporting open source development, which raised a few eyebrows considering their history. This was at a Solr conference, which made me think it’d be interesting to see if you could build a search engine to prove out these claims.

Microsoft’s position is not one of altruism; they freely admit that this is borne of a desire to make various systems work well on Azure, which you can see from their open source marketing materials, where they have developed a sudden interest in PHP, Node.JS, Python, etc.

Microsoft started building Azure in 2008, making it commercially available in 2010, and in 2012 they released an update which allowed for Node.js and PHP.

I built a tool to prove this out, searching the history of several popular open source repositories, including the ones Microsoft mentions specifically. You can see here that Microsoft had a number of commits from 2010-2012, with the peak in 2011. This is likely more of a tactical play on their part than a long-term strategy, but better than some alternatives they’ve pursed in the past.

corporate

The numbers above are counts of commits against the following Github repositories with an @microsoft.com email address (so possibly there are some left out, for instance if someone emailed in a patch).

The search repository includes the following projects: Drupal, Git, Lucene, Lucene.NET, Solr, Mono, Node.JS, PHP, Postgres, and Vagrant. Collectively this represents 2.1 GB Git History, including 232,839 commits – this compresses to 132 MB when indexed in Solr.

The indexing process is quite simple, using a simple Java class to copy data from Git to Solr, which you can also use to “>find who are the senior developers on various projects:

uml

Inferring a commit author’s company can be trivially accomplished from an email address:

// x.y@google.com => google.com
String company = emailAddress.split("@")[1];

if (company.contains("."))
{	
  // abc.com => google
  company = company.substring(0, company.lastIndexOf("."));
}
			
return company;

In the end what this shows is not just that Microsoft has made some overtures to the rest of the software world, but also that you can readily combine some journalistic research processes with coding to produce interesting results – any activity which occurs in public and on a regular basis can be queried to seek answers, given the will and the patience to look, combined with some basic technical knowledge.

If you want to look into this further, you may find the following resources of interest: