My First A/B Test… with Results

A/B testing gets a lot of attention on Hacker News, inbound.org, and other forums, and appeals to me as a data analysis exercise. As a software engineer with a practical bent, I like the concept of data analysis techniques which produce useful results while treating a system as a black box. This stands in contrast to algorithms that aim to analyze data and tell a story, for instance applying agent-based modeling to political science and the study of war and peace.

Testing two variations of a site to see how people react turns out to be extremely difficult to tinker with on a blog. You need to have the right sort of problems to justify using statistics, and it’s challenging to create those problems to happen to justify the experiment. From another angle, I’ve long been leery of using split testing at all as keeping WordPress stable has been a real pain so I prefer to avoid additional operational complexity.

Enter Adzerk, which provides a hosted ad server, removing the need for additional infrastructure. The function of an ad server is to let you upload media (pictures, etc) and set up business rules for when to display each “ad”, effectively letting you run something resembling an Adsense clone, minus all the AI. Adzerk has a nice free plan, which covers you up to a ridiculously high number of impressions, so it’s not really necessary. I’ve been really happy with the site, although from their perspective I’m likely a terrible customer, as they’re not making any money off me.

The logical products to promote on a programming site seem like jobs, books, and developer tools – right now I’m just running campaigns on a couple sites consisting of links to Amazon pages. Here’s a screenshot a campaign set up very recently:

Screen Shot 2013-05-22 at 9.36.38 PM

Once you get enough impressions to compare, you can just turn off each entry and make new ones for the next test. There’s no particular reason you have to use traditional display advertising with this – as cheap as it is, one could easily build a very dynamic site using their API, for instance to pump in hot news, thumbnails for suggested stories, etc.

I’m running campaigns on a couple different sites – After a slow build-up in readership, and a burst of Hacker News traffic, I was able to finally achieve statistically significant results for clickthroughs on a test of four Javascript books:

TitleImpressionsClickthrough %Clicks
Javascript the Good Parts98010.220
Javascript Patterns100920.2222
Ext CookBook99440.1111
ExtJS in Action100600.4343

 

To test this, I made a table of each combination of values, comparing them using the chi-squared test on Evan’s Awesome A/B testing tools. Fortunately this showed ExtJS in Action to be a clear winner over all the rest as I hoped- one risk of this technique is that there is a clear loser or a couple equivalent winners, where I was hoping for one to be the best.

xno differenceno differenceloses
no differencexno differenceloses
no differenceno differencexloses
winswinswinsx

 

In all, this took around nine months to achieve, but with little effort on my part. The lesson I take is this: while this is fun, it may be better to look for big wins elsewhere- especially borrowing ideas from people who already run these tests. Additionally, the tests would complete much faster running only two variations, and many can run in parallel.

Building a full-text index of git commits using lunr.js and Github APIs

Github has a nice API for inspecting repositories – it lets you read gists, issues, commit history, files and so on. Git repository data lends itself to demonstrating the power of combining full text and faceted search, as there is a mix of free text fields (commit messages, code) and enumerable fields (committers, dates, committer employers). Github APIs return JSON, which has the nice property of resembling a tree structure – results can be recursed over without fear of infinite loops. Note that to download the entire commit history for a repository, you need to page through it by sha hash. The API I use here lacks diffs, which must be retrieved elsewhere.

To test this, access a URL like so. The configurable arguments are the repository owner and name fields.
https://api.github.com/repos/torvalds/linux/commits

This is what a commit looks like:

{
  "sha": "7638417db6d59f3c431d3e1f261cc637155684cd",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7638417db6d59f3c431d3e1f261cc637155684cd",
  "author": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "[email protected]"
  },
  "committer": {
    "date": "2008-07-09T16:13:30+12:00",
    "name": "Scott Chacon",
    "email": "[email protected]"
  },
  "message": "my commit message",
  "tree": {
    "url": "https://api.github.com/repos/octocat/Hello-World/git/trees/827efc6d56897b048c772eb4087f854f46256132",
    "sha": "827efc6d56897b048c772eb4087f854f46256132"
  },
  "parents": [
    {
      "url": "https://api.github.com/repos/octocat/Hello-World/git/commits/7d1b31e74ee336d15cbd21741bc88a537ed063a0",
      "sha": "7d1b31e74ee336d15cbd21741bc88a537ed063a0"
    }
  ]
}

To make the test simple, I download these as JSON locally, then start a python webserver. Were I to make many such calls on a public site, I’d set up a proxy to the github APIs.

python -m SimpleHTTPServer

This data has a number of nested objects and must be flattened to fit into the lunr.js full-text index. This example uses the commit number (0, 1, 2..N) as the location in the index, but a real environment should use the commit hash to allow partitioning the ingestion process. Nested objects are flattened by joining subsequent keys with underscores in between. A production-worthy solution needs to escape these to prevent collisions.

var documents = [];
 
function recurse(doc_num, base, obj, value) {
  if ($.isPlainObject(value)) {
    $.each(value, function (k, v) {
      recurse(doc_num, base + obj + "_", k, v);
    });
  } else {
    process(doc_num, base + obj, value);
  }
}
 
function process(doc_num, key, value) {
  if (documents.length <= doc_num)
    documents[doc_num] = {};
 
  if (value !== null)
    documents[doc_num][key] = value + '';
}
 
$.each(data, function(doc_num, commit) {
  $.each(commit, function(k, v) {
    recurse(doc_num, '', k, v)
  });
});

Normally, one sets up a lunr full-text index by specifying all the fields, much like Solr’s numerous XML config files. Lunr doesn’t have nearly as many configuration options, since you only specify the ‘boost’ parameter to increase the value of certain fields in ranking. I imagine this will change as the project grows, at the very least to include type hints.

Given the simplicity of field objects, you can infer infer the field list from JSON payloads. The code below provides two modes, one where you inspect the entire JSON payload, or one where you limit how many commits you check, a good option when JSON data is consistent.

The function accepts configuration objects resembling ExtJS config objects, which lets you override as desired. If fields derived from existing data are required, they can be inserted after any documents are inserted.

function inferIndex(documents, config) {
  return lunr(function() {
    this.ref('id');
    var found = {};
    var idx = this;
 
    $.each(documents,
      function(doc_num, doc) {
 
        if (config &&
            config.limit &&
            config.limit < doc_num)
          return;
 
        $.each(doc, function(k, v) {
          if (!found[k]) {
            if (config && config[k]) {
              idx.field(k, config[k]);
            } else {
              idx.field(k);
            }
            found[k] = true;
          }
        });
    });
  });
}
 
var index =
  inferIndex(documents,
    {limit: 1,
     'commit_author_name':{boost:10}});

Inserting flattened documents into the index becomes simple. The method below provides a callback, should you desire to add calculated fields fields.

$.each(documents,
  function(doc_num, attrs, doc_cb) {
    var doc =
      $.extend(
        {id: doc_num}, attrs);
 
    if (doc_cb) {
      doc = doc_cb(doc);
    }
 
    index.add(doc);
});

At this point we’ve indexed the entire commit history from a git repository, which lets us search for commits by topic. While this is useful, it’d be really nice to be able to facet on fields, which would return the number of documents in a category, like a SQL group by. I’ve found it particularly convenient to facet on author, date, or author’s company.

If you have access to the original documents, you can easily construct facets based on the results of a lunr search:

function facet(index, query, data, field) {
  var results = index.search(query);
 
  var facets = {};
  $.each(results, function(index, searchResult) {
    var doc = data[searchResult.ref];
 
    facets[doc[field]] =
      (facets[doc[field]] === undefined ? 0 :
      facets[doc[field]]) + 1; } );
 
  return facets;
}

Commit messages in repositories where I work often contain names of clients who requested a feature or bug fix. Consequently doing a search faceted by author provides a list of who worked with each client the most – this can also tell you who has worked with various pieces of technology.

The following query demonstrates this technique:

var facets =
   facet(index,
        'driver',
        documents,
        'commit_author_name');
{"Wolfram Sang":24,"Linus Torvalds":3}

The approach shown here works well, but requires retrieving results requires access to the original document data. If we want to filter the results to a category, we need a richer search API than lunr currently provides, as well as callback options within the search API. In Solr there are also options to skip lower-casing data, as that may be inappropriate for category titles. Mitigating these issues will be explored further in future essays.

If you enjoyed this, you may also be interested in:

Full-Text Indexing PDFs in Javascript

I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs that represent 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them.

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.

To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for pdf.js is instructive, in that the Mozilla developers use browser features that aren’t in common use. Web Workers, for instance, let you set up background processing threads.

The pdf.js APIS make heavy use of Promises, which hold references in code to operations that haven’t completed yet. You operate on them using callbacks:

var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 // this code is called once the PDF is ready
});

This API seems immature yet- ideally you should be able to do promise.then(f(x)).then(g(x)).then(h(x)) etc, but that isn’t yet available.

For rendering PDFs the Promise pattern makes a lot of sense, as it leaves room for parallelizing the rendering process. For merely extracting the text from a PDF it feels like a lot of work – you need to be confident that your callbacks run in order and track which one is last.

The following demonstrates how to extract all the PDF text, which is then printed to the browser console log:

‘use strict’;
var pdf = PDFJS.getDocument('http://www.pacer.gov/documents/pacermanual.pdf');
 
var pdf = PDFJS.getDocument('pacermanual.pdf');
pdf.then(function(pdf) {
 var maxPages = pdf.pdfInfo.numPages;
 for (var j = 1; j <= maxPages; j++) {
    var page = pdf.getPage(j);
 
    // the callback function - we create one per page
    var processPageText = function processPageText(pageIndex) {
      return function(pageData, content) {
        return function(text) {
          // bidiTexts has a property identifying whether this
          // text is left-to-right or right-to-left
          for (var i = 0; i < text.bidiTexts.length; i++) {
            str += text.bidiTexts[i].str;
          }
 
          if (pageData.pageInfo.pageIndex ===
              maxPages - 1) {
            // later this will insert into an index
            console.log(str);
          }
        }
      }
    }(j);
 
    var processPage = function processPage(pageData) {
      var content = pageData.getTextContent();
 
      content.then(processPageText(pageData, content));
    }
 
    page.then(processPage);
 }
});

It’s not trivial to identify where headings and images are. This would require hooking into the rendering code, and possibly a deep understanding of PDF commands (PDFs appear to be represented as stream of rendering commands, similar to RTF).

Lunr
Creating a Lunr index and adding text is straightforward- all the APIs operate on JSON bean objects, which is a pleasantly simple API:

doc1 = {
    id: 1,
    title: 'Foo',
    body: 'Foo foo foo!'
  };
 
doc2 = {
    id: 2,
    title: 'Bar',
    body: 'Bar bar bar!'
  }
 
doc3 = {
    id: 3,
    title: 'gary',
    body: 'Foo Bar bar bar!'
  }
 
index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
  })
 
// Add documents to the index
index.add(doc1)
index.add(doc2)
index.add(doc3)

Searching is simple – one neat tidbit I found is that you can inspect the index easily, since it’s just a JS object:

// Run a search
index.search(“foo”)
 
// Inspect the actual index to see which docs match a term
index2.tokenStore.root.f.o.o.docs

When I was first introduced to full-text indexing, I was confused by what is meant by a “document” – this generalizes beyond a PDF or Office document to any database row, possibly including large blobs of text.

Full-text search would be pretty dumb if you had to build the index every time, and Lunr makes it really easy to serialize and deserialize the index itself:

var serializedIndex = JSON.stringify(index1.toJSON())
var deserializedIndex = JSON.parse(serializedIndex)
var index2 = lunr.Index.load(deserializedIndex)

Index.toJSON also returns a “bean” style object (not a string). I’ve never seen an API like this, and I really like the idea – it gives you a clean Javascript object with only the data that requires serialization.

The following are attributes of the index:

  • corpusTokens – Sorted list of tokens
  • documentStore – list of each document – catenate
  • fields – The fields used to describe each document (similar to database columns)
  • pipeline – The pipeline object used to process tokens
  • tokenStore – Where and how often words are referenced in each document

One great thing about this type of index is that the work can be done in parallel and then combined as a map-reduce job. Only three entries from the above object need to be combined, as “fields” and “pipeline” are static. The following demonstrates the implementation of the reduction step (note jQuery is referenced):

(function reduce(a, b) {
  var j1 = a.toJSON();
  var j2 = b.toJSON();
 
  // The "unique" function does uniqueness by sorting,
  // which we need here.
  var corpusTokens =
      $.unique(
          $.merge(
              $.merge([], j1.corpusTokens),
                           j2.corpusTokens));
 
  // It's important to create new arrays and
  // objects throughout, or else you modify 
  // the source indexes, which is disastrous.
  var documentStore =
     {store: $.extend({},
                      j1.documentStore.store,
                      j2.documentStore.store),
      length: j1.documentStore.length + j2.documentStore.length};
 
  var jt1 = j1.tokenStore;
  var jt2 = j2.tokenStore;
 
  // The 'true' here triggers a deep copy
  var tokenStore = {
    root: $.extend(true, {}, jt1.root, jt2.root),
    length: jt1.length + jt2.length
  };
 
  return {version: j1.version,
          fields: $.merge([], j1.fields),
          ref: j1.ref,
          documentStore: documentStore,
          tokenStore: tokenStore,
          corpusTokens: corpusTokens,
          pipeline: $.merge([], j1.pipeline)};
})(index1, index2)

I tested this by creating three indexes: index1, index2, and index3. index1 is {doc1}, index2 is {doc2, doc3}, and index3 is {doc1, doc2, doc3}. To test the code, you need simply diff:

JSON.stringify(index3.toJSON())
 
JSON.stringify(combine(index1, index2))

Possibilities

Overall this technique has a lot of wasted network I/O, making this seem silly. On the other hand, there are listings on ebay and fiverr selling for “traffic”, which typically comes from pop-unders, botnets, hidden iframes, etc. You can easily find listings like “20,000 hits for $3”, and less in bulk. This is typically cheap because it has little commercial value other than perpetrating various forms of fraud.

You’d need a cheap VM with loads of bandwidth to use as a proxy, as well as publically available data – you couldn’t use this as a scraping technique due to browser protections against cross-domain requests. You’d also need to generate unique document IDs in a unique fashion, perhaps using the original URL.

If a traffic source runs on modern browsers, one could use this as a source of potentially cheap and unlimited processing power, even for point of combining the indexes, although provisions must be made for the natural instability of the system.

If you enjoyed this, you might also enjoy the following:

Lessons Learned from 0 to 40,000 Readers

Starting Out

I started writing a little over a year ago, after finding “Technical Blogging” by Antonio Cangiano through Hacker News. Since then, a bit over 40,000 people have read articles I’ve written, not a huge number in the grand scheme of things, but enough to draw a few lessons. The more I write, the more good things happen on their own, but at random moments.

Before starting to write, I came upon ~650 followers on Google Plus through a mutual-follow program on Hacker News and was interested in the SEO potential. If you’re signed into Google, people you follow are promoted in search results. Before this I built a site for a family member selling beehive plans and learned a bit about SEO. It turns out that making money by selling $10 print books in a narrow niche is difficult, especially when you can’t do the writing yourself, so I’ve pivoted to pursue technical subjects.

Choosing Material

Antonio recommends writing in a focused niche, but I am still developing direction. I started with vague ideas based on things that interest me – database tuning, machine learning, data science, statistics, and practical applications of these. In practice, I’ve written on wider subjects – anything within “full stack web development” is fair game, trying to focus on new, or popular tech – Scala, DevOps (Vagrant/Chef/Virtualization), Hadoop, R, and scraping. Typically I write with an audience in mind – e.g. a niche forum or subreddit. When this works well, it often ends up in the forum without my help, as in the case of writing about Flippa auction data, where I received links from the experienced-people forum and the Flippa corporate blog.

Choosing a strategy for generating material can be challenging. There are several models for success: just write what you’re interested in; pick topics with a good ratio of searches to existing quality content; pick an area where you own a product; or document knowledge from your consulting services. The last approach is particularly valuable, as it’s easy to establish business value. This has been used to build info-products with great success by Patrick Mckenzie, Brennan Dunn, and Nathan Barry.

I’ve followed the first two approaches – there are a few tools to help you pick topics, such as the very spammy looking, but useful Market Samurai. For a person with a generic IT blog, several types of articles naturally arise from this approach: proof of concepts, weird undocumented error messages, things that ought to be in product documentation, and things that are in product documentation, but can’t be found through Google for whatever reason.

I’ve found a few motivating examples of the last case- “R” is hard to search for, and a lot of their API documentation is in PDFs (weird…). Before receiving VC funding, ExtJS had a period where their primary developers would mock bloggers on their forums, which seems to have somewhat limited the press they received outside their forum. They also made their help in a Rich Javascript UI that Google had trouble spidering.

Detailed proof of concepts lead to people with the title “Founder” contacting you about jobs, an improvement over typical recruiter spam. Writing resembling technical documentation or error messages leads to a lot of people contacting me for help. I get enough traffic from this approach that I’m established as an authority who can review Javascript books.

Another article generation technique that works well is “_Old Concept_ with _New Tool_”. E.g. “Building a JSON Webservice in R“, “Implementing K-Means in Scala” – basically trend-surfing. A key success of this is that someone actually applied to a job where I work because of a post called “Scala vs. Clojure.” The appeal of this tactic for the author is the ease of writing; for the reader, it allows them to see a familiar and easily-digestable concept in a new language.

Editing

I have a long list of potential titles, which I update whenever I recognize something that might be of interest. I also have an editing checklist, which I refine over time. For programming blogs, how to format code is important – I use a WordPress plugin based on GeSHi. I found that code written in a REPL is often very dense, and I’ve had complaints from people on Reddit, so I try to format the code with more whitespace than I would normally use. I also push most of my code to github (if there is enough to justify it) – this lets me get feedback from people who are using it.

Promotion

Recently, Google Webmaster Tools showed a massive spike (400%) in how often they showed my blog in search. This was because they put my wordpress /tag/scraping at the top of search results for a day. Few people clicked through – my WP tag pages were not setup to look appealing, and I’ve since begun to fix that. To that aim, I wrote a script to extract counts from Twitter, HN, and Reddit, to estimate popularity, so I could make each tag page a “Guide” to the subject in question, filtered to the most popular and useful articles I’ve written on each subject.

The other SEO lesson is that the “not provided” keyword in Google Analytics removes a lot of insight. When a user is logged into their Gmail account, keywords are blocked being sent to the recipient site by Google. A lot of traffic is also missing the name of the referring domain entirely, an frequent occurrence when one of my articles is posted on reddit. For me, “not provided” has risen from 50% of search traffic a year ago to 80% today. Since Google Analytics lets you see top articles by referring domain, having detailed URLs helps make up for the lack of keyword data.

I must confess, I’ve thought for a long time that Twitter was kind of stupid, but recently I’ve been disabused of this notion, although I may be the last person to figure it out. While attending Philly ETE I realized that during the keynotes people sit there checking their phones – looking around the room you see Twitter apps.Writing summaries of conference talks using the conference hashtag resulted in a bunch of local people reading my articles and following me. Software that follows twitter hashtags for you is essentially an niche search engine, of which it turns out there are many – hnsearch, and the WordPress plugin search.

I’ve written several articles which have been posted to Twitter by 20+ people. I only noticed in hindsight – t.co is the 6th most productive source of visits for me. Some of these posts are automated – e.g. from HN or DZone.One of the most popular posts of these posts was an entirely hypothetical commentary on using map-reduce techniques to farm database queries out to browsers. It’s the only thing I’ve written that seems to have received significant attention on Google+ (19 events on G+, 24 on Twitter). Discovering how often this was shared prompted this post – I was surprised as it was something I’d whipped off in half an hour.

A few people have shared my articles on Reddit, which has introduced me to new subreddits, but most of these posts have fared poorly due to poor targeting. Posts I’ve made received more votes, even though they are self posts, because they are at least relevant. Unlike Reddit or Hacker News, DZone actively promotes it’s writers. If your articles do well on the social media/voting side of their forum, they invite you to join their network, which I have found very helpful.

Technical Considerations

Most default WordPress themes render terribly in mobile devices, so it’s worth testing, preferably on both Android and iOS. For me, about 10% of my visitors are on phones. The most common rendering issue is that they render the same view a desktop would see, but scaled down so small that you have to zoom and scroll both left-right and up-down. Installing a mobile plugin made my blog render usably, although not pretty, which has reduced the number of people who just give up.

For people who enter through mobile, most come from an app (Reddit, Hacker News, or Twitter), an email link, or an RSS subscription. An important consideration is whether it’s actually possible for someone interested to follow you if they are on a phone. On a desktop you can easily copy a URL to an RSS reader, or type in an email address, but on a phone it’s harder, which I suspect calls for tighter integration with twitter, etc, but I haven’t proved this out yet.

The second technical issue worth paying attention to is page load time. Even when it’s fast for you, it’s not for others. I set this blog up on a Linode host, and have been slowly improving this (e.g. APC, Image sizes, more memory from Linode, Apache caching). The WPEngine blog is extremely helpful for finding useful tidbits – the wasted effort I’ve gone through tuning this site seems to be what they are trying to help businesses avoid.

If you run your own server, you eventually will get hacked, especially if you use popular software – I’ve recently started using Better WP Security to mitigate this risk.

It should come as no surprise that spam has become very sophisticated- for instance, there are people scraping comments from other sites to post on yours. I accepted one comment I never was able to figure out – a character named “Weng Fu”, who goes around to tech blogs praising the glories of VB6 (!), presumably an exercise in trolling. I forgot about this comment until recently, when I found that Google Webmaster Tools shows I rank well for “Weng Fu VB6” – who knew.

ExtJs JSON Reader Example

I received the following email from a reader:
Thank you very much for finding time to read my mail.
I came across your blog http://garysieling.com/blog/extjs-pie-chart-example
It would be greatly helpful, if you could provide me with the code of binding the data dynamically to the DS. I have already generated the data in JSON format via Servlet:

{“success”:true,”campaignList”:[{"NumberOfCampaigns":1,"CamapaignScheduleDate":"Mar 23, 2013 12:00:00 AM"}]}

How do I attach this here..

var store = Ext.create(‘Ext.data.Store’, {
    model: ‘PopulationPoint’,
    data: MY_DATA
  });

Any help would be of great use.
 
Thanks and regards
Response:
There are a couple ways to approach this problem. The linked example that I wrote uses a raw Javascript object to render a pie chart. Lets say your servlet returns JSON in a variable – you could parse this yourself and use the result in the store, but Ext provides the JSONReader for this purpose:
var store = new Ext.data.JsonStore({
    url: '/servlet',
    root: 'campaignList',
    fields: [{name:'NumberOfCampaigns', type: 'int'},
             {name:'CampaignScheduleDate', type:'date'}]
});
A couple things to note here- this assumes you know the types in advance, and in particular, you want to test dates carefully to make sure you render them in a format Ext can parse.  If needed, you can specify a date format as well, like so: dateFormat: ‘m-d-Y g:i A’
You can add a “mapping” to the fields, to alias columns from the servlet. The root value is optional – if a servlet returns a raw array this is not necessary. You may need to add “totalProperty” to the store as well – this specifies the name of a property in the JSON payload which specifies the total number of records. This is only needed for paging scenarios, where not all the results are returned at once.

Entity recognition with Scala and Stanford NLP Named Entity Recognizer

The following sample will extract the contents of a court case and attempt to recognize names and locations using entity recognition software from Stanford NLP. From the samples, you can see it’s fairly good at finding nouns, but not always at identifying the type of each noun.

In this example, the entities I’d like to see are different – companies, law firms, lawyers, etc, but this test is good enough. The default examples provided let you choose different sets of things that can be recognized: {Location, Person, Organization}, {Location, Person, Organization, Misc}, and {Time, Location, Organization, Person, Money, Percent, Date}. The process of extracting PDF data and processing it takes about five seconds.

For this text, selecting different options sometimes led to the classifier picking different options for a noun – one time it’s a person, another time it’s an organization, etc. One improvement might be to run several classifiers and to allow them to vote. This classifier also loses words sometimes – if a subject is listed with a first, middle, and last name, it sometimes picks just two words. I’ve noticed similar issues with company names.

import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import java.io._
import org.xml.sax._
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreAnnotations
 
object pdfHandler extends ContentHandler {
  val contents: StringBuffer = new StringBuffer()
 
  def characters(ch: Array[Char], start: Int, length: Int) {
    contents.append(new String(ch))
  }
 
  def endDocument() {
  }
 
  def endElement(uri: String, localName: String, qName: String) {
  }
 
  def endPrefixMapping(prefix: String) {
  }
 
  def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
  }
 
  def processingInstruction(target: String, data: String) {
  }
 
  def setDocumentLocator(locator: Locator) {
  }
 
  def skippedEntity(name: String) {
  }
 
  def startDocument() {
  }
 
  def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
  }
 
  def startPrefixMapping(prefix: String, uri: String) {
  }
}
 
object pdf extends App {
  val file = """e:\data\11-1285_i4dk.pdf"""
 
  val pdf: PDFParser = new PDFParser();
 
  val stream: InputStream = new FileInputStream(file)
  val handler: ContentHandler = pdfHandler
  val metadata: Metadata = new Metadata()
  val context: ParseContext = new ParseContext()
 
  pdf.parse(stream,
    handler,
    metadata,
    context)
 
  stream.close()
 
  val contents: String = pdfHandler.contents.toString()
  println(contents)
 
  val src = "stanford-ner-2013-04-04/classifiers/"
  val classifier1 = "english.all.3class.distsim.crf.ser.gz"
  val classifier2 = "english.conll.4class.distsim.crf.ser.gz"
  val classifier3 = "english.muc.7class.distsim.crf.ser.gz"
 
  val serializedClassifier = src + classifier1
 
  val classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier)
  val out = classifier.classify(contents)
 
  var words = 0
  for (i <- 0 to out.size() - 1) {
    val sentence = out.get(i)
 
    var foundWord = ""
    var oldWordClass = ""
 
    for (j <- 0 to sentence.size() - 1) {
      val word = sentence.get(j)
      val wordClass = word.get(classOf[CoreAnnotations.AnswerAnnotation]) + ""
 
      if (!oldWordClass.equals(wordClass)) {
        if (!oldWordClass.equals("O") && !oldWordClass.equals("")) {
          print("[/" + oldWordClass + "]")
        }
      }
 
      if (!wordClass.equals("O") && !wordClass.equals("")) {
        if (!oldWordClass.equals(wordClass)) {
          print("[" + wordClass + "]")
        }
      }
 
      oldWordClass = wordClass
 
      words = words + 1
      print(word);
      print(" ");
 
      if (words > 10) {
        words = 0
        println(" ")
      }
    }
  }
}
11-1285 [ORGANIZATION]US Airways , Inc. [/ORGANIZATION]v.
[PERSON]McCutchen [/PERSON]-LRB- 4\/16\/13 -RRB- 1 -LRB-
Slip Opinion -RRB- OCTOBER TERM ,
2012 Syllabus NOTE : Where it
is feasible , a syllabus -LRB-
headnote -RRB- will be released ,
as isbeing done in connection with
this case , at the time
the opinion is issued . The
syllabus constitutes no part of the
opinion of the Court but has
beenprepared by the Reporter of Decisions
for the convenience of the reader
. See [LOCATION]United States [/LOCATION]v. [ORGANIZATION]Detroit
Timber & Lumber Co. [/ORGANIZATION], 200
U. S. 321 , 337 .
SUPREME COURT OF THE [ORGANIZATION]UNITED STATES
Syllabus US AIRWAYS [/ORGANIZATION], INC. ,
IN ITS CAPACITY AS FIDUCIARY AND
PLAN ADMINISTRATOR OF THE [LOCATION]US [/LOCATION]AIRWAYS
, INC. . EMPLOYEE BENEFITS PLAN
v. [PERSON]MCCUTCHEN [/PERSON]ET AL. . CERTIORARI
TO THE [ORGANIZATION]UNITED STATES [/ORGANIZATION]COURT OF
APPEALS FOR THE THIRD CIRCUIT No.
11 -- 1285 . Argued November
27 , 2012 -- Decided April
16 , 2013 The health benefits
plan established by petitioner [ORGANIZATION]US Airways
[/ORGANIZATION]paid $ 66,866 in medical expenses
for injuries suffered by respondentMcCutchen ,
a [ORGANIZATION]US Airways [/ORGANIZATION]employee , in
a car accident caused by athird
party . The plan entitled [ORGANIZATION]US
Airways [/ORGANIZATION]to reimbursement if
[PERSON]McCutchen [/PERSON]

Extracting PDF text with Scala

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like. The big downside to this is that the Eclipse Scala plugin doesn’t seem to have the ability to fill in interface methods on an object.

import java.io._
 
import org.apache.tika.parser.pdf._
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.xml.sax._
 
object pdfHandler extends ContentHandler {
	def characters(ch : Array[Char], start: Int, length: Int) {
		println(new String(ch))
	}
 
	def endDocument() {
	}
 
	def endElement(uri: String, localName: String, qName: String) {
	}
 
	def endPrefixMapping(prefix: String) {
	}
 
	def ignorableWhitespace(ch: Array[Char], start: Int, length: Int) {
	}
 
	def processingInstruction(target: String, data: String) {
	}
 
	def setDocumentLocator(locator: Locator) {
	}
 
	def skippedEntity(name: String) {
	}
 
	def startDocument() {
	}
 
	def startElement(uri: String, localName: String, qName: String, atts: Attributes) {
	}
 
	def startPrefixMapping(prefix: String, uri: String) {
	}
}
 
object pdf extends App {
	val folder = """\\nas\Files\Data\pacer2\"""
	val subfolder = """\00\00\gov.uscourts.rid.6064\"""
	val file = """gov.uscourts.rid.6064.20.0.pdf"""
 
	val pdf : PDFParser = new PDFParser();
 
	val stream : InputStream = new FileInputStream(folder + subfolder + file)
	val handler : ContentHandler = pdfHandler
	val metadata : Metadata = new Metadata()
	val context : ParseContext = new ParseContext()
 
	pdf.parse(stream,
         handler,
         metadata,
         context)
 
    stream.close()
}

Output:

UNITED STATES DISTRICT COURT
FOR THE DISTRICT OF RHODE ISLAND
...
It is hereby agreed by and between the parties that the above-captioned matter be
dismissed, with prejudice, no interest, no costs.

Implementing k-means in Scala

To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two – the exact points used are in the code below.

This implements Lloyd’s algorithm, which tries to cluster points in iterations in a simple manner:

1. Assume a certain number of clusters
2. Group the points at random
3. Compute the center of each cluster
4. For each point, compute which cluster is closest
5. Move all the points into new groupings
6. Repeat 3-5 a few times, until you’re happy with the results

I like how the functional programming style forces you to recreate all the data structures, in this case. It might be tempting to implement this in an imperative style, modifying data structures in place, but since steps 4-5 require separate data, you are protected against making it more difficult. You can see the full source below, or on github.

Since this example is fairly contrived, this converges pretty quickly:

Initial State:
  Cluster 0
  Mean: (17.83517750970944, 12.242720407317105)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (11.9096128931784, 21.1855674228972))
    (22.4668345067162, 8.9705504626857))
    (7.91362116378194, 21.325928219919))
    (22.636600400773, 2.46561420928429))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (25.1439536911272, 3.58469981317611))
    (23.5359486724204, 4.07290025106778))
    (11.7493214262468, 17.8517235677469))
    (12.4277617893575, 19.4887691804508))
    (11.931275122466, 18.0462702532436))
    (25.4645673159779, 7.54703465191098))
    (21.8031183153743, 5.69297814349064))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (16.95249500233747, 12.848199048608048)
    (11.7265904596619, 16.9636039793709))
    (10.7751248849735, 22.1517666115673))
    (23.6587920739353, 3.35476798095758))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (7.03171204763376, 19.1985058633283))
    (23.7722765903534, 3.74873642284525))
    (10.259545802461, 23.4515683763173))
    (28.1587146197594, 3.70625885635717))
    (10.1057940183815, 18.7332929859685))
    (8.90149362263775, 19.6314465074203))
    (12.4353462881232, 19.6310467981989))
    (24.3793349065557, 4.59761596097384))
    (22.5447925324242, 2.99485404382734))
    (26.8942422516129, 5.02646862012427))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (27.0339042858296, 4.4151109960116))
    (11.0118378554584, 20.9773232834654))
Iteration: 0
  Cluster 0
  Mean: (23.781370272978315, 5.754127202865132)
    (11.7265904596619, 16.9636039793709))
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.296576237184727, 20.09138475584863)
    (10.8348626966492, 18.7800980127523))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 1
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 2
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 3
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 4
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
Iteration: 5
  Cluster 0
  Mean: (24.415832368416023, 5.164154740943777)
    (23.6587920739353, 3.35476798095758))
    (22.4668345067162, 8.9705504626857))
    (21.4930923464916, 3.28999356823389))
    (26.4748241341303, 9.25128245838802))
    (22.636600400773, 2.46561420928429))
    (23.7722765903534, 3.74873642284525))
    (25.1439536911272, 3.58469981317611))
    (28.1587146197594, 3.70625885635717))
    (23.5359486724204, 4.07290025106778))
    (24.3793349065557, 4.59761596097384))
    (25.4645673159779, 7.54703465191098))
    (22.5447925324242, 2.99485404382734))
    (21.8031183153743, 5.69297814349064))
    (26.8942422516129, 5.02646862012427))
    (23.9177161897547, 8.1377950229489))
    (24.5349708443852, 5.00561881333415))
    (26.2100410238973, 5.06220487544192))
    (27.0339042858296, 4.4151109960116))
    (23.7770902983858, 7.19445492687232))
  Cluster 1
  Mean: (10.371840143630894, 19.92676471498138)
    (10.8348626966492, 18.7800980127523))
    (11.7265904596619, 16.9636039793709))
    (7.7875624720831, 20.1569764307574))
    (10.7751248849735, 22.1517666115673))
    (11.9096128931784, 21.1855674228972))
    (7.91362116378194, 21.325928219919))
    (7.03171204763376, 19.1985058633283))
    (13.0838514816799, 20.3398794353494))
    (11.7396623802245, 17.7026240456956))
    (10.259545802461, 23.4515683763173))
    (10.1057940183815, 18.7332929859685))
    (11.7493214262468, 17.8517235677469))
    (8.90149362263775, 19.6314465074203))
    (12.4277617893575, 19.4887691804508))
    (12.4353462881232, 19.6310467981989))
    (11.931275122466, 18.0462702532436))
    (6.56491029696013, 21.5098251711267))
    (8.87507602702847, 21.4823134390704))
    (11.0118378554584, 20.9773232834654))
class Point(dx: Double, dy: Double) {
  val x: Double = dx
  val y: Double = dy
 
  override def toString(): String = {
    "(" + x + ", " + y + ")"
  }
 
  def dist(p: Point): Double = {
    return x * p.x + y * p.y
  }
}
 
object kmeans extends App {
  val k: Int = 2
 
  // Correct answers to centers are (10, 20) and (25, 5)
  val points: List[Point] = List(
    new Point(10.8348626966492, 18.7800980127523),
    new Point(10.259545802461, 23.4515683763173),
    new Point(11.7396623802245, 17.7026240456956),
    new Point(12.4277617893575, 19.4887691804508),
    new Point(10.1057940183815, 18.7332929859685),
    new Point(11.0118378554584, 20.9773232834654),
    new Point(7.03171204763376, 19.1985058633283),
    new Point(6.56491029696013, 21.5098251711267),
    new Point(10.7751248849735, 22.1517666115673),
    new Point(8.90149362263775, 19.6314465074203),
    new Point(11.931275122466, 18.0462702532436),
    new Point(11.7265904596619, 16.9636039793709),
    new Point(11.7493214262468, 17.8517235677469),
    new Point(12.4353462881232, 19.6310467981989),
    new Point(13.0838514816799, 20.3398794353494),
    new Point(7.7875624720831, 20.1569764307574),
    new Point(11.9096128931784, 21.1855674228972),
    new Point(8.87507602702847, 21.4823134390704),
    new Point(7.91362116378194, 21.325928219919),
    new Point(26.4748241341303, 9.25128245838802),
    new Point(26.2100410238973, 5.06220487544192),
    new Point(28.1587146197594, 3.70625885635717),
    new Point(26.8942422516129, 5.02646862012427),
    new Point(23.7770902983858, 7.19445492687232),
    new Point(23.6587920739353, 3.35476798095758),
    new Point(23.7722765903534, 3.74873642284525),
    new Point(23.9177161897547, 8.1377950229489),
    new Point(22.4668345067162, 8.9705504626857),
    new Point(24.5349708443852, 5.00561881333415),
    new Point(24.3793349065557, 4.59761596097384),
    new Point(27.0339042858296, 4.4151109960116),
    new Point(21.8031183153743, 5.69297814349064),
    new Point(22.636600400773, 2.46561420928429),
    new Point(25.1439536911272, 3.58469981317611),
    new Point(21.4930923464916, 3.28999356823389),
    new Point(23.5359486724204, 4.07290025106778),
    new Point(22.5447925324242, 2.99485404382734),
    new Point(25.4645673159779, 7.54703465191098)).sortBy(
      p => (p.x + " " + p.y).hashCode())
 
  def clusterMean(points: List[Point]): Point = {
    val cumulative = points.reduceLeft((a: Point, b: Point) => new Point(a.x + b.x, a.y + b.y))
 
    return new Point(cumulative.x / points.length, cumulative.y / points.length)
  }
 
  def render(points: Map[Int, List[Point]]) {
    for (clusterNumber <- points.keys.toSeq.sorted) {
      System.out.println("  Cluster " + clusterNumber)
 
      val meanPoint = clusterMean(points(clusterNumber))
      System.out.println("  Mean: " + meanPoint)
 
      for (j <- 0 to points(clusterNumber).length - 1) {
        System.out.println("    " + points(clusterNumber)(j) + ")")
      }
 
      System.out.println("")
    }
  }
 
  val clusters =
    points.zipWithIndex.groupBy(
      x => x._2 % k) transform (
        (i: Int, p: List[(Point, Int)]) => for (x <- p) yield x._1)
 
  System.out.println("Initial State: ")
  render(clusters)
 
  def iterate(clusters: Map[Int, List[Point]]): Map[Int, List[Point]] = {
    val unzippedClusters =
      (clusters: Iterator[(Point, Int)]) => clusters.map(cluster => cluster._1)
 
    // find cluster means
    val means =
      (clusters: Map[Int, List[Point]]) =>
        for (clusterIndex <- clusters.keys)
          yield clusterMean(clusters(clusterIndex))
 
    // find the closest index
    def closest(p: Point, means: Iterable[Point]): Int = {
      val distances = for (center <- means) yield p.dist(center)
      return distances.zipWithIndex.min._2
    }
 
    // assignment step
    val newClusters =
      points.groupBy(
        (p: Point) => closest(p, means(clusters)))
 
    render(newClusters)
 
    return newClusters
  }
 
  var clusterToTest = clusters
  for (i <- 0 to 5) {
    System.out.println("Iteration: " + i)
    clusterToTest = iterate(clusterToTest)
  }
}

Scala zip/zipAll Example

The zip function combines two lists into tuples. If the lists are of differing lengths, the shorter length is used. If you don’t like this behavior, the zipAll function will keep all elements, filling in specified values for the blanks (compare this to the recycling rule in R, which lets you continuously cycle through the shorter list).

val a = List("a", "b", "c", "d")
val b = List(1, 2, 3);
 
println(a.zipAll(b, "for missing values", 100))

And here is the output for each:

List((a,1), (b,2), (c,3))
List((a,1), (b,2), (c,3), (d,100))

Scala Tree Sort Example

This demonstrates basic language features – case classes, iteration, anonymous functions, etc.

abstract class Node
case class LeafNode(data: String) extends Node;
case class FullNode(data: String, left: Node, right: Node) extends Node
case class LeftNode(data: String, left: Node) extends Node
case class RightNode(data: String, right: Node) extends Node
 
object test extends App {
  val words = List("revertively", "dispelled", "overmoral",
    "sylphid", "nonhabitability", "noiselessness",
    "undisconnected", "shoveling", "visalia", "ilo");
 
  def construct(A: List[String]): Node = {
    def insert(tree: Node, value: String): Node = {
      tree match {
        case null => LeafNode(value)
        case LeafNode(data) => if (value > data) {
          LeftNode(data, LeafNode(value))
        } else {
          RightNode(data, LeafNode(value))
        }
        case LeftNode(data, left) => if (value > data) {
          LeftNode(value, LeftNode(data, left))
        } else {
          FullNode(data, left, LeafNode(value))
        }
        case RightNode(data, right) => if (value > data) {
          FullNode(data, LeafNode(value), right)
        } else {
          RightNode(value, RightNode(data, right))
        }
        case FullNode(data, left, right) => if (value > data) {
          FullNode(data, insert(left, value), right)
        } else {
          FullNode(data, left, insert(right, value))
        }
      }
    }
 
    var tree: Node = null;
 
    for (item <- A) {
      tree = insert(tree, item)
    }
 
    return tree
  };
 
  val f = (A: String) =>
    System.out.println(A);
 
  words.map(f);
 
  var x = construct(words);
 
  def recurseNode(A: Node, depth: Int) {
    def display(data: String, depth: Int) {
      for (i <- 1 to depth * 2) { System.out.print("-") }
      System.out.println(data);
    }
    A match {
      case null => {
        display("[]", depth)
      }
      case LeafNode(data) => {
        display(data, depth)
        recurseNode(null, depth + 1)
        recurseNode(null, depth + 1)
      }
      case FullNode(data, left, right) => {
        display(data, depth)
        recurseNode(left, depth + 1)
        recurseNode(right, depth + 1)
      }
      case RightNode(data, right) => {
        display(data, depth)
        recurseNode(null, depth + 1)
        recurseNode(right, depth + 1)
      }
      case LeftNode(data, left) => {
        display(data, depth)
        recurseNode(left, depth + 1)
        recurseNode(null, depth + 1)
      }
    }
  }
 
  def output(A: Node, recurse: (Node, Int) => Unit) = {
    recurse(A, 0)
  }
 
  def renderTree(A: Node) = {
    output(x, recurseNode);
  }
 
  renderTree(x);
 
  def sortedRender(A: Node, depth: Int) {
    def display(data: String, depth: Int) {
      System.out.println(data);
    }
    A match {
      case null => {
      }
      case LeafNode(data) => {
        display(data, depth)
      }
      case FullNode(data, left, right) => {
        sortedRender(left, depth + 1)
        display(data, depth)
        sortedRender(right, depth + 1)
      }
      case RightNode(data, right) => {
        sortedRender(null, depth + 1)
        display(data, depth)
        sortedRender(right, depth + 1)
      }
      case LeftNode(data, left) => {
        sortedRender(left, depth + 1)
        display(data, depth)
      }
    }
  }
 
  def renderTreeSorted(A: Node) = {
    output(x, sortedRender);
  }
 
  System.out.println("")
  renderTreeSorted(x);
}