{"id":997,"date":"2013-05-03T21:37:34","date_gmt":"2013-05-03T21:37:34","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=997"},"modified":"2013-05-03T21:37:34","modified_gmt":"2013-05-03T21:37:34","slug":"implementing-k-means-in-scala","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/implementing-k-means-in-scala\/","title":{"rendered":"Implementing k-means in Scala"},"content":{"rendered":"<p>To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two &#8211; the exact points used are in the code below.<\/p>\n<p>This implements Lloyd&#8217;s algorithm, which tries to cluster points in iterations in a simple manner:<\/p>\n<p>1. Assume a certain number of clusters<br \/>\n2. Group the points at random<br \/>\n3. Compute the center of each cluster<br \/>\n4. For each point, compute which cluster is closest<br \/>\n5. Move all the points into new groupings<br \/>\n6. Repeat 3-5 a few times, until you&#8217;re happy with the results<\/p>\n<p>I like how the functional programming style forces you to recreate all the data structures, in this case. It might be tempting to implement this in an imperative style, modifying data structures in place, but since steps 4-5 require separate data, you are protected against making it more difficult. You can see the full source below, or <a href=\"https:\/\/github.com\/garysieling\/scala-k-means\">on github<\/a>.<\/p>\n<p>Since this example is fairly contrived, this converges pretty quickly:<\/p>\n<pre>Initial State: \n  Cluster 0\n  Mean: (17.83517750970944, 12.242720407317105)\n    (10.8348626966492, 18.7800980127523))\n    (7.7875624720831, 20.1569764307574))\n    (11.9096128931784, 21.1855674228972))\n    (22.4668345067162, 8.9705504626857))\n    (7.91362116378194, 21.325928219919))\n    (22.636600400773, 2.46561420928429))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (25.1439536911272, 3.58469981317611))\n    (23.5359486724204, 4.07290025106778))\n    (11.7493214262468, 17.8517235677469))\n    (12.4277617893575, 19.4887691804508))\n    (11.931275122466, 18.0462702532436))\n    (25.4645673159779, 7.54703465191098))\n    (21.8031183153743, 5.69297814349064))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (16.95249500233747, 12.848199048608048)\n    (11.7265904596619, 16.9636039793709))\n    (10.7751248849735, 22.1517666115673))\n    (23.6587920739353, 3.35476798095758))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (7.03171204763376, 19.1985058633283))\n    (23.7722765903534, 3.74873642284525))\n    (10.259545802461, 23.4515683763173))\n    (28.1587146197594, 3.70625885635717))\n    (10.1057940183815, 18.7332929859685))\n    (8.90149362263775, 19.6314465074203))\n    (12.4353462881232, 19.6310467981989))\n    (24.3793349065557, 4.59761596097384))\n    (22.5447925324242, 2.99485404382734))\n    (26.8942422516129, 5.02646862012427))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (27.0339042858296, 4.4151109960116))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 0\n  Cluster 0\n  Mean: (23.781370272978315, 5.754127202865132)\n    (11.7265904596619, 16.9636039793709))\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.296576237184727, 20.09138475584863)\n    (10.8348626966492, 18.7800980127523))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 1\n  Cluster 0\n  Mean: (24.415832368416023, 5.164154740943777)\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.371840143630894, 19.92676471498138)\n    (10.8348626966492, 18.7800980127523))\n    (11.7265904596619, 16.9636039793709))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 2\n  Cluster 0\n  Mean: (24.415832368416023, 5.164154740943777)\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.371840143630894, 19.92676471498138)\n    (10.8348626966492, 18.7800980127523))\n    (11.7265904596619, 16.9636039793709))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 3\n  Cluster 0\n  Mean: (24.415832368416023, 5.164154740943777)\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.371840143630894, 19.92676471498138)\n    (10.8348626966492, 18.7800980127523))\n    (11.7265904596619, 16.9636039793709))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 4\n  Cluster 0\n  Mean: (24.415832368416023, 5.164154740943777)\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.371840143630894, 19.92676471498138)\n    (10.8348626966492, 18.7800980127523))\n    (11.7265904596619, 16.9636039793709))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))\n\nIteration: 5\n  Cluster 0\n  Mean: (24.415832368416023, 5.164154740943777)\n    (23.6587920739353, 3.35476798095758))\n    (22.4668345067162, 8.9705504626857))\n    (21.4930923464916, 3.28999356823389))\n    (26.4748241341303, 9.25128245838802))\n    (22.636600400773, 2.46561420928429))\n    (23.7722765903534, 3.74873642284525))\n    (25.1439536911272, 3.58469981317611))\n    (28.1587146197594, 3.70625885635717))\n    (23.5359486724204, 4.07290025106778))\n    (24.3793349065557, 4.59761596097384))\n    (25.4645673159779, 7.54703465191098))\n    (22.5447925324242, 2.99485404382734))\n    (21.8031183153743, 5.69297814349064))\n    (26.8942422516129, 5.02646862012427))\n    (23.9177161897547, 8.1377950229489))\n    (24.5349708443852, 5.00561881333415))\n    (26.2100410238973, 5.06220487544192))\n    (27.0339042858296, 4.4151109960116))\n    (23.7770902983858, 7.19445492687232))\n\n  Cluster 1\n  Mean: (10.371840143630894, 19.92676471498138)\n    (10.8348626966492, 18.7800980127523))\n    (11.7265904596619, 16.9636039793709))\n    (7.7875624720831, 20.1569764307574))\n    (10.7751248849735, 22.1517666115673))\n    (11.9096128931784, 21.1855674228972))\n    (7.91362116378194, 21.325928219919))\n    (7.03171204763376, 19.1985058633283))\n    (13.0838514816799, 20.3398794353494))\n    (11.7396623802245, 17.7026240456956))\n    (10.259545802461, 23.4515683763173))\n    (10.1057940183815, 18.7332929859685))\n    (11.7493214262468, 17.8517235677469))\n    (8.90149362263775, 19.6314465074203))\n    (12.4277617893575, 19.4887691804508))\n    (12.4353462881232, 19.6310467981989))\n    (11.931275122466, 18.0462702532436))\n    (6.56491029696013, 21.5098251711267))\n    (8.87507602702847, 21.4823134390704))\n    (11.0118378554584, 20.9773232834654))<\/pre>\n<pre lang=\"Java\">class Point(dx: Double, dy: Double) {\n  val x: Double = dx\n  val y: Double = dy\n\n  override def toString(): String = {\n    \"(\" + x + \", \" + y + \")\"\n  }\n\n  def dist(p: Point): Double = {\n    return (x - p.x) * (x - p.x) + (y - p.y) * (y - p.y);\n  }\n}\n\nobject kmeans extends App {\n  val k: Int = 2\n\n  \/\/ Correct answers to centers are (10, 20) and (25, 5)\n  val points: List[Point] = List(\n    new Point(10.8348626966492, 18.7800980127523),\n    new Point(10.259545802461, 23.4515683763173),\n    new Point(11.7396623802245, 17.7026240456956),\n    new Point(12.4277617893575, 19.4887691804508),\n    new Point(10.1057940183815, 18.7332929859685),\n    new Point(11.0118378554584, 20.9773232834654),\n    new Point(7.03171204763376, 19.1985058633283),\n    new Point(6.56491029696013, 21.5098251711267),\n    new Point(10.7751248849735, 22.1517666115673),\n    new Point(8.90149362263775, 19.6314465074203),\n    new Point(11.931275122466, 18.0462702532436),\n    new Point(11.7265904596619, 16.9636039793709),\n    new Point(11.7493214262468, 17.8517235677469),\n    new Point(12.4353462881232, 19.6310467981989),\n    new Point(13.0838514816799, 20.3398794353494),\n    new Point(7.7875624720831, 20.1569764307574),\n    new Point(11.9096128931784, 21.1855674228972),\n    new Point(8.87507602702847, 21.4823134390704),\n    new Point(7.91362116378194, 21.325928219919),\n    new Point(26.4748241341303, 9.25128245838802),\n    new Point(26.2100410238973, 5.06220487544192),\n    new Point(28.1587146197594, 3.70625885635717),\n    new Point(26.8942422516129, 5.02646862012427),\n    new Point(23.7770902983858, 7.19445492687232),\n    new Point(23.6587920739353, 3.35476798095758),\n    new Point(23.7722765903534, 3.74873642284525),\n    new Point(23.9177161897547, 8.1377950229489),\n    new Point(22.4668345067162, 8.9705504626857),\n    new Point(24.5349708443852, 5.00561881333415),\n    new Point(24.3793349065557, 4.59761596097384),\n    new Point(27.0339042858296, 4.4151109960116),\n    new Point(21.8031183153743, 5.69297814349064),\n    new Point(22.636600400773, 2.46561420928429),\n    new Point(25.1439536911272, 3.58469981317611),\n    new Point(21.4930923464916, 3.28999356823389),\n    new Point(23.5359486724204, 4.07290025106778),\n    new Point(22.5447925324242, 2.99485404382734),\n    new Point(25.4645673159779, 7.54703465191098)).sortBy(\n      p =&gt; (p.x + \" \" + p.y).hashCode())\n\n  def clusterMean(points: List[Point]): Point = {\n    val cumulative = points.reduceLeft((a: Point, b: Point) =&gt; new Point(a.x + b.x, a.y + b.y))\n\n    return new Point(cumulative.x \/ points.length, cumulative.y \/ points.length)\n  }\n\n  def render(points: Map[Int, List[Point]]) {\n    for (clusterNumber  x._2 % k) transform (\n        (i: Int, p: List[(Point, Int)]) =&gt; for (x  clusters.map(cluster =&gt; cluster._1)\n\n    \/\/ find cluster means\n    val means =\n      (clusters: Map[Int, List[Point]]) =&gt;\n        for (clusterIndex  closest(p, means(clusters)))\n\n    render(newClusters)\n\n    return newClusters\n  }\n\n  var clusterToTest = clusters\n  for (i<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>To generate sample data, I selected two points, (10, 20) and (25, 5), then generated a list of normally distributed points around those two &#8211; the exact points used are in the code below. This implements Lloyd&#8217;s algorithm, which tries to cluster points in iterations in a simple manner: 1. Assume a certain number of &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/implementing-k-means-in-scala\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Implementing k-means in Scala&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,6],"tags":[62,115,147,152,244,300,327,385,480,532],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/997"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=997"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/997\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=997"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=997"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=997"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}